What is LoRA?
Parameter-Efficient Fine-Tuning (PEFT) methods enable fine-tuning of large pre-trained language models with significantly fewer parameters. Low-Rank Adaptation (LoRA) is one of the most widely used PEFT techniques.
It freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
(Frozen)
(Trainable)
(Trainable)
The Need for Parameter Efficiency
As models like GPT-3 reached 175B parameters, full fine-tuning became prohibitively expensive. LoRA drastically reduces the hardware barriers for fine-tuning while retaining performance on par with full fine-tuning.
Full fine-tuning of a 175B parameter model requires several terabytes of VRAM. LoRA reduces this to ~35GB, making fine-tuning possible on a single advanced consumer GPU.
Code Implementation
Here is a conceptual implementation of LoRA.
import torch
import torch.nn as nn
import math
class LoRALayer(nn.Module):
def __init__(self, d_model, rank=8, alpha=16):
super().__init__()
self.d_model = d_model
self.rank = rank
self.scaling = alpha / rank
# A matrix: d_model x rank, initialized with Gaussian distribution
self.lora_A = nn.Parameter(torch.randn(d_model, rank) / math.sqrt(d_model))
# B matrix: rank x d_model, initialized to zero
self.lora_B = nn.Parameter(torch.zeros(rank, d_model))
def forward(self, x):
# Wx + BAx (W is not shown here)
lora_update = (x @ self.lora_A @ self.lora_B) * self.scaling
return lora_update