What is LoRA?

Parameter-Efficient Fine-Tuning (PEFT) methods enable fine-tuning of large pre-trained language models with significantly fewer parameters. Low-Rank Adaptation (LoRA) is one of the most widely used PEFT techniques.

It freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

Pre-trained Weights
W
(Frozen)
d × d
+
LoRA Update
B
(Trainable)
d × r
×
A
(Trainable)
r × d
Final Computation: h = Wx + BAx

The Need for Parameter Efficiency

As models like GPT-3 reached 175B parameters, full fine-tuning became prohibitively expensive. LoRA drastically reduces the hardware barriers for fine-tuning while retaining performance on par with full fine-tuning.

Crucial Hardware Fact

Full fine-tuning of a 175B parameter model requires several terabytes of VRAM. LoRA reduces this to ~35GB, making fine-tuning possible on a single advanced consumer GPU.

Code Implementation

Here is a conceptual implementation of LoRA.

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
  def __init__(self, d_model, rank=8, alpha=16):
    super().__init__()
    self.d_model = d_model
    self.rank = rank
    self.scaling = alpha / rank

    # A matrix: d_model x rank, initialized with Gaussian distribution
    self.lora_A = nn.Parameter(torch.randn(d_model, rank) / math.sqrt(d_model))
    # B matrix: rank x d_model, initialized to zero
    self.lora_B = nn.Parameter(torch.zeros(rank, d_model))

  def forward(self, x):
    # Wx + BAx (W is not shown here)
    lora_update = (x @ self.lora_A @ self.lora_B) * self.scaling
    return lora_update