Primers • Parameter Efficient FineTuning
 Overview
 ParameterEfficient FineTuning (PEFT)
 Advantages
 PEFT methods
 Prompt Modifications
 Adapters
 Reparameterization
 LowRank Adaptation (LoRA)
 Background
 Overview
 Advantages
 Limitations
 Hyperparameters
 How does having a lowrank matrix in LoRA help the finetuning process?
 How does lowrank constraint introduced by LoRA inherently act as a form of regularization, especially for the lower layers of the model?
 How does LoRA help avoid catastrophic forgetting?
 How does multiplication of two lowrank matrices in LoRA lead to lower attention layers being impacted less than higher attention layers?
 In LoRA, why is \(A\) initialized using a Gaussian and \(B\) set to 0?
 For a given task, how do we determine whether to finetune the attention layers or feedforward layers?
 Assuming we’re finetuning attention weights, which specific attention weight matrices should we apply LoRA to?
 Is there a relationship between setting scaling factor and rank in LoRA?
 How do you determine the optimal rank \(r\) for LoRA?
 How do LoRA hyperparameters interact with each other? Is there a relationship between LoRA hyperparameters?
 Why does a higher rank make it the easier to overfit?
 Does LoRA adapt weights in all layers?
 Quantized LowRank Adaptation (QLoRA)
 QuantizationAware LowRank Adaptation (QALoRA)
 Refined LowRank Adaptation (ReLoRA)
 SLoRA: Serving Thousands of Concurrent LoRA Adapters
 WeightDecomposed LowRank Adaptation (DoRA)
 Summary of LoRA Techniques
 Lowrank Linear Subspace ReFT (LoReFT)
 Stratified Progressive Adaptation Finetuning (SPAFIT)
 BitFit
 NOLA
 Matrix of Rank Adaptation (MoRA)
 LowRank Adaptation (LoRA)
 Which PEFT Technique to Choose: A Mental Model
 Comparative Analysis of Popular PEFT Methods
 Practical Tips for Finetuning LLMs Using LoRA
 Related: Surgical finetuning
 References
 Citation
Overview
 Finetuning of large pretrained models on downstream tasks is called “transfer learning”.
 While full finetuning pretrained models on downstream tasks is a common, effective approach, it is an inefficient approach to transfer learning.
 The simplest way out for efficient finetuning could be to freeze the networks’ lower layers and adapt only the top ones to specific tasks.
 In this article, we’ll explore Parameter Efficient FineTuning (PEFT) methods that enable us to adapt a pretrained model to downstream tasks more efficiently – in a way that trains lesser parameters and hence saves cost and training time, while also yielding performance similar to full finetuning.
ParameterEfficient FineTuning (PEFT)
 Let’s start off by defining what parameterefficient finetuning is and give some context on it.
 Parameterefficient finetuning is particularly used in the context of largescale pretrained models (such as in NLP), to adapt that pretrained model to a new task without drastically increasing the number of parameters.
 The challenge is this: modern pretrained models (like BERT, GPT, T5, etc.) contain hundreds of millions, if not billions, of parameters. Finetuning all these parameters on a downstream task, especially when the available dataset for that task is small, can easily lead to overfitting. The model may simply memorize the training data instead of learning genuine patterns. Moreover, introducing additional layers or parameters during finetuning can drastically increase computational requirements and memory consumption.
 As mentioned earlier, PEFT allows to only finetune a small number of model parameters while freezing most of the parameters of the pretrained LLM. This helps overcome the catastrophic forgetting issue that full finetuned LLMs face where the LLM forgets the original task it was trained on after being finetuned.
 The image below (source) gives a nice overview of PEFT and its benefits.
Advantages
 Parameterefficient finetuning is useful due the following reasons:
 Reduced computational costs (requires fewer GPUs and GPU time).
 Faster training times (finishes training faster).
 Lower hardware requirements (works with cheaper GPUs with less VRAM).
 Better modeling performance (reduces overfitting).
 Less storage (majority of weights can be shared across different tasks).
Practical usecase
 Credits to the below section go to Pranay Pasula.
 PEFT obviates the need for 40 or 80GB A100s to make use of powerful LLMs. In other words, you can finetune 10B+ parameter LLMs for your desired task for free or on cheap consumer GPUs.
 Using PEFT methods like LoRA, especially 4bit quantized base models via QLoRA, you can finetune 10B+ parameter LLMs that are 3040GB in size on 16GB GPUs. If it’s out of your budget to buy a 16GB GPU/TPU, Google Colab occasionally offers a 16GB VRAM Tesla T4 for free. Remember to save your model checkpoints every now and then and reload them as necessary, in the event of a Colab disconnect/kernel crash.
 If you’re finetuning on a single task, the base models are already so expressive that you need only a few (~10s100s) of examples to perform well on this task. With PEFT via LoRA, you need to train only a trivial fraction (in this case, 0.08%), and though the weights are stored as 4bit, computations are still done at 16bit.
 Note that while a good amount of VRAM is still needed for the finetuning process, using PEFT, with a small enough batch size, and little gradient accumulation, can do the trick while still retaining ‘fp16’ computation. In some cases, the performance on the finetuned task can be comparable to that of a finetuned 16bit model.
 Key takeaway: You can finetune powerful LLMs to perform well on a desired task using free compute. Use a <10B parameter model, which is still huge, and use quantization, PEFT, checkpointing, and provide a small training set, and you can quickly finetune this model for your use case.
PEFT methods
 Below, we will delve into individual PEFT methods and delve deeper into their nuances.
Prompt Modifications
Soft Prompt Tuning
 First introduced in the The Power of Scale for ParameterEfficient Prompt Tuning; this paper by Lester et al. introduces a simple yet effective method called soft prompt tuning, which prepends a trainable tensor to the model’s input embeddings, essentially creating a soft prompt to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts, soft prompts are learned through backpropagation and can be finetuned to incorporate signals from any number of labeled examples.
 Soft prompt tuning only requires storing a small taskspecific prompt for each task, and enables mixedtask inference using the original pretrained model.
 The authors show that prompt tuning outperforms fewshot learning by a large margin, and becomes more competitive with scale.
 This is an interesting approach that can help to effectively use a single frozen model for multitask serving.
 Model tuning requires making a taskspecific copy of the entire pretrained model for each downstream task and inference must be performed in separate batches. Prompt tuning only requires storing a small taskspecific prompt for each task, and enables mixedtask inference using the original pretrained model. With a T5 “XXL” model, each copy of the tuned model requires 11 billion parameters. By contrast, our tuned prompts would only require 20,480 parameters per task—a reduction of over five orders of magnitude – assuming a prompt length of 5 tokens.
 Thus, instead of using discrete text prompts, prompt tuning employs soft prompts. Soft prompts are learnable and conditioned through backpropagation, making them adaptable for specific tasks.
 Prompt Tuning offers many benefits such as:
 MemoryEfficiency: Prompt tuning dramatically reduces memory requirements. For instance, while a T5 “XXL” model necessitates 11 billion parameters for each taskspecific model, prompttuned models need a mere 20,480 parameters (assuming a prompt length of 5 tokens).
 Versatility: Enables the use of a single frozen model for multitask operations.
 Performance: Outshines fewshot learning and becomes more competitive as the scale grows.
Soft Prompt vs. Prompting
 Soft prompt tuning and prompting a model with extra context are both methods designed to guide a model’s behavior for specific tasks, but they operate in different ways. Here’s how they differ:
 Mechanism:
 Soft Prompt Tuning: This involves introducing trainable parameters (soft prompts) that are concatenated or added to the model’s input embeddings. These soft prompts are learned during the finetuning process and are adjusted through backpropagation to condition the model to produce desired outputs for specific tasks.
 Prompting with Extra Context: This method involves feeding the model with handcrafted or predefined text prompts that provide additional context. There’s no explicit finetuning; instead, the model leverages its pretrained knowledge to produce outputs based on the provided context. This method is common in fewshot learning scenarios where the model is given a few examples as prompts and then asked to generalize to a new example.
 Trainability:
 Soft Prompt Tuning: The soft prompts are trainable. They get adjusted during the finetuning process to optimize the model’s performance on the target task.
 Prompting with Extra Context: The prompts are static and not trainable. They’re designed (often manually) to give the model the necessary context for the desired task.
 Use Case:
 Soft Prompt Tuning: This method is particularly useful when there’s a need to adapt a pretrained model to various downstream tasks without adding significant computational overhead. Since the soft prompts are learned and optimized, they can capture nuanced information necessary for the task.
 Prompting with Extra Context: This is often used when finetuning isn’t feasible or when working with models in a zeroshot or fewshot setting. It’s a way to leverage the vast knowledge contained in large pretrained models by just guiding their behavior with carefully crafted prompts.
 In essence, while both methods use prompts to guide the model, soft prompt tuning involves learning and adjusting these prompts, whereas prompting with extra context involves using static, handcrafted prompts to guide the model’s behavior.
Prefix Tuning
 Proposed in PrefixTuning: Optimizing Continuous Prompts for Generation, prefixtuning is a lightweight alternative to finetuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous taskspecific vector (called the prefix).
 Instead of adding a soft prompt to the model input, it prepends trainable parameters to the hidden states of all transformer blocks. During finetuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.
 Prefixtuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.
 The figure below from the paper shows that finetuning (top) updates all Transformer parameters (the red Transformer box) and requires storing a full model copy for each task. They propose prefixtuning (bottom), which freezes the Transformer parameters and only optimizes the prefix (the red prefix blocks). Consequently, prefixtuning only need to store the prefix for each task, making prefixtuning modular and spaceefficient. Note that each vertical block denote transformer activations at one time step.

They apply prefixtuning to GPT2 for tabletotext generation and to BART for summarization. They find that by learning only 0.1% of the parameters, prefixtuning obtains comparable performance in the full data setting, outperforms finetuning in lowdata settings, and extrapolates better to examples with topics unseen during training.

The image below (source) illustrate how in prefix tuning, trainable tensors are addted to each transformer block instead of only in the input embedding.
Hard prompt tuning
 Hard prompt tuning directly modifies the input prompt to the model. This can involve a vast multitude of things such as:
 We can add examples of outputs we expect from the prompt
 We can add tags specifically relating to our task at hand
 In essence, it is just the modification of the string input, or prompt, to the model.
Adapters
 Adapter layers, often termed “Adapters”, add minimal additional parameters to the pretrained model. These adapters are inserted between existing layers of the network.
 Adapters is a PEFT technique shown to achieve similar performance as compared to tuning the top layers while requiring as fewer parameters as two orders of magnitude.
 Adapterbased tuning simply inserts new modules called “adapter modules” between the layers of the pretrained network.
 The image below (source) illustrates this concept for the transformer block:
 During finetuning, only the parameters of these adapter layers are updated, while the original model parameters are kept fixed. This results in a model with a small number of additional parameters that are taskspecific.
 Keeping the full PT model frozen, these modules are the only optimizable ones while finetuning – this means only a very few parameters are introduced per task yielding “compact” models.
 They offer many benefits such as:
 ParameterEfficiency: By keeping the main model frozen and only updating the adapter layers, a minimal number of parameters are added per task. This results in compact models that are memoryefficient.
 Performance: Despite the small parameter footprint, adapters often achieve performance comparable to conventional finetuning.

The adapter module consists of two fully connected layers with a bottleneck structure. This structure is inspired by autoencoders, which are designed to encode information into a compressed representation and then decode it back to its original form.
 Here’s how the parameter efficiency is achieved:

Bottleneck Structure: The first layer of the adapter reduces the dimensionality of the input (e.g., from 1024 to 24 dimensions). This drastic reduction means that the information from the original 1024 dimensions must be compressed into just 24 dimensions. The second layer then projects these 24 dimensions back to the original 1024 dimensions.

Reduction in Parameters: This bottleneck approach significantly reduces the number of parameters. In your example, the total number of parameters introduced by the adapter is 49,152 (from the computation 1024x24 + 24x1024). If we were to use a single fully connected layer to project a 1024dimensional input to a 1024dimensional output directly, it would require 1,048,576 parameters (1024x1024).

Efficiency Analysis: By using the adapter approach, the number of parameters is substantially lower. Comparing 49,152 parameters to 1,048,576 parameters shows a dramatic reduction, making the adapter much more efficient in terms of parameter usage.

Why is this Beneficial?: This efficiency is particularly beneficial when finetuning large pretrained models. Instead of retraining or adapting the entire network (which would be computationally expensive and memoryintensive), adapters allow for targeted adjustments with far fewer additional parameters. This makes the process more manageable and practical, especially when resources are limited.
 The adapter’s bottleneck structure allows it to achieve similar functionality (adapting the model to new tasks or data) as a fullsized layer would, but with a significantly reduced number of parameters. This efficiency makes adapters a popular choice for finetuning large pretrained models in a resourceeffective manner.
What is an Adapter Module?
 Let’s look at the application of the adapter module in the transformer architecture in three points:
 The adapter module (right) first projects the original \(d\)dimensional features into a smaller \(m\)dimensional vector, applies a nonlinearity, and then projects it back to \(d\) dimensions.
 As can be seen, the module features a skipconnection  With it in place, when the parameters of the projection layers are initialized to nearzero which eventually leads to near identity initialization of the module. This is required for stable finetuning and is intuitive as with it, we essentially do not disturb the learning from pretraining.
 In a transformer block (left), the adapter is applied directly to the outputs of each of the layers (attention and feedforward).
How do you decide the value of \(m\)?
 The size \(m\) in the Adapter module determines the number of optimizable parameters and hence poses a parameter vs performance tradeoff.
 The original paper experimentally investigates that the performance remains fairly stable across varying adapter sizes \(m\) and hence for a given model a fixed size can be used for all downstream tasks.
LLaMAAdapters

This paper introduces an efficient finetuning method called LLaMAAdapter. This method is designed to adapt the LLaMA model into an instructionfollowing model with high efficiency in terms of resource usage and time. Key aspects of this paper include:

Parameter Efficiency: LLaMAAdapter introduces only 1.2 million learnable parameters on top of the frozen LLaMA 7B model, which is significantly fewer than the full 7 billion parameters of the model. This approach leads to a more efficient finetuning process both in terms of computational resources and time, taking less than one hour on 8 A100 GPUs.

Learnable Adaption Prompts: The method involves appending a set of learnable adaption prompts to the input instruction tokens in the higher transformer layers of LLaMA. These prompts are designed to adaptively inject new instructions into the frozen LLaMA while preserving its pretrained knowledge, effectively guiding the subsequent contextual response generation.

Zeroinitialized Attention Mechanism: To avoid disturbances from randomly initialized adaption prompts, which can harm finetuning stability and effectiveness, the paper proposes a zeroinitialized attention mechanism with a learnable gating factor. This mechanism allows for a stable learning process and progressive incorporation of instructional signals during training. It ensures that the newly acquired instructional signals are effectively integrated into the transformer while retaining the pretrained knowledge of LLaMA.

Generalization and Multimodal Reasoning: LLaMAAdapter is not only effective for language tasks but can also be extended to multimodal instructions, allowing for imageconditioned LLaMA models. This capability enables superior reasoning performance on benchmarks like ScienceQA and COCO Caption. Additionally, the approach has demonstrated strong generalization capacity in traditional vision and language tasks.


In summary, the LLaMAAdapter represents a significant advancement in the field of parameterefficient finetuning of large language models. Its innovative use of learnable adaption prompts and zeroinitialized attention mechanism provides a highly efficient method for adapting pretrained models to new tasks and domains, including multimodal applications.

The image below (source) illustrates this concept below.
Reparameterization
LowRank Adaptation (LoRA)
Background
Rank of a Matrix
 The rank of a matrix is a measure of the number of independent rows or columns in the matrix. In simple terms, it tells us the maximum number of linearly independent rows or columns (i.e., vectors) the matrix has.
 If a matrix has rank 1, it means all rows or all columns can be represented as multiples of each other, so there’s essentially only one unique “direction” in the data.

A fullrank matrix has rank equal to the smallest of its dimensions (number of rows or columns), meaning all rows and columns are independent.

Example:
 Consider the following 3x3 matrix \(A\):

StepbyStep to Determine the Rank:

Row Reduction: To find the rank, we can use Gaussian elimination to transform the matrix into its row echelon form, making it easier to see linearly independent rows.
After rowreducing \(A\), we get:
\[A = \begin{bmatrix} 1 & 2 & 3 \\ 0 & 3 & 6 \\ 0 & 0 & 0 \end{bmatrix}\] 
Count Independent Rows: Now we look at the rows with nonzero entries:
 The first row \([1, 2, 3]\) is nonzero.
 The second row \([0, 3, 6]\) is also nonzero and independent of the first row.
 The third row is all zeros, which does not contribute to the rank.
Since there are two nonzero, independent rows in the row echelon form, the rank of \(A\) is 2.


Explanation:
 The rank of 2 indicates that only two rows or columns in \(A\) contain unique information, and the third row (or column) can be derived from a combination of the other two. Essentially, this matrix can be thought of as existing in a 2dimensional space rather than a full 3dimensional space, despite its 3x3 size.
 In summary:
 The rank of matrix \(A\) is 2.
 This rank tells us the matrix’s actual dimensionality in terms of its independent information.
Related: Rank of a Tensor
 While LoRA injects trainable lowrank matrices, it is important to understand rank in the context of tensors as well.

The rank of a tensor refers to the number of dimensions, also called the order of the tensor. This is different from matrix rank in linear algebra, which relates to the number of linearly independent rows or columns. For tensors, rank simply tells us how many dimensions or axes the tensor has.

Explanation with Examples:
 Scalar (Rank 0 Tensor):
 A scalar is a single number with no dimensions.
 Example:
5
or3.14
 Shape:
()
(no dimensions)  Rank: 0
 Vector (Rank 1 Tensor):
 A vector is a onedimensional array of numbers.
 Example:
[3, 7, 2]
 Shape:
(3,)
(one dimension with 3 elements)  Rank: 1
 Matrix (Rank 2 Tensor):
 A matrix is a twodimensional array of numbers, like a table.
 Example: \(\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}\)
 Shape:
(2, 3)
(two dimensions: 2 rows, 3 columns)  Rank: 2
 3D Tensor (Rank 3 Tensor):
 A 3D tensor can be thought of as a “stack” of matrices, adding a third dimension.
 Example: \(\begin{bmatrix} \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}, \begin{bmatrix} 7 & 8 & 9 \\ 10 & 11 & 12 \end{bmatrix} \end{bmatrix}\)
 Shape:
(2, 2, 3)
(three dimensions: 2 matrices, each with 2 rows and 3 columns)  Rank: 3
 4D Tensor (Rank 4 Tensor):
 A 4D tensor might represent multiple “stacks” of 3D tensors.
 Example: In deep learning, a 4D tensor is commonly used to represent batches of color images, with dimensions
[batch size, channels, height, width]
.  Shape:
(10, 3, 64, 64)
for a batch of 10 images, each with 3 color channels and a resolution of 64x64.  Rank: 4
 Scalar (Rank 0 Tensor):
 General Rule:
 Rank = Number of dimensions (or axes) of the tensor.
 Why Rank Matters:
 The rank of a tensor tells us about its structural complexity and the data it can represent. Higherrank tensors can represent more complex data structures, which is essential in fields like deep learning, physics simulations, and data science for handling multidimensional data.
Overview
 Essence:
 LowRank Adaptation (LoRA) simplifies the finetuning of large models by decomposing complex, highdimensional weight matrices into lowerdimensional forms. This technique, akin to methods like PCA and SVD, allows for the retention of critical information while significantly reducing the size and complexity of the weights, thus enhancing finetuning efficiency on resourceconstrained settings.
 Application:
 LoRA identifies key dimensions in the original weight matrix of neural networks, optimizing these reduced dimensions to maintain the model’s learning capabilities with less computational cost. It adds trainable lowrank matrices to the model’s architecture, specifically to the Transformer layers, and optimizes these matrices instead of the entire model, leading to fewer trainable parameters and reduced memory requirements.
 Benefits:
 This approach offers considerable time and memory efficiency, as a large portion of the model’s parameters are kept frozen, reducing both training time and GPU memory requirements. It also avoids additional inference latency and facilitates easy taskswitching during deployment, requiring changes only in a small subset of weights.
 In Summary:
 LoRA represents a smart balance in model finetuning, preserving the core strengths of large pretrained models while adapting them efficiently for specific tasks or datasets. It’s a technique that redefines efficiency in the world of massive language models.
 A matrix is said to be rankdeficient if it does not have full rank. The rank deficiency of a matrix is the difference between the lesser of the number of rows and columns, and the rank. For more, refer Wikipedia: Rank.
 Before we continue, let’s recap by taking a quick look at traditional finetuning vs. LoRA with the images (source) below:
LoRA efficiently finetunes largescale neural networks by introducing trainable lowrank matrices, simplifying the model’s complexity while retaining its robust learning capabilities.
Advantages
Parameter Efficiency
 Compared to full finetuning GPT3 175B with Adam, LoRA can reduce the number of trainable parameters by 10,000 times. Specifically, this means that LoRA only finetunes approximately 0.01% of the parameters of the original model.
 The below table from the LoRA paper indicates that for GPT3 with LoRA, we see that we only finetune \(\frac{4.7}{175255} \times 100 = 0.002\%\) and \(\frac{38}{175255} \times 100 = 0.02\%\) parameters.
GPU Memory (and Storage) Savings
 Compared to full finetuning GPT3 175B with Adam, LoRA can reduce the GPU memory requirement by 3 times. Specifically, this means that LoRA finetunes the original model with 33% of the memory.
 For a large Transformer trained with Adam, LoRA reduces VRAM usage by up to twothirds by avoiding the need to store optimizer states for the frozen parameters. On GPT3 175B, VRAM consumption during training drops from 1.2TB to 350GB. When adapting only the query and value projection matrices with a rank \(r = 4\), the checkpoint size decreases significantly from approximately 350GB to 35MB. This efficiency allows training with significantly fewer GPUs and avoids I/O bottlenecks.
Efficient Task Switching
 Task switching is more costeffective as only the LoRA weights need swapping, enabling the creation of numerous customized models that can be dynamically swapped on machines storing the pretrained weights in VRAM.
Faster Training Speed
 Training speed also improves by 25% compared to full finetuning, as the gradient calculation for the vast majority of the parameters is unnecessary.
No additional inference latency
 LoRA ensures no additional inference latency when deployed in production by allowing explicit computation and storage of the combined weight matrix \(W = W_0 + BA\). During inference, this approach uses the precomputed matrix \(W\), which includes the original pretrained weights \(W_0\) and the lowrank adaptation matrices \(B\) and \(A\). This method eliminates the need for dynamic computations during inference.
 When switching to another downstream task, the pretrained weights \(W_0\) can be quickly restored by subtracting the current lowrank product \(BA\) and adding the new taskspecific lowrank product \(B' A'\). This operation incurs minimal memory overhead and allows for efficient task switching without impacting inference speed. By merging the lowrank matrices with the pretrained weights in advance, LoRA avoids the extra computational burden during realtime inference (unlike adapters), ensuring latency remains on par with that of fully finetuned models.
Limitations
 While LoRA offers significant advantages in terms of parameter efficiency and memory savings, it also has some limitations. One notable limitation is the complexity involved in batching inputs for different tasks when using distinct lowrank matrices \(A\) and \(B\). If the goal is to absorb \(A\) and \(B\) into the combined weight matrix \(W\) to avoid additional inference latency, it becomes challenging to batch inputs from different tasks in a single forward pass. This is because each task would require a different set of \(A\) and \(B\) matrices, complicating the batching process.
 Additionally, although it is possible to avoid merging the weights and dynamically select the appropriate LoRA modules for each sample in a batch, this approach is more feasible in scenarios where latency is not a critical concern. This workaround does not fully address the need for seamless integration when lowlatency inference is required across multiple tasks.
 In summary, while LoRA provides a highly efficient adaptation method, the complexity in handling multiple tasks simultaneously and the need for careful management of lowrank matrices during batching are important considerations for its practical deployment.
Hyperparameters
 LoRAspecific hyperparameters include rank (\(r\)) and alpha (\(\alpha\)). Others, while still used for LoRAbased finetuning, such as learning rate (lr), dropout probability (\(p\)), and batch size (bs), are more generic to deep learningbased model training/finetuning. Here’s a detailed explanation of each:
Rank (\(r\))

Description: In LoRA, instead of finetuning the full weight matrix, a lowrank approximation is used, where a weight matrix \(W_0\) is decomposed into two smaller matrices, \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times k}\), where \(r\) is much smaller than \(d\) or \(k\). The rank (\(r\)) of matrices \(A\) and \(B\) – one of the core hyperparameters in LoRA – represents the rank of the lowrank decomposition applied to the weight matrices. The new weight is then modeled as:
\[W = W_0 + \Delta W = W_0 + A \cdot B\]  Role: The rank controls the dimensionality of the lowrank matrices and hence the number of additional parameters introduced during finetuning.

Interpretation: Lower values of \(r\) will impose stronger restrictions on how much the weight matrices can adapt, potentially limiting the model’s flexibility but greatly reducing the computational and memory footprint. Higher values of \(r\) allow for more expressive updates but increase the number of parameters and computation required.

Equation: In matrix form, for any original weight matrix \(W_0 \in \mathbb{R}^{d \times k}\), the adapted weight update is expressed as:
\[\Delta W = A \cdot B\] where, \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times k}\), where \(r \ll d, k\).
 Typical Values: 1–16, depending on the size of the model and the complexity of the task. In most tasks, a small rank (e.g., 4 or 8) provides a good tradeoff between performance and efficiency.
Scaling Factor (\(\alpha\))

Description: \(\alpha\) is a scaling factor applied to the LoRA updates. Specifically, it scales the lowrank updates \(A \cdot B\) before adding them to the base weight matrix \(W_0\). The weight update rule becomes:
\[W = W_0 + \frac{\alpha}{r} \cdot (A \cdot B)\] 
Role: The purpose of \(\alpha\) is to control the magnitude of the lowrank updates to prevent the model from diverging too far from the pretrained weights. By dividing \(\alpha\) by the rank \(r\), LoRA ensures that the update magnitude is normalized according to the size of the lowrank decomposition. This is crucial because a larger rank would introduce more freedom for updates, and the division by \(r\) keeps the updates in check.

Interpretation: A higher \(\alpha\) means that the lowrank updates will have a larger impact on the final weight, while a smaller \(\alpha\) means the lowrank updates will contribute less to the adapted model. The division by \(r\) helps keep the effect of the lowrank update consistent across different choices of rank.

Equation: The weight update is now written as:
\[\Delta W = \frac{\alpha}{r} \cdot (A \cdot B)\] 
Typical Values: Common values for \(\alpha\) are in the range of 1–32. The typical recommendation is to set \(\alpha = \frac{r}{\text{base rank}}\), where \(\text{base rank}\) is a predetermined scale for the model.
Dropout Probability (\(p\))

Description: Dropout is a regularization technique used to prevent overfitting, and it is applied in the LoRA framework as well. The dropout probability (p) refers to the probability with which a particular element in the lowrank matrices \(A\) and \(B\) is randomly set to zero during training. Dropout is typically used to reduce overfitting by introducing noise during training.

Role: The role of dropout in LoRA is to regularize the lowrank weight updates and ensure they do not overfit to the finetuning data. By randomly zeroing out parts of the matrices, the model learns more robust and generalizable updates.

Interpretation: Higher values of dropout probability \(p\) imply more aggressive regularization, which can reduce overfitting but also may slow down learning. Lower values of \(p\) imply less regularization and could potentially lead to overfitting on small datasets.

Equation: The dropout operation is typically represented as:
\[A_{dropped} = A \odot \text{Bernoulli}(1p)\] where, \(\odot\) denotes elementwise multiplication, and \(\text{Bernoulli}(1p)\) is a binary mask where each element is independently drawn from a Bernoulli distribution with probability \(1  p\).

Typical Values: Dropout probabilities \(p\) are typically set between 0.0 (no dropout) and 0.3 for LoRA tasks.

Learning Rate (lr)

Description: The learning rate is a fundamental hyperparameter in any optimization process, and it determines the step size at which the model’s parameters are updated during training. In the context of LoRA, it controls the update of the lowrank matrices \(A\) and \(B\) rather than the full model weights.

Role: The learning rate governs how fast or slow the lowrank matrices adapt to the new task. A high learning rate can lead to faster convergence but risks overshooting the optimal solution, while a small learning rate can provide more stable convergence but might take longer to adapt to the new task.

Interpretation: A higher learning rate might be used in the early stages of finetuning to quickly adapt to the new task, followed by a lower rate to refine the final performance. However, too high a learning rate may destabilize training, especially when \(\alpha\) is large.

Equation: The update to the lowrank parameters follows the standard gradient descent update rule:
\[\theta_{t+1} = \theta_t  lr \cdot \nabla_{\theta} L\]Where \(L\) is the loss function, \(\nabla_{\theta} L\) is the gradient of the loss with respect to the lowrank parameters \(\theta\), and \(lr\) is the learning rate.

Typical Values: Learning rates for LoRA typically range from \(10^{5}\) to \(10^{3}\), depending on the model, the task, and the scale of adaptation needed.

Batch Size (bs)

Description: The batch size is the number of examples that are passed through the model at one time before updating the weights. It is a crucial hyperparameter for stabilizing the training process.

Role: In LoRA, the batch size affects how stable and efficient the lowrank adaptation process is. A larger batch size can stabilize the gradient estimates and speed up convergence, while smaller batches introduce more noise into the gradient, which may require a smaller learning rate to maintain stability.

Interpretation: Smaller batch sizes allow for faster updates but with noisier gradients, whereas larger batch sizes reduce noise but require more memory. Finding the right balance is important for both computational efficiency and effective adaptation.

Equation: The loss for a given batch of size \(bs\) is averaged over the batch:
\[L_{\text{batch}} = \frac{1}{bs} \sum_{i=1}^{bs} L_i\] where, \(L_i\) is the loss for the \(i\)th example in the batch.

Typical Values: Batch sizes can vary widely depending on the available hardware resources. Typical values range from 8 to 64.

Summary
 The main hyperparameters involved in LoRA—rank (\(r\)), alpha (\(\alpha\)), dropout probability (p), learning rate (lr), and batch size (bs)—are crucial for controlling the behavior and effectiveness of LoRA. By adjusting these parameters, LoRA can offer an efficient way to finetune large pretrained models with significantly reduced computational costs and memory usage while maintaining competitive performance. Each of these hyperparameters impacts the tradeoff between model flexibility, computational efficiency, and training stability.
 These hyperparameters are interconnected, especially scaling factor and rank; changes in one can require adjustments in others; more on this in the section on Is There a Relationship Between Setting Scaling Factor and Rank in LoRA?. Effective tuning of these parameters is critical for leveraging LoRA’s capabilities to adapt large models without extensive retraining.
How does having a lowrank matrix in LoRA help the finetuning process?
 In LoRA, a lowrank matrix is a matrix with a rank significantly smaller than its full dimensionality, which enables efficient and focused adjustments to model parameters. This lightweight adaptation mechanism allows large language models to learn new tasks without overfitting by capturing only the most essential adjustments, thus optimizing both information representation and parameter efficiency.
What is a Lowrank Matrix?
 A matrix is considered lowrank when its rank (the number of independent rows or columns) is much smaller than its dimensions. For example, a 1000x1000 matrix with rank 10 is lowrank because only 10 of its rows or columns contain unique information, and the others can be derived from these. This smaller rank indicates that the matrix contains a limited variety of independent patterns or directions, meaning it has a reduced capacity to capture complex relationships.
LowRank in LoRA Context
 In LoRA, lowrank matrices are introduced to finetune large language models with fewer trainable parameters. Here’s how it works:
 Adding LowRank Matrices: LoRA adds small, lowrank matrices to the model’s layers (typically linear or attention layers). These matrices serve as “adaptation” layers that adjust the original layer’s output.
 Freezing the Original Weights: The original model weights remain frozen during finetuning. Only the lowrank matrices are trained, which reduces the number of parameters to update.
 By limiting the rank of these new matrices, LoRA effectively limits the number of patterns they can represent. For instance, a rank5 matrix in a highdimensional space can only capture 5 independent directions, which forces the model to learn only essential, lowdimensional adjustments without becoming too complex.
Example
 Suppose we have a pretrained model layer represented by a 512x512 matrix (common in large language models). Instead of finetuning this large matrix directly, LoRA adds two lowrank matrices, \(A\) and \(B\), with dimensions 512x10 and 10x512, respectively. Here:
 The product \(A \times B\) has a rank of 10, much smaller than 512.
 This product effectively adds a lowrank adaptation to the original layer, allowing it to adjust its output in just a few key directions (10 in this case), rather than making unrestricted adjustments.
Why rank matters
 The rank of the LoRA matrices directly affects the model’s ability to learn taskspecific patterns:
 Lower Rank: Imposes a strong constraint on the model, which helps it generalize better and reduces the risk of overfitting.
 Higher Rank: Provides more flexibility but also increases the risk of overfitting, as the model can learn more complex adjustments that may fit the finetuning data too closely.
How does lowrank constraint introduced by LoRA inherently act as a form of regularization, especially for the lower layers of the model?
 In LoRA, the lowrank constraint serves as a builtin regularization mechanism by limiting the model’s flexibility during finetuning. This constraint especially impacts lower layers, which are designed to capture general, foundational features. By further restricting these layers, LoRA minimizes their adaptation to taskspecific data, reducing the risk of overfitting. This regularization preserves the model’s foundational knowledge in the lower layers, while allowing the higher layers—where taskspecific adjustments are most beneficial—to adapt more freely.
LowRank Constraint as Regularization

LowRank Matrices Limit Complexity: By adding only lowrank matrices to the model’s layers, LoRA restricts the model’s capacity to represent highly complex, taskspecific patterns. A lowrank matrix has fewer independent “directions” or dimensions in which it can vary. This means that the model, even when finetuned, can only make adjustments within a constrained range, learning broad, generalizable patterns rather than memorizing specific details of the training data. This limited capacity serves as a form of regularization, preventing the model from overfitting.

Reduced Sensitivity to Noisy Patterns: Lowrank matrices inherently ignore minor or highly detailed variations in the training data, focusing only on dominant, overarching patterns. This makes LoRA less sensitive to the idiosyncrasies of the finetuning dataset, enhancing the model’s robustness and generalization ability.
Effect on Lower Layers
 The lower layers of a neural network, especially in a transformer model, are primarily responsible for extracting generalpurpose features from the input data. In language models, for example:
 Lower layers capture basic syntactic relationships, such as sentence structure and word dependencies.
 These layers learn representations that are widely applicable across tasks and domains.
 Because these lower layers are already optimized to represent broad, generalizable patterns from pretraining, they are naturally less flexible and more constrained in what they capture compared to higher layers, which focus on more taskspecific details. Adding a lowrank constraint in LoRA further reinforces this effect:

Enhanced Regularization on Lower Layers: Since lower layers are already constrained to capture only general patterns, the addition of a lowrank constraint essentially adds a second layer of regularization. This means that these layers become even less likely to adapt in ways that would compromise their generalpurpose functionality. The lowrank constraint reinforces their role as foundational feature extractors, preserving their generalization capability while preventing overfitting on the specific details of the finetuning data.

Minimal Disruption of PreTrained Knowledge: The lowrank adaptation in LoRA ensures that lower layers maintain the knowledge they acquired during pretraining. Because these layers are regularized by the lowrank constraint, they are less likely to overfit to new data patterns introduced during finetuning. This preservation of pretrained knowledge is crucial for maintaining the model’s transferability to other tasks or domains, as lower layers retain their broad, foundational representations.
Why This Matters for Generalization
 When finetuning with LoRA:
 Higher Layers Adapt More Easily: Higher layers, being closer to the output, are more adaptable and can more readily accommodate taskspecific changes introduced during finetuning.
 Lower Layers Remain Generalized: Lower layers, reinforced by the lowrank constraint, retain their focus on general patterns. This balanced approach helps the model generalize well to unseen data because the lower layers still provide robust, generalpurpose representations while the higher layers adapt to the new task.
How does LoRA help avoid catastrophic forgetting?

LoRA helps prevent catastrophic forgetting by finetuning large pretrained models in a way that preserves their foundational knowledge while allowing for taskspecific adaptations. Catastrophic forgetting occurs when finetuning neural networks, particularly large pretrained models, causes them to overwrite or disrupt previously learned information, reducing performance on earlier tasks. LoRA mitigates this risk through a few key strategies:
 Freezing Original Weights: The core model parameters remain untouched, preserving the base knowledge and preventing interference.
 Introducing LowRank Matrices: These matrices have limited capacity, focusing solely on taskspecific adjustments, which allows the model to adapt to new tasks without losing general knowledge.
 Targeting Specific Layers: LoRA typically modifies higher attention layers, avoiding disruption to fundamental representations in lower layers.
 ParameterEfficient, Modular Adaptation: LoRA’s modular design allows for reversible, taskspecific adjustments, making it suitable for flexible multitask and continual learning.

Through this approach, LoRA enables large models to adapt efficiently to new tasks while retaining previously learned information, which is especially valuable for applications requiring retention of prior knowledge.
Freezing the Original Weights
 One of the core aspects of LoRA is that it freezes the original model weights and adds new, lowrank matrices that handle the finetuning process:
 The frozen original weights retain the model’s general knowledge from pretraining. This means that core information, patterns, and representations acquired from extensive pretraining on large datasets remain unaffected.
 Since only the lowrank matrices are adjusted for the new task, there is no direct modification of the original weights. This minimizes the risk of overwriting or disrupting the knowledge captured in those weights.
 By keeping the original parameters intact, LoRA avoids catastrophic forgetting in a way that typical finetuning (where the original weights are updated) does not.
LowRank Adaptation Layers for TaskSpecific Adjustments
 LoRA introduces lowrank matrices as additional layers to the model, which have the following properties:
 Limited Capacity: Lowrank matrices have a constrained capacity to represent new information, which forces them to focus only on essential, taskspecific adaptations. This means they cannot significantly alter the underlying model’s behavior, preserving the broader general knowledge.
 Focused Adaptation: By adding taskspecific information via lowrank matrices rather than altering the model’s entire parameter space, LoRA ensures that the new taskspecific changes are confined to these auxiliary matrices. This helps the model adapt to new tasks without losing its prior knowledge.
LayerSpecific Impact
 LoRA often targets specific layers in the model, commonly the attention layers:
 Higher Attention Layers: These layers (closer to the output) are responsible for more taskspecific representations and are typically the ones modified by LoRA. This selective adaptation means that the deeper, more taskgeneral features in lower layers are left intact, reducing the risk of catastrophic forgetting.
 Minimal LowerLayer Impact: Since lower layers (closer to the input) remain unchanged or minimally affected, the model retains the generalpurpose, foundational features learned during pretraining, which are crucial for generalization.
 This selective impact allows LoRA to introduce new, taskspecific representations while preserving fundamental information, balancing new task learning with knowledge retention.
ParameterEfficient FineTuning
 LoRA is designed for parameterefficient finetuning, meaning it uses a fraction of the parameters that traditional finetuning would require:
 LoRA adds only a small number of new parameters through the lowrank matrices. This efficiency keeps the model changes lightweight, making it less likely to interfere with the original model’s representations.
 The lowrank constraint also regularizes the finetuning process, helping to prevent overfitting to the new task, which can indirectly support retention of general knowledge. Overfitting can cause catastrophic forgetting if the model becomes too specialized, as it loses flexibility in dealing with tasks beyond the finetuning data.
Easy Reversibility
 Since LoRA’s approach is to add new matrices rather than alter the original model’s weights, it makes it easy to revert the model to its original state or apply it to different tasks:
 The lowrank matrices can be removed or swapped out without affecting the base model. This modularity allows for rapid switching between tasks or models, making it easy to adapt the model to different tasks while maintaining the pretrained knowledge.
 This adaptability is particularly useful for multitask learning or continual learning, as it allows LoRAenhanced models to apply distinct lowrank adaptations for different tasks without compromising the model’s underlying pretrained knowledge.
Modular and Reusable Adapters
 With LoRA, finetuning for different tasks can be achieved by creating different lowrank matrices for each new task:
 These modular, reusable matrices enable taskspecific tuning without overwriting previous adaptations or the original model. This is especially valuable for applications where the model needs to perform multiple tasks or domains interchangeably.
 By associating each task with its own set of lowrank matrices, LoRA enables the model to maintain knowledge across tasks without interference, effectively circumventing catastrophic forgetting.
How does multiplication of two lowrank matrices in LoRA lead to lower attention layers being impacted less than higher attention layers?
 In LoRA, the use of lowrank matrices enables efficient, controlled updates by selectively applying them to specific layers—primarily in the higher attention layers rather than the lower ones. This targeted approach allows the model to adjust effectively to taskspecific nuances in these higher layers, which capture more complex patterns and contextual information, while preserving the general features encoded in the lower layers. By focusing finetuning efforts on the higher layers, LoRA minimizes overfitting and retains foundational knowledge from pretraining, making it an efficient and effective finetuning strategy.
Role of LowRank Matrices in LoRA
 LoRA adds two lowrank matrices, \(A\) and \(B\), to certain layers, typically in the form:
\(W_{\text{new}} = W + A \times B\)
 where:
 \(W\) is the original (frozen) weight matrix in the model layer.
 \(A\) and \(B\) are lowrank matrices (with ranks much smaller than the original dimensionality of \(W\)), creating a lowrank adaptation.
 where:
 The product \(A \times B\) has a limited rank and thus introduces only a restricted adjustment to \(W\). This adjustment constrains the layer to learn only a few independent patterns rather than a full set of complex, taskspecific transformations.
Higher Attention Layers: TaskSpecific Focus
 In large models, higher attention layers (closer to the output) tend to capture taskspecific, abstract features, while lower attention layers (closer to the input) capture general, reusable patterns. By applying LoRAbased finetuning primarily to higher attention layers:
 The model’s lowrank adaptation focuses on highlevel, taskspecific adjustments rather than modifying general representations.
 Higher layers, which already deal with more specific information, are more sensitive to the small adjustments made by \(A \times B\) since they directly influence taskrelated outputs.
 In practice, LoRAbased finetuning modifies these higher layers more significantly because these layers are more directly responsible for adapting the model to new tasks. Lower layers, in contrast, require less taskspecific adjustment and retain their generalpurpose features.
Limited Capacity of LowRank Matrices and Layer Impact
 The lowrank matrices \(A\) and \(B\) have limited expressive power (due to their low rank), meaning they can only introduce a small number of directional adjustments in the weight space. This limited capacity aligns well with higher layers because:
 Higher layers don’t need drastic changes but rather subtle adjustments to finetune the model to specific tasks.
 The constraint imposed by lowrank matrices helps avoid overfitting by restricting the number of learned patterns, which is ideal for the highlevel, abstract representations in higher layers.
 For lower layers, which capture broad, generalpurpose features, such limited adjustments don’t significantly impact the model. Lower layers still operate with the general features learned during pretraining, while higher layers adapt to taskspecific details.
Why Lower Layers are Less Affected
 Lower layers in the attention stack are less impacted by LoRA’s lowrank updates because:
 They are often not finetuned at all in LoRAbased setups, preserving the general features learned during pretraining.
 Even when finetuned with lowrank matrices, the limited capacity of \(A \times B\) is not sufficient to drastically alter their broader, foundational representations.
In LoRA, why is \(A\) initialized using a Gaussian and \(B\) set to 0?
 In LoRA, the initialization strategy where matrix \(A\) is initialized with a Gaussian distribution and matrix \(B\) is set to zero is crucial for ensuring a smooth integration of the adaptation with minimal initial disruption to the pretrained model. This approach is designed with specific goals in mind:
Preserving Initial Model Behavior
 Rationale: By setting \(B\) to zero, the product \(\Delta W = BA\) initially equals zero. This means that the adapted weights do not alter the original pretrained weights at the beginning of the training process.
 Impact: This preserves the behavior of the original model at the start of finetuning, allowing the model to maintain its pretrained performance and stability. The model begins adaptation from a known good state, reducing the risk of drastic initial performance drops.
Gradual Learning and Adaptation
 Rationale: Starting with \(\Delta W = 0\) allows the model to gradually adapt through the updates to \(B\) during training. This gradual adjustment is less likely to destabilize the model than a sudden, large change would.
 Impact: As \(B\) starts updating from zero, any changes in the model’s behavior are introduced slowly. This controlled adaptation is beneficial for training dynamics, as it allows the model to incrementally learn how to incorporate new information effectively without losing valuable prior knowledge.
Ensuring Controlled Updates
 Rationale: Gaussian initialization of \(A\) provides a set of initial values that, while random, are statistically regularized by the properties of the Gaussian distribution (such as having a mean of zero and a defined variance). This regularity helps in providing a balanced and predictable set of initial conditions for the adaptation process.
 Impact: The Gaussian distribution helps ensure that the values in \(A\) are neither too large nor too biased in any direction, which could lead to disproportionate influence on the updates when \(B\) begins to change. This helps in maintaining a stable and effective learning process.
Focused Adaptation
 Rationale: The lowrank matrices \(A\) and \(B\) are intended to capture the most essential aspects of the new data or tasks relative to the model’s existing capabilities. By starting with \(B = 0\) and \(A\) initialized randomly, the learning focuses on identifying and optimizing only those aspects that truly need adaptation, as opposed to relearning aspects that the model already performs well.

Impact: This focus helps optimize training efficiency by directing computational resources and learning efforts towards making meaningful updates that enhance the model’s capabilities in specific new areas.
 This initialization strategy supports the overall goal of LoRA: to adapt large, pretrained models efficiently with minimal resource expenditure and without compromising the foundational strengths of the original model. This approach ensures that any new learning builds on and complements the existing pretrained model structure.
For a given task, how do we determine whether to finetune the attention layers or feedforward layers?
 Deciding whether to finetune the attention layers or the feedforward (MLP) layers in a model adapted using LoRA involves several considerations. These include the nature of the task, the model architecture, and the distribution of parameters between attention and feedforward layers.
 Note that the LoRA paper originally only adapted the attention weights for downstream tasks and froze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameterefficiency. Thus, the number of attention weights relative to feedforward weights can impact the choice of .
 Here are some key factors to guide this decision:
Nature of the Task
 Task Requirements: Attention mechanisms are particularly effective for tasks that benefit from modeling relationships between different parts of the input, such as sequencetosequence tasks or tasks requiring contextual understanding. If the task demands strong relational reasoning or context sensitivity, finetuning attention layers might be more beneficial.
 FeedForward Layer Role: MLPs generally focus on transforming the representation at individual positions without considering other positions. They are effective for tasks requiring more substantial nonlinear transformation of features. If the task demands significant feature transformation at individual positions, MLPs may need adaptation.
Model Architecture
 Proportion of Parameters: In transformer architectures, MLPs typically contain a larger number of parameters compared to attention mechanisms (of the order of 2x to 5x). For example, in standard configurations like those seen in BERT or GPT models, the MLPs can contain around three times more parameters than the attention layers.
 Impact on Efficiency: Because MLPs are parameterheavy, finetuning them can significantly increase the number of trainable parameters, impacting training efficiency and computational requirements. If parameter efficiency is a priority, you might opt to adapt only the attention layers, as originally done in the LoRA approach.
Computational Constraints
 Resource Availability: The decision can also be influenced by available computational resources. Adapting attention layers only can save computational resources and training time, making it a preferable option when resources are limited.
 Balance of Adaptation and Performance: If computational resources allow, experimenting with both components can be useful to understand which contributes more to performance improvements on specific tasks.
Empirical Testing
 A/B Testing: One effective way to determine the optimal strategy for a specific model and task is to conduct empirical tests where you finetune the attention layers alone, the MLP layers alone, and both together in different experiments to compare the performance impacts.
 Performance Metrics: Monitoring key performance metrics specific to the task during these tests will guide which components are more critical to finetune.
TaskSpecific Research and Insights

Literature and Benchmarks: Insights from research papers and benchmarks on similar tasks can provide guidelines on what has worked well historically in similar scenarios. For example, tasks that require nuanced understanding of input relationships (like question answering or summarization) might benefit more from tuning attention mechanisms.

In summary, the choice between tuning attention or MLP layers depends on the specific demands of the task, the model’s architecture, the balance of parameters, and empirical results. Considering these aspects can help in making a decision that optimizes both performance and efficiency.
Assuming we’re finetuning attention weights, which specific attention weight matrices should we apply LoRA to?
 The question of which attention weight matrices in the transformer architecture should be adapted using LoRA to optimize performance on downstream tasks is central to maximizing the effectiveness of parameter usage, especially when dealing with large models like GPT3. Based on the findings reported in the LoRA paper and the specific experiment mentioned, here’s a detailed explanation and recommendation:
Context and Setup
 The LoRA paper explores the adaptation of various weight matrices within the selfattention module of GPT3 under a limited parameter budget. With a constraint of 18 million trainable parameters, the authors tested different configurations of adapting the weights associated with the query (\(W_q\)), key (\(W_k\)), value (\(W_v\)), and output (\(W_o\)) matrices. This setup allows for a comparison of the effectiveness of adapting different combinations of weights at varying ranks.
Experimental Findings
 Parameter Allocation: The experiment considered adapting individual weight types at a rank of 8 and combinations of weights at lower ranks (4 and 2) due to the fixed parameter budget. This arrangement allowed assessing whether it is more beneficial to distribute the available parameters across multiple weight types or concentrate them on fewer weights at a higher rank.
 Performance Metrics: The validation accuracies on the WikiSQL and MultiNLI datasets served as the primary performance indicators. The results show varying degrees of success depending on which weights were adapted and how the ranks were distributed. The table below from the LoRA paper shows validation accuracy on WikiSQL and MultiNLI after applying LoRA to different types of attention weights in GPT3, given the same number of trainable parameters. Adapting both \(W_q\) and \(W_v\) gives the best performance overall. They find the standard deviation across random seeds to be consistent for a given dataset, which they report in the first column.
Key Results and Recommendations
 Single vs. Multiple Weight Adaptation: Adapting single weight matrices (\(W_q\), \(W_k\), \(W_v\), or \(W_o\) individually) at a higher rank generally resulted in lower performance compared to adapting combinations of weights at a reduced rank. Specifically, putting all parameters in ∆\(W_q\) or ∆\(W_k\) alone did not yield optimal results.
 Optimal Combination: The combination of adapting both \(W_q\) and \(W_v\) at a rank of 4 emerged as the most effective strategy, achieving the highest validation accuracies on both datasets. This suggests a balanced approach to distributing the parameter budget across multiple types of attention weights, rather than focusing on a single type, leads to better performance.
 Effectiveness of Rank Distribution: The result indicates that a lower rank (such as 4) is sufficient to capture essential adaptations in the weights, making it preferable to spread the parameter budget across more types of weights rather than increasing the rank for fewer weights.
Conclusion and Strategy for Applying LoRA
 Based on these findings, when applying LoRA within a limited parameter budget, it is advisable to:
 Distribute Parameters Across Multiple Weights: Focus on adapting multiple types of attention weights (such as \(W_q\) and \(W_v\)) rather than a single type, as this approach leverages the synergistic effects of adapting multiple aspects of the attention mechanism.
 Use Lower Ranks for Multiple Weights: Opt for a lower rank when adapting multiple weights to ensure that the parameter budget is used efficiently without compromising the ability to capture significant adaptations.
 This strategy maximizes the impact of the available parameters by enhancing more dimensions of the selfattention mechanism, which is crucial for the model’s ability to understand and process input data effectively across different tasks.
Is there a relationship between setting scaling factor and rank in LoRA?
 In the LoRA framework, the relationship between the scaling factor \(\alpha\) and the rank \(r\) of the adaptation matrices \(A\) and \(B\) is an important consideration for tuning the model’s performance and managing how adaptations are applied to the pretrained weights. Both \(\alpha\) and \(r\) play significant roles in determining the impact of the lowrank updates on the model, and their settings can influence each other in terms of the overall effect on the model’s behavior.
Understanding \(\alpha\) and \(r\)
 Scaling Factor \(\alpha\): This parameter scales the contribution of the lowrank updates \(\Delta W = BA\) before they are applied to the original model weights \(W\). It controls the magnitude of changes introduced by the adaptation, effectively modulating how aggressive or subtle the updates are.
 Rank \(r\): This determines the dimensionality of the lowrank matrices \(A\) and \(B\). The rank controls the expressiveness of the lowrank updates, with higher ranks allowing for more complex adaptations but increasing computational costs and potentially the risk of overfitting.
Relationship and Interaction
 Balancing Impact: A higher rank \(r\) allows the model to capture more complex relationships and nuances in the adaptations, potentially leading to more significant changes to the model’s behavior. In such cases, \(\alpha\) might be adjusted downward to temper the overall impact, ensuring that the modifications do not destabilize the model’s pretrained knowledge excessively.
 Adjusting for Subtlety: Conversely, if the rank \(r\) is set lower, which constrains the flexibility and range of the updates, \(\alpha\) may need to be increased to make the limited updates more impactful. This can help ensure that the adaptations, though less complex, are sufficient to achieve the desired performance improvements.
 Experimental Tuning: The optimal settings for \(\alpha\) and \(r\) often depend on the specific task, the dataset, and the desired balance between adapting to new tasks and retaining generalizability. Experimentation and validation are typically necessary to find the best combination.
Practical Considerations
 Overfitting vs. Underfitting: Higher ranks with aggressive scaling factors can lead to overfitting, especially when the model starts fitting too closely to nuances of the training data that do not generalize well. Conversely, too low a rank and/or too conservative an \(\alpha\) might lead to underfitting, where the model fails to adapt adequately to new tasks.
 Computational Efficiency: Higher ranks increase the number of parameters and computational costs. Balancing \(\alpha\) and \(r\) can help manage computational demands while still achieving meaningful model improvements.
Conclusion
 The relationship between \(\alpha\) and \(r\) in LoRA involves a delicate balance. Adjusting one can necessitate compensatory changes to the other to maintain a desired level of adaptation effectiveness without sacrificing the model’s stability or performance. Understanding how these parameters interact can significantly enhance the strategic deployment of LoRA in practical machine learning tasks.
How do you determine the optimal rank \(r\) for LoRA?
 The optimal rank \(r\) for LoRA is influenced by the specific task and the type of weight adaptation. Based on the results reported in the paper from the experiments on the WikiSQL and MultiNLI datasets:
 For WikiSQL:
 When adapting only \(W_q\), the optimal rank is \(r = 4\), with a validation accuracy of 70.5%.
 When adapting \(W_q\) and \(W_v\), the optimal rank is \(r = 8\), with a validation accuracy of 73.8%.
 When adapting \(W_q, W_k, W_v, W_o\), the optimal ranks are \(r = 4\) and \(r = 8\), both achieving a validation accuracy of 74.0%.
 For MultiNLI:
 When adapting only \(W_q\), the optimal rank is \(r = 4\), with a validation accuracy of 91.1%.
 When adapting \(W_q\) and \(W_v\), the optimal rank is \(r = 8\), with a validation accuracy of 91.6%.
 When adapting \(W_q, W_k, W_v, W_o\), the optimal ranks are \(r = 2\) and \(r = 4\), both achieving a validation accuracy of 91.7%.
 For WikiSQL:
 The table below from the paper shows the validation accuracy on WikiSQL and MultiNLI with different rank \(r\) by adapting \(\left\{W_q, W_v\right\}\), \(\left\{W_q, W_k, W_v, W_c\right\}\), and just \(W_q\) for a comparison.. To our surprise, a rank as small as one suffices for adapting both \(W_q\) and \(W_v\) on these datasets while training \(W_q\) alone needs a larger \(r\).
 In summary, while the optimal rank \(r\) varies depending on the dataset and the type of weight adaptation, a rank of \(r = 4\) or \(r = 8\) generally yields the best performance. Specifically, a rank of \(r = 4\) is often sufficient for single weight types like \(W_q\), and a rank of \(r = 8\) is more effective for adapting multiple weight types such as \(W_q\) and \(W_v\).
 However, a small \(r\) cannot be expected to work for every task or dataset. Consider the following thought experiment: if the downstream task were in a different language than the one used for pretraining, retraining the entire model (similar to LoRA with \(r = d_{model}\)) could certainly outperform LoRA with a small \(r\).
 In summary, selecting a rank that is too high can counteract the benefits of the lowrank adaptation by allowing the model to become overly complex and fit the training data too precisely. Conversely, choosing a rank that’s too low may limit the model’s ability to capture necessary information, leading to underfitting. Therefore, setting the rank in LoRA finetuning involves finding a balance: enough capacity to adapt to new data without overfitting.
How do LoRA hyperparameters interact with each other? Is there a relationship between LoRA hyperparameters?

There is a significant relationship among the hyperparameters in the LowRank Adaptation (LoRA) technique, particularly how they interact and influence each other to affect the adaptation and performance of the model. Understanding the interactions between these hyperparameters is crucial for effectively tuning the model to achieve desired behaviors and performance improvements. Here’s a detailed breakdown of the primary hyperparameters in LoRA and how they are interrelated:
 Rank and Scaling Factor:
 Higher ranks allow \(A\) and \(B\) to capture more detailed and complex modifications. However, with increased rank, the potential for overfitting and destabilizing the original model’s behavior also rises. The scaling factor \(\alpha\) often needs to be adjusted in response to the rank; a higher rank might require a smaller \(\alpha\) to moderate the effect of these more complex updates.
 Rank and Regularization:
 As the rank increases, the number of parameters in \(A\) and \(B\) also increases, which can lead to overfitting. Regularization becomes more critical with higher ranks to ensure that the model generalizes well and does not just memorize the training data.
 Learning Rate and Scaling Factor:
 The learning rate for \(A\) and \(B\) can influence how quickly the model adapts the lowrank updates. If \(\alpha\) is high, leading to stronger updates, a lower learning rate might be necessary to prevent training instability. Conversely, with a lower \(\alpha\), a higher learning rate might be feasible to ensure that the updates are sufficiently impactful.
 Regularization and Learning Rate:
 Regularization settings might need adjustment based on the learning rate. A higher learning rate can cause larger updates, which might increase the risk of overfitting unless balanced by stronger regularization.
Practical Considerations
 Tuning Strategy:
 Tuning these hyperparameters requires careful experimentation and validation. Often, changes to one parameter necessitate adjustments to others to maintain a balanced and effective training regime.
 Tradeoffs:
 There are tradeoffs between model flexibility, training stability, computational efficiency, and the risk of overfitting. Effective management of LoRA’s hyperparameters is key to navigating these tradeoffs.
 ApplicationSpecific Adjustments:
 Depending on the specific requirements of the task and characteristics of the data, the optimal settings for these hyperparameters can vary significantly. Taskspecific performance metrics and validation are essential to guide these adjustments.
 In summary, understanding and managing the relationships between these LoRA hyperparameters enables practitioners to finely tune their models to specific tasks without extensive retraining while leveraging pretrained model architectures efficiently.
Why does a higher rank make it the easier to overfit?
 In LoRAbased finetuning, a higher rank can indeed lead to easier overfitting. To understand why, let’s break down the mechanics of LoRA and how rank affects model capacity and overfitting.
 The rank in LoRA determines the dimensions of these additional matrices, effectively controlling their capacity to capture information:
 Low Rank: Small matrices that can represent only limited information.
 High Rank: Larger matrices with greater capacity to capture complex patterns.

In mathematical terms, a higher rank means more degrees of freedom in the lowrank matrices, allowing them to approximate more complex relationships in the data.
 Here’s why a higher rank increases overfitting in LoRA:

Increased Capacity to Capture Training Noise: A higher rank increases the expressive power of the LoRA matrices. This means they can capture not only meaningful patterns in the training data but also noise or spurious correlations. This added capacity can lead the model to “memorize” the training data rather than generalize from it, making it prone to overfitting.

Less Regularization Effect: Lowrank matrices act as a form of regularization by constraining the model’s capacity to learn only the most essential patterns. When the rank is increased, this regularization effect diminishes. The model can then adjust more parameters, fitting closely to the training data distribution, which can hurt its performance on unseen data.

Reduced Ability to Generalize: The initial idea behind LoRA is to adapt large models with minimal parameter changes to preserve generalization. By increasing the rank, we deviate from this minimalist adaptation, moving toward a more specialized adaptation to the training data. This specialization makes it harder for the model to generalize to different data distributions.

Higher Variance in Learned Features: With higherrank matrices, the LoRAbased adjustments might capture a wider variety of features from the training data, leading to high variance in the learned representations. This increased variance can cause the model’s predictions to vary more significantly with small changes in the input, reducing its robustness and making it overfit the nuances of the training set.

Does LoRA adapt weights in all layers?

LoRA does not typically adapt weights across all layers of a neural network; instead, it targets specific layers, often the attention layers in large language models. This selective adaptation is a design choice aimed at balancing the effectiveness of finetuning with computational efficiency and minimizing the risk of overfitting. By modifying only key layers, like attention layers, LoRA efficiently focuses on layers where taskspecific information is most impactful while preserving the generalpurpose features of the lower layers.

Layers Typically Adapted in LoRA:
 In the original LoRA implementation:
 Attention Layers: LoRA primarily targets attention layers (such as the query and key projections in transformers) because they play a critical role in capturing contextual information. By adapting only these layers, LoRA can achieve significant taskspecific improvements without needing to modify the entire model.
 Few Additional Layers (if necessary): Sometimes, LoRA may extend adaptation to a few other layers (like feedforward layers in transformers) if the new task requires it. However, this is usually done with caution to avoid overfitting and to keep the parameter footprint low.

Why not all layers?:
 Computational Efficiency: Adapting all layers would introduce a large number of lowrank matrices throughout the model, greatly increasing the memory and computation requirements, which LoRA is specifically designed to reduce.
 Risk of Overfitting: Adapting all layers, especially the lower (more general) layers, could lead the model to overfit to the finetuning dataset, particularly if the dataset is small. Lower layers tend to capture general features, and adapting them might make the model too specialized, losing generalization ability.
 Focus on TaskSpecific Information: The upper (or top) layers of a model generally capture taskspecific features, while lower layers handle more general, reusable features. LoRA’s selective adaptation focuses on adjusting only those layers where taskspecific learning is most beneficial.
Does LoRA impact lower attention layers less than higher attention layers?

Yes, in practice, LoRA impacts higher attention layers more than lower ones. This is because LoRA selectively adapts layers, targeting the taskspecific adaptability of higher layers while preserving the generalpurpose features in lower layers. This design enables effective task adaptation with minimal overfitting, allowing the model to retain broad applicability.

Why higher attention layers are more affected:

Function of Higher Attention Layers: Higher attention layers (those closer to the output) tend to capture more taskspecific, abstract information. During finetuning, LoRA modifies these layers to incorporate new taskrelated features. Adjustments here have a greater impact on the model’s output because these layers process information in a way that directly influences final predictions.

Less Impact on Lower Layers: Lower layers (closer to the input) generally focus on extracting basic, general features. For example, in language models, they capture fundamental linguistic structures like syntax and word relationships. Since these lower layers capture foundational patterns, they benefit less from taskspecific adaptations. Finetuning these lower layers with LoRA could lead to a loss of generalizable features, which would reduce the model’s ability to transfer across tasks.

LoRA’s Selective Impact: LoRA is typically implemented on a subset of attention heads or specific projections within the attention mechanism (e.g., the query and key projections). Even when applied across all layers, the taskspecific nature of finetuning tends to have a more pronounced effect on the higher layers, which adapt more flexibly to new data patterns.

Regularization Effect in Lower Layers: Because LoRA introduces a lowrank constraint, it inherently acts as a form of regularization. Lower layers, which are already constrained to represent general features, become even more regularized. This further reduces the likelihood of significant changes in these layers and minimizes the effect of LoRA on them.


Practical Implications:

In many cases, finetuning with LoRA results in:
 Major adjustments to higher layers, allowing the model to learn specific features of the finetuning task.
 Minimal impact on lower layers, preserving general knowledge from pretraining and preventing overfitting.
Quantized LowRank Adaptation (QLoRA)
 Proposed in QLoRA: Efficient Finetuning of Quantized LLMs.
 This paper by Dettmers et al. from UW presents QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16bit finetuning task performance. Put simply, QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. QLoRA backpropagates gradients through a frozen, 4bit quantized pretrained language model into Low Rank Adapters (LoRA).
 Put simply, QLoRA is a method designed to efficiently finetune large pretrained language models (LLMs), like a 65B parameter model, on limited GPU memory without sacrificing performance. It combines the principles of LowRank Adaptation (LoRA) with innovative 4bit NormalFloat (NF4) quantization and Double Quantization techniques, optimizing parameter efficiency and computational resource utilization.
 At a toplevel, QLoRA operates based on the following steps:
 Quantize the pretrained model to 4 bits and freeze it.
 Attach small, trainable adapter layers (similar to LoRA).
 Finetune only the adapter layers while using the frozen quantized model for context.
 Key Components:
 LowRank Adaptation: QLoRA follows LoRA’s strategy of injecting trainable lowrank matrices into the architecture of pretrained LLMs, specifically targeting Transformer layers. This selective finetuning strategy focuses on optimizing these lowrank matrices rather than the entire model, reducing the number of trainable parameters and computational costs.
 Quantization: The distinguishing aspect of QLoRA lies in its quantization approach, which includes:
 NF4 Quantization: This technique involves quantizing the model weights to 4bit NormalFloat (NF4), efficiently compressing them to fit a specific distribution suitable for NF4 without complex algorithms.
 Double Quantization: This secondary quantization further reduces memory overhead by quantizing the quantization constants themselves, using 8bit floats with a 256 block size, achieving significant memory savings without affecting model performance.
 Operation:
 QLoRA employs a frozen, 4bit quantized pretrained language model and finetunes it by backpropagating gradients into the low rank adapters. This method optimizes computation through lowbit quantization and reduces the number of parameters by using lowrank structures, striking a balance between efficiency and performance.
 Their best model family, which they name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes.
 They use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models).
 Their results show that QLoRA finetuning on a small highquality dataset leads to stateoftheart results, even when using smaller models than the previous SoTA. They provide a detailed analysis of chatbot performance based on both human and GPT4 evaluations showing that GPT4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, they find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemonpicked analysis demonstrates where Guanaco fails compared to ChatGPT.
 The figure below from the paper shows different finetuning methods and their memory requirements. QLORA improves over LoRA by quantizing the transformer model to 4bit precision and using paged optimizers to handle memory spikes.
In the QLoRA approach, it is the original model’s weights that are quantized to 4bit precision. The newly added Lowrank Adapter (LoRA) weights are not quantized; they remain at a higher precision and are finetuned during the training process. This strategy allows for efficient memory use while maintaining the performance of large language models during finetuning.
 To learn more about QLoRA and how it works, this Hugging Face blog post is highly recommended.
QuantizationAware LowRank Adaptation (QALoRA)
 Proposed in QALoRA: QuantizationAware LowRank Adaptation of Large Language Models.
 Recent advancements in large language models (LLMs) have significantly improved their capabilities in various languageunderstanding tasks. However, the deployment of these models, especially on edge devices, is hampered by their substantial computational requirements.
 This paper by Xu et al. from Huawei seeks to address the aforementioned issue and proposes a quantizationaware lowrank adaptation (QALoRA) algorithm, a technique that aims to mitigate the computational burden by efficiently finetuning lowbit diffusion models without compromising accuracy. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use groupwise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QALoRA is easily implemented with a few lines of code, and it equips the original LoRA with twofold abilities: (i) during finetuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after finetuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy.
 Put simply, QALoRA stands out by incorporating a quantizationaware approach that merges and quantizes the weights of LowRank Adaptation (LoRA) with the fullprecision model weights. This process not only optimizes memory usage and computational efficiency during inference but also ensures the seamless integration of LoRA and auxiliary weights into a quantized model. Notably, QALoRA allows for the reduction of weight precision (e.g., to INT4, INT3, and INT2) during finetuning, significantly decreasing time and memory usage while maintaining accuracy levels, as it eliminates the need for posttraining quantization.
 The algorithm operates through several key steps:
 Adding LoRA Weights: LoRA weights are introduced to the pretrained model, enhancing its adaptability.
 FineTuning LoRA Weights: These weights are then specifically finetuned, which involves updating the LoRA weights while the original model’s weights remain unchanged.
 Merging LoRA and Original Model Weights: Subsequently, the finetuned LoRA weights are merged with the model’s original weights.
 Quantization: Finally, the combined weight set is quantized to a lowerbit format, which is essential for reducing both memory and computational costs.
 The following figure from the paper shows an illustration of the goal of QALoRA. Compared to prior adaptation methods, LoRA and QLoRA, QALoRA is computationally efficient in both the finetuning and inference stages. More importantly, it does not suffer an accuracy loss because posttraining quantization is not required. They display INT4 quantization in the figure, but QALoRA can be generalized to INT3 and INT2.
 QALoRA’s effectiveness has been validated across different finetuning datasets and downstream scenarios, particularly with the LLaMA and LLaMA2 model families. Its unique integration of quantizationaware techniques with lowrank adaptation principles marks a significant advancement in the finetuning of diffusion models for lowbit settings. This approach not only addresses the challenges posed by the computational demands of LLMs but also opens up new possibilities for deploying these models more efficiently and effectively on a wider range of devices.
 The implementation of QALoRA is straightforward and can be achieved with a minimal addition to the existing codebase, showcasing its practicality for widespread adoption. Further details, including code examples, are available on their GitHub repository, making it accessible for researchers and practitioners aiming to leverage the benefits of this innovative adaptation technique.
 Code
Refined LowRank Adaptation (ReLoRA)
 Proposed in Stack More Layers Differently: HighRank Training Through LowRank Updates by Lialin et al. from UMass Lowell.
 Refined LowRank Adaptation (ReLoRA) is a lowrank training technique as an alternative approach to training large neural networks. ReLoRA utilizes lowrank updates to train highrank networks. Put simply, they explore whether LoRA can be used for pretraining (as opposed to finetuning) LLMs in a parameterefficient manner.
 They apply ReLoRA to pretraining transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training.
 Furthermore, they observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multibillionparameter networks efficiently. Their findings shed light on the potential of lowrank training techniques and their implications for scaling laws.
 A caveat worth mentioning is that the researchers only pretrained models up to 350 M parameters for now (the smallest Llama 2 model is 7B parameters, for comparison).
 The following figure (source) presents an overview of their results:
SLoRA: Serving Thousands of Concurrent LoRA Adapters
 This paper by Sheng et al. from UC Berkeley, Stanford, and Shanghai Jiao Tong focuses on the scalable serving of LoRA (LowRank Adaptation) adapters for large language models (LLMs).
 The “pretrainthenfinetune” paradigm, widely adopted in deploying LLMs, leads to numerous finetuned variants, presenting significant opportunities for batched inference during serving. The paper introduces SLoRA, a system designed for this purpose.
 SLoRA addresses memory management challenges by storing all adapters in main memory and fetching them to GPU memory as needed. The system employs Unified Paging, a unified memory pool managing dynamic adapter weights and KV cache tensors, to reduce memory fragmentation and I/O overhead.
 The paper presents a novel tensor parallelism strategy and customized CUDA kernels for efficient heterogeneous batching of LoRA computations, enabling the serving of thousands of adapters on a single or multiple GPUs with minimal overhead.
 The following image from the paper shows separated batched computation for the base model and LoRA computation. The batched computation of the base model is implemented by GEMM. The batched computation for LoRA adapters is implemented by custom CUDA kernels which support batching various sequence lengths and adapter ranks.
 The following image from the paper shows an overview of memory allocation in SLoRA. SLoRA stores all adapters in the main memory and fetches the active adapters for the current batch to the GPU memory. The GPU memory is used to store the KV cache, adapter weights, base model weights, and other temporary tensors.
 SLoRA’s performance is evaluated against stateoftheart libraries like Weights PEFT and vLLM, showing up to 4 times higher throughput and the capability to serve significantly more adapters.
 The system is effective in reducing the training and communication costs in Federated Learning, making it a promising approach for deploying large language models in resourceconstrained environments.
 This paper contributes significantly to the field of machine learning by presenting a novel and efficient method for serving a large number of LoRA adapters, a crucial aspect in the deployment of largescale language models.
 Code
Predibase
 Similar to SLoRA, Predibase, a startup, offers a unique serving infrastructure – LoRAX – which lets you costeffectively serve many finetuned adapters on a single GPU in dedicated deployments.
WeightDecomposed LowRank Adaptation (DoRA)
 Proposed in DoRA: WeightDecomposed LowRank Adaptation by Liu et al. from NVIDIA and HKUST.
 WeightDecomposed LowRank Adaptation (DoRA) is a novel ParameterEfficient FineTuning (PEFT) method that surpasses existing techniques like LoRA by decomposing pretrained weights into magnitude and directional components for efficient finetuning. This method is designed to bridge the accuracy gap between LoRAbased methods and full finetuning, without increasing inference costs.
 The authors’ weight decomposition analysis reveals fundamental differences between full finetuning and LoRA, showing that directional updates play a crucial role in learning capability. DoRA employs LoRA for directional updates and introduces trainable magnitude components, enhancing learning capacity and stability.
 DoRA demonstrates superior performance across a range of tasks, including commonsense reasoning, visual instruction tuning, and image/videotext understanding, across models like LLaMA, LLaVA, and VLBART. It achieves this by effectively managing the tradeoff between the number of trainable parameters and learning capacity, without adding inference overhead.
 The following figure from the paper illustrates an overview of DoRA, which decomposes the pretrained weight into magnitude and direction components for finetuning, especially with LoRA to efficiently update the direction component. Note that \(\\cdot\_c\) denotes the vectorwise norm of a matrix across each column vector.
 Experiments show that DoRA not only outperforms LoRA but also matches or exceeds the performance of full finetuning across different tasks, with significant improvements in commonsense reasoning tasks and multimodal understanding, illustrating its effectiveness and efficiency.
 The paper also explores DoRA’s compatibility with other LoRA variants, such as VeRA, and demonstrates its adaptability across different training sizes and rank settings, further establishing its utility as a versatile and powerful finetuning method.
 Blog
Summary of LoRA Techniques
 The following section is inspired from Cameron Woulfe’s (source) post.
 Here’s an overview of some prevalent variants of LoRA techniques:
 LoRA models the update derived for a model’s weights during finetuning with a low rank decomposition, implemented in practice as a pair of linear projections. LoRA leaves the pretrained layers of the LLM fixed and injects a trainable rank decomposition matrix into each layer of the model.
 QLoRA is (arguably) the most popular LoRA variant and uses model quantization techniques to reduce memory usage during finetuning while maintaining (roughly) equal levels of performance. QLoRA uses 4bit quantization on the pretrained model weights and trains LoRA modules on top of this. In practice, QLoRA saves memory at the cost of slightlyreduced training speed.
 QALoRA is an extension of LoRA/QLoRA that further reduces the computational burden of training and deploying LLMs. It does this by combining parameterefficient finetuning with quantization (i.e., groupwise quantization applied during training/inference).
 LoftQ studies a similar idea to QALoRA – applying quantization and LoRA finetuning on a pretrained model simultaneously.
 LongLoRA attempts to cheaply adapt LLMs to longer context lengths using a parameterefficient (LoRAbased) finetuning scheme. In particular, we start with a pretrained model and finetune it to have a longer context length. This finetuning is made efficient by:
 Using sparse local attention instead of dense global attention (optional at inference time).
 Using LoRA (authors find that this works well for context extension).
 SLoRA aims to solve the problem of deploying multiple LoRA modules that are used to adapt the same pretrained model to a variety of different tasks. Put simply, SLoRA does the following to serve thousands of LoRA modules on a single GPU (or across GPUs):
 Stores all LoRA modules in main memory.
 Puts modules being used to run the current query into GPU memory.
 Uses unified paging to allocate GPU memory and avoid fragmentation.
 Proposes a new tensor parallelism strategy to batch LoRA computations.
 **ReLoRA refines neural network training by iteratively applying lowrank updates to achieve highrank performance, streamlining the process for large models.
 DoRA surpasses existing techniques like LoRA by decomposing pretrained weights into magnitude and directional components for efficient finetuning. This method is designed to bridge the accuracy gap between LoRAbased methods and full finetuning, without increasing inference costs. It employs LoRA for directional updates and introduces trainable magnitude components, enhancing learning capacity and stability.
 Many other LoRA variants exist as well:
 LQLoRA: uses a more sophisticated quantization scheme within QLoRA that performs better and can be adapted to a target memory budget.
 MultiLoRA: extension of LoRA that better handles complex multitask learning scenarios.
 LoRAFA: freezes half of the lowrank decomposition matrix (i.e., the A matrix within the product AB) to further reduce memory overhead.
 TiedLoRA: leverages weight tying to further improve the parameter efficiency of LoRA.
 GLoRA: extends LoRA to adapt pretrained model weights and activations to each task in addition to an adapter for each layer.
Lowrank Linear Subspace ReFT (LoReFT)
 Proposed in ReFT: Representation Finetuning for Language Models by Wu et al. from Stanford and the Pr(Ai)2R Group.
 Representation Finetuning (ReFT) is a suite of methods to modify the hidden representations of language models (LMs) for taskspecific adaptation. Unlike traditional parameterefficient finetuning (PEFT) methods that adapt by modifying weights, ReFT manipulates a small fraction of model representations, enhancing the interpretability and flexibility of the interventions.
 A key variant within ReFT, named Lowrank Linear Subspace ReFT (LoReFT), leverages a lowrank projection matrix to edit representations in a linear subspace. This approach is demonstrated to be 10\(\times\)–50\(\times\) more parameterefficient compared to existing stateoftheart PEFTs like LoRA.
 The ReFT methodology, specifically Lowrank Linear Subspace ReFT (LoReFT), operates by editing hidden representations in a linear subspace. LoReFT modifies these representations using a projection matrix \(R\), which redefines them in a lowdimensional subspace for efficiency. The matrix \(R\) has orthonormal rows, which are crucial for maintaining the quality of the intervention without adding much complexity.
 The core intervention of LoReFT, as per the distributed interchange intervention (DII) formula \(DII(b, s, R) = b + R^\top(Rs  Rb)\), leverages the projection matrix to adjust the hidden states \(b\) towards a target state \(s\) by the application of \(R\). This intervention is designed to manipulate the model output towards desired behaviors or answers subtly and effectively.
 LoReFT employs a linear transformation defined by the parameters \(W\) and \(b\) (not to be confused with the bias term), which projects the representation into the subspace before it is edited. This transformation helps in aligning the representation more closely with the taskspecific features that are crucial for performance.
 Practically, LoReFT is implemented as a set of nonoverlapping interventions across multiple layers of a Transformerbased model. These interventions are strategically placed to modify the behavior of the model without extensive retraining of the underlying parameters.
 Each intervention is applied after the computation of layer \(L\) representations, meaning it directly affects the computation of subsequent layers \(L+1\) to \(L+m\). This placement ensures that the interventions have a cascading effect, enhancing their impact on the final model output.
 The hyperparameter tuning for LoReFT focuses on the number and placement of interventions across the layers, optimizing both the effectiveness of each intervention and the overall computational overhead. This involves selecting the appropriate number of prefix and suffix positions in the input where interventions are most beneficial, as well as deciding on the layers where these modifications will have the most impact.
 The figure below from the paper shows an illustration of ReFT. (1) The left panel depicts an intervention I: the intervention function \(\Phi\) is applied to hidden representations at positions \(P\) in layer \(L\). (2) The right panel depicts the hyperparameters we tune when experimenting with LoReFT. Specifically, the figure depicts application of LoReFT at all layers with prefix length \(p\) = 2 and suffix length \(s\) = 2. When not tying layer weights, we train separate intervention parameters at each position and layer, resulting in 16 interventions with unique parameters in this example.
 The authors evaluate LoReFT across multiple domains, including commonsense reasoning, arithmetic reasoning, instructionfollowing, and natural language understanding. It is shown that LoReFT achieves competitive or superior performance on all tasks, especially shining in commonsense reasoning benchmarks.
 Implementation details reveal that LoReFT interventions are applied at selected layers and positions within the LM, optimizing both the number of interventions and their locations through hyperparameter tuning. This targeted approach allows for minimal additional computational overhead at inference.
 LoReFT is implemented in a publicly available Python library,
pyreft
, which facilitates the adoption of ReFT methods by providing tools to apply these interventions on any pretrained LM from the HuggingFace model hub.  The paper establishes the potential of representationfocused finetuning as a more effective alternative to weightbased methods, setting new standards for efficiency and performance in adapting largescale LMs to diverse tasks.
Stratified Progressive Adaptation Finetuning (SPAFIT)
 Proposed in SPAFIT: Stratified Progressive Adaptation Finetuning for Pretrained Large Language Models by Arora and Wang from Simon Fraser University, Stratified Progressive Adaptation Finetuning (SPAFIT) is a novel ParameterEfficient FineTuning (PEFT) method aimed at optimizing the finetuning process of Transformerbased large language models by localizing the finetuning to specific layers according to their linguistic knowledge importance. This addresses issues like catastrophic forgetting and computational inefficiency common in full finetuning methods.
 SPAFIT organizes the model into three groups of layers, with increasing complexity of finetuning allowed as the layers progress from basic linguistic processing to more taskspecific functions. Group 1 layers remain completely frozen, Group 2 layers undergo finetuning only on bias terms, and Group 3 layers are finetuned using both BitFit for simple parameters and LowRank Adaptation (LoRA) for more significant weight matrices.
 The authors conducted experiments using the BERTlargecased model across nine tasks from the GLUE benchmark. Their results demonstrate that SPAFIT can achieve or exceed the performance of full finetuning and other PEFT methods like Full BitFit and Full LoRA while finetuning significantly fewer parameters.
 The figure below from the paper illustrates an example implementation of SPAFIT on BERT.
 Notable results include SPAFIT models achieving the best performance on tasks involving sentence similarity, like MRPC and STSB, and showing a substantial reduction in the number of parameters finetuned—highlighting SPAFIT’s efficiency.
 The research suggests that different types of linguistic knowledge can indeed be localized to specific layers of a language model, potentially leading to more targeted and efficient finetuning strategies.
 The paper raises points for future investigation, including the application of SPAFIT to more complex tasks like summarization and to models that contain both encoder and decoder architectures. The study also acknowledges the need for further analysis on the optimal balance of parameter efficiency against task performance and the extent of adaptation necessary at different layers.
BitFit
 Proposed in BitFit: Simple Parameterefficient Finetuning for Transformerbased Masked Languagemodels by BenZaken et al. from Yoav Goldberg’s group at Bar Ilan University and the Allen Institute for Artificial Intelligence introduces BitFit, a finetuning method for pretrained BERT models. BitFit focuses on updating only the biasterms of the model, which are a minimal fraction of the model’s parameters, effectively reducing the memory footprint and computational demands typically associated with full model finetuning.
 BitFit’s methodology leverages the observation that finetuning often doesn’t require extensive retraining of all parameters. Instead, finetuning only the bias terms achieves competitive results compared to full model finetuning, especially with small to mediumsized datasets. In scenarios permitting slight performance degradation, the method can be constrained to adjust only two specific types of bias terms, representing just 0.04% of the total model parameters.
 Implementation details include freezing the transformerencoder’s main weights and training only the bias terms along with a taskspecific classification layer. This approach allows the model to handle multiple tasks efficiently in a streaming fashion without requiring simultaneous access to all task datasets.
 Experimental results on the GLUE benchmark show that BitFit is comparable or superior to full finetuning in several NLP tasks. It also outperforms other parameterefficient methods like DiffPruning and Adapters in terms of the number of parameters modified, showcasing its effectiveness in achieving high performance with significantly fewer trainable parameters.
 The findings underscore the potential of focusing finetuning efforts on a small subset of parameters, specifically bias terms, to maintain or even enhance performance while minimizing computational costs. This approach also prompts further exploration of the role of bias terms in neural networks and their impact on model behavior and task transferability.
NOLA
 Proposed in NOLA: Compressing LoRA Using Linear Combination of Random Basis by Koohpayegani et al. in ICLR 2024, NOLA is a novel method for compressing large language models (LLMs) that addresses the limitations of LowRank Adaptation (LoRA). NOLA reparameterizes the rankdecomposition matrices used in LoRA through linear combinations of randomly generated basis matrices, significantly reducing the parameter count by optimizing only the mixture coefficients.
 NOLA decouples the number of trainable parameters from both the rank choice and network architecture, unlike LoRA, where parameters are inherently dependent on the matrix dimensions and rank, which must be an integer. This method not only preserves the adaptation quality but also allows for extreme compression, achieving up to 20 times fewer parameters than the most compressed LoRA models without loss of performance.
 The method’s implementation includes using a pseudorandom number generator for creating basis matrices, where the generator’s seed and the linear coefficients are stored, greatly reducing storage requirements. Quantization of these coefficients further minimizes storage needs without impacting model performance.
 The figure below from the paper shows the process that NOLA follows. After constraining the rank of \(\Delta W\) by decomposing it to \(A \times B\), we reparametrize A and B to be a linear combination of several random basis matrices. We freeze the basis and W and learn the combination coefficients. To reconstruct the model, we store the coefficients and the seed of the random generator which is a single scalar. NOLA results in more compression compared to LoRA and more importantly decouples the compression ratio from the rank and dimensions of W. One can reduce the number of parameters to 4 times smaller than rank=1 of LoRA which is not possible with LoRA due to rank being an integer number.
 Detailed experimental evaluations across several tasks and models, including GPT2 and LLaMA2, showcase NOLA’s effectiveness. It maintains or exceeds benchmark metrics such as BLEU and ROUGEL while using significantly fewer parameters compared to both LoRA and full model finetuning.
 The approach’s versatility is demonstrated through its application not only in natural language processing tasks but also in adapting Vision Transformer (ViT) models for image classification, indicating its potential widespread applicability across different types of deep learning architectures.
 Code
Matrix of Rank Adaptation (MoRA)
 Proposed in MoRA: HighRank Updating for ParameterEfficient FineTuning by Jiang et al. from Beihang University and Microsoft introduces a novel method, MoRA (Matrix of Rank Adaptation) is a parameterefficient finetuning (PEFT) technique for LLMs. The authors identify limitations in existing PEFT methods, particularly LowRank Adaptation (LoRA), which may restrict LLMs’ ability to learn and retain new knowledge. To address these issues, MoRA employs a highrank updating mechanism using a square matrix to achieve greater flexibility and effectiveness without increasing the number of trainable parameters.
 MoRA utilizes nonparameterized operators to adjust input and output dimensions, ensuring the weight can be integrated back into LLMs like LoRA. The method involves the following steps:
 Reduction of Input Dimension: Nonparameter operators reduce the input dimension for the square matrix.
 Increase of Output Dimension: Corresponding operators increase the output dimension, maintaining the number of trainable parameters while achieving highrank updates.
 The figure below from the paper illustrates an overview of our method compared to LoRA under same number of trainable parameters. \(W\) is the frozen weight from model. \(A\) and \(B\) are trainable lowrank matrices in LoRA. \(M\) is the trainable matrix in our method. Gray parts are nonparameter operators to reducing the input dimension and increasing the output dimension. \(r\) represents the rank in two methods.
 The authors comprehensively evaluate MoRA across five tasks—instruction tuning, mathematical reasoning, continual pretraining, memory, and pretraining—demonstrating that MoRA outperforms LoRA in memoryintensive tasks and achieves comparable performance in other areas.
 Technical Details and Implementation:
 LowRank Limitation in LoRA: LoRA uses lowrank matrices to approximate fullrank updates, limiting its capacity to store new information, especially in memoryintensive tasks. The lowrank matrices A and B in LoRA struggle to fully capture the complexity needed for tasks requiring substantial knowledge enhancement.
 HighRank Updating in MoRA: MoRA replaces the lowrank matrices with a square matrix, significantly increasing the rank and thus the capacity for updates. For example, LoRA with rank 8 employs matrices \(A \in \mathbb{R}^{4096 \times 8}\) and \(B \in \mathbb{R}^{8 \times 4096}\), while MoRA uses a square matrix \(M \in \mathbb{R}^{256 \times 256}\), achieving a higher rank with the same number of parameters.
 Compression and Decompression Functions: MoRA employs various methods to implement compression and decompression functions, including truncation, sharing rows/columns, reshaping, and rotation. These methods help reduce the input dimension and increase the output dimension effectively.
 Rotation Operators: Inspired by RoPE (Rotary Position Embedding), MoRA introduces rotation operators to differentiate inputs, enhancing the expressiveness of the square matrix.
 Evaluation and Results:
 Memory Task: In memorizing UUID pairs, MoRA showed significant improvements over LoRA with the same number of trainable parameters. MoRA required fewer training steps to achieve high accuracy compared to LoRA, demonstrating its effectiveness in memoryintensive tasks.
 FineTuning Tasks: MoRA was evaluated on instruction tuning (using Tülu v2 dataset), mathematical reasoning (using MetaMath, GSM8K, MATH), and continual pretraining (in biomedical and financial domains). It matched LoRA’s performance in instruction tuning and mathematical reasoning but outperformed LoRA in continual pretraining tasks, benefiting from highrank updating.
 Pretraining: MoRA and a variant, ReMoRA (which merges updates back into the model during training), were evaluated on pretraining transformers from scratch on the C4 dataset. MoRA showed better pretraining loss and perplexity metrics compared to LoRA and ReLoRA, further validating the advantages of highrank updating.
 MoRA addresses the limitations of lowrank updates in LoRA by employing highrank matrices, significantly enhancing the model’s capacity to learn and memorize new knowledge. This method shows promise for improving parameterefficient finetuning of LLMs, especially in memoryintensive and domainspecific tasks. The authors provide comprehensive implementation details and empirical evaluations, establishing MoRA as an effective advancement in the field of PEFT.
Which PEFT Technique to Choose: A Mental Model
 Choosing a PEFT involves simply matching them with your objectives as shown in the figure below.
Soft Prompt Tuning

What: Soft Prompt tuning involves adding a small trainable prefix to the input of the pretrained LLM during finetuning, which modifies the representation learned by the pretrained model to better suit the downstream task.

When to use: Prompt Tuning is a good choice when you have a large pretrained LLM but want to finetune it for multiple different downstream tasks at inference time with minimal computational resources. It is also useful when you want to generate diverse and highquality text outputs based on specific prompts.
Prefix Tuning

What: Prefix Tuning involves learning a set of trainable parameters that modify the pretrained LLM’s hidden states in response to taskspecific prompts during inference, effectively finetuning the model at inference time.

When to use: When you want to finetune a pretrained LLM for a specific downstream task and have limited computational resources when you want to modify the representation learned by the pretrained model for a particular task.
Adapters

What: Adapters are tiny modules that are added to pretrained LLMs, typically between the pretrained layers, to adapt the model to new downstream tasks. During finetuning, only the weights of the adapter are learned, while the pretrained model’s parameters remain fixed.

When to use: When you need to finetune multiple downstream tasks on the same pretrained model. Additionally, Adapters are flexible and can be quickly and easily plugged into different parts of the pretrained model without requiring major modifications.
BitFit

What: BitFit simplifies the finetuning process by only updating the bias terms of the model, reducing the number of parameters that need to be modified.

When to use: BitFit is an excellent choice when computational resources are a constraint or when working with smaller datasets. It’s especially suited for tasks where slight performance compromises are acceptable in exchange for greater efficiency.
 Key Features:
 BiasOnly Training: By focusing on updating only the bias terms, BitFit significantly lowers the computational demands and memory usage.
 Efficient Adaptability: This method achieves comparable results to more extensive finetuning methods with far fewer parameter updates, making it ideal for rapid deployment and iterative development.
 Process:
 Freezing Main Weights: The main weights of the Transformer encoder are frozen, preserving the pretrained knowledge.
 Bias Term Training: Only the bias terms are finetuned along with a taskspecific classification layer, providing an efficient way to adapt the model to new tasks.
 Evaluation Across Tasks: BitFit’s efficacy is tested on various NLP tasks, showing its capability to maintain high performance with minimal parameter adjustments.
LoRA

What: LoRA (LowRank Adaptation) is a technique that modifies the pretrained LLM’s attention mechanism during finetuning by introducing a lowrank matrix factorization that learns taskspecific attention patterns.

When to use: LoRA is a good choice when you want to finetune a pretrained LLM for a specific downstream task that requires taskspecific attention patterns. It is also useful when you have limited computational resources and want to reduce the number of trainable parameters in the model. Specifically:
 Memory Efficiency is Desired but Not Critical: LoRA offers substantial savings in terms of parameters and computational requirements. If you’re looking to achieve a balanced reduction in trainable parameters without diving into the complexities of quantization, LoRA is an ideal choice.
 Realtime Application: LoRA ensures no added inference latency, making it suitable for realtime applications.
 TaskSwitching is Required: LoRA can share the pretrained model across multiple tasks, reducing the need for maintaining separate models for each task.
QLoRA

What: QLoRA (Quantized LowRank Adaptation) is an advanced finetuning technique that integrates quantization with lowrank adaptation, allowing for efficient finetuning of large language models with significantly reduced memory usage.

When to use: QLoRA is ideal for scenarios where memory and computational efficiency are paramount, particularly when finetuning very large models on limited hardware. It is especially useful when working with lowbit model environments or when full 16bit finetuning would be prohibitively expensive.
 Key Features:
 4bit Quantization: QLoRA uses a novel 4bit NormalFloat (NF4) quantization, optimized for normally distributed weights, to reduce the memory footprint.
 Double Quantization: This technique further reduces memory usage by quantizing the quantization constants.
 Paged Optimizers: These manage memory spikes during gradient checkpointing, enabling stable finetuning on a single GPU.
 Process:
 Model Quantization: The pretrained model is quantized to 4bit precision using NF4.
 Adding LoRA Weights: LoRA weights are integrated into the quantized model.
 FineTuning: The LoRA weights are finetuned, with gradients backpropagated through the frozen quantized model.
 Double Quantization: Quantization constants are further quantized to minimize memory usage.
 Key Features:
QALoRA

What: QALoRA is a specialized technique for finetuning lowbit diffusion models. It integrates quantizationaware strategies with LowRank Adaptation (LoRA) principles, providing an efficient way to handle lowbit model environments.

When to use: Ideal for scenarios where the primary goal is to optimize memory usage and computational efficiency in lowbit settings. This method is particularly effective when traditional finetuning approaches fall short due to the constraints of lowbit environments.
 Key Features:
 QuantizationAware Approach: QALoRA uniquely combines LoRA weights with fullprecision model weights, then jointly quantizes them, enhancing memory and computational efficiency during inference.
 Efficient for LowBit Models: Tailored for lowbit diffusion models, it addresses the specific challenges posed by these environments, making it a standout choice in such contexts.
 Process:
 Adding LoRA Weights: QALoRA begins by integrating LoRA weights into the pretrained model.
 FineTuning LoRA Weights: These weights are then finetuned, focusing solely on the LoRA weights while keeping the original model weights unchanged.
 Merging Weights: Postfinetuning, the LoRA and original model weights are merged.
 Quantization: The merged weights undergo quantization to a lowerbit format, crucial for reducing memory and computational costs.
ReLoRA

What: ReLoRA is an innovative approach for training highrank networks efficiently. It revises the LowRank Adaptation method by iteratively applying lowrank updates to gradually increase the model’s effective rank.

When to use: Best suited for training largescale models, particularly when the objective is to achieve highrank training outcomes with less computational expenditure. ReLoRA is especially valuable for large transformer language models where resource efficiency is critical.
 Key Features:
 Iterative LowRank Updates: Unlike traditional lowrank methods, ReLoRA applies updates in an iterative manner, each time incrementally enhancing the model’s rank, leading to more efficient highrank network training.
 Resource Efficiency: Allows for training of large, highperforming models while significantly reducing computational demands.
 Differentiation from Other Techniques:
 ReLoRA stands out from previous techniques like standard LoRA by its unique iterative process. This method incrementally increases the rank of the model through successive lowrank updates, enabling more dynamic and refined training for largescale models.
SLoRA

What: SLoRA is a scalable system for serving multiple LoRA (LowRank Adaptation) adapters concurrently in large language models (LLMs). It manages memory efficiently by storing all adapters in main memory and dynamically fetching them to GPU memory. The system uses customized CUDA kernels for batch processing, optimizing both memory usage and computational efficiency.

When to use: SLoRA is ideal for scenarios where many finetuned variants of LLMs need to be served simultaneously with high throughput. It significantly reduces memory fragmentation and I/O overhead, making it suitable for largescale deployments in resourceconstrained environments.
 Key Features:
 Efficient Memory Management: Utilizes a unified memory pool to manage adapter weights dynamically, reducing memory fragmentation.
 High Throughput Serving: Custom CUDA kernels enable efficient heterogeneous batching of LoRA computations, allowing the serving of thousands of adapters with minimal overhead.
 Reduced Training and Communication Costs: Offers an effective solution in federated learning scenarios by lowering the costs associated with training and data communication.
 Process:
 Storage of Adapters: All adapters are stored in the main memory, ready for dynamic retrieval.
 Dynamic Fetching: Adapters required for current computations are fetched into GPU memory as needed.
 Batch Processing: Customized CUDA kernels facilitate batch processing, ensuring efficient computation across various sequence lengths and adapter ranks.
DoRA

What: DoRA is an advanced finetuning method that decomposes pretrained model weights into magnitude and directional components. This decomposition facilitates efficient finetuning by employing LoRA for directional updates and introducing trainable magnitude components to enhance the learning capacity and stability.

When to use: DoRA is particularly effective when there is a need to bridge the performance gap between LoRAbased methods and full finetuning without increasing inference costs. It’s suitable for tasks that require high performance, such as commonsense reasoning, visual instruction tuning, and multimodal understanding.
 Key Features:
 Weight Decomposition: Separates weights into magnitude and direction, allowing for targeted updates that enhance learning capability without additional inference overhead.
 Enhanced Learning Capacity: Integrates trainable magnitude components with directional updates, providing a balanced approach to finetuning that improves both stability and learning capacity.
 Versatility Across Tasks: Demonstrates superior performance across various tasks and models, proving its adaptability and effectiveness in different settings.
 Process:
 Decomposition of Weights: Begins with the decomposition of pretrained model weights into their magnitude and directional components.
 Directional Updates Using LoRA: Employs LoRA specifically for updating directional components during finetuning.
 Training of Magnitude Components: Trainable magnitude components are finetuned separately, enhancing the overall learning capacity of the model.
 Performance Evaluation: The effectiveness of DoRA is validated across multiple tasks, showcasing significant performance improvements compared to other finetuning methods.
SPAFIT

What: SPAFIT (Stratified Progressive Adaptation Finetuning) is a ParameterEfficient FineTuning (PEFT) method that targets specific layers of a Transformerbased large language model according to their contribution to linguistic knowledge.

When to use: SPAFIT is effective when you want to avoid the pitfalls of catastrophic forgetting and computational inefficiency typical in full model finetuning. It’s particularly useful for tasks that require different levels of linguistic processing, allowing for tailored adaptation.
 Key Features:
 LayerSpecific FineTuning: SPAFIT divides the model into three groups, allowing each group of layers to be finetuned to varying extents based on their importance to task performance.
 Efficiency and Performance: By finetuning fewer parameters, SPAFIT achieves competitive or superior results compared to full finetuning, particularly on tasks involving sentence similarity.
 Process:
 Layer Grouping: Model layers are categorized into three groups based on their function and linguistic contribution.
 Adaptive FineTuning: Group 1 layers remain frozen, Group 2 layers are finetuned only on bias terms, and Group 3 layers undergo a more comprehensive finetuning using BitFit and LoRA for different components.
 Performance Evaluation: SPAFIT’s effectiveness is validated across multiple NLP tasks, showing strong results with fewer finetuned parameters.
NOLA

What: NOLA is a novel method for compressing large language models that reparameterizes the matrices used in LowRank Adaptation (LoRA) through linear combinations of randomly generated basis matrices, drastically reducing the parameter count.

When to use: Ideal for situations where extreme model compression is necessary without sacrificing performance, making it suitable for deployment in resourceconstrained environments or when model storage costs need to be minimized.
 Key Features:
 Parameter Compression: Achieves up to 20 times fewer parameters than the most compressed LoRA models.
 Decoupling Parameter Count: Separates the number of trainable parameters from the rank choice and network architecture, allowing for more flexible and efficient model compression.
 Process:
 Matrix Reparameterization: Decomposes weight changes into two matrices, \(A\) and \(B\), which are then reparameterized using a linear combination of random basis matrices.
 Learning Combination Coefficients: Focuses on optimizing the mixture coefficients for these basis matrices while keeping the original matrices frozen.
 Storage Optimization: Stores only the coefficients and the seed of the random number generator used for creating the basis matrices, significantly reducing storage requirements.
 Evaluation on Multiple Tasks: Demonstrates effectiveness across various tasks and models, maintaining or exceeding benchmark metrics while significantly reducing the parameter count.
MoRA

What: MoRA (Matrix of Rank Adaptation) is an advanced finetuning technique designed to enhance the capacity of large language models (LLMs) to learn and retain new knowledge. It replaces the lowrank matrices used in LoRA with a highrank square matrix, significantly increasing the model’s update capacity without increasing the number of trainable parameters. This method introduces nonparameterized operators to adjust the input and output dimensions, ensuring efficient integration with existing LLMs.
 When to use: MoRA is particularly effective for tasks that require substantial knowledge enhancement and memory capacity. It is wellsuited for scenarios where:
 MemoryIntensive Tasks: The task demands significant memorization and the retention of new knowledge, such as continual pretraining and memory tasks.
 Limited Resources: You need to maximize performance while maintaining low computational and memory overheads.
 Performance Matching or Exceeding LoRA: The method outperforms LoRA on memoryintensive tasks and achieves comparable performance on other tasks, making it a versatile choice across various applications.
 Key Features:
 HighRank Updates: Utilizes a square matrix to achieve highrank updates, significantly increasing the model’s capacity to learn and retain new information.
 Efficient Parameter Use: Maintains the same number of trainable parameters as LoRA by employing nonparameterized operators for input and output dimension adjustments.
 Versatility Across Tasks: Demonstrates superior performance in memoryintensive tasks and matches performance in other finetuning scenarios, proving its effectiveness across diverse applications.
 Process:
 Input Dimension Reduction: Nonparameterized operators reduce the input dimension for the highrank square matrix.
 Output Dimension Increase: Corresponding operators increase the output dimension, maintaining parameter efficiency.
 Integration with LLMs: The highrank matrix and operators can be integrated back into the LLM, similar to LoRA, ensuring seamless deployment.
 Empirical Evaluation: Comprehensive evaluation across multiple tasks, including instruction tuning, mathematical reasoning, and continual pretraining, demonstrating significant improvements in memoryintensive tasks and comparable performance in others.
Comparative Analysis of Popular PEFT Methods
PEFT Methods  Description  When to Use  Computational Overhead  Memory Efficiency  Versatility across Tasks  Performance Impact 

Prompt Tuning  Modifies LLM's hidden states with trainable parameters in response to taskspecific prompts.  Large pretrained LLM. Adaptation to multiple tasks. 
Low  Moderate  High  Depends on prompt quality 
Prefix Tuning  Adds a trainable prefix to modify LLM's learned representation.  Taskspecific adaptation. Limited resources. 
Low  Moderate  Moderate  Can vary, but usually positive with proper tuning 
Adapters  Inserts neural modules between LLM layers; only adapter weights are updated during finetuning.  Multiple tasks on one LLM. Flexibility required. 
Moderate  Good (only adapters are finetuned)  High (can be added for multiple tasks)  Typically positive if adapters are welltuned 
LoRA  Introduces a lowrank matrix into the attention mechanism to learn taskspecific patterns.  Tasks with specialized attention requirements. Limited resources. 
LowModerate  Good  Moderate  Generally positive with good training 
QLoRA  Builds on LoRA with quantization for enhanced memory efficiency.  Strict memory constraints. Emphasis on performance & efficiency. 
Low  Excellent  High  Comparable or better than full finetuning 
QALoRA  Enhances LoRA with quantizationaware techniques for finetuning lowbit diffusion models.  Optimizing efficiency in lowbit settings. Resourceconstrained environments. 
Low  Excellent  Moderate  Enhanced efficiency and effectiveness in specific settings 
ReLoRA  Iteratively applies lowrank updates for efficient training of highrank networks.  Largescale models requiring highrank training with reduced resources.  Moderate  Good  Moderate  Achieves highrank training efficiency and performance 
SLoRA  System for scalable serving of LoRA adapters in LLMs, using a unified memory management system and custom CUDA kernels for batch processing.  Deploying multiple LLM variants efficiently. High throughput needs in serving. 
Moderate  Good (efficient memory management)  High (supports thousands of concurrent adapters)  Increases throughput, reduces costs in federated settings 
DoRA  Decomposes pretrained weights into magnitude and directional components for finetuning, employing LoRA for directional updates to enhance learning capacity and stability.  Improving learning capacity without adding inference overhead. High performance across diverse tasks. 
Low  Good  High (adaptable across various models and tasks)  Matches or exceeds full finetuning performance 
SPAFIT  Stratifies layer finetuning by linguistic importance, selectively applying adaptations.  Optimal resource allocation. High performance with reduced parameter tuning. 
Low to moderate  High (finetunes fewer parameters)  High (effective across multiple tasks)  Matches or exceeds full model tuning 
BitFit  Updates only bias terms of pretrained BERT models, reducing the finetuning overhead.  Small to medium datasets. Minimal performance degradation acceptable. 
Low  High (minimal parameters are updated)  Moderate (depends on the importance of bias terms)  Comparable or superior to full finetuning 
NOLA  Compresses LoRA using a linear combination of random basis matrices, minimizing parameter counts.  Extreme model compression without losing performance. Resourceconstrained environments. 
Low  Excellent (up to 20 times fewer parameters)  High (effective across NLP and Vision tasks)  Maintains or exceeds benchmark metrics 
MoRA  Employs a highrank square matrix for updates, enhancing the model's capacity to learn and retain new knowledge while maintaining parameter efficiency.  Tasks requiring substantial knowledge enhancement and memory capacity. Limited resources. 
LowModerate  Good  High  Outperforms LoRA on memoryintensive tasks and matches performance on others 
Practical Tips for Finetuning LLMs Using LoRA

This section is inspired by the findings of Sebastian Raschka’s blog talking about practical tips for finetuning.

Consistency in LLM Training: Despite the inherent randomness in training models on GPUs, the outcomes of LoRA experiments remain consistent across multiple runs, which is promising for comparative studies.

QLoRA ComputeMemory Tradeoffs: Quantized LoRA (QLoRA) offers a 33% reduction in GPU memory usage at the cost of a 33% increase in runtime, proving to be a viable alternative to regular LoRA when facing GPU memory constraints.

Learning Rate Schedulers: Using learning rate schedulers like cosine annealing can optimize convergence during training and avoid overshooting the loss minima. While it has a notable impact on SGD optimizer performance, it makes less difference when using Adam or AdamW optimizers.

Choice of Optimizers: The optimizer choice (Adam vs. SGD) doesn’t significantly impact the peak memory demands of LLM training, and swapping Adam for SGD may not provide substantial memory savings, especially with a small LoRA rank (r).

Impact of Multiple Training Epochs: Iterating multiple times over a static dataset in multiepoch training may not be beneficial and could deteriorate model performance, possibly due to overfitting.

Applying LoRA Across Layers: Enabling LoRA across all layers, not just the Key and Value matrices, can significantly increase model performance, though it also increases the number of trainable parameters and memory requirements.

LoRA Hyperparameters: Adjusting the LoRA rank (r) and selecting an appropriate alpha value are crucial. A heuristic that yielded good results was setting alpha at twice the rank’s value, with r=256 and alpha=512 being the best setting in one particular case.

Finetuning Large Models: LoRA allows for finetuning 7 billion parameter LLMs on a single GPU with 14 GB of RAM within a few hours. However, optimizing an LLM to excel across all benchmark tasks may be unattainable with a static dataset.


Additionally, the article addresses common questions related to LoRA:

Importance of Dataset: The dataset used for finetuning is critical, and data quality is very important. Experiments showed that a curated dataset with fewer examples (like LIMA) could yield better performance than larger datasets (like Alpaca).

LoRA for Domain Adaptation: LoRA’s effectiveness for domain adaptation requires further investigation. Including taskspecific examples in the finetuning process is recommended.

Selecting the Best Rank: Choosing the best rank for LoRA is a hyperparameter that needs to be explored for each LLM and dataset. A larger rank could lead to overfitting, while a smaller rank may not capture diverse tasks within a dataset.

Enabling LoRA for All Layers: Exploring the impact of enabling LoRA for different combinations of layers is suggested for future experiments.

Avoiding Overfitting: To prevent overfitting, one could decrease the rank or increase the dataset size, adjust the weight decay rate, or consider increasing the dropout value for LoRA layers.

Other Optimizers: Exploring other optimizers, such as Sophia, which promises faster training and better performance than Adam, is suggested for future research.

Factors Influencing Memory Usage: Model size, batch size, the number of trainable LoRA parameters, and dataset size can influence memory usage. Shorter training sequences can lead to substantial memory savings.

Related: Surgical finetuning
 While not exactly a PEFT method, Surgical finetuning by Lee et al. from Finn’s group at Stanford is a method of selectively updating specific layers in a neural network based on how a finetuning dataset differs from the original pretraining dataset, rather than retraining every layer.
 Motivation:
 Layer Specificity: Early layers in a neural network capture fundamental features of inputs (e.g., edges or shapes in images), while deeper layers combine these features for predictions (e.g., classifying images).
 Efficiency: Rather than universally finetuning every layer, selectively updating specific layers can achieve better performance, especially when the finetuning dataset has notable differences from the pretraining dataset.
 Approaches:
 Manual Approach:
 Finetune each layer individually and create a distinct model for each layer.
 Compare the performance of each model to identify the best layers for finetuning.
 Automated Approach:
 Calculate gradients for each layer.
 Derive relative gradients by dividing the layer’s gradient by its weight magnitude.
 Normalize these relative gradients across layers, ranking them between 0 to 1.
 Assign learning rates for layers based on their normalized relative gradient value during training.
 Based on the findings in this paper, here are some tips for determining which layers to finetune when adapting a pretrained model to a new target distribution:
 Consider the type of distribution shift between the source and target data:
 For inputlevel shifts like image corruptions, finetuning earlier layers (first conv block) tends to work best. This allows the model to adapt to changes in the input while preserving higherlevel features.
 For featurelevel shifts where the feature representations differ between source and target, finetuning middle layers (middle conv blocks) tends to work well. This tunes the midlevel features without distorting lowlevel or highlevel representations.
 For outputlevel shifts like label distribution changes, finetuning later layers (fully connected classifier) tends to be most effective. This keeps the feature hierarchy intact and only adapts the output mapping.
 Try finetuning only a single contiguous block of layers while freezing others. Systematically test first, middle, and last blocks to find the best one.
 Use criteria like relative gradient norms to automatically identify layers that change the most for the target data. Finetuning those with higher relative gradients can work better than full finetuning.
 When in doubt, finetuning only the classifier head is a solid default that outperforms no finetuning. But for shifts related to inputs or features, surgical finetuning of earlier layers can improve over this default.
 If possible, do some quick validation experiments to directly compare different surgical finetuning choices on a small heldout set of target data.
 The key insight is that different parts of the network are best suited for adapting to different types of distribution shifts between the source and target data.
 Consider the type of distribution shift between the source and target data:
 Manual Approach:
 Results:
 CIFARC Dataset:
 Manual approach yielded an accuracy of 82.8%.
 Finetuning the entire network resulted in 79.9% accuracy.
 The automated approach achieved an accuracy of 81.4%.
 CIFARC Dataset:
 Significance: Surgical finetuning is rooted in understanding how neural networks process input. This enhanced understanding can drive the discovery of more efficient methods to improve machine learning models.
 Consideration: For more complex datasets, discerning differences between pretraining and finetuning datasets can be challenging. This complexity might make automated approaches like the one proposed more valuable, even if it didn’t yield the best performance on CIFARC.
LoRA vs. QLoRA experimentation by Sebastian Raschka
 This section is taken from Sebastian Raschka’s post on LoRA & QLoRA experiments to finetune opensource LLMs, and presents his learnings:
 Despite embracing the inherent randomness of LLM training (or when training models on GPUs in general), the outcomes remain remarkably consistent across multiple runs.
 QLoRA presents a tradeoff that might be worthwhile if you’re constrained by GPU memory. It offers 33% memory savings at the cost of a 33% increase in runtime.
 When finetuning LLMs, the choice of optimizer shouldn’t be a major concern. While SGD on its own is suboptimal, there’s minimal variation in outcomes whether you employ AdamW, SGD with a scheduler, or AdamW with a scheduler.
 While Adam is often labeled a memoryintensive optimizer due to its introduction of two new parameters for every model parameter, this doesn’t significantly affect the peak memory demands of the LLM. This is because the majority of the memory is allocated for large matrix multiplications rather than retaining extra parameters.
 For static datasets, iterating multiple times as done in multiepoch training might not be beneficial. It often deteriorates the results, probably due to overfitting.
 If you’re incorporating LoRA, ensure it’s applied across all layers, not just to the Key and Value matrices, to maximize model performance.
 Adjusting the LoRA rank is essential, and so is selecting an apt alpha value. A good heuristic is setting alpha at twice the rank’s value.
 7B models can be finetuned efficiently within a few hours on a single GPU possessing 14 Gb of RAM.
 With a static dataset, optimizing an LLM to excel across all benchmark tasks is unattainable. Addressing this requires diverse data sources, or perhaps LoRA might not be the ideal tool.
References
 Finetuning LLMs Efficiently with Adapters
 Srishti Gureja on LinkedIn
 Sebastian Raschka on LinkedIn
 Prithivi Da on LinkedIn
 🤗 PEFT: ParameterEfficient FineTuning of BillionScale Models on LowResource Hardware
 Hugging Face: PEFT
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledPEFT,
title = {Parameter Efficient FineTuning (PEFT)},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}