• Fine-tuning of large pre-trained models on downstream tasks is called “transfer learning”.
  • While full fine-tuning pre-trained models on downstream tasks is a common, effective approach, it is an inefficient approach to transfer learning.
  • The simplest way out for efficient fine-tuning could be to freeze the networks’ lower layers and adapt only the top ones to specific tasks.
  • In this article, we’ll explore Parameter Efficient Fine-Tuning (PEFT) methods that enable us to adapt a pre-trained model to downstream tasks more efficiently – in a way that trains lesser parameters and hence saves cost and training time, while also yielding performance similar to full fine-tuning.

Parameter-Efficient Fine-Tuning (PEFT)

  • Let’s start off by defining what parameter-efficient fine-tuning is and give some context on it.
  • Parameter-efficient fine-tuning is particularly used in the context of large-scale pre-trained models (such as in NLP), to adapt that pre-trained model to a new task without drastically increasing the number of parameters.
  • The challenge is this: modern pre-trained models (like BERT, GPT, T5, etc.) contain hundreds of millions, if not billions, of parameters. Fine-tuning all these parameters on a downstream task, especially when the available dataset for that task is small, can easily lead to overfitting. The model may simply memorize the training data instead of learning genuine patterns. Moreover, introducing additional layers or parameters during fine-tuning can drastically increase computational requirements and memory consumption.
  • As mentioned earlier, PEFT allows to only fine-tune a small number of model parameters while freezing most of the parameters of the pre-trained LLM. This helps overcome the catastrophic forgetting issue that full fine-tuned LLMs face where the LLM forgets the original task it was trained on after being fine-tuned.
  • The image below (source) gives a nice overview of PEFT and its benefits.


  • Parameter-efficient fine-tuning is useful due the following reasons:
    1. Reduced computational costs (requires fewer GPUs and GPU time).
    2. Faster training times (finishes training faster).
    3. Lower hardware requirements (works with cheaper GPUs with less VRAM).
    4. Better modeling performance (reduces overfitting).
    5. Less storage (majority of weights can be shared across different tasks).

Practical use-case

  • Credits to the below section go to Pranay Pasula.
  • PEFT obviates the need for 40 or 80GB A100s to make use of powerful LLMs. In other words, you can fine-tune 10B+ parameter LLMs for your desired task for free or on cheap consumer GPUs.
  • Using PEFT methods like LoRA, especially 4-bit quantized base models via QLoRA, you can fine-tune 10B+ parameter LLMs that are 30-40GB in size on 16GB GPUs. If it’s out of your budget to buy a 16GB GPU/TPU, Google Colab occasionally offers a 16GB VRAM Tesla T4 for free. Remember to save your model checkpoints every now and then and reload them as necessary, in the event of a Colab disconnect/kernel crash.
  • If you’re fine-tuning on a single task, the base models are already so expressive that you need only a few (~10s-100s) of examples to perform well on this task. With PEFT via LoRA, you need to train only a trivial fraction (in this case, 0.08%), and though the weights are stored as 4-bit, computations are still done at 16-bit.
  • Note that while a good amount of VRAM is still needed for the fine-tuning process, using PEFT, with a small enough batch size, and little gradient accumulation, can do the trick while still retaining ‘fp16’ computation. In some cases, the performance on the fine-tuned task can be comparable to that of a fine-tuned 16-bit model.
  • Key takeaway: You can fine-tune powerful LLMs to perform well on a desired task using free compute. Use a <10B parameter model, which is still huge, and use quantization, PEFT, checkpointing, and provide a small training set, and you can quickly fine-tune this model for your use case.

PEFT methods

  • Below, we will delve into individual PEFT methods and delve deeper into their nuances.

Prompt Modifications

Soft Prompt Tuning

  • First introduced in the The Power of Scale for Parameter-Efficient Prompt Tuning; this paper by Lester et al. introduces a simple yet effective method called soft prompt tuning, which prepends a trainable tensor to the model’s input embeddings, essentially creating a soft prompt to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts, soft prompts are learned through backpropagation and can be fine-tuned to incorporate signals from any number of labeled examples.
  • Soft prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pre-trained model.
  • The authors show that prompt tuning outperforms few-shot learning by a large margin, and becomes more competitive with scale.
  • This is an interesting approach that can help to effectively use a single frozen model for multi-task serving.
  • Model tuning requires making a task-specific copy of the entire pre-trained model for each downstream task and inference must be performed in separate batches. Prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pretrained model. With a T5 “XXL” model, each copy of the tuned model requires 11 billion parameters. By contrast, our tuned prompts would only require 20,480 parameters per task—a reduction of over five orders of magnitude – assuming a prompt length of 5 tokens.
  • Thus, instead of using discrete text prompts, prompt tuning employs soft prompts. Soft prompts are learnable and conditioned through backpropagation, making them adaptable for specific tasks.

  • Prompt Tuning offers many benefits such as:
    • Memory-Efficiency: Prompt tuning dramatically reduces memory requirements. For instance, while a T5 “XXL” model necessitates 11 billion parameters for each task-specific model, prompt-tuned models need a mere 20,480 parameters (assuming a prompt length of 5 tokens).
    • Versatility: Enables the use of a single frozen model for multi-task operations.
    • Performance: Outshines few-shot learning and becomes more competitive as the scale grows.

Soft Prompt vs. Prompting

  • Soft prompt tuning and prompting a model with extra context are both methods designed to guide a model’s behavior for specific tasks, but they operate in different ways. Here’s how they differ:
  1. Mechanism:
    • Soft Prompt Tuning: This involves introducing trainable parameters (soft prompts) that are concatenated or added to the model’s input embeddings. These soft prompts are learned during the fine-tuning process and are adjusted through backpropagation to condition the model to produce desired outputs for specific tasks.
    • Prompting with Extra Context: This method involves feeding the model with handcrafted or predefined text prompts that provide additional context. There’s no explicit fine-tuning; instead, the model leverages its pre-trained knowledge to produce outputs based on the provided context. This method is common in few-shot learning scenarios where the model is given a few examples as prompts and then asked to generalize to a new example.
  2. Trainability:
    • Soft Prompt Tuning: The soft prompts are trainable. They get adjusted during the fine-tuning process to optimize the model’s performance on the target task.
    • Prompting with Extra Context: The prompts are static and not trainable. They’re designed (often manually) to give the model the necessary context for the desired task.
  3. Use Case:
    • Soft Prompt Tuning: This method is particularly useful when there’s a need to adapt a pre-trained model to various downstream tasks without adding significant computational overhead. Since the soft prompts are learned and optimized, they can capture nuanced information necessary for the task.
    • Prompting with Extra Context: This is often used when fine-tuning isn’t feasible or when working with models in a zero-shot or few-shot setting. It’s a way to leverage the vast knowledge contained in large pre-trained models by just guiding their behavior with carefully crafted prompts.
  • In essence, while both methods use prompts to guide the model, soft prompt tuning involves learning and adjusting these prompts, whereas prompting with extra context involves using static, handcrafted prompts to guide the model’s behavior.

Prefix Tuning

  • Proposed in Prefix-Tuning: Optimizing Continuous Prompts for Generation, prefix-tuning is a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix).
  • Instead of adding a soft prompt to the model input, it prepends trainable parameters to the hidden states of all transformer blocks. During fine-tuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.
  • Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.
  • The figure below from the paper shows that fine-tuning (top) updates all Transformer parameters (the red Transformer box) and requires storing a full model copy for each task. They propose prefix-tuning (bottom), which freezes the Transformer parameters and only optimizes the prefix (the red prefix blocks). Consequently, prefix-tuning only need to store the prefix for each task, making prefix-tuning modular and space-efficient. Note that each vertical block denote transformer activations at one time step.

  • They apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. They find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.

  • The image below (source) illustrate how in prefix tuning, trainable tensors are addted to each transformer block instead of only in the input embedding.

Hard prompt tuning

  • Hard prompt tuning directly modifies the input prompt to the model. This can involve a vast multitude of things such as:
    • We can add examples of outputs we expect from the prompt
    • We can add tags specifically relating to our task at hand
  • In essence, it is just the modification of the string input, or prompt, to the model.


  • Adapter layers, often termed “Adapters”, add minimal additional parameters to the pretrained model. These adapters are inserted between existing layers of the network.
  • Adapters is a PEFT technique shown to achieve similar performance as compared to tuning the top layers while requiring as fewer parameters as two orders of magnitude.
  • Adapter-based tuning simply inserts new modules called “adapter modules” between the layers of the pre-trained network.
  • The image below (source) illustrates this concept for the transformer block:

  • During fine-tuning, only the parameters of these adapter layers are updated, while the original model parameters are kept fixed. This results in a model with a small number of additional parameters that are task-specific.
  • Keeping the full PT model frozen, these modules are the only optimizable ones while fine-tuning – this means only a very few parameters are introduced per task yielding “compact” models.
  • They offer many benefits such as:
    • Parameter-Efficiency: By keeping the main model frozen and only updating the adapter layers, a minimal number of parameters are added per task. This results in compact models that are memory-efficient.
    • Performance: Despite the small parameter footprint, adapters often achieve performance comparable to conventional fine-tuning.
  • The adapter module consists of two fully connected layers with a bottleneck structure. This structure is inspired by autoencoders, which are designed to encode information into a compressed representation and then decode it back to its original form.

  • Here’s how the parameter efficiency is achieved:
  1. Bottleneck Structure: The first layer of the adapter reduces the dimensionality of the input (e.g., from 1024 to 24 dimensions). This drastic reduction means that the information from the original 1024 dimensions must be compressed into just 24 dimensions. The second layer then projects these 24 dimensions back to the original 1024 dimensions.

  2. Reduction in Parameters: This bottleneck approach significantly reduces the number of parameters. In your example, the total number of parameters introduced by the adapter is 49,152 (from the computation 1024x24 + 24x1024). If we were to use a single fully connected layer to project a 1024-dimensional input to a 1024-dimensional output directly, it would require 1,048,576 parameters (1024x1024).

  3. Efficiency Analysis: By using the adapter approach, the number of parameters is substantially lower. Comparing 49,152 parameters to 1,048,576 parameters shows a dramatic reduction, making the adapter much more efficient in terms of parameter usage.

  4. Why is this Beneficial?: This efficiency is particularly beneficial when fine-tuning large pre-trained models. Instead of retraining or adapting the entire network (which would be computationally expensive and memory-intensive), adapters allow for targeted adjustments with far fewer additional parameters. This makes the process more manageable and practical, especially when resources are limited.

  • The adapter’s bottleneck structure allows it to achieve similar functionality (adapting the model to new tasks or data) as a full-sized layer would, but with a significantly reduced number of parameters. This efficiency makes adapters a popular choice for fine-tuning large pre-trained models in a resource-effective manner.

What is an Adapter Module?

  • Let’s look at the application of the adapter module in the transformer architecture in three points:
    • The adapter module (right) first projects the original \(d\)-dimensional features into a smaller \(m\)-dimensional vector, applies a non-linearity, and then projects it back to \(d\) dimensions.
    • As can be seen, the module features a skip-connection - With it in place, when the parameters of the projection layers are initialized to near-zero which eventually leads to near identity initialization of the module. This is required for stable fine-tuning and is intuitive as with it, we essentially do not disturb the learning from pre-training.
    • In a transformer block (left), the adapter is applied directly to the outputs of each of the layers (attention and feedforward).

How to decide the value of \(m\)?

  • The size \(m\) in the Adapter module determines the no. of optimizable parameters and hence poses a parameter vs performance tradeoff.
  • The original paper experimentally investigates that the performance remains fairly stable across varying Adapter sizes m and hence for a given model a fixed size can be used for all downstream tasks.


  • This paper introduces an efficient fine-tuning method called LLaMA-Adapter. This method is designed to adapt the LLaMA model into an instruction-following model with high efficiency in terms of resource usage and time. Key aspects of this paper include:

    1. Parameter Efficiency: LLaMA-Adapter introduces only 1.2 million learnable parameters on top of the frozen LLaMA 7B model, which is significantly fewer than the full 7 billion parameters of the model. This approach leads to a more efficient fine-tuning process both in terms of computational resources and time, taking less than one hour on 8 A100 GPUs.

    2. Learnable Adaption Prompts: The method involves appending a set of learnable adaption prompts to the input instruction tokens in the higher transformer layers of LLaMA. These prompts are designed to adaptively inject new instructions into the frozen LLaMA while preserving its pre-trained knowledge, effectively guiding the subsequent contextual response generation.

    3. Zero-initialized Attention Mechanism: To avoid disturbances from randomly initialized adaption prompts, which can harm fine-tuning stability and effectiveness, the paper proposes a zero-initialized attention mechanism with a learnable gating factor. This mechanism allows for a stable learning process and progressive incorporation of instructional signals during training. It ensures that the newly acquired instructional signals are effectively integrated into the transformer while retaining the pre-trained knowledge of LLaMA.

    4. Generalization and Multi-modal Reasoning: LLaMA-Adapter is not only effective for language tasks but can also be extended to multi-modal instructions, allowing for image-conditioned LLaMA models. This capability enables superior reasoning performance on benchmarks like ScienceQA and COCO Caption. Additionally, the approach has demonstrated strong generalization capacity in traditional vision and language tasks.

  • In summary, the LLaMA-Adapter represents a significant advancement in the field of parameter-efficient fine-tuning of large language models. Its innovative use of learnable adaption prompts and zero-initialized attention mechanism provides a highly efficient method for adapting pre-trained models to new tasks and domains, including multi-modal applications.

  • The image below (source) illustrates this concept below.


Low Rank Adaptation (LoRA)

  • Essence:
    • Low Rank Adaptation (LoRA) simplifies the fine-tuning of large models by decomposing complex, high-dimensional weight matrices into lower-dimensional forms. This technique, akin to methods like PCA and SVD, allows for the retention of critical information while significantly reducing the size and complexity of the weights, thus enhancing fine-tuning efficiency on resource-constrained settings.
  • Application:
    • LoRA identifies key dimensions in the original weight matrix of neural networks, optimizing these reduced dimensions to maintain the model’s learning capabilities with less computational cost. It adds trainable low-rank matrices to the model’s architecture, specifically to the Transformer layers, and optimizes these matrices instead of the entire model, leading to fewer trainable parameters and reduced memory requirements.
  • Benefits:
    • This approach offers considerable time and memory efficiency, as a large portion of the model’s parameters are kept frozen, reducing both training time and GPU memory requirements. It also avoids additional inference latency and facilitates easy task-switching during deployment, requiring changes only in a small subset of weights.
  • In Summary:
    • LoRA represents a smart balance in model fine-tuning, preserving the core strengths of large pre-trained models while adapting them efficiently for specific tasks or datasets. It’s a technique that redefines efficiency in the world of massive language models.

  • A matrix is said to be rank-deficient if it does not have full rank. The rank deficiency of a matrix is the difference between the lesser of the number of rows and columns, and the rank. For more, refer Wikipedia: Rank.

  • Before we continue, let’s recap by taking a quick look at traditional finetuning vs. LoRA with the images (source) below:

LoRA efficiently fine-tunes large-scale neural networks by introducing trainable low-rank matrices, simplifying the model’s complexity while retaining its robust learning capabilities.

LoRA Hyperparameters
  • LoRA is particularly useful in situations where fine-tuning all parameters of a large model is computationally expensive or when the amount of available training data is limited. The core idea behind LoRA is to introduce trainable low-rank matrices that adapt the pre-trained weights of a model without directly modifying them. Here are the key hyperparameters associated with LoRA:
    1. Rank (\(r\)): This is arguably the most important hyperparameter in LoRA. The rank of the adaptation matrices determines the balance between the model’s capacity and the number of additional parameters introduced. A higher rank increases the model’s capacity to learn the desired task (i.e., domain adapt) from new data but also increases the number of parameters and computational cost. Finding the optimal rank usually requires experimentation.

    2. Learning Rate: While not unique to LoRA, the learning rate for the LoRA parameters is crucial. It might be beneficial to set a different learning rate for the LoRA parameters compared to the rest of the model, especially if the base model’s parameters are frozen.

    3. Weight Decay: This hyperparameter controls the regularization of the LoRA parameters. Adjusting weight decay can help prevent overfitting, especially when the dataset is small or the model is particularly large.

    4. Initialization Scale (\(\alpha\)): The scale at which the low-rank matrices are initialized can affect the adaptation process. Proper initialization helps in starting the training process from a good point, which can lead to better performance.

    5. Training Epochs: The number of epochs to train the LoRA adaptation. Since LoRA involves training fewer parameters than the entire model, it might converge faster, so the optimal number of epochs might be different from standard full-model fine-tuning.

    6. Batch Size: The size of the batches used during training can influence the model’s performance and convergence speed. Batch size can affect the stability and quality of the gradient estimates, which are crucial when adapting a model using LoRA.

  • These hyperparameters can significantly impact the effectiveness of LoRA in adapting pre-trained models to new tasks or datasets. The optimal settings for these hyperparameters can vary depending on the specific task, the size of the dataset, and the architecture of the base model being adapted. Experimentation and validation on a development dataset are often required to find the best combination of hyperparameters for a given application.

Quantized Low-Rank Adaptation (QLoRA)

  • Proposed in QLoRA: Efficient Finetuning of Quantized LLMs.
  • This paper by Dettmers et al. from UW presents QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. Put simply, QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA).
  • Put simply, QLoRA is a method designed to efficiently fine-tune large pre-trained language models (LLMs), like a 65B parameter model, on limited GPU memory without sacrificing performance. It combines the principles of Low-Rank Adaptation (LoRA) with innovative 4-bit NormalFloat (NF4) quantization and Double Quantization techniques, optimizing parameter efficiency and computational resource utilization.
  • At a top-level, QLoRA operates based on the following steps:
    • Quantize the pre-trained model to 4 bits and freeze it.
    • Attach small, trainable adapter layers (similar to LoRA).
    • Finetune only the adapter layers while using the frozen quantized model for context.
  • Key Components:
    1. Low-Rank Adaptation: QLoRA follows LoRA’s strategy of injecting trainable low-rank matrices into the architecture of pretrained LLMs, specifically targeting Transformer layers. This selective fine-tuning strategy focuses on optimizing these low-rank matrices rather than the entire model, reducing the number of trainable parameters and computational costs.
    2. Quantization: The distinguishing aspect of QLoRA lies in its quantization approach, which includes:
      • NF4 Quantization: This technique involves quantizing the model weights to 4-bit NormalFloat (NF4), efficiently compressing them to fit a specific distribution suitable for NF4 without complex algorithms.
      • Double Quantization: This secondary quantization further reduces memory overhead by quantizing the quantization constants themselves, using 8-bit floats with a 256 block size, achieving significant memory savings without affecting model performance.
  • Operation:
    • QLoRA employs a frozen, 4-bit quantized pretrained language model and fine-tunes it by backpropagating gradients into the low rank adapters. This method optimizes computation through low-bit quantization and reduces the number of parameters by using low-rank structures, striking a balance between efficiency and performance.
  • Their best model family, which they name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes.
  • They use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models).
  • Their results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. They provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, they find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT.
  • The figure below from the paper shows different finetuning methods and their memory requirements. QLORA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.

In the QLoRA approach, it is the original model’s weights that are quantized to 4-bit precision. The newly added Low-rank Adapter (LoRA) weights are not quantized; they remain at a higher precision and are fine-tuned during the training process. This strategy allows for efficient memory use while maintaining the performance of large language models during finetuning.

  • To learn more about QLoRA and how it works, the Hugging Face blog post is highly recommended.

Quantization-Aware Low-Rank Adaptation (QA-LoRA)

  • Proposed in QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models.
  • Recent advancements in large language models (LLMs) have significantly improved their capabilities in various language-understanding tasks. However, the deployment of these models, especially on edge devices, is hampered by their substantial computational requirements.
  • This paper by Xu et al. from Huawei seeks to address the aforementioned issue and proposes a quantization-aware low-rank adaptation (QA-LoRA) algorithm, a technique that aims to mitigate the computational burden by efficiently fine-tuning low-bit diffusion models without compromising accuracy. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy.
  • Put simply, QA-LoRA stands out by incorporating a quantization-aware approach that merges and quantizes the weights of Low-Rank Adaptation (LoRA) with the full-precision model weights. This process not only optimizes memory usage and computational efficiency during inference but also ensures the seamless integration of LoRA and auxiliary weights into a quantized model. Notably, QA-LoRA allows for the reduction of weight precision (e.g., to INT4, INT3, and INT2) during fine-tuning, significantly decreasing time and memory usage while maintaining accuracy levels, as it eliminates the need for post-training quantization.
  • The algorithm operates through several key steps:
    1. Adding LoRA Weights: LoRA weights are introduced to the pre-trained model, enhancing its adaptability.
    2. Fine-Tuning LoRA Weights: These weights are then specifically fine-tuned, which involves updating the LoRA weights while the original model’s weights remain unchanged.
    3. Merging LoRA and Original Model Weights: Subsequently, the fine-tuned LoRA weights are merged with the model’s original weights.
    4. Quantization: Finally, the combined weight set is quantized to a lower-bit format, which is essential for reducing both memory and computational costs.
  • The following figure from the paper shows an illustration of the goal of QA-LoRA. Compared to prior adaptation methods, LoRA and QLoRA, QA-LoRA is computationally efficient in both the fine-tuning and inference stages. More importantly, it does not suffer an accuracy loss because post-training quantization is not required. They display INT4 quantization in the figure, but QA-LoRA can be generalized to INT3 and INT2.

  • QA-LoRA’s effectiveness has been validated across different fine-tuning datasets and downstream scenarios, particularly with the LLaMA and LLaMA2 model families. Its unique integration of quantization-aware techniques with low-rank adaptation principles marks a significant advancement in the fine-tuning of diffusion models for low-bit settings. This approach not only addresses the challenges posed by the computational demands of LLMs but also opens up new possibilities for deploying these models more efficiently and effectively on a wider range of devices.
  • The implementation of QA-LoRA is straightforward and can be achieved with a minimal addition to the existing codebase, showcasing its practicality for widespread adoption. Further details, including code examples, are available on their GitHub repository, making it accessible for researchers and practitioners aiming to leverage the benefits of this innovative adaptation technique.
  • Code

Refined Low-Rank Adaptation (ReLoRA)

  • Proposed in Stack More Layers Differently: High-Rank Training Through Low-Rank Updates by Lialin et al. from UMass Lowell.
  • Refined Low-Rank Adaptation (ReLoRA) is a low-rank training technique as an alternative approach to training large neural networks. ReLoRA utilizes low-rank updates to train high-rank networks. Put simply, they explore whether LoRA can be used for pretraining (as opposed to finetuning) LLMs in a parameter-efficient manner.
  • They apply ReLoRA to pre-training transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training.
  • Furthermore, they observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multi-billion-parameter networks efficiently. Their findings shed light on the potential of low-rank training techniques and their implications for scaling laws.
  • A caveat worth mentioning is that the researchers only pretrained models up to 350 M parameters for now (the smallest Llama 2 model is 7B parameters, for comparison).
  • The following figure (source) presents an overview of their results:

Weight-Decomposed Low-Rank Adaptation (DoRA)

  • Proposed in DoRA: Weight-Decomposed Low-Rank Adaptation by Liu et al. from NVIDIA and HKUST.
  • Weight-Decomposed Low-Rank Adaptation (DoRA) is a novel Parameter-Efficient Fine-Tuning (PEFT) method that surpasses existing techniques like LoRA by decomposing pre-trained weights into magnitude and directional components for efficient fine-tuning. This method is designed to bridge the accuracy gap between LoRA-based methods and full fine-tuning, without increasing inference costs.
  • The authors’ weight decomposition analysis reveals fundamental differences between full fine-tuning and LoRA, showing that directional updates play a crucial role in learning capability. DoRA employs LoRA for directional updates and introduces trainable magnitude components, enhancing learning capacity and stability.
  • DoRA demonstrates superior performance across a range of tasks, including commonsense reasoning, visual instruction tuning, and image/video-text understanding, across models like LLaMA, LLaVA, and VL-BART. It achieves this by effectively managing the trade-off between the number of trainable parameters and learning capacity, without adding inference overhead.
  • The following figure from the paper illustrates an overview of DoRA, which decomposes the pre-trained weight into magnitude and direction components for fine-tuning, especially with LoRA to efficiently update the direction component. Note that \(\|\cdot\|_c\) denotes the vector-wise norm of a matrix across each column vector.

  • Experiments show that DoRA not only outperforms LoRA but also matches or exceeds the performance of full fine-tuning across different tasks, with significant improvements in commonsense reasoning tasks and multimodal understanding, illustrating its effectiveness and efficiency.
  • The paper also explores DoRA’s compatibility with other LoRA variants, such as VeRA, and demonstrates its adaptability across different training sizes and rank settings, further establishing its utility as a versatile and powerful fine-tuning method.

Summary of LoRA Techniques

  • The following section is inspired from Cameron Woulfe’s [(source])(https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/) post.
  • Here’s an overview of some prevalent variants of LoRA techniques:
    • LoRA models the update derived for a model’s weights during finetuning with a low rank decomposition, implemented in practice as a pair of linear projections. LoRA leaves the pretrained layers of the LLM fixed and injects a trainable rank decomposition matrix into each layer of the model.
    • QLoRA is (arguably) the most popular LoRA variant and uses model quantization techniques to reduce memory usage during finetuning while maintaining (roughly) equal levels of performance. QLoRA uses 4-bit quantization on the pretrained model weights and trains LoRA modules on top of this. In practice, QLoRA saves memory at the cost of slightly-reduced training speed.
    • QA-LoRA is an extension of LoRA/QLoRA that further reduces the computational burden of training and deploying LLMs. It does this by combining parameter-efficient finetuning with quantization (i.e., group-wise quantization applied during training/inference).
    • LoftQ studies a similar idea to QA-LoRA—applying quantization and LoRA finetuning on a pretrained model simultaneously.
    • LongLoRA attempts to cheaply adapt LLMs to longer context lengths using a parameter-efficient (LoRA-based) finetuning scheme. In particular, we start with a pretrained model and finetune it to have a longer context length. This finetuning is made efficient by:
      • Using sparse local attention instead of dense global attention (optional at inference time).
      • Using LoRA (authors find that this works well for context extension).
    • S-LoRA aims to solve the problem of deploying multiple LoRA modules that are used to adapt the same pretrained model to a variety of different tasks. Put simply, S-LoRA does the following to serve thousands of LoRA modules on a single GPU (or across GPUs):
      • Stores all LoRA modules in main memory.
      • Puts modules being used to run the current query into GPU memory.
      • Uses unified paging to allocate GPU memory and avoid fragmentation.
      • Proposes a new tensor parallelism strategy to batch LoRA computations.
    • ReLoRA refines neural network training by iteratively applying low-rank updates to achieve high-rank performance, streamlining the process for large models.
    • Many other LoRA variants exist as well:
      • LQ-LoRA: uses a more sophisticated quantization scheme within QLoRA that performs better and can be adapted to a target memory budget.
      • MultiLoRA: extension of LoRA that better handles complex multi-task learning scenarios.
      • LoRA-FA: freezes half of the low-rank decomposition matrix (i.e., the A matrix within the product AB) to further reduce memory overhead.
      • Tied-LoRA: leverages weight tying to further improve the parameter efficiency of LoRA.
      • GLoRA: extends LoRA to adapt pretrained model weights and activations to each task in addition to an adapter for each layer.

Representation Finetuning (ReFT)

  • Proposed in ReFT: Representation Finetuning for Language Models by Wu et al. from Stanford and the Pr(Ai)2R Group.
  • Representation Finetuning (ReFT) is a suite of methods to modify the hidden representations of language models (LMs) for task-specific adaptation. Unlike traditional parameter-efficient finetuning (PEFT) methods that adapt by modifying weights, ReFT manipulates a small fraction of model representations, enhancing the interpretability and flexibility of the interventions.
  • A key variant within ReFT, named Low-rank Linear Subspace ReFT (LoReFT), leverages a low-rank projection matrix to edit representations in a linear subspace. This approach is demonstrated to be 10\(\times\)–50\(\times\) more parameter-efficient compared to existing state-of-the-art PEFTs like LoRA.
  • The ReFT methodology, specifically Low-rank Linear Subspace ReFT (LoReFT), operates by editing hidden representations in a linear subspace. LoReFT modifies these representations using a projection matrix \(R\), which redefines them in a low-dimensional subspace for efficiency. The matrix \(R\) has orthonormal rows, which are crucial for maintaining the quality of the intervention without adding much complexity.
  • The core intervention of LoReFT, as per the distributed interchange intervention (DII) formula \(DII(b, s, R) = b + R^\top(Rs - Rb)\), leverages the projection matrix to adjust the hidden states \(b\) towards a target state \(s\) by the application of \(R\). This intervention is designed to manipulate the model output towards desired behaviors or answers subtly and effectively.
  • LoReFT employs a linear transformation defined by the parameters \(W\) and \(b\) (not to be confused with the bias term), which projects the representation into the subspace before it is edited. This transformation helps in aligning the representation more closely with the task-specific features that are crucial for performance.
  • Practically, LoReFT is implemented as a set of non-overlapping interventions across multiple layers of a Transformer-based model. These interventions are strategically placed to modify the behavior of the model without extensive retraining of the underlying parameters.
  • Each intervention is applied after the computation of layer \(L\) representations, meaning it directly affects the computation of subsequent layers \(L+1\) to \(L+m\). This placement ensures that the interventions have a cascading effect, enhancing their impact on the final model output.
  • The hyperparameter tuning for LoReFT focuses on the number and placement of interventions across the layers, optimizing both the effectiveness of each intervention and the overall computational overhead. This involves selecting the appropriate number of prefix and suffix positions in the input where interventions are most beneficial, as well as deciding on the layers where these modifications will have the most impact.
  • The figure below from the paper shows an illustration of ReFT. (1) The left panel depicts an intervention I: the intervention function \(\Phi\) is applied to hidden representations at positions \(P\) in layer \(L\). (2) The right panel depicts the hyperparameters we tune when experimenting with LoReFT. Specifically, the figure depicts application of LoReFT at all layers with prefix length \(p\) = 2 and suffix length \(s\) = 2. When not tying layer weights, we train separate intervention parameters at each position and layer, resulting in 16 interventions with unique parameters in this example.

  • The authors evaluate LoReFT across multiple domains, including commonsense reasoning, arithmetic reasoning, instruction-following, and natural language understanding. It is shown that LoReFT achieves competitive or superior performance on all tasks, especially shining in commonsense reasoning benchmarks.
  • Implementation details reveal that LoReFT interventions are applied at selected layers and positions within the LM, optimizing both the number of interventions and their locations through hyperparameter tuning. This targeted approach allows for minimal additional computational overhead at inference.
  • LoReFT is implemented in a publicly available Python library, pyreft, which facilitates the adoption of ReFT methods by providing tools to apply these interventions on any pretrained LM from the HuggingFace model hub.
  • The paper establishes the potential of representation-focused finetuning as a more effective alternative to weight-based methods, setting new standards for efficiency and performance in adapting large-scale LMs to diverse tasks.

Which PEFT Technique to Choose: A Mental Model

  • Choosing a PEFT involves simply matching them with your objectives as shown in the figure below.

Soft Prompt Tuning

  • What: Soft Prompt tuning involves adding a small trainable prefix to the input of the pre-trained LLM during fine-tuning, which modifies the representation learned by the pre-trained model to better suit the downstream task.

  • When to use: Prompt Tuning is a good choice when you have a large pre-trained LLM but want to fine-tune it for multiple different downstream tasks at inference time with minimal computational resources. It is also useful when you want to generate diverse and high-quality text outputs based on specific prompts.

Prefix Tuning

  • What: Prefix Tuning involves learning a set of trainable parameters that modify the pre-trained LLM’s hidden states in response to task-specific prompts during inference, effectively fine-tuning the model at inference time.

  • When to use: When you want to fine-tune a pre-trained LLM for a specific downstream task and have limited computational resources when you want to modify the representation learned by the pre-trained model for a particular task.


  • What: Adapters are tiny modules that are added to pre-trained LLMs, typically between the pre-trained layers, to adapt the model to new downstream tasks. During fine-tuning, only the weights of the adapter are learned, while the pre-trained model’s parameters remain fixed.

  • When to use: When you need to fine-tune multiple downstream tasks on the same pre-trained model. Additionally, Adapters are flexible and can be quickly and easily plugged into different parts of the pre-trained model without requiring major modifications.


  • What: LoRA (Low-Rank Adaptation) is a technique that modifies the pre-trained LLM’s attention mechanism during fine-tuning by introducing a low-rank matrix factorization that learns task-specific attention patterns.

  • When to use: LoRA is a good choice when you want to fine-tune a pre-trained LLM for a specific downstream task that requires task-specific attention patterns. It is also useful when you have limited computational resources and want to reduce the number of trainable parameters in the model. Specifically:

    • Memory Efficiency is Desired but Not Critical: LoRA offers substantial savings in terms of parameters and computational requirements. If you’re looking to achieve a balanced reduction in trainable parameters without diving into the complexities of quantization, LoRA is an ideal choice.
    • Real-time Application: LoRA ensures no added inference latency, making it suitable for real-time applications.
    • Task-Switching is Required: LoRA can share the pretrained model across multiple tasks, reducing the need for maintaining separate models for each task.


  • What: QA-LoRA is a specialized technique for fine-tuning low-bit diffusion models. It integrates quantization-aware strategies with Low-Rank Adaptation (LoRA) principles, providing an efficient way to handle low-bit model environments.

  • When to use: Ideal for scenarios where the primary goal is to optimize memory usage and computational efficiency in low-bit settings. This method is particularly effective when traditional fine-tuning approaches fall short due to the constraints of low-bit environments.

  • Key Features:
    • Quantization-Aware Approach: QA-LoRA uniquely combines LoRA weights with full-precision model weights, then jointly quantizes them, enhancing memory and computational efficiency during inference.
    • Efficient for Low-Bit Models: Tailored for low-bit diffusion models, it addresses the specific challenges posed by these environments, making it a standout choice in such contexts.
  • Process:
    1. Adding LoRA Weights: QA-LoRA begins by integrating LoRA weights into the pre-trained model.
    2. Fine-Tuning LoRA Weights: These weights are then fine-tuned, focusing solely on the LoRA weights while keeping the original model weights unchanged.
    3. Merging Weights: Post-fine-tuning, the LoRA and original model weights are merged.
    4. Quantization: The merged weights undergo quantization to a lower-bit format, crucial for reducing memory and computational costs.


  • What: ReLoRA is an innovative approach for training high-rank networks efficiently. It revises the Low-Rank Adaptation method by iteratively applying low-rank updates to gradually increase the model’s effective rank.

  • When to use: Best suited for training large-scale models, particularly when the objective is to achieve high-rank training outcomes with less computational expenditure. ReLoRA is especially valuable for large transformer language models where resource efficiency is critical.

  • Key Features:
    • Iterative Low-Rank Updates: Unlike traditional low-rank methods, ReLoRA applies updates in an iterative manner, each time incrementally enhancing the model’s rank, leading to more efficient high-rank network training.
    • Resource Efficiency: Allows for training of large, high-performing models while significantly reducing computational demands.
  • Differentiation from Other Techniques:
    • ReLoRA stands out from previous techniques like standard LoRA by its unique iterative process. This method incrementally increases the rank of the model through successive low-rank updates, enabling more dynamic and refined training for large-scale models.
PEFT Methods Description When to Use Computational Overhead Memory Efficiency Versatility across Tasks Performance Impact
Prompt Tuning Modifies LLM's hidden states with trainable parameters in response to task-specific prompts. Large pre-trained LLM.
Adaptation to multiple tasks.
Low Moderate High Depends on prompt quality
Prefix Tuning Adds a trainable prefix to modify LLM's learned representation. Task-specific adaptation.
Limited resources.
Low Moderate Moderate Can vary, but usually positive with proper tuning
Adapters Inserts neural modules between LLM layers; only adapter weights are updated during fine-tuning. Multiple tasks on one LLM.
Flexibility required.
Moderate Good (only adapters are fine-tuned) High (can be added for multiple tasks) Typically positive if adapters are well-tuned
LoRA Introduces a low-rank matrix into the attention mechanism to learn task-specific patterns. Tasks with specialized attention requirements.
Limited resources.
Low-Moderate Good Moderate Generally positive with good training
QLoRA Builds on LoRA with quantization for enhanced memory efficiency. Strict memory constraints.
Emphasis on performance & efficiency.
Low Excellent High Comparable or better than full fine-tuning
QA-LoRA Enhances LoRA with quantization-aware techniques for fine-tuning low-bit diffusion models. Optimizing efficiency in low-bit settings.
Resource-constrained environments.
Low Excellent Moderate Enhanced efficiency and effectiveness in specific settings
ReLoRA Iteratively applies low-rank updates for efficient training of high-rank networks. Large-scale models requiring high-rank training with reduced resources. Moderate Good Moderate Achieves high-rank training efficiency and performance

Practical Tips for Finetuning LLMs Using LoRA

  • This section is inspired by the findings of Sebastian Raschka’s blog talking about practical tips for finetuning.

    1. Consistency in LLM Training: Despite the inherent randomness in training models on GPUs, the outcomes of LoRA experiments remain consistent across multiple runs, which is promising for comparative studies.

    2. QLoRA Compute-Memory Trade-offs: Quantized LoRA (QLoRA) offers a 33% reduction in GPU memory usage at the cost of a 33% increase in runtime, proving to be a viable alternative to regular LoRA when facing GPU memory constraints.

    3. Learning Rate Schedulers: Using learning rate schedulers like cosine annealing can optimize convergence during training and avoid overshooting the loss minima. While it has a notable impact on SGD optimizer performance, it makes less difference when using Adam or AdamW optimizers.

    4. Choice of Optimizers: The optimizer choice (Adam vs. SGD) doesn’t significantly impact the peak memory demands of LLM training, and swapping Adam for SGD may not provide substantial memory savings, especially with a small LoRA rank (r).

    5. Impact of Multiple Training Epochs: Iterating multiple times over a static dataset in multi-epoch training may not be beneficial and could deteriorate model performance, possibly due to overfitting.

    6. Applying LoRA Across Layers: Enabling LoRA across all layers, not just the Key and Value matrices, can significantly increase model performance, though it also increases the number of trainable parameters and memory requirements.

    7. LoRA Hyperparameters: Adjusting the LoRA rank (r) and selecting an appropriate alpha value are crucial. A heuristic that yielded good results was setting alpha at twice the rank’s value, with r=256 and alpha=512 being the best setting in one particular case.

    8. Fine-tuning Large Models: LoRA allows for fine-tuning 7 billion parameter LLMs on a single GPU with 14 GB of RAM within a few hours. However, optimizing an LLM to excel across all benchmark tasks may be unattainable with a static dataset.

  • Additionally, the article addresses common questions related to LoRA:

    • Importance of Dataset: The dataset used for fine-tuning is critical, and data quality is very important. Experiments showed that a curated dataset with fewer examples (like LIMA) could yield better performance than larger datasets (like Alpaca).

    • LoRA for Domain Adaptation: LoRA’s effectiveness for domain adaptation requires further investigation. Including task-specific examples in the fine-tuning process is recommended.

    • Selecting the Best Rank: Choosing the best rank for LoRA is a hyperparameter that needs to be explored for each LLM and dataset. A larger rank could lead to overfitting, while a smaller rank may not capture diverse tasks within a dataset.

    • Enabling LoRA for All Layers: Exploring the impact of enabling LoRA for different combinations of layers is suggested for future experiments.

    • Avoiding Overfitting: To prevent overfitting, one could decrease the rank or increase the dataset size, adjust the weight decay rate, or consider increasing the dropout value for LoRA layers.

    • Other Optimizers: Exploring other optimizers, such as Sophia, which promises faster training and better performance than Adam, is suggested for future research.

    • Factors Influencing Memory Usage: Model size, batch size, the number of trainable LoRA parameters, and dataset size can influence memory usage. Shorter training sequences can lead to substantial memory savings.

Surgical fine-tuning

  • A quick note here before we start, the Surgical fine-tuning paper, whose contents are mentioned below, focuses on computer vision and does not directly deal with large language models (LLMs). The techniques are applied to convolutional neural networks and vision transformers for image classification tasks.
  • Authors: Yoonho Lee, Annie S. Chen, Fahim Tajwar, Ananya Kumar, Huaxiu Yao, Percy Liang, Chelsea Finn
  • Definition: Surgical fine-tuning is a method of selectively updating specific layers in a neural network based on how a fine-tuning dataset differs from the original pretraining dataset, rather than retraining every layer.
  • Motivation:
    1. Layer Specificity: Early layers in a neural network capture fundamental features of inputs (e.g., edges or shapes in images), while deeper layers combine these features for predictions (e.g., classifying images).

    2. Efficiency: Rather than universally fine-tuning every layer, selectively updating specific layers can achieve better performance, especially when the fine-tuning dataset has notable differences from the pretraining dataset.

  • Approaches:
    1. Manual Approach:
      • Fine-tune each layer individually and create a distinct model for each layer.
      • Compare the performance of each model to identify the best layers for fine-tuning.
    2. Automated Approach:
      • Calculate gradients for each layer.
      • Derive relative gradients by dividing the layer’s gradient by its weight magnitude.
      • Normalize these relative gradients across layers, ranking them between 0 to 1.
      • Assign learning rates for layers based on their normalized relative gradient value during training.
    3. Based on the findings in this paper, here are some tips for determining which layers to fine-tune when adapting a pretrained model to a new target distribution:
      • Consider the type of distribution shift between the source and target data: 1) For input-level shifts like image corruptions, fine-tuning earlier layers (first conv block) tends to work best. This allows the model to adapt to changes in the input while preserving higher-level features. 2) For feature-level shifts where the feature representations differ between source and target, fine-tuning middle layers (middle conv blocks) tends to work well. This tunes the mid-level features without distorting low-level or high-level representations. 3) For output-level shifts like label distribution changes, fine-tuning later layers (fully connected classifier) tends to be most effective. This keeps the feature hierarchy intact and only adapts the output mapping.
      • Try fine-tuning only a single contiguous block of layers while freezing others. Systematically test first, middle, and last blocks to find the best one.
      • Use criteria like relative gradient norms to automatically identify layers that change the most for the target data. Fine-tuning those with higher relative gradients can work better than full fine-tuning.
      • When in doubt, fine-tuning only the classifier head is a solid default that outperforms no fine-tuning. But for shifts related to inputs or features, surgical fine-tuning of earlier layers can improve over this default.
      • If possible, do some quick validation experiments to directly compare different surgical fine-tuning choices on a small held-out set of target data.
      • The key insight is that different parts of the network are best suited for adapting to different types of distribution shifts between the source and target data.
  • Results:
    • CIFAR-C Dataset:
      • Manual approach yielded an accuracy of 82.8%.
      • Fine-tuning the entire network resulted in 79.9% accuracy.
      • The automated approach achieved an accuracy of 81.4%.
  • Significance: Surgical fine-tuning is rooted in understanding how neural networks process input. This enhanced understanding can drive the discovery of more efficient methods to improve machine learning models.
  • Consideration: For more complex datasets, discerning differences between pretraining and fine-tuning datasets can be challenging. This complexity might make automated approaches like the one proposed more valuable, even if it didn’t yield the best performance on CIFAR-C.

Tracing Model Outputs to the Training Data by Anthropic

  • This article summarizes research by Anthropic on using influence functions to study how large language models generalize from their training data. While this is not specifically a fine-tuning methodology, it does provide some insight into the LLM’s generalization capabilities. The key points are:
  • Influence functions let them identify which training examples most influence a model’s outputs. This provides clues about how models generalize.
    • Influence functions are a technique from statistics used to determine which training examples most influence a machine learning model’s parameters and predictions.
    • The key idea is to approximate how much adding or removing a training example would affect the model’s learned parameters. This lets you identify the most “influential” examples in the training set.
  • They scaled up influence functions to very large models (up to 52B parameters), which is necessary to study behaviors that emerge at scale.
  • Influence patterns become more abstract and conceptual with increasing model scale. Smaller models’ influences tend to involve token-level overlaps, while larger models’ influences are more thematically related.
  • Influence also becomes more cross-lingual with scale. Translated queries have higher influence from English training examples in larger models.
  • Influences are distributed across many training examples, not just a few. So models do not seem to be memorizing and reciting specific examples.
  • Influence can be localized to specific layers. Lower layers capture detailed wording while higher layers capture more abstract concepts.
  • Next steps are extending influence functions to fine-tuned models, connecting influence to mechanistic interpretability, and using it for alignment research.

LoRA vs. QLoRA experimentation by Sebastian Raschka

  • This section is taken from Sebastian Raschka’s post on LoRA & QLoRA experiments to finetune open-source LLMs, and presents his learnings:
    1. Despite embracing the inherent randomness of LLM training (or when training models on GPUs in general), the outcomes remain remarkably consistent across multiple runs.
    2. QLoRA presents a trade-off that might be worthwhile if you’re constrained by GPU memory. It offers 33% memory savings at the cost of a 33% increase in runtime.
    3. When finetuning LLMs, the choice of optimizer shouldn’t be a major concern. While SGD on its own is suboptimal, there’s minimal variation in outcomes whether you employ AdamW, SGD with a scheduler, or AdamW with a scheduler.
    4. While Adam is often labeled a memory-intensive optimizer due to its introduction of two new parameters for every model parameter, this doesn’t significantly affect the peak memory demands of the LLM. This is because the majority of the memory is allocated for large matrix multiplications rather than retaining extra parameters.
    5. For static datasets, iterating multiple times as done in multi-epoch training might not be beneficial. It often deteriorates the results, probably due to overfitting.
    6. If you’re incorporating LoRA, ensure it’s applied across all layers, not just to the Key and Value matrices, to maximize model performance.
    7. Adjusting the LoRA rank is essential, and so is selecting an apt alpha value. A good heuristic is setting alpha at twice the rank’s value.
    8. 7B models can be finetuned efficiently within a few hours on a single GPU possessing 14 Gb of RAM.
      • With a static dataset, optimizing an LLM to excel across all benchmark tasks is unattainable. Addressing this requires diverse data sources, or perhaps LoRA might not be the ideal tool.



If you found our work useful, please cite it as:

  title   = {Parameter Efficient Fine-Tuning (PEFT)},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}