• Fine-tuning is a critical step in the deployment of language models for specific Natural Language Processing (NLP) tasks. While large language models like BERT or GPT are trained on a large corpus of text data in an unsupervised manner, fine-tuning is the process where these pre-trained models are further trained on a specific task using a smaller amount of task-specific data. This process allows the model to adapt its generalized language understanding to the specific task at hand.
  • Fine-tuning is a process where pre-trained models are further trained on a smaller, task-specific dataset. The fine-tuning process allows pre-trained models to adapt to specific tasks. In NLP, fine-tuning is commonly used in tasks such as text classification, named entity recognition, sentiment analysis, question answering, and more.
  • Let’s look at how fine-tuning is done as illustrated in the image below (source).

The Fine-tuning Process:

  • Fine-tuning is carried out using a task-specific dataset. In this process, the weights of the pre-trained model are slightly adjusted to improve performance on the specific task.
  • The process can be illustrated with the following steps:
    1. Choose a pre-trained model: Start with a pre-trained model such as BERT or GPT, which have been trained on a large text corpus.
    2. Prepare your task-specific data: Your data should be in a format that aligns with the model architecture. For example, for a text classification task, your data should include text and its corresponding labels.
    3. Update model architecture if needed: For certain tasks, you might need to slightly alter the model architecture. For instance, for a classification task, you would typically add a dense layer on top of the model’s output to get predictions for each of your classes.
    4. Fine-tune the model: Train the model on your task-specific data. During fine-tuning, the entire model weights are usually updated, but with a smaller learning rate, as large updates can wipe out the pre-learned representations.

Finetuning - Updating the output layers

  • The method described here, referred to as Finetuning I, is a variant of the feature-based approach. Like the feature-based approach, this method leverages a pretrained large language model (LLM) to extract useful representations from the input data. However, instead of using these representations to train a separate model, the pretrained model itself is extended with additional output layers and these layers are trained on the task at hand.
  • The steps involved in this approach (source) are as follows:
  1. Loading a Pretrained Model: The pretrained LLM (in this case, DistilBERT) is loaded, but this time with additional output layers for sequence classification.

     model = AutoModelForSequenceClassification.from_pretrained(
  2. Freezing the Pretrained Layers: All the parameters of the pretrained model are frozen, meaning they will not be updated during training. This ensures that the knowledge captured by the model during pretraining is preserved.

     for param in model.parameters():
         param.requires_grad = False
  3. Unfreezing the Output Layers: The parameters of the added output layers (in this case, the pre_classifier and classifier layers) are unfrozen, so they will be updated during training.

     for param in model.pre_classifier.parameters():
         param.requires_grad = True
     for param in model.classifier.parameters():
         param.requires_grad = True
  4. Finetuning the Model: The model is trained on the task-specific data using a custom training loop provided by the PyTorch Lightning framework.

     lightning_model = CustomLightningModule(model)
     trainer = L.Trainer(
  5. Evaluating the Model: The performance of the model is evaluated on the test data.

     trainer.test(lightning_model, dataloaders=test_loader)
  • This method is expected to perform similarly to the feature-based approach in terms of both speed and model performance because the same pretrained model is used as the base in both cases. However, the feature-based approach might be more convenient in some cases, as it allows for the pre-computation and storage of the feature embeddings, which can be a benefit in some practical scenarios.

Finetuning - Updating all the layers

  • Using the LLM with additional output layers and updating the parameters of all layers (i.e., not freezing any layers). This approach yielded a test accuracy of 92%.
  • The original BERT paper (Devlin et al.) suggested that finetuning only the output layer can result in comparable modeling performance to finetuning all layers, despite the latter being substantially more expensive computationally due to the larger number of parameters involved. For example, a BERT base model has around 110 million parameters, but the final layer for binary classification consists of merely 1,500 parameters. The last two layers of a BERT base model account for 60,000 parameters, which is only around 0.6% of the total model size.
  • However, the performance of these approaches can vary based on how similar the target task and domain are to the data the model was pretrained on. In practice, it’s often found that finetuning all layers (Finetuning II) results in superior modeling performance.
  • The Python code for Finetuning II is almost identical to that for Finetuning I, but it lacks the step where the pretrained model’s parameters are frozen. In other words, all parameters of the model are updated during the training process. This approach generally gives better performance, as seen in the movie review classification example, but at the cost of increased computational resources.


1) Feature-based approach with logistic regression: 83% test accuracy 2) Finetuning - Updating the output layers, updating the last 2 layers: 87% accuracy 3) Finetuning- Updating all the layers, updating all layers: 92% accuracy.

  • These results are consistent with the general rule of thumb that finetuning more layers often results in better performance, but it comes with increased cost.
  • (Source) for this subsection.

Benefits of Fine-tuning:

  • Efficiency: Fine-tuning a pre-trained model is much faster and requires less computational resources than training a model from scratch, as the pre-trained model has already learned a good amount of language representations.
  • Performance: Fine-tuned models usually outperform models trained from scratch, especially when the task-specific data is limited. This is because pre-trained models already have a general understanding of the language, so they only need to adapt to the specifics of the task.

Challenges in Fine-tuning:

  • Overfitting: If the task-specific dataset is small, there’s a risk that the model might overfit to the training data during fine-tuning. Regularization techniques like dropout or weight decay can help to mitigate this.
  • Catastrophic forgetting: This is a phenomenon where the model tends to forget the previously learned knowledge while fine-tuning. It’s often observed when the fine-tuning task is very different from the original pretraining task. Using a smaller learning rate during fine-tuning can alleviate this issue.

Models and how they finetune

  1. BERT: BERT (Bidirectional Encoder Representations from Transformers) is designed as a deeply bidirectional model. When fine-tuning BERT for a specific task, we add an additional output layer so that the number of outputs matches the number of classes in the specific task. All the parameters of BERT and the additional output layer are trained on the task-specific dataset. For example, for a text classification task, we add a dense layer with a softmax activation function to get the probability distribution over the classes. For a named entity recognition task, the model is fine-tuned to predict an entity class for each token in the input.

  2. GPT: GPT (Generative Pretrained Transformer) is an autoregressive model that uses the decoder mechanism of the transformer architecture. When fine-tuning GPT, an additional linear layer is typically added to map the hidden states to target classes. For example, in the case of text generation, we use the trained GPT model and fine-tune it on our task-specific dataset, then use the model to generate new text based on the input text. For text classification tasks, a classification head is added on top of the transformer outputs for class prediction.

  3. RoBERTa: RoBERTa is a variant of BERT and follows a similar procedure for fine-tuning. An additional layer is added based on the task, and all parameters are trained on the task-specific data.

  4. XLNet: XLNet is another transformer-based model which outperforms BERT on several benchmarks. For fine-tuning, similar to BERT and GPT, an additional layer is added based on the task, and all parameters are trained on the task-specific data.

  5. DistilBERT: DistilBERT is a smaller and faster version of BERT. It follows a similar procedure for fine-tuning as BERT, where an additional output layer is added, and all parameters are trained on the task-specific data.

  • For each of these models, fine-tuning involves training the model on a task-specific dataset. The process may vary slightly depending on the model architecture and the specific task. Also, each model has its own strengths and weaknesses, so the choice of model for fine-tuning would depend on the specific requirements of the task.

Feature-Based Approach

  • The Feature-Based Approach is a method for applying pre-trained large language models (LLMs) to specific tasks. In this approach, instead of retraining or fine-tuning the entire model, we use it to generate output embeddings for our training set, which can then be used as input features to train a separate, typically simpler, model for the task we are interested in. This is typically done with embedding-focused models like BERT, but can also be applied to generative models like GPT.

In the provided Python code snippet (source), the following steps are carried out:

  1. Loading a Pretrained Model: The pretrained LLM (in this case, DistilBERT) is loaded.

     model = AutoModel.from_pretrained("distilbert-base-uncased")
  2. Tokenizing the Dataset: The text data in the dataset is broken down into tokens that the model can process. This step is indicated in the comments but the code isn’t shown.

  3. Generating Embeddings: Each batch of tokenized data is passed through the model to generate output embeddings. Here, the last_hidden_state output from the model is used, which represents the high-level features learned by the model.

    def get_output_embeddings(batch): 
        output = model(
        ).last_hidden_state[:, 0]
    return {"features": output}
  1. Creating Feature Sets: The embeddings are then stored in a dataset.

     dataset_features = dataset_tokenized.map(
       get_output_embeddings, batched=True, batch_size=10)
  2. Splitting the Dataset: The features are split into training, validation, and test sets, along with their corresponding labels.

     X_train = np.array(imdb_features["train"]["features"])
     y_train = np.array(imdb_features["train"]["label"])
     X_val = np.array(imdb_features["validation"]["features"])
     y_val = np.array(imdb_features["validation"]["label"])
     X_test = np.array(imdb_features["test"]["features"])
     y_test = np.array(imdb_features["test"]["label"])
  3. Training a Classifier: A simple logistic regression model is trained on the feature vectors.

     from sklearn.linear_model import LogisticRegression
     clf = LogisticRegression()
     clf.fit(X_train, y_train)
  4. Evaluating the Classifier: The accuracy of the classifier is then evaluated on the training, validation, and test sets.

     print("Training accuracy", clf.score(X_train, y_train))
     print("Validation accuracy", clf.score(X_val, y_val))
     print("test accuracy", clf.score(X_test, y_test))
  • This feature-based approach is especially useful when you have a smaller target dataset or when computational resources are limited. By transforming the data using a pretrained LLM, you can leverage the knowledge captured by the model during its pretraining, allowing a simpler model to potentially achieve high performance on the task at hand.

Parameter Efficient Finetuning

  • Parameter-Efficient Finetuning (PEFT) is a method for adapting Large Language Models (LLMs) to specific tasks. This technique allows for reusing pretrained models while minimizing the computational and resource footprints. Some of the key advantages of PEFT include:
  1. Reduced computational costs: PEFT methods require fewer GPUs and less GPU time.
  2. Faster training times: PEFT methods complete training more quickly than methods that involve training all layers.
  3. Lower hardware requirements: PEFT methods work with smaller GPUs and require less memory.
  4. Better modeling performance: By limiting the number of parameters that need to be adjusted, PEFT methods can reduce the risk of overfitting.
  5. Less storage: Majority of the model weights can be shared across different tasks.
  • In previous sections, it was established that finetuning more layers often leads to better results. But what if we need to finetune larger models that only just fit into GPU memory? In such cases, PEFT methods can be very useful.
  • Some of the most widely used PEFT techniques include prefix tuning, adapters, and low-rank adaptation. While the specifics of these techniques vary, they all involve introducing a small number of additional parameters that are then finetuned. In contrast to the Finetuning II method described above (where all layers are finetuned), these methods adjust only a small number of parameters, thereby saving on computational resources. And even though they update fewer parameters, they can achieve predictive performance that is comparable to or even better than methods that finetune all layers.
  • To give a brief overview:
  • Prefix Tuning: It involves training a small network that processes the input before it’s fed into the pretrained model. This allows the input to be dynamically adapted for each task.
  • Adapters: In this method, small feed-forward neural networks (adapters) are added to each layer of the pretrained model. These adapters are trained while the main parameters of the model are frozen.
  • Low-rank Adaptation: This technique involves adding a low-rank matrix to the weight matrix of each layer. This matrix is finetuned while the main parameters of the model are kept frozen.
  • Note that each of these methods offers its own balance between computational efficiency and modeling performance. The choice of which method to use can depend on factors like the specific task, the computational resources available, and the size of the pretrained model.

In-Context Learning and Indexing

  • In-context learning can be used as an alternative to finetuning, especially when direct access to the model is not provided. The image below (source), shows in-context learning.
  • In-Context Learning and Indexing refer to strategies for using large language models (LLMs) like GPT-3 and GPT-4 to perform specific tasks that the model wasn’t originally trained on.
  • In-Context Learning is a method where we provide a few examples of a task via the input prompt, and the model uses this context to generate appropriate responses. This ability comes from the model’s extensive pretraining on a general corpus, which gives it a large amount of knowledge to draw on. An example of in-context learning might be providing the model with a couple of examples of a specific type of sentence, and then the model being able to generate more sentences of that type.
  • Hard Prompt Tuning is a strategy related to in-context learning. Here, we modify the inputs with the goal of getting better outputs. This is known as ‘hard’ prompt tuning because it involves directly changing the input words or tokens. Although this method is more resource-efficient compared to parameter finetuning (where the model’s parameters are updated to better perform a task), it often falls short in performance, as it doesn’t adapt the model’s parameters to the specific nuances of a task. This can limit its adaptability. Also, the process of hard prompt tuning can require significant human involvement to compare the quality of different prompts and decide which works best.
  • The image below (source), shows hard prompting.
  • Soft Prompt Tuning is a differentiable version of prompt tuning, which allows for adjustments and optimization of prompts in a more automated manner. However, this is not covered in detail in your original text.
  • Indexing is another method that utilizes in-context learning. With this technique, an LLM can be turned into an information retrieval system. Essentially, this means the model can extract data from external resources and websites. An indexing module breaks down a document or website into smaller pieces, converting these into vectors which are stored in a vector database. When a user asks a question or makes a query, the indexing module calculates the similarity between the query and each vector in the database. The top k most similar vectors are then used to generate the response. This makes it possible for the LLM to provide information from an extensive range of sources, beyond what it was explicitly trained on.
  • The image below (source), shows indexing.