Introduction

  • LLaMA (paper; blog) is a collection of language models released by Meta AI (FAIR).
  • LLaMA is open source, unlike ChatGPT or GPT-4, available to the public on this Github repository.
    • It’s important to note however that while the code is available for the public, the weights are only available once you confirm your use case does not involve commercial use.
  • Inspired by Chinchilla’s scaling laws paper, the LLaMA paper proposes a set of “small” large language models that outperform GPT-3 (175B) with >10x fewer parameters (13B). And a larger 65B version outperforms PaLM-540B. In sum, the paper proposes smaller open-source models trained on public data that outperform some of the proprietary LLMs created in recent years.
  • The LLaMA models, which outperform GPT-3, are a welcome alternative to previous open-source models like OPT and BLOOM, which are said to underperform GPT-3.

Methods LLaMA used

  • Below are some of the methods LLaMA uses to improve performance and outpace recent LLMs; the smallest (7B) model is on par with GPT-3 on many language tasks.
  • The LLaMA architecture adopts four architectural modifications compared to the original Transformer:
    1. RMSNorm for pre-normalization
    2. Rotary embeddings
    3. SwiGLU activation function
    4. Attention optimizations

Pre-normalization

  • To improve the training stability, LLaMA normalizes the input of each transformer sub-layer, instead of normalizing the output. LLaMA use the RMSNorm normalizing function.
  • Pre-normalization is a technique that normalizes the input data before its fed into the neural network.
  • The aim here is to improve the efficiency and stability of the training process by normalizing and reducing variation and correlation of the input features.
  • Pre-normalization can take many forms, but the most common method is to subtract the mean and divide by the standard deviation of each feature across the training dataset.
  • This ensures that the mean of each feature is zero and the standard deviation is one, which can make it easier for the neural network to learn the relationships between the features without being overly affected by their scales or magnitudes. -Pre-normalization can be especially important when dealing with features that have vastly different scales or magnitudes, as this can cause the neural network to overemphasize some features and underemphasize others.
  • By pre-normalizing the data, these differences are accounted for, and the neural network can better learn the underlying patterns and relationships in the data

SwiGLU activation function (Swish-Gated Linear Unit)

  • LLaMA replaces the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020) in Swish-Gated Linear Units for Neural Network Function Approximation to offer better performance.
  • The SwiGLU activation function is based on the Swish activation function, which is a smooth and non-monotonic function that has been shown to outperform other commonly used activation functions such as ReLU and sigmoid in certain neural network architectures.
\[SwiGLU(x) = x * sigmoid(beta * x) + (1 - sigmoid(beta * x)) * x\]
  • Experimental results have shown that SwiGLU can outperform other activation functions such as ReLU, Swish, and GELU (Gaussian Error Linear Units) on certain image classification and language modeling tasks.
  • However, the effectiveness of SwiGLU can depend on the specific architecture and dataset used, so it may not always be the best choice for every application.

Rotary Positional Embeddings

  • LLaMA does not utilize absolute positional embeddings to embed the notion of sequentiality of information as in the original Transformer, and instead, utilize Rotary Positional Embeddings (RoPE), introduced by Su et al. (2021) in Rotary Position Embedding, at each layer of the network.
  • The basic idea behind rotary embeddings is to introduce additional structure into the position embeddings used in deep learning models. Position embeddings are used to encode the position of each element in a sequence (such as a word in a sentence) as a vector, which is then combined with the corresponding element embedding to form the input to the model.
  • In traditional position embeddings, the vectors representing different positions are orthogonal to each other. However, this orthogonality can lead to certain symmetries in the model, which can limit its expressive power.
  • Rotary embeddings address this issue by introducing a phase shift between the position embeddings for different dimensions. This phase shift is achieved using a matrix that has a special form based on the properties of rotations in high-dimensional space. The resulting embeddings are no longer orthogonal, but they preserve certain rotational symmetries that can make the model more expressive.
  • Experimental results have shown that rotary embeddings can improve the performance of deep learning models on certain tasks, such as machine translation and language modeling.

Attention Optimizations

  • LLaMA uses both memory efficient attention and FlashAttention, which offer an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. The former present a very simple algorithm for attention that requires \(O(1)\) memory with respect to sequence length and an extension to self-attention that requires \(O(\log{n})\) memory.
  • This is achieved by not storing the attention weights and not computing the key/query scores that are masked due to the causal nature of the language modeling task. This, in turn, helps improve the training efficiency and time-to-convergence.
  • This also means that it would likely be possible to extend the context length to something much larger.

Visual Summary

  • The following visual summary by Sebastian Raschka details the methods LLaMA used to achieve this performance: pre-normalization, SwiGLU activations, and Rotary Embeddings. Sebastian Rashcka also points out that the plots show a steep negative slope when showing the training loss versus the number of training tokens, indicating that the authors should have trained the model for more than 1-2 epochs.

Model Variants

  • LLaMA is available in several sizes (7B, 13B, 33B, and 65B parameters).

Training Protocol

  • LLaMA 65B and LLaMA 33B are trained on 1.4 trillion tokens while LLaMA 7B, is trained on 1 trillion tokens.
  • LLaMA was trained like most language models, it took an input of a sequence of words and worked on predicting the next word.
  • It was trained on 20 different languages with a focus on Latin and Cyrillic alphabets.

Results

  • LLaMA-13B outperforms GPT-3 (175B) on most benchmarks.
  • LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B.