Large Language Models

  • Large Language Models (LLMs) like GPT-3 or BERT, are deep neural networks.
  • They contain many connected neurons connected by billions of weighted links.
  • “Given an input text “prompt”, at essence what these systems do is compute a probability distribution over a “vocabulary”—the list of all words (or actually parts of words, or tokens) that the system knows about. The vocabulary is given to the system by the human designers. GPT-3, for example, has a vocabulary of about 50,000 tokens.” Source

How do LLMs work?

  • LLMs are usually tasked with a certain problem, whether sentence completion or what have you.
  • They start by taking the prompt they receive, and converting it to vectors.
  • They then do layer by layer computations, which result in assigning a number or logit to each word in its vocabulary.
  • Then, depending on the task assigned to the LLM, it will convert each logit into a probability distribution determining which word shall come next in the text.

Similarity Computation

  • The natural next step here is to understand if two sentences are similar or different from each other.
  • Sentence similarity is the measure of the degree to which two sentences are semantically equivalent in meaning.
  • Below are the two most common measure of sentence similarity:

Dot Product

Cosine Similarity

\([\text{cosine_similarity}(\mathbf{u},\mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\left|\mathbf{u}\right|\left|\mathbf{v}\right|} = \frac{\sum_{i=1}^{n} u_i v_i}{\sqrt{\sum_{i=1}^{n} u_i^2} \sqrt{\sum_{i=1}^{n} v_i^2}}]\)

  • where,
  • \(u\) and \(v\) are the two vectors being compared,
  • \(⋅\) represents the dot product, ∥ u ∥ ∥u∥ and ∥ v ∥ ∥v∥ represent the magnitudes (or norms) of the vectors, and n n is the number of dimensions in the vectors.