Primers • Token Sampling Methods: Greedy, Beam Search, Temperature, topk, topp
Overview
 In this article, let’s go over the rules and procedure for an \(n\)dimensional tensor product, i.e., say \(A[a,b,c] \times B[i,j,k]\).
Background: Logits and Softmax

Neural networks produce class probabilities with logit vector \(\mathbf{z}\) where \(\mathbf{z}=\left(z_{1}, \ldots, z_{n}\right)\) by performing the softmax function to produce probability vector \(\mathbf{q}=\left(q_{1}, \ldots, q_{n}\right)\) by comparing \(z_{i}\) with with the other logits.
\(q_{i}=\frac{\exp \left(z_{i} / T\right)}{\sum_{j} \exp \left(z_{j} / T\right)}\)
 where \(T\) is the temperature parameter, normally set to 1 .

The softmax function normalizes the candidates at each iteration of the network based on their exponential values by ensuring the network outputs are all between zero and one at every timestep.
Temperature

Temperature is a hyperparameter of LSTMs (and neural networks generally) used to control the randomness of predictions by scaling the logits before applying softmax. For example, in TensorFlow’s Magenta implementation of LSTMs, temperature represents how much to divide the logits by before computing the softmax.
 When the temperature is 1, we compute the softmax directly on the logits (the unscaled output of earlier layers), and using a temperature of 0.6 the model computes the softmax on \(\frac{\text { logits }}{0.6}\), resulting in a larger value. Performing softmax on larger values makes the LSTM more confident (less input is needed to activate the output layer) but also more conservative in its samples (it is less likely to sample from unlikely candidates).
 Using a higher temperature produces a softer probability distribution over the classes, and makes the RNN more “easily excited” by samples, resulting in more diversity/randomness in its tokens (thus enabling it to get out of repetitive loops easily) but also leads to more mistakes.
 Temperature therefore increases the sensitivity to low probability candidates. In LSTMs, the candidate, or sample, can be a letter, a word, or musical note, for example from the Wikipedia article on softmax function:
For high temperatures \((\tau \rightarrow \infty\) ), all [samples] have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature \(\left(\tau \rightarrow 0^{+}\right)\) , the probability of the [sample] with the highest expected reward tends to \(1 .\)
References
 Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015)
 What is Temperature in LSTM (and neural networks generally)?
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledTokenSamplingMethods,
title = {Token Sampling Methods},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}