Generative learning algorithms

  • We’ve mainly looked at learning algorithms that model \(P(y \mid x ; \theta)\), the conditional distribution of \(y\) given \(x\). For instance, logistic regression modeled \(P(y \mid x ; \theta)\) as \(h_{\theta}(x)=g(\theta^{T} x)\) where \(g(\cdot)\) is the sigmoid function. In this section, we’ll talk about a different type of learning algorithm.
  • Let’s use a binary classification problem as the motivation behind our discussion. Consider a classification problem in which we want to learn to distinguish between malignant \((y=1)\) and benign \((y=0)\) tumors.
    • Given a training set, an algorithm like logistic regression, the perceptron algorithm or a generalized linear model initially starts with randomly initialized parameters. Over the course of learning, the algorithm performs gradient descent where the straight line hyperplane (or decision boundary) evolves until you obtain a boundary that separates the positive-negative examples – in this case, the malignant and benign tumors.
    • Then, to classify a new sample as either malignant or benign, it checks on which side of the decision boundary the new sample falls in, and makes its prediction accordingly.
  • There’s a different class of algorithms which aren’t trying to maximize the likelihood \(P(y \mid x)\), or looking at both classes and searching for the separation boundary. Instead, these algorithms look at the classes one at a time.
    • First, looking at the malignant tumors, we can learn features of what malignant tumors look like. Then, looking at benign tumors, we can build a separate model of what benign tumors look like.
    • Finally, to classify a new tumor, we can match the new sample against the malignant tumor model and the benign tumor model, to see whether the new tumor looks more like a malignant or benign tumor we had seen in the training set.
  • Algorithms that try to learn \(P(y \mid x)\) directly (such as logistic regression), or algorithms that try to learn the mapping from the input space \(\mathcal{X}\) to the labels (such as the perceptron algorithm) are called discriminative learning algorithms. Here, we’ll talk about algorithms that instead try to model \(P(x \mid y)\) (and \(P(y)\)). These algorithms are called generative learning algorithms.
    • For instance, if \(y\) indicates whether an example is a benign tumor \((0)\) or an malignant tumor \((1)\), then \(P(x \mid y=0)\) models the distribution of benign tumors’ features, while \(P(x \mid y=1)\) models the distribution of malignant tumors’ features.
    • The model also learns the class prior \(P(y)\)). To illustrate the concept of a class prior – when the patient walks into your office, before you’ve even seen them, the odds that their tumor is malignant versus benign is referred to the class prior.
    • Thus, a generative learning algorithm builds a model of features for each of the classes in isolation. At test time, the algorithm evaluates a new sample against each of these models, identifies the model it matches most closely against and returns that as the prediction.
  • After modeling \(P(y)\) (called the class priors) and \(P(x \mid y)\), our algorithm can then use Bayes rule to derive the posterior distribution on \(y\) given \(x\):
\[P(y \mid x)=\frac{P(x \mid y) P(y)}{P(x)}\]
  • Here, the denominator is given by \(P(x)=P(x \mid y=1) P(y=1)+P(x \mid y=0) P(y=0)\), which is a function of the quantities \(P(x \mid y)\) and \(P(y)\). Note that both of these quantities were learned as part of the learning process of generative learning algorithms.
  • The above equation represents the underlying framework we’ll use to build generative learning algorithms such as Gaussian discriminant analysis and Naive Bayes.
  • Actually, if are calculating \(P(y \mid x)\) in order to make a prediction, then we don’t actually need to calculate the denominator, since,
\[\begin{aligned} \operatorname*{arg\,max} _{y} P(y \mid x) &=\operatorname*{arg\,max} _{y} \frac{P(x \mid y) P(y)}{P(x)} \\ &=\operatorname*{arg\,max} _{y} P(x \mid y) P(y) \end{aligned}\]
  • Key takeaways
    • Discriminative learning algorithms learn \(P(y \mid x)\), i.e., the distribution of the output given an input. In other words, learn \(h_{\theta}(x)={0, 1}\) directly.
    • Generative learning algorithms learn \(P(x \mid y)\), i.e., the distribution of features given a class, and \(P(y)\) which is called the class prior. In the tumor-identification setting, where a tumor may be malignant or benign, these models learn the features for each case and use them to make a prediction.

Gaussian discriminant analysis

  • The first generative learning algorithm that we’ll look at is Gaussian discriminant analysis (GDA), which can be used for continuous-valued features, for say, tumor classification.
  • In this model, we’ll assume that \(P(x \mid y)\) is distributed according to a multivariate normal distribution. Let’s talk briefly about the properties of multivariate normal distributions before moving on to the GDA model itself.

The multivariate Gaussian distribution

  • The multivariate Gaussian distribution (also called the multivariate normal distribution) is the generalization of the Gaussian distribution from a \(1\)-dimensional random variable to an \(n\)-dimensional random variable (or simply, \(n\)-random variable). In other words, rather than a uni-variate random variable, the multivariate Gaussian seeks to model multiple random variables.
  • Assume \(X\) is distributed as a multivariate Gaussian, i.e, \(X \in \mathbb{R}^{n}\), parameterized by a mean vector \(\mu \in \mathbb{R}^{n}\) and a covariance matrix \(\Sigma \in \mathbb{R}^{n \times n}\), where \(\Sigma \geq 0\) is symmetric and positive semi-definite. Formally, this is written as,
\[X \sim \mathcal{N}(\mu, \Sigma)\]
  • The probability density function (PDF) of \(X\) is given by:

    \[P(X ; \mu, \Sigma)=\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(X-\mu)^{T} \Sigma^{-1}(X-\mu)\right)\]
    • where \(“|\Sigma|”\) denotes the determinant of the matrix \(\Sigma\).
  • The expected value of \(X\) is given by \(\mu\):

\[\mathrm{E}[X]=\int_{x} x P(x ; \mu, \Sigma) d x=\mu\]
  • The covariance of a vector-valued random variable \(X\) is defined as \(\operatorname{Cov}(X)=\) \(\mathrm{E}\left[(X-\mathrm{E}[X])(X-\mathrm{E}[X])^{T}\right]\). Covariance generalizes the notion of the variance of a real-valued random variable to an \(n\)-dimensional setting. The covariance can also be defined as \(\operatorname{Cov}(X)=\) \(\mathrm{E}\left[XX^{T}\right]-(\mathrm{E}[X])(\mathrm{E}[X])^{T}\). Since \(X\) is distributed as a multivariate Gaussian, i.e., \(X \sim \mathcal{N}(\mu, \Sigma)\), \(\operatorname{Cov}(X)=\Sigma\).
  • Here are some examples of what the PDF of a Gaussian distribution looks like. Recall that the Gaussian is the familiar bell-shaped curve. Similarly, the multivariable Gaussian is represented by the same bell-shaped curve with two parameters \(\mu\) and \(\sigma\) that control the mean and the variance of the PDF, but in \(n\)-dimensions. For e.g., for a \(2\)-dimensional Gaussian (or a Gaussian over \(2\) random variables), \(\mu\) is \(2\)-dimensional, while the covariance matrix is of size \(2 \times 2\).
  • The below figure shows a Gaussian with zero mean (i.e., the \(2 \times 1\) zero-vector) and covariance matrix \(\Sigma=I\) (the \(2 \times 2\) identity matrix). A Gaussian with zero mean and identity covariance is also called the standard normal distribution.

  • The figure below shows the PDF of a Gaussian with zero mean and \(\Sigma=0.6 I\). We’ve essentially taken the covariance matrix and multiplied it by a number \(< 1\), which has shrunk the variance, i.e., reduced the variability of our distribution.

  • The figure below shows one with \(\Sigma=2 I\).

  • From the above images, we see that as \(\Sigma\) becomes larger, the Gaussian becomes more “wider/shorter”, and as it becomes smaller, the distribution becomes more “compressed/taller”. This is because the PDF always integrates to \(1\), i.e., the area under the curve of a PDF function is always \(1\), so the Gaussian curve scales its spread vs. height accordingly.
  • The figures below show Gaussians with mean \(0\) and their corresponding covariance matrices \(\Sigma\):
\[\Sigma=\left[\begin{array}{ll} 1 & 0 \\ 0 & 1\end{array}\right]\]

  • The figure above shows the standard normal distribution.
\[\Sigma=\left[\begin{array}{cc} 1 & 0.5 \\ 0.5 & 1 \end{array}\right]\]

\[\Sigma=\left[\begin{array}{cc} 1 & 0.8 \\ 0.8 & 1 \end{array}\right]\]

  • From the above figures, we see that as we increase the off-diagonal entries in \(\Sigma\), the PDF becomes more “compressed” towards the \(45^{\circ}\) line (given by \(x_{1}=x_{2}\)). Geometrically speaking, this implies that two random variables are positively correlated. We can see this more clearly when we look at the contours of the same three Gaussian densities instead of these 3-D bumps:
\[\Sigma=\left[\begin{array}{ll} 1 & 0 \\ 0 & 1\end{array}\right]\]

  • Note that the above figure should show perfectly round circles, but the aspect ratio of the image probably makes them look a little bit fatter :)
\[\Sigma=\left[\begin{array}{cc} 1 & 0.5 \\ 0.5 & 1 \end{array}\right]\]

\[\Sigma=\left[\begin{array}{cc} 1 & 0.8 \\ 0.8 & 1 \end{array}\right]\]

  • Thus, as \(\Sigma\) becomes larger, the PDF places value on the random variables being positively correlated.

  • Here’s another set of examples along with their corresponding covariance matrices \(\Sigma\):

\[\Sigma=\left[\begin{array}{cc}1 & -0.5 \\ -0.5 & 1\end{array}\right]\]

\[\Sigma=\left[\begin{array}{cc}1 & -0.8 \\ -0.8 & 1\end{array}\right]\]

  • From the above two figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the PDF now becomes “compressed” again, but in the opposite direction, i.e., towards the \(135^{\circ}\) line. Again, geometrically, you endow the two random variables with negative correlation.
\[\Sigma=\left[\begin{array}{cc}3 & 0.8 \\ 0.8 & 1\end{array}\right]\]

  • From the above figures, we see that as we vary the off-diagonal parameters, the contours tend to form ellipses.
  • Another set of examples with \(\Sigma=I\) are as follows. By varying \(\mu\), we can shift the center of the Gaussian density around:
\[\mu=\left[\begin{array}{l} 1 \\ 0 \end{array}\right]\]

\[\mu=\left[\begin{array}{c} -0.5 \\ 0 \end{array}\right]\]

\[\mu=\left[\begin{array}{c} -1 \\ -1.5 \end{array}\right]\]

  • Note that if you carry out principal component analysis on the covariance matrix, the eigenvectors points in the principal axes of the ellipse which are defined by the contours.
  • Key takeaway
    • As you vary the mean \(\mu\) and the covariance matrix \(\Sigma\) of the Gaussian density, you can change the center and the spread/height of the PDF respectively.

The Gaussian Discriminant Analysis model

  • When we have a classification problem in which the input features \(x\) are continuous-valued random variables, we can use the Gaussian Discriminant Analysis (GDA) model, which models \(P(x \mid y)\) using a multivariate normal distribution.
  • Let’s consider the tumor-identification task for this discussion, and model it as a Gaussian distribution. This can be expressed using two equations \(P(x \mid y = 0)\) and \(P(x \mid y = 1)\), which essentially model the probability density of the tumor’s features given it is benign or malignant as Gaussians. On the other hand, \(y\) is a Bernoulli random variable, which takes on values \(\{0, 1\}\). Formally, the model is given by,
\[\begin{aligned} y & \sim \operatorname{Bernoulli}(\phi) \\ x \mid y=0 & \sim \mathcal{N}\left(\mu_{0}, \Sigma\right) \\ x \mid y=1 & \sim \mathcal{N}\left(\mu_{1}, \Sigma\right) \end{aligned}\]
  • Thus, the parameters of our GDA model are \(\phi\), \(\mu_0\), \(\mu_1\), and \(\Sigma\). Note that we used the same covariance matrix \(\Sigma\) for both classes, but different mean vectors \(\mu_0\) and \(\mu_1\). Put simply, we’re assuming that the two Gaussians representing the positive and negative classes have the same covariance matrix but have different means. You can also use different covariance matrices \(\Sigma_0\) and \(\Sigma_1\), but that is not commonly seen.
    • Thus, if you can fit \(\phi\), \(\mu_0\), \(\mu_1\) and \(\Sigma\) to your data, then these parameters will define \(P(x \mid y)\) and \(P(y)\) for your data.
  • Note that \(\phi \in \mathbb{R}\), \(\mu_0 \in \mathbb{R}^n\), \(\mu_1 \in \mathbb{R}^n\), and \(\Sigma \in \mathbb{R}^{n \times n}\).
  • Writing out the distributions, this is:
\[\begin{aligned} P(y) &=\phi^{y}(1-\phi)^{1-y} \\ P(x \mid y=0) &=\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x-\mu_{0}\right)^{T} \Sigma^{-1}\left(x-\mu_{0}\right)\right) \\ P(x \mid y=1) &=\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x-\mu_{1}\right)^{T} \Sigma^{-1}\left(x-\mu_{1}\right)\right) \end{aligned}\]
  • Note that the exponential notation used for \(P(y)\) is similar to what we saw earlier with logistic regression.
  • Plugging in the above quantities \(P(x \mid y=0), P(x \mid y=1), P(y=0)\) and \(P(y=1)\) in Bayes formula that we saw in the section on generative learning algorithms, you can easily ascertain a tumor as benign or malignant for our particular example.
  • Suppose you have a training set, \(\{(x^{(i)}, y^{(i)})\}_{i=1}^m\). To fit the aforementioned parameters to our training set, we’re going to maximize our joint likelihood \(L(\cdot)\),
\[\begin{aligned} L\left(\phi, \mu_{0}, \mu_{1}, \Sigma\right) &= \prod_{i=1}^{m} P(x^{(i)}, y^{(i)} ; \phi, \mu_{0}, \mu_{1}, \Sigma) \\ &=\prod_{i=1}^{m} P(x^{(i)} \mid y^{(i)} ; \mu_{0}, \mu_{1}, \Sigma) P(y^{(i)} ; \phi) \end{aligned}\]
  • The big difference between the cost functions of a generative learning algorithm compared to a discriminative learning algorithm is that in case of a generative learning algorithm like GDA, you are trying to choose parameters \(\phi, \mu_{0}, \mu_{1}, \Sigma\), that maximize the joint likelihood \(P(x, y; \phi, \mu_{0}, \mu_{1}, \Sigma)\). On the other hand, for discriminative learning algorithms such as linear regression, logistic regression or generalized linear models, you are trying to choose parameters \(\theta\), that maximize the conditional likelihood \(P(y \mid x; \theta)\).
  • The log-likelihood of the data \(\ell(\cdot)\) is simply the log of the likelihood \(L(\cdot)\). Formally,
\[\begin{aligned} \ell\left(\phi, \mu_{0}, \mu_{1}, \Sigma\right) &=\log \prod_{i=1}^{m} P(x^{(i)}, y^{(i)} ; \phi, \mu_{0}, \mu_{1}, \Sigma) \\ &=\log \prod_{i=1}^{m} P(x^{(i)} \mid y^{(i)} ; \mu_{0}, \mu_{1}, \Sigma) P(y^{(i)} ; \phi) \end{aligned}\]
  • To maximize \(\ell(\cdot)\) with respect to the parameters, we take the derivative of \(\ell(\cdot)\), set the derivative equal to \(0\) and then solve for the values of the parameters that maximize the expression for \(\ell(\cdot)\). This yields the maximum likelihood estimate of \(\phi\) to be,
\[\phi =\frac{\sum_{i=1}^{m} y^{(i)}}{m} \\\]
  • An intuitive explanation around the value of \(\phi\) that maximizes the expression is as follows. Recall that \(\phi\) is the estimate of probability of \(y\) being equal to \(1\). In our particular example, the chance that the next patient walks into your doctor’s office with a malignant tumor is denoted by \(\phi\). The maximum likelihood of the bias of a coin toss is the fraction of the heads seen. Likewise, the maximum likelihood estimate for \(\phi\) is just the fraction of your training examples with label \(y = 1\).
  • Another way to write the above expression for \(\phi\) is using the indicator notation, as follows:
\[\phi =\frac{\sum_{i=1}^{m} \mathbb{1}\left\{y^{(i)}=1\right\}}{m} \\\]
  • Note that the indicator notation \(\mathbb{1}\{\cdot\}\) represents a function which returns \(1\) if its argument is true, otherwise \(0\). In other words, the indicator of a true statement is equal to \(1\) while that of a false statement is equal to \(0\).
  • The maximum likelihood estimate of the mean vectors \(\mu_0, \mu_1\) is,
\[\begin{aligned} \mu_{0} &=\frac{\sum_{i=1}^{m} 1\left\{y^{(i)}=0\right\} x^{(i)}}{\sum_{i=1}^{m} 1\left\{y^{(i)}=0\right\}} \\ \mu_{1} &=\frac{\sum_{i=1}^{m} 1\left\{y^{(i)}=1\right\} x^{(i)}}{\sum_{i=1}^{m} 1\left\{y^{(i)}=1\right\}} \\ \end{aligned}\]
  • To build intuition around the value of \(\mu_0\) that maximizes the expression, think about what would be the maximum likelihood estimate of the mean of all of the features for the benign tumors. Well, what you do is you take all the benign tumors in your training set and just take the average, that seems like a very reasonable way. Just look- look at your training set. Look at all of the- look at all of the benign tumors, all the Os, I guess, and you just take the mean of these, and that, you know, seems like a pretty reasonable way to estimate Mu 0, right? Look all of your negative examples and average their features. So this is a way of writing out that intuition. Um, So the denominator is sum from i equals 1 through m indicates a y_i equals, 0, and so the denominator will count up the number of examples that have benign tumors, right? Because every time y_i equals 0, you get an extra 1 in this sum, um, ah, and so the denominator ends up being the total number of benign tumors in your training set. Okay? Um, and the numerator, ah, sum for m equals 1 through m indicator is a benign tumor times x_i. So the effect of that is, um, whenever, a tumor is benign is 1 times the features, whenever an example is malignant is 0 times the features and so the numerator is summing up all the features, all the feature vectors for all of the examples that are benign.
  • Num: so this is the sum of feature vectors for all the examples with y equals 0 and the denominator is a number of the examples where y equals 0. The ratio sums up all of the feature vectors for the benign tumors divide by the total number of benign tumors in the training set, and so that’s just the mean of the feature vectors of all of the benign examples.
  • The maximum likelihood estimate of the covariance matrix \(\Sigma\) is,

\(\Sigma =\frac{1}{m} \sum_{i=1}^{m}\left(x^{(i)}-\mu_{y^{(i)}}\right)\left(x^{(i)}-\mu_{y^{(i)}}\right)^{T}\)

  • But the covariance matrix, basically tries to, you know, fit contours to the ellipse, right? Like we saw, ah, so- so try to fill the Gaussian to both of these with these corresponding means but you want one covariance matrix to both of these.
  • Pictorially, the algorithm’s operation can be seen as follows:

  • Shown in the figure is the training set, as well as the contours of the two Gaussian distributions that have been fit to the data in each of the two classes. Note that the two Gaussians have contours that are the same shape and orientation, since they share a covariance matrix \(\Sigma\), but they have different means \(\mu_{0}\) and \(\mu_{1}\). Also shown in the figure is the straight line giving the decision boundary at which \(P(y=1 \mid x)=0.5\). On one side of the boundary, we’ll predict \(y=1\) to be the most likely outcome, and on the other side, we’ll predict \(y=0\).

Discussion: GDA and logistic regression

  • The GDA model has an interesting relationship to logistic regression. If we view the quantity \(P(y=1 \mid x ; \phi, \mu_{0}, \mu_{1}, \Sigma)\) as a function of \(x\), we’ll find that it can be expressed in the form

    \[P(y=1 \mid x ; \phi, \Sigma, \mu_{0}, \mu_{1})=\frac{1}{1+\exp (-\theta^{T} x)}\]
    • where \(\theta\) is some appropriate function of \(\phi, \Sigma, \mu_{0}, \mu_{1}\). Note that this uses the convention of redefining the \(x^{(i)}\)’s on the right-hand-side to be \(n+1\) dimensional vectors by adding the extra coordinate \(x_{0}^{(i)}=1\). This is exactly the form that logistic regression - a discriminative algorithm-used to model \(P(y=\) \(1 \mid x)\).
  • When would we prefer one model over another? GDA and logistic regression will, in general, give different decision boundaries when trained on the same dataset. Which is better?
  • We just argued that if \(P(x \mid y)\) is multivariate Gaussian (with shared \(\Sigma\)), then \(P(y \mid x)\) necessarily follows a logistic function. The converse, however, is not true; i.e., \(P(y \mid x)\) being a logistic function does not imply \(P(x \mid y)\) is multivariate Gaussian. This shows that GDA makes stronger modeling assumptions about the data than does logistic regression. It turns out that when these modeling assumptions are correct, then GDA will find better fits to the data, and is a better model. Specifically, when \(P(x \mid y)\) is indeed Gaussian (with shared \(\Sigma\)), then GDA is asymptotically efficient.
  • Informally, this means that in the limit of very large training sets (large \(m\)), there is no algorithm that is strictly better than GDA (in terms of, say, how accurately they estimate \(P(y \mid x)\)).
    • In particular, it can be shown that in this setting, GDA will be a better algorithm than logistic regression; and more generally, even for small training set sizes, we would generally expect GDA to better.
  • In contrast, by making significantly weaker assumptions, logistic regression is also more robust and less sensitive to incorrect modeling assumptions. There are many different sets of assumptions that would lead to \(P(y \mid x)\) taking the form of a logistic function.
    • For example, if \(x \mid y=0 \sim \operatorname{Poisson}(\lambda_{0})\), and \(x \mid y=1 \sim \operatorname{Poisson}(\lambda_{1})\), then \(P(y \mid x)\) will be logistic. Logistic regression will also work well on Poisson data like this. But if we were to use GDA on such data-and fit Gaussian distributions to such non-Gaussian data-then the results will be less predictable, and GDA may (or may not) do well.
  • Key takeaways
    • Compared to say logistic regression for classification, GDA is actually a simpler and maybe more computationally efficient algorithm to implement in some cases. In case of small data sets, GDA might be the better choice than logistic regression.
    • GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to learn “well”) when the modeling assumptions are correct or at least approximately correct.
    • Logistic regression makes weaker assumptions, and is significantly more robust to deviations from modeling assumptions. Specifically, when the data is indeed non-Gaussian, then in the limit of large datasets, logistic regression will almost always do better than GDA. For this reason, in practice logistic regression is used more often than GDA.
    • Some related considerations about discriminative vs. generative models also apply for the Naive Bayes algorithm that we discuss next, but the Naive Bayes algorithm is still considered a very good, and is certainly also a very popular, classification algorithm.

Further reading

Here are some (optional) links you may find interesting for further reading:

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledGaussianDiscriminantAnalysis,
  title   = {Gaussian Discriminant Analysis},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS229: Machine Learning},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}