Primers • Naive Bayes
- Overview
- From Bayes’ theorem \(\rightarrow\) Naive Bayes
- Example: Applying Bayes’ Theorem with All Terms
- Key Concepts
- Loss Function
- Applications
- Pros and Cons
- Example
- Citation
Overview
- Naive Bayes is a family of probabilistic algorithms based on Bayes’ theorem, with the “naive” assumption of conditional independence between every pair of features given the class label. It is widely used for classification tasks due to its simplicity, efficiency, and interpretability.
- Naive Bayes algorithms are primarily used in supervised learning tasks for classification problems. The approach applies Bayes’ theorem to calculate the probability of a given sample belonging to each possible class and assigns the sample to the class with the highest posterior probability.
- Despite its simplifying assumptions, Naive Bayes remains a robust and efficient tool for a wide range of practical applications, particularly in domains like text analysis and medical diagnostics. Its simplicity and interpretability make it a popular choice for baseline models and quick analyses.
From Bayes’ theorem \(\rightarrow\) Naive Bayes
-
Bayes’ theorem is expressed as:
\[P(C|X) = \frac{P(X|C)P(C)}{P(X)}\]- where:
- \(P(C \mid X)\) is the posterior probability of the class \(C\) given the data \(X\).
- \(P(X \mid C)\) is the likelihood of data \(X\) given class \(C\).
- \(P(C)\) is the prior probability of class \(C\).
- \(P(X)\) is the marginal probability of data \(X\).
- where:
-
The “naive” assumption simplifies the likelihood $$P(X C)\(by assuming that features are conditionally independent given the class. For\)n\(features\)x_1, x_2, \dots, x_n$$, this assumption allows us to express:
Example: Applying Bayes’ Theorem with All Terms
- To understand the terms in Bayes’ theorem, consider a simple spam email classification problem. Suppose we want to classify an email as either “Spam” (S) or “Not Spam” (\(\neg S\)) based on a feature \(X\), which is the presence of the word “offer” in the email.
Given Data
-
Prior probabilities:
- \(P(S) = 0.2\) (20% of emails are spam).
- \(P(\neg S) = 0.8\) (80% of emails are not spam).
-
Likelihoods:
-
$$P(X S) = 0.9$$ (90% of spam emails contain the word “offer”). -
$$P(X \neg S) = 0.1$$ (10% of non-spam emails contain the word “offer”).
-
-
Marginal probability:
- \(P(X) = P(X | S)P(S) + P(X | \neg S)P(\neg S)\): \(P(X) = (0.9 \cdot 0.2) + (0.1 \cdot 0.8) = 0.18 + 0.08 = 0.26.\)
Calculating Posterior Probabilities
- Using Bayes’ theorem:
-
For Spam: \(P(S | X) = \frac{P(X | S)P(S)}{P(X)} = \frac{0.9 \cdot 0.2}{0.26} = \frac{0.18}{0.26} \approx 0.692.\)
-
For Not Spam: \(P(\neg S | X) = \frac{P(X | \neg S)P(\neg S)}{P(X)} = \frac{0.1 \cdot 0.8}{0.26} = \frac{0.08}{0.26} \approx 0.308.\)
Conclusion
-
The posterior probabilities indicate that the email is more likely to be spam (\(P(S \mid X) \approx 0.692\)) than not spam ($$P(\neg S X) \approx 0.308$$). Thus, the email would be classified as spam based on this model. - This example demonstrates the roles of all terms in Bayes’ theorem:
- Prior probabilities (\(P(S)\) and \(P(\neg S)\)) represent the base rates of each class.
-
Likelihoods ($$P(X S)\(and\)P(X \neg S)$$) quantify the probability of observing the feature given each class. - Marginal probability (\(P(X)\)) ensures proper normalization of posterior probabilities.
-
Posterior probability ($$P(S X)$$) provides the final classification decision.
Key Concepts
Bayesian Probability
- The foundation of Naive Bayes lies in Bayesian probability, where we update our prior beliefs based on new evidence.
Conditional Independence Assumption
- The algorithm assumes that features are independent given the class. This assumption simplifies computation but may not always hold in real-world data.
Types of Naive Bayes Classifiers
- Depending on the type of data, different variations of Naive Bayes are used:
- Gaussian Naive Bayes: Assumes continuous data is normally distributed.
- Multinomial Naive Bayes: Suitable for discrete count data, such as word frequencies in text classification.
- Bernoulli Naive Bayes: Suitable for binary/boolean features.
Parameter Estimation
- Parameters \(P(C)\) (prior) and \(P(x_i \mid C)\) (likelihood) are estimated using Maximum Likelihood Estimation (MLE), which counts the occurrences in the training data to compute probabilities.
Loss Function
-
The goal of Naive Bayes is to maximize the posterior probability. Alternatively, we minimize a loss function based on the log-likelihood. The negative log-likelihood (NLL) is commonly used as the loss function:
\[\text{NLL} = - \sum_{i=1}^N \log P(C_i | X_i)\]- where \(N\) is the number of training samples, \(C_i\) is the true class label for sample \(i\), and \(P(C_i \mid X_i)\) is the predicted probability for the correct class.
Applications
Text Classification and Sentiment Analysis
- Naive Bayes is frequently used in Natural Language Processing (NLP) tasks such as spam detection, sentiment analysis, and document classification due to its efficiency with high-dimensional data.
Medical Diagnosis
- Useful for predicting diseases based on symptoms, assuming conditional independence between symptoms.
Email Spam Filtering
- Detects spam emails by analyzing the frequency of specific words in the email content.
Recommendation Systems
- Predicts user preferences based on past behaviors.
Real-time Prediction
- Due to its computational efficiency, Naive Bayes is effective in real-time applications.
Pros and Cons
Pros
- Simplicity: Easy to implement and interpret.
- Efficiency: Computationally fast, especially with large datasets.
- Scalability: Handles high-dimensional data well, such as text data.
- No Iterative Parameter Estimation: Training involves simple computations.
Cons
- Conditional Independence Assumption: Rarely holds in practice, which may degrade performance.
- Limited Expressiveness: Assumes simple relationships between features and the target variable.
- Data Imbalance: Sensitive to imbalanced datasets, as probabilities can be dominated by frequent classes.
- Feature Engineering: May require careful preprocessing and feature selection for optimal performance.
Example
Problem: Spam Detection
- Suppose we have two classes: \(C_1\) (Spam) and \(C_2\) (Not Spam). The features are words in an email (\(X = \{x_1, x_2, x_3, \dots, x_n\}\)).
Step 1: Compute Priors
\[P(C_1) = \frac{\text{Number of Spam Emails}}{\text{Total Emails}}\] \[P(C_2) = \frac{\text{Number of Not Spam Emails}}{\text{Total Emails}}\]Step 2: Compute Likelihoods
- For each word \(x_i\), compute \(P(x_i \mid C_1)\) and \(P(x_i \mid C_2)\) using word frequencies.
Step 3: Compute Posterior
- For a new email, compute:
Step 4: Classification
- Assign the email to the class with the highest posterior probability.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledNaiveBayes,
title = {Naive Bayes},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}