Aman's AI Journal • Primers • Linear and Logistic Regression

Linear regression
FAQs
Logistic Regression
FAQ: Why is the role of the sigmoid function in logistic regression?
- Nature of the Problem: Binary Classification
- The Role of the Sigmoid Function
Further Reading
Citation

Linear regression

Overview

Linear regression is a fundamental and widely used statistical technique in both machine learning and statistical analysis. Its purpose is to predict a dependent (or target) variable based on one or more independent (or explanatory) variables. This method is popular due to its simplicity, interpretability, and its foundational role in understanding more complex models. Despite the growing complexity of machine learning algorithms, linear regression remains an essential tool for predictive modeling and explanatory analysis across various domains.
Linear regression is a statistical method used to model the relationship between a dependent (outcome) variable and one or more independent (predictor) variables. For linear regression to produce valid and reliable results, several assumptions must be met. Here are the main assumptions of linear regression:

Assumptions of Linear Regression

Linearity:
- The relationship between the independent variables (predictors) and the dependent variable (outcome) is assumed to be linear. Specifically, the expected value of the dependent variable is a linear function of the independent variables.
Independence of Errors:
- The residuals (the differences between the observed and predicted values) should be independent of each other. This assumption means there should be no autocorrelation (particularly important in time series data).
Homoscedasticity (Constant Variance of Errors):
- The residuals should have constant variance at all levels of the independent variables. This means that the spread of the residuals should be similar across the range of predicted values. If there is heteroscedasticity, it suggests that the model is performing better for some values than others.
Normality of Errors:
- The residuals should be normally distributed. This assumption is especially important when conducting hypothesis testing (e.g., calculating p-values or confidence intervals).
No Multicollinearity:
- The independent variables should not be highly correlated with each other. If multicollinearity exists, it becomes difficult to determine the individual effect of each predictor.
No Endogeneity (No Correlation Between Errors and Predictors):
- The independent variables should not be correlated with the error term. This assumption ensures that the predictors are truly independent of the error term and are not capturing any omitted variable bias.

The General Equation of Linear Regression

At the heart of linear regression is the equation that represents the relationship between the target variable and one or more predictors:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon\]
- where:
  - \(y\): Dependent (or target) variable we aim to predict.
  - \(x_1, x_2, \dots, x_p\): Independent (or explanatory) variables.
  - \(\beta_0\): Intercept term.
  - \(\beta_1, \beta_2, \dots, \beta_p\): Coefficients (or weights) that quantify the impact of each predictor \(x_i\) on \(y\).
  - \(\epsilon\): Error term representing the irreducible noise or unmodeled aspects of the data.
The goal of linear regression is to estimate the coefficients (\(\beta\)) that minimize the prediction error for the dependent variable based on the given set of independent variables.

Key Concepts

Supervised Learning Algorithm: Linear regression is a supervised learning technique, meaning it learns the relationship between input variables and a labeled output variable from training data.
Prediction: Once a linear regression model is fitted with appropriate coefficients, it can be used to predict \(y\) for new data by simply plugging in the values of the predictors.
Simple vs. Multiple Regression:
- Simple linear regression involves a single independent variable.
- Multiple linear regression involves two or more independent variables.

For example, predicting house prices based on square footage is a case of simple linear regression, while predicting weight based on both height and age would be a multiple regression scenario.

Model Fitting: Estimating the Coefficients

The primary task in linear regression is to find the set of coefficients (\(\beta_0, \beta_1, \dots, \beta_p\)) that minimize the difference between predicted and actual values. This is achieved by minimizing the sum of squared errors (SSE) or residuals, which is the difference between the observed data points and the predicted values from the regression line.

The fitted model can then be used to make predictions using the following equation:
\[\hat{y} = \hat{\beta_0} + \hat{\beta_1} x_1 + \hat{\beta_2} x_2 + \dots + \hat{\beta_p} x_p\]
- where \(\hat{y}\) represents the predicted values and \(\hat{\beta_i}\) are the estimated coefficients.

Evaluation of Linear Regression Models

Evaluating a linear regression model involves assessing how well the model fits the data and how accurate its predictions are.

Mean Squared Error (MSE)

MSE is a common loss function for regression models. It is calculated by taking the average of the squared differences between the actual and predicted values:

\[MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2\]

Interpretation: A lower MSE indicates that the model is making predictions closer to the actual values.

R-Squared (Coefficient of Determination)

R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates that the model perfectly explains the variance in the data.

\[R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y_i})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\]

Interpretation: An \(R^2\) of 0.8, for example, means that 80% of the variance in the target variable is explained by the model.

Optimization Techniques: Finding the Best Coefficients

Gradient Descent

Gradient descent is an iterative optimization algorithm used to minimize the cost function (MSE). It works by calculating the gradient (or slope) of the error function and updating the coefficients in the direction that minimizes this error. It is particularly useful when working with large datasets or many features, as it is computationally efficient.
The update rule in gradient descent for linear regression is:
\[\beta_j = \beta_j - \alpha \frac{\partial}{\partial \beta_j} MSE\]
- where:
  - \(\alpha\) is the learning rate (a small positive number that controls the step size).
  - \(\frac{\partial}{\partial \beta_j} MSE\) represents the derivative of the cost function with respect to the coefficient \(\beta_j\).

Normal Equation

For smaller datasets or problems where computational complexity is not a concern, the normal equation offers a closed-form solution to finding the best coefficients. The normal equation is derived by minimizing the residual sum of squares (RSS) and is given by:
\[\hat{\beta} = (X^T X)^{-1} X^T y\]
- where \(X\) is the matrix of input features, and \(y\) is the vector of target values.

Extensions and Interpretations

Linear regression can be extended in various ways, including:
- Regularized Regression: Adds penalty terms to the cost function to prevent overfitting (e.g., Lasso and Ridge regression).
- Polynomial Regression: Models non-linear relationships by adding polynomial terms of the independent variables.
- Multivariate Regression: Used when there are multiple independent variables predicting a single dependent variable.
Interpreting the coefficients in these extended models follows the same principles, but the presence of multiple predictors or transformations may require more nuanced interpretation of the model’s behavior.

FAQs

Yes, linear regression assumes that the relationship between the independent variables (features) and the outcome variable is linear. However, this does not necessarily mean that the individual predictors themselves must always appear in the form they are collected (i.e., without transformation). You can include non-linear transformations of the predictors in a linear regression model.
For example:
- You can include polynomial terms (like \(x^2\), \(x^3\)) or interaction (cross) terms (like \(x_1 \times x_2\)) in your model. As long as the model is linear in the coefficients, it remains a “linear regression” model.
For instance, if the relationship is quadratic (e.g., \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon\)), this is still considered a linear regression model because it is linear in the parameters \(\beta_0, \beta_1, \beta_2\).

How Does The Linearity Assumption Hold with Non-Linear (Cross) Features?

The inclusion of non-linear terms or cross features (interaction terms) in a regression model does not violate the assumptions of linear regression. The key factor is that the model remains linear in its parameters. For example, consider an interaction term \(x_1 \times x_2\) in the model:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \times x_2) + \epsilon.\]

Although \(x_1\) and \(x_2\) interact non-linearly, the model is still considered linear because the outcome \(y\) is expressed as a linear combination of the parameters \(\beta_0, \beta_1, \beta_2,\) and \(\beta_3\).
The linearity assumption in linear regression pertains to the linearity of the model’s parameters, not necessarily the raw features or predictors. As a result, you can incorporate transformations or non-linear features, such as polynomial terms or interaction terms, as long as the relationship between the predictors and the parameters remains linear.
In summary, linear regression accommodates non-linear features because these do not affect the fundamental linearity in the model’s coefficients. This allows for flexibility in capturing complex relationships while adhering to the framework of linear regression.

What is Multicollinearity?

Multicollinearity arises when two or more independent variables in a linear regression model are highly correlated, resulting in overlapping or redundant information. In such cases, the model struggles to isolate the individual effects of each predictor on the dependent variable, leading to unreliable coefficient estimates, inflated standard errors, and challenges in interpretation.
This presents a significant limitation in linear regression, as it compromises the model’s capacity to accurately estimate the effects of the predictors.
Multicollinearity can be identified using diagnostic tools such as the Variance Inflation Factor (VIF) and addressed through strategies like eliminating correlated variables, applying regularization techniques, or employing dimensionality reduction methods, such as Principal Component Analysis (PCA).

How Does Multicollinearity Affect Linear Regression?

Unstable Coefficients (Inflated Standard Errors):
- Multicollinearity causes the estimates of the regression coefficients to become very sensitive to small changes in the model or data. This leads to inflated standard errors, making the coefficients unreliable. As a result, a variable that might have a significant relationship with the dependent variable could appear statistically insignificant.
Difficult Interpretation:
- When predictors are highly correlated, it becomes harder to interpret the individual coefficients. For instance, if two variables are almost perfectly correlated, the model struggles to assign appropriate weight to each variable, making the individual coefficients unreliable or counterintuitive.
Reduces Statistical Power:
- The presence of multicollinearity reduces the precision of the estimated coefficients, which can decrease the statistical power of hypothesis tests. This makes it harder to detect significant effects that truly exist.
High Variance Inflation Factor (VIF):
- The Variance Inflation Factor (VIF) is often used to detect multicollinearity. If the VIF for any predictor is large (usually greater than 5 or 10), it indicates the presence of multicollinearity. A high VIF means the variable is highly correlated with other predictors.

Is Multicollinearity a Limitation of Linear Regression?

Yes, multicollinearity is considered a limitation of linear regression. However, it is not an inherent limitation of the model itself but rather a problem in the data that can affect the performance and interpretability of the linear regression model.
Here’s how it limits linear regression:
- Unreliable Coefficients: High multicollinearity can make the regression coefficients unreliable. This undermines the predictive power of the model and leads to difficulties in drawing meaningful conclusions.
- Difficulty in Interpretation: When variables are highly correlated, it becomes hard to tell which one is driving the effect on the dependent variable. This makes it difficult to understand the relationship between predictors and the outcome.
- Inflated Variance of Predictions: Multicollinearity increases the variability of the coefficient estimates, which in turn can increase the variability in predictions, making the model less generalizable.

How to Detect and Address Multicollinearity?

Variance Inflation Factor (VIF):
- A common diagnostic tool is the VIF, which measures how much the variance of a regression coefficient is inflated due to multicollinearity. A VIF greater than 5 or 10 indicates high multicollinearity.
Correlation Matrix:
- Examining a correlation matrix of the independent variables can reveal pairs of variables that are highly correlated (correlation close to 1 or -1). This is an indication of potential multicollinearity.
Drop One of the Correlated Variables:
- If two or more variables are highly correlated, consider dropping one of them. This can simplify the model and reduce multicollinearity.
Principal Component Analysis (PCA):
- PCA can transform the correlated variables into a set of uncorrelated components, which can be used in regression analysis. This reduces the dimensionality of the data and avoids multicollinearity.
Ridge Regression or Lasso Regression:
- These are regularization techniques that can help mitigate the effects of multicollinearity. Ridge regression adds a penalty to the size of the coefficients, which reduces their sensitivity to multicollinearity. Lasso regression goes a step further and can shrink some coefficients to zero, effectively selecting a subset of the predictors.

Logistic Regression

Overview

Logistic regression is a fundamental and powerful supervised learning algorithm widely used for binary classification tasks. In machine learning, supervised learning involves training a model on input-output pairs to learn patterns that enable predictions on unseen data. Logistic regression specifically predicts the probability of a categorical outcome based on input features, where the outcome belongs to one of two classes (e.g., “rainy” vs. “sunny” or “success” vs. “failure”). Although logistic regression can be extended to multiple classes, its most common application is in binary classification.
At its core, logistic regression estimates the probability that a given observation falls into a particular class. This probability is derived using a sigmoid function, which converts the output of a linear equation into a probability value constrained between 0 and 1. The relationship between the input features and the binary response variable is modeled through this sigmoid function, allowing for the output to be interpreted as the likelihood of an observation belonging to a particular class.
Logistic regression finds the optimal parameters for this relationship using techniques like gradient descent or maximum likelihood estimation (MLE), both of which aim to minimize prediction errors. These optimization methods adjust the model’s coefficients to best fit the data, making logistic regression both a flexible and interpretable tool for classification tasks in machine learning.

The Logistic Regression Equation

For a binary classification task, the logistic regression model predicts the probability \(P(y = 1 \mid X)\), where \(y\) is the binary outcome (0 or 1) and \(X = (x_1, x_2, \dots, x_k)\) represents the input features.
The logistic regression model is based on the following equation:
\[P(y = 1 \mid X) = \text{sigmoid}(z) = \frac{1}{1 + e^{-z}}\]
- where:
  - \[z = \hat{\beta_0} + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 + \dots + \hat{\beta_k}x_k\]
  - \(\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_k}\) are the coefficients (parameters) of the model.
  - The sigmoid function ensures that the output is a probability, constrained between 0 and 1.
The linear predictor \(z\) is a linear combination of the input features, much like in linear regression. However, instead of predicting continuous outcomes, logistic regression applies the sigmoid function to ensure the output can be interpreted as a probability.

How Logistic Regression Works: A Practical Example

To understand how logistic regression works in practice, consider a scenario where you want to predict the weather in Seattle, specifically whether a day will be sunny or rainy. The outcome can be represented as a binary variable: assign 1 to a sunny day and 0 to a rainy day. Suppose the feature you use to make this prediction is the temperature.
Fitting a linear regression model to this data would be inappropriate because the predicted values could fall outside the range of 0 and 1, leading to nonsensical results (e.g., a prediction of 1.5 for sunny days). Instead, logistic regression fits a logistic (sigmoid) function, ensuring that predicted probabilities are between 0 and 1. You can interpret these probabilities as the likelihood of a sunny day, given a specific temperature.
Once the model outputs a probability, a classification threshold (often 0.5) is applied. For example, if the predicted probability is greater than 0.5, the model will predict a sunny day. If it’s less than 0.5, it will predict a rainy day. Adjusting the threshold allows for more cautious or risk-tolerant predictions depending on the context.

Model Evaluation: The Log-Loss Function

Logistic regression models are evaluated using a loss function that measures how well the model predicts the true outcomes. For binary classification, the appropriate loss function is the Log-Loss (also known as binary cross-entropy). The Log-Loss for a given dataset with \(n\) samples is defined as:
\[\text{Log-Loss} = - \sum_{i=0}^{n} \left( y_i \cdot \log(p_i) + (1 - y_i) \cdot \log(1 - p_i) \right)\]
- where:
  - \(y_i\) is the true label for the \(i\)-th observation.
  - \(p_i\) is the predicted probability for the \(i\)-th observation.
The Log-Loss penalizes large deviations between the predicted probability and the actual label, with larger penalties when the model is confident but wrong (i.e., when a predicted probability is near 1 for an outcome of 0, or vice versa). Minimizing the Log-Loss function leads to more accurate predictions.

Estimating Coefficients: Gradient Descent and Maximum Likelihood Estimation

In logistic regression, the goal is to find the optimal values of the coefficients \(\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_k}\) that minimize the Log-Loss. There are two primary methods to estimate these coefficients: gradient descent and maximum likelihood estimation (MLE).

Gradient Descent

Gradient descent is an iterative optimization algorithm used to minimize the Log-Loss function. The process begins by selecting initial values for the parameters and then updating them iteratively. The direction and magnitude of each update are determined by the gradient of the Log-Loss function, which points in the direction of the steepest increase. By moving in the opposite direction of the gradient, the algorithm seeks to minimize the loss function.
Each iteration updates the parameters according to the following rule:
\[\beta_j^{(t+1)} = \beta_j^{(t)} - \alpha \frac{\partial \text{Log-Loss}}{\partial \beta_j}\]
- where:
  - \(\alpha\) is the learning rate (step size).
  - \(\frac{\partial \text{Log-Loss}}{\partial \beta_j}\) is the partial derivative of the Log-Loss with respect to \(\beta_j\).

Maximum Likelihood Estimation (MLE)

MLE is another method to estimate the coefficients in logistic regression. It involves maximizing the likelihood that the observed data occurred given the model’s predictions. The log-likelihood function is the log of the likelihood function, and maximizing the log-likelihood is equivalent to minimizing the Log-Loss. This can be achieved by setting the partial derivatives of the log-likelihood function with respect to the parameters equal to zero and solving the resulting system of equations.

Interpreting Logistic Regression Coefficients

Interpreting the coefficients in logistic regression differs from linear regression because they are expressed in terms of log-odds rather than probabilities. - The odds are calculated as:
\[\text{Odds} = \frac{p}{1 - p}\]
- where \(p\) is the probability of the positive class (e.g., sunny day). The logistic regression coefficients represent the change in the log-odds of the outcome for a one-unit increase in the corresponding predictor variable.
To make the coefficients easier to interpret, we can exponentiate them, transforming them from the log-odds scale to the odds scale. For example, if the coefficient \(\hat{\beta_1}\) for a feature is 0.7, then \(e^{0.7} \approx 2\), meaning that a one-unit increase in that feature multiplies the odds of the positive class by approximately 2.

FAQ: Why is the role of the sigmoid function in logistic regression?

The sigmoid function plays a central role in logistic regression because it maps the predicted values, which can range from \(-\infty\) to \(+\infty\), to a range between 0 and 1, making it ideal for probability estimation. Logistic regression is a classification algorithm designed to predict the probability of an observation belonging to a specific class, such as in binary classification (e.g., 0 or 1).
By applying the sigmoid function, the unbounded linear combination of input features is transformed into a probability value that represents the likelihood of the observation being in the positive class. This function provides a natural link between the log-odds of an event and its probability. Additionally, its smooth and differentiable nature ensures efficient optimization during the model training process, making it well-suited for the task.

Nature of the Problem: Binary Classification

Logistic regression is primarily used for binary classification tasks where the dependent variable (output) is categorical (e.g., 0 or 1, yes or no, true or false).
Probabilities are naturally bounded between 0 and 1, so the model needs an output function that conforms to this range.

The Role of the Sigmoid Function

The sigmoid function is defined as: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
Range: The output of the sigmoid function is always in the interval (0, 1).
Shape: The S-shaped curve ensures smooth transitions between 0 and 1. Small values of \(z\) approach 0, and large values of \(z\) approach 1.
This makes it ideal for interpreting \(z\), the linear combination of inputs and weights, as a probability.

Connecting Linear Regression to Probabilities

Logistic regression starts with a linear model:

\(z = w_0 + w_1x_1 + w_2x_2 + \dots + w_nx_n\)
- where \(z\) can take any real value (\(-\infty\) to \(+\infty\)).
Directly interpreting \(z\) as a probability would not make sense because probabilities are bounded between 0 and 1.
Applying the sigmoid function to \(z\) transforms it into a probability:
\[P(y = 1 | x) = \sigma(z) = \frac{1}{1 + e^{-z}}\]

Log-Odds Transformation

The sigmoid function naturally corresponds to the concept of log-odds (logit):
- Log-odds represent the logarithm of the odds of an event occurring:
  \[\text{logit}(p) = \ln\left(\frac{p}{1-p}\right)\]
- The inverse of the logit function is the sigmoid function:
  \[p = \frac{1}{1 + e^{-z}}\]
- This means that logistic regression models the log-odds as a linear combination of the input features.

Probabilistic Interpretation

After applying the sigmoid function, the output \(P(y = 1 \mid x)\) can be interpreted as:
- The probability of the positive class (1) for given input \(x\).
- \(1 - P(y = 1 \mid x)\) as the probability of the negative class (0).
This probabilistic output is useful for decision-making, thresholding, and downstream tasks.

Gradient-Based Optimization

Logistic regression uses maximum likelihood estimation (MLE) to find the parameters (\(w_0, w_1, \dots\)).
The sigmoid function’s mathematical properties (smooth gradient and bounded output) ensure that the cost function (cross-entropy loss) is differentiable and convex, making it easier to optimize using gradient descent.

Why Not Other Activation Functions?

Other functions like the step function or hyperbolic tangent (\(\tanh\)) might also be considered, but they have limitations:
- Step Function: Outputs discrete values (0 or 1), making it unsuitable for gradient-based optimization.
- Hyperbolic Tangent: Outputs values between -1 and 1, which isn’t ideal for probability estimation.
The sigmoid function, on the other hand, is simple, computationally efficient, and perfectly suited for probabilistic interpretations in the context of logistic regression.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledLinearLogisticRegression,
  title   = {Linear and Logistic Regression},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}

Linear regression

Overview

Assumptions of Linear Regression

The General Equation of Linear Regression

Key Concepts

Model Fitting: Estimating the Coefficients

Evaluation of Linear Regression Models

Mean Squared Error (MSE)

R-Squared (Coefficient of Determination)

Optimization Techniques: Finding the Best Coefficients

Gradient Descent

Normal Equation

Extensions and Interpretations

FAQs

Does Linear Regression Assume that Features are Linearly Related to the Outcome Variable?

How Does The Linearity Assumption Hold with Non-Linear (Cross) Features?

What is Multicollinearity?

How Does Multicollinearity Affect Linear Regression?

Is Multicollinearity a Limitation of Linear Regression?

How to Detect and Address Multicollinearity?

Logistic Regression

Overview

The Logistic Regression Equation

How Logistic Regression Works: A Practical Example

Model Evaluation: The Log-Loss Function

Estimating Coefficients: Gradient Descent and Maximum Likelihood Estimation

Gradient Descent

Maximum Likelihood Estimation (MLE)

Interpreting Logistic Regression Coefficients

FAQ: Why is the role of the sigmoid function in logistic regression?

Nature of the Problem: Binary Classification

The Role of the Sigmoid Function

Connecting Linear Regression to Probabilities

Log-Odds Transformation

Probabilistic Interpretation

Gradient-Based Optimization

Why Not Other Activation Functions?

Further Reading

Citation