Primers • Linear and Logistic Regression
 Introduction
 FAQs
 Does Linear Regression Assume that Features are Linearly Related to the Outcome Variable?
 How Does The Linearity Assumption Hold with NonLinear (Cross) Features?
 Summary:
 What is Multicollinearity?
 How to Detect and Address Multicollinearity?
 Logistic Regression
 The Logistic Regression Equation
 How Logistic Regression Works: A Practical Example
 Model Evaluation: The LogLoss Function
 Estimating Coefficients: Gradient Descent and Maximum Likelihood Estimation
 Interpreting Logistic Regression Coefficients
 Further Reading
 Citation
Introduction
 Linear regression is a fundamental and widely used statistical technique in both machine learning and statistical analysis. Its purpose is to predict a dependent (or target) variable based on one or more independent (or explanatory) variables. This method is popular due to its simplicity, interpretability, and its foundational role in understanding more complex models. Despite the growing complexity of machine learning algorithms, linear regression remains an essential tool for predictive modeling and explanatory analysis across various domains.
 Linear regression is a statistical method used to model the relationship between a dependent (outcome) variable and one or more independent (predictor) variables. For linear regression to produce valid and reliable results, several assumptions must be met. Here are the main assumptions of linear regression:
Assumptions of Linear Regression
 Linearity:
 The relationship between the independent variables (predictors) and the dependent variable (outcome) is assumed to be linear. Specifically, the expected value of the dependent variable is a linear function of the independent variables.
 Independence of Errors:
 The residuals (the differences between the observed and predicted values) should be independent of each other. This assumption means there should be no autocorrelation (particularly important in time series data).
 Homoscedasticity (Constant Variance of Errors):
 The residuals should have constant variance at all levels of the independent variables. This means that the spread of the residuals should be similar across the range of predicted values. If there is heteroscedasticity, it suggests that the model is performing better for some values than others.
 Normality of Errors:
 The residuals should be normally distributed. This assumption is especially important when conducting hypothesis testing (e.g., calculating pvalues or confidence intervals).
 No Multicollinearity:
 The independent variables should not be highly correlated with each other. If multicollinearity exists, it becomes difficult to determine the individual effect of each predictor.
 No Endogeneity (No Correlation Between Errors and Predictors):
 The independent variables should not be correlated with the error term. This assumption ensures that the predictors are truly independent of the error term and are not capturing any omitted variable bias.
The General Equation of Linear Regression

At the heart of linear regression is the equation that represents the relationship between the target variable and one or more predictors:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon\] where:
 \(y\): Dependent (or target) variable we aim to predict.
 \(x_1, x_2, \dots, x_p\): Independent (or explanatory) variables.
 \(\beta_0\): Intercept term.
 \(\beta_1, \beta_2, \dots, \beta_p\): Coefficients (or weights) that quantify the impact of each predictor \(x_i\) on \(y\).
 \(\epsilon\): Error term representing the irreducible noise or unmodeled aspects of the data.
 where:

The goal of linear regression is to estimate the coefficients (\(\beta\)) that minimize the prediction error for the dependent variable based on the given set of independent variables.
Key Concepts

Supervised Learning Algorithm: Linear regression is a supervised learning technique, meaning it learns the relationship between input variables and a labeled output variable from training data.

Prediction: Once a linear regression model is fitted with appropriate coefficients, it can be used to predict \(y\) for new data by simply plugging in the values of the predictors.

Simple vs. Multiple Regression:
 Simple linear regression involves a single independent variable.
 Multiple linear regression involves two or more independent variables.
For example, predicting house prices based on square footage is a case of simple linear regression, while predicting weight based on both height and age would be a multiple regression scenario.
Model Fitting: Estimating the Coefficients
The primary task in linear regression is to find the set of coefficients (\(\beta_0, \beta_1, \dots, \beta_p\)) that minimize the difference between predicted and actual values. This is achieved by minimizing the sum of squared errors (SSE) or residuals, which is the difference between the observed data points and the predicted values from the regression line.

The fitted model can then be used to make predictions using the following equation:
\[\hat{y} = \hat{\beta_0} + \hat{\beta_1} x_1 + \hat{\beta_2} x_2 + \dots + \hat{\beta_p} x_p\] where \(\hat{y}\) represents the predicted values and \(\hat{\beta_i}\) are the estimated coefficients.
Evaluation of Linear Regression Models
Evaluating a linear regression model involves assessing how well the model fits the data and how accurate its predictions are.
Mean Squared Error (MSE)
 MSE is a common loss function for regression models. It is calculated by taking the average of the squared differences between the actual and predicted values:
 Interpretation: A lower MSE indicates that the model is making predictions closer to the actual values.
RSquared (Coefficient of Determination)
Rsquared measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates that the model perfectly explains the variance in the data.
\[R^2 = 1  \frac{\sum_{i=1}^{n}(y_i  \hat{y_i})^2}{\sum_{i=1}^{n}(y_i  \bar{y})^2}\] Interpretation: An \(R^2\) of 0.8, for example, means that 80% of the variance in the target variable is explained by the model.
Optimization Techniques: Finding the Best Coefficients
Gradient Descent

Gradient descent is an iterative optimization algorithm used to minimize the cost function (MSE). It works by calculating the gradient (or slope) of the error function and updating the coefficients in the direction that minimizes this error. It is particularly useful when working with large datasets or many features, as it is computationally efficient.

The update rule in gradient descent for linear regression is:
\[\beta_j = \beta_j  \alpha \frac{\partial}{\partial \beta_j} MSE\] where:
 \(\alpha\) is the learning rate (a small positive number that controls the step size).
 \(\frac{\partial}{\partial \beta_j} MSE\) represents the derivative of the cost function with respect to the coefficient \(\beta_j\).
 where:
Normal Equation

For smaller datasets or problems where computational complexity is not a concern, the normal equation offers a closedform solution to finding the best coefficients. The normal equation is derived by minimizing the residual sum of squares (RSS) and is given by:
\[\hat{\beta} = (X^T X)^{1} X^T y\] where \(X\) is the matrix of input features, and \(y\) is the vector of target values.
Extensions and Interpretations
 Linear regression can be extended in various ways, including:
 Regularized Regression: Adds penalty terms to the cost function to prevent overfitting (e.g., Lasso and Ridge regression).
 Polynomial Regression: Models nonlinear relationships by adding polynomial terms of the independent variables.
 Multivariate Regression: Used when there are multiple independent variables predicting a single dependent variable.
 Interpreting the coefficients in these extended models follows the same principles, but the presence of multiple predictors or transformations may require more nuanced interpretation of the model’s behavior.
FAQs
Does Linear Regression Assume that Features are Linearly Related to the Outcome Variable?
 Yes, linear regression assumes that the relationship between the independent variables (features) and the outcome variable is linear. However, this does not necessarily mean that the individual predictors themselves must always appear in the form they are collected (i.e., without transformation). You can include nonlinear transformations of the predictors in a linear regression model.
 For example:
 You can include polynomial terms (like \(x^2\), \(x^3\)) or interaction (cross) terms (like \(x_1 \times x_2\)) in your model. As long as the model is linear in the coefficients, it remains a “linear regression” model.
 For instance, if the relationship is quadratic (e.g., \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon\)), this is still considered a linear regression model because it is linear in the parameters \(\beta_0, \beta_1, \beta_2\).
How Does The Linearity Assumption Hold with NonLinear (Cross) Features?
 Nonlinear terms or cross features (interaction terms) like \(x_1 \times x_2\) do not violate the assumptions of linear regression. Even though the individual features may have nonlinear relationships with the outcome variable (e.g., through interactions or polynomial terms), the model is still considered linear as long as it remains linear in the coefficients.
 For example:
 Suppose you have an interaction term \(x_1 \times x_2\). The resulting model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \times x_2) + \epsilon\) is still linear in terms of the parameters \(\beta_0, \beta_1, \beta_2, \beta_3\), so the linear regression framework still applies.
 What matters is that the outcome is modeled as a linear combination of the predictors, even if those predictors themselves are nonlinear functions of the original data.
Summary:
 Linearity assumption refers to the linearity in the model parameters, not necessarily the raw data. You can introduce transformations or nonlinear features (e.g., cross terms, polynomial terms) as long as the relationship between the outcome and the parameters remains linear.
 The inclusion of nonlinear features (like cross features or polynomial terms) does not violate the assumptions of linear regression, as the model is still “linear” in its parameters.
What is Multicollinearity?
 Multicollinearity arises when two or more independent variables in a linear regression model are highly correlated, resulting in overlapping or redundant information. In such cases, the model struggles to isolate the individual effects of each predictor on the dependent variable, leading to unreliable coefficient estimates, inflated standard errors, and challenges in interpretation.
 This presents a significant limitation in linear regression, as it compromises the model’s capacity to accurately estimate the effects of the predictors.
 Multicollinearity can be identified using diagnostic tools such as the Variance Inflation Factor (VIF) and addressed through strategies like eliminating correlated variables, applying regularization techniques, or employing dimensionality reduction methods, such as Principal Component Analysis (PCA).
How Does Multicollinearity Affect Linear Regression?
 Unstable Coefficients (Inflated Standard Errors):
 Multicollinearity causes the estimates of the regression coefficients to become very sensitive to small changes in the model or data. This leads to inflated standard errors, making the coefficients unreliable. As a result, a variable that might have a significant relationship with the dependent variable could appear statistically insignificant.
 Difficult Interpretation:
 When predictors are highly correlated, it becomes harder to interpret the individual coefficients. For instance, if two variables are almost perfectly correlated, the model struggles to assign appropriate weight to each variable, making the individual coefficients unreliable or counterintuitive.
 Reduces Statistical Power:
 The presence of multicollinearity reduces the precision of the estimated coefficients, which can decrease the statistical power of hypothesis tests. This makes it harder to detect significant effects that truly exist.
 High Variance Inflation Factor (VIF):
 The Variance Inflation Factor (VIF) is often used to detect multicollinearity. If the VIF for any predictor is large (usually greater than 5 or 10), it indicates the presence of multicollinearity. A high VIF means the variable is highly correlated with other predictors.
Is Multicollinearity a Limitation of Linear Regression?

Yes, multicollinearity is considered a limitation of linear regression. However, it is not an inherent limitation of the model itself but rather a problem in the data that can affect the performance and interpretability of the linear regression model.

Here’s how it limits linear regression:

Unreliable Coefficients: High multicollinearity can make the regression coefficients unreliable. This undermines the predictive power of the model and leads to difficulties in drawing meaningful conclusions.

Difficulty in Interpretation: When variables are highly correlated, it becomes hard to tell which one is driving the effect on the dependent variable. This makes it difficult to understand the relationship between predictors and the outcome.

Inflated Variance of Predictions: Multicollinearity increases the variability of the coefficient estimates, which in turn can increase the variability in predictions, making the model less generalizable.

How to Detect and Address Multicollinearity?
 Variance Inflation Factor (VIF):
 A common diagnostic tool is the VIF, which measures how much the variance of a regression coefficient is inflated due to multicollinearity. A VIF greater than 5 or 10 indicates high multicollinearity.
 Correlation Matrix:
 Examining a correlation matrix of the independent variables can reveal pairs of variables that are highly correlated (correlation close to 1 or 1). This is an indication of potential multicollinearity.
 Drop One of the Correlated Variables:
 If two or more variables are highly correlated, consider dropping one of them. This can simplify the model and reduce multicollinearity.
 Principal Component Analysis (PCA):
 PCA can transform the correlated variables into a set of uncorrelated components, which can be used in regression analysis. This reduces the dimensionality of the data and avoids multicollinearity.
 Ridge Regression or Lasso Regression:
 These are regularization techniques that can help mitigate the effects of multicollinearity. Ridge regression adds a penalty to the size of the coefficients, which reduces their sensitivity to multicollinearity. Lasso regression goes a step further and can shrink some coefficients to zero, effectively selecting a subset of the predictors.
Logistic Regression
 Logistic regression is a fundamental and powerful supervised learning algorithm widely used for binary classification tasks. In machine learning, supervised learning involves training a model on inputoutput pairs to learn patterns that enable predictions on unseen data. Logistic regression specifically predicts the probability of a categorical outcome based on input features, where the outcome belongs to one of two classes (e.g., “rainy” vs. “sunny” or “success” vs. “failure”). Although logistic regression can be extended to multiple classes, its most common application is in binary classification.
 At its core, logistic regression estimates the probability that a given observation falls into a particular class. This probability is derived using a sigmoid function, which converts the output of a linear equation into a probability value constrained between 0 and 1. The relationship between the input features and the binary response variable is modeled through this sigmoid function, allowing for the output to be interpreted as the likelihood of an observation belonging to a particular class.
 Logistic regression finds the optimal parameters for this relationship using techniques like gradient descent or maximum likelihood estimation (MLE), both of which aim to minimize prediction errors. These optimization methods adjust the model’s coefficients to best fit the data, making logistic regression both a flexible and interpretable tool for classification tasks in machine learning.
The Logistic Regression Equation
 For a binary classification task, the logistic regression model predicts the probability \(P(y = 1 \mid X)\), where \(y\) is the binary outcome (0 or 1) and \(X = (x_1, x_2, \dots, x_k)\) represents the input features.

The logistic regression model is based on the following equation:
\[P(y = 1 \mid X) = \text{sigmoid}(z) = \frac{1}{1 + e^{z}}\] where:
 \[z = \hat{\beta_0} + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 + \dots + \hat{\beta_k}x_k\]
 \(\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_k}\) are the coefficients (parameters) of the model.
 The sigmoid function ensures that the output is a probability, constrained between 0 and 1.
 where:
 The linear predictor \(z\) is a linear combination of the input features, much like in linear regression. However, instead of predicting continuous outcomes, logistic regression applies the sigmoid function to ensure the output can be interpreted as a probability.
How Logistic Regression Works: A Practical Example
 To understand how logistic regression works in practice, consider a scenario where you want to predict the weather in Seattle, specifically whether a day will be sunny or rainy. The outcome can be represented as a binary variable: assign 1 to a sunny day and 0 to a rainy day. Suppose the feature you use to make this prediction is the temperature.
 Fitting a linear regression model to this data would be inappropriate because the predicted values could fall outside the range of 0 and 1, leading to nonsensical results (e.g., a prediction of 1.5 for sunny days). Instead, logistic regression fits a logistic (sigmoid) function, ensuring that predicted probabilities are between 0 and 1. You can interpret these probabilities as the likelihood of a sunny day, given a specific temperature.
 Once the model outputs a probability, a classification threshold (often 0.5) is applied. For example, if the predicted probability is greater than 0.5, the model will predict a sunny day. If it’s less than 0.5, it will predict a rainy day. Adjusting the threshold allows for more cautious or risktolerant predictions depending on the context.
Model Evaluation: The LogLoss Function

Logistic regression models are evaluated using a loss function that measures how well the model predicts the true outcomes. For binary classification, the appropriate loss function is the LogLoss (also known as binary crossentropy). The LogLoss for a given dataset with \(n\) samples is defined as:
\[\text{LogLoss} =  \sum_{i=0}^{n} \left( y_i \cdot \log(p_i) + (1  y_i) \cdot \log(1  p_i) \right)\] where:
 \(y_i\) is the true label for the \(i\)th observation.
 \(p_i\) is the predicted probability for the \(i\)th observation.
 where:

The LogLoss penalizes large deviations between the predicted probability and the actual label, with larger penalties when the model is confident but wrong (i.e., when a predicted probability is near 1 for an outcome of 0, or vice versa). Minimizing the LogLoss function leads to more accurate predictions.
Estimating Coefficients: Gradient Descent and Maximum Likelihood Estimation
 In logistic regression, the goal is to find the optimal values of the coefficients \(\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_k}\) that minimize the LogLoss. There are two primary methods to estimate these coefficients: gradient descent and maximum likelihood estimation (MLE).
Gradient Descent

Gradient descent is an iterative optimization algorithm used to minimize the LogLoss function. The process begins by selecting initial values for the parameters and then updating them iteratively. The direction and magnitude of each update are determined by the gradient of the LogLoss function, which points in the direction of the steepest increase. By moving in the opposite direction of the gradient, the algorithm seeks to minimize the loss function.

Each iteration updates the parameters according to the following rule:
\[\beta_j^{(t+1)} = \beta_j^{(t)}  \alpha \frac{\partial \text{LogLoss}}{\partial \beta_j}\] where:
 \(\alpha\) is the learning rate (step size).
 \(\frac{\partial \text{LogLoss}}{\partial \beta_j}\) is the partial derivative of the LogLoss with respect to \(\beta_j\).
 where:
Maximum Likelihood Estimation (MLE)
 MLE is another method to estimate the coefficients in logistic regression. It involves maximizing the likelihood that the observed data occurred given the model’s predictions. The loglikelihood function is the log of the likelihood function, and maximizing the loglikelihood is equivalent to minimizing the LogLoss. This can be achieved by setting the partial derivatives of the loglikelihood function with respect to the parameters equal to zero and solving the resulting system of equations.
Interpreting Logistic Regression Coefficients

Interpreting the coefficients in logistic regression differs from linear regression because they are expressed in terms of logodds rather than probabilities.  The odds are calculated as:
\[\text{Odds} = \frac{p}{1  p}\] where \(p\) is the probability of the positive class (e.g., sunny day). The logistic regression coefficients represent the change in the logodds of the outcome for a oneunit increase in the corresponding predictor variable.

To make the coefficients easier to interpret, we can exponentiate them, transforming them from the logodds scale to the odds scale. For example, if the coefficient \(\hat{\beta_1}\) for a feature is 0.7, then \(e^{0.7} \approx 2\), meaning that a oneunit increase in that feature multiplies the odds of the positive class by approximately 2.
Further Reading
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledLinear&LogisticRegression,
title = {Linear and Logistic Regression},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}