ML Algorithms Comparative Analysis
 Overview
 Classification Algorithms
 Regression Algorithms
 Either Classification or Regression Algorithms
 References
 Citation
Overview
 Here’s a deep dive on the pros and cons of the most commonly used regression and classification algorithms and the usecases for each.
Classification Algorithms
Logistic Regression
 To start off here, Logistic Regression is a misnomer as it does not pertain to a regression problem at all.
 Logistic regression estimates the probability of an event occurring based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.
 You can use your returned value in one of two ways:
 You may just need an output of 0 or 1 and use it “as is”.

Say your model predicts the probability that your baby will cry at night as:
\[p(crynight)= 0.05\] 
Let’s leverage this information to see how many times a year you’ll have to wake up to soothe your baby:
\[\begin{align} wakeUp &= p(crynights)*nights \\ &= 0.05 * 365 \\ &= 18 \text{ days} \end{align}\]

 Or you may want to convert it into a binary category such as: spam or not spam and convert it to a binary classification problem.
 In this scenario, you’d want an accurate prediction of 0 or 1 and no probabilities in between.
 The best way to obtain this is by leveraging the sigmoid function as it guarantees a value between 0 and 1.
 Sigmoid function is represented as:
 where,
 \(y^{\prime}\) is the output of the logistic regression model for a particular example.
 The \(w\) values are the model’s learned weights, and \(b\) is the bias.
 The \(x\) values are the feature values for a particular example.
 You may just need an output of 0 or 1 and use it “as is”.
 Pros:
 Algorithm is quite simple and efficient.
 Provides concrete probability scores as output.
 Cons:
 Bad at handling a large number of categorical features.
 It assumes that the data is free of missing values and predictors are independent of each other.
 Use case:
 Logistic regression is used when the dependent variable (target) is categorical as in binary classification problems.
 For example, To predict whether an email is spam (1) or (0) or whether the tumor is malignant (1) or not (0).
 Logistic regression is used when the dependent variable (target) is categorical as in binary classification problems.
Naive Bayes Classifier
 Naive Bayes is a supervised learning algorithms based on Bayes’ theorem which can serve as either a binary or multiclass classifier.
 It is termed “naive” because it makes the naive assumption of conditional independence between every pair of features given the value of the class variable.

Bayes Theorem is represented in the image below:

The thought behind naive Bayes classification is to try to classify the data by maximizing:
\[P(O \mid C_i) P(C_i)\] where,
 \(O\) is the object or tuple in a dataset.
 \(i\) is an index of the class.
 where,
 Pros:
 Underthehood, Naive Bayes involves a multiplication (once the probability is known) which makes the algorithm simplistic and fast.
 It can also be used to solve multiclass prediction problems.
 This classifier performs better than other models with less training data if the assumption of independence of features holds.
 Cons:
 It assumes that all the features are independent. This is actually a big con because features in reality are frequently not fully independent.
 Use case:
 When the assumption of independence holds between features, Naive Bayes classifier typically performs better than logistic regression and requires less training data.
 It performs well in case of categorical input variables compared to continuous/numerical variable(s).
Regression Algorithms
Linear Regression
 Linear regression analysis is a supervised machine learning algorithm that is used to predict the value of an output variable based on the value of an input variable.
 The output variable we’re looking to predict is called the dependent variable.
 The input variable we’re using to predict the output variable’s value is called the independent variable.
 Assumes a linear relationship between the output and input variable(s) and fits a linear equation on the data.
 The goal of Linear Regression is to predict output values for inputs that are not present in the data set, with the belief that those outputs would fall on the fitted line.
 Pros:
 Performs very well for linearly separated data.
 Easy to implement and is interpretable.
 Cons:
 Prone to noise and overfitting.
 Very sensitive to outliers.
 Use case:
 Linear regression is commonly used for predictive analysis and modeling.
Either Classification or Regression Algorithms
kNearest Neighbors
 Based on the ageold adage “birds of a feather flock together”.
 The \(k\)nearest neighbors algorithm, also known as \(k\)NN, is a nonparametric, supervised machine learning algorithm, which uses proximity to make classifications or predictions about the grouping of an individual data point.
 While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.
 The value of \(k\) is a hyperparameter which represents the number of neighbors you’d like the algorithm to refer as it generates its output.
 \(k\)NN answers the question that given the current data, what are the \(k\) most similar data points to the query.
 \(k\)NN calculates distance typically using either Euclidean or Manhattan distance:

Euclidean distance:
\[d(x, y)=\sqrt{\sum_{i=1}^{n}\left(y_{i}x_{i}\right)^{2}}\] 
Manhattan distance:
\[d(x, y)=\sum_{i=1}^{m}\leftx_{i}y_{i}\right\]
 This is the highlevel view of how the algorithm works:
 For each example in the data:
 Calculate distance between query example and current example from the data.
 Add the distance and index to an ordered collection.
 Sort in ascending order by distance.
 Pick first \(k\) from sorted order.
 Get labels of selected \(k\) entries.
 If regression, return the mean of \(k\) labels.
 If classification, return the majority vote (some implementations incorporate weights for each vote) of the labels.
 For each example in the data:
 Pros:
 Easy to implement.
 Needs only a few hyperparameters which are:
 The value of \(k\).
 Distance metric used.
 Cons:
 Does not scale well as it takes too much memory and data storage compared with other classifiers.
 Prone to overfitting if the value of \(k\) is too low and will underfit if the value of \(k\) is too high.
 Use case:
 While \(k\)NNs can be used for regression problems, they are typically used for classification.
 When labelled data is too expensive or impossible to obtain.
 When the dataset is relatively smaller and is noise free.
Support Vector Machines
 The objective of Support Vector Machines (SVMs) is to find a hyperplane in an \(N\)dimensional space (where \(N\) is the number of features) that distinctly classifies the data points.
 Note that a hyperplane is a decision boundary that helps classify the data points. If the number of input features is 2, hyperplane is just a line, if input features is 3, it becomes a 2D plane.
 The subset of training data points utilized in the decision function are called “support vectors”, hence the name Support Vector Machines.
 In the instance that the data is not linearly separable, we need to use the kernel trick (also called the polynomial trick). The SVM kernel is a function that takes low dimensional input space and transforms it into higherdimensional space, i.e., it enables learning a nonlinear decision boundary to separate datapoints. The gist of the kernel trick is that learning a linear model in the higherdimensional space is equivalent to learning a nonlinear model in the original lowerdimensional input space. More on this in the article on SVM Kernel/Polynomial Trick
 The image below displays the linear hyperplane separating the two classes such that the distance from the hyperplane to the nearest data point on each side is maximized. This hyperplane is known as the maximummargin hyperplane/hard margin.
 Pros:
 Effective in high dimensional spaces.
 Effective in cases where the number of dimensions is greater than the number of samples.
 Works well when there is a clear margin of separation between classes.
 Memory efficient as it uses a subset of training points in the decision function (“support vectors”).
 Versatile since different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
 Cons:
 Doesn’t perform well when we have large datasets because the required training time is higher.
 If the number of features is much greater than the number of samples, avoiding overfitting in choosing kernel functions and regularization term is crucial.
 SVMs do not directly provide probability estimates, these are calculated using an expensive fivefold crossvalidation.
 Doesn’t perform very well when the dataset has more noise, i.e., when target classes are overlapping.
 Use case:
 While SVMs can be used for regression problems, they are typically used for classification (similar to \(k\)NN) and outliers detection.
 Works great if the number of features is high and they occur in high dimensional spaces.
Decision Trees
 Like the name suggests, a Decision Tree is a tree with a flowchart like structure consisting of 3 elements:
 The internal node denotes a test on an attribute.
 Each branch represents an outcome of the test.
 Each leaf node (terminal node) holds a class label.
 The objective of a Decision Tree is to create a training model that can to predict the class of the target variable by learning simple decision rules inferred from prior data (training data).
 Pros:
 Interpretability is high due to the intuitive nature of a tree.
 Cons:
 Decision trees are susceptible to overfitting (random forests are a great way to fix this issue).
 Small changes in data can lead to large structural changes on the tree.
 Use case:
 When you want to be able to lay out all the possible outcomes of a problem and work on challenging each option.
Random Forests
 A random forest is a robust ML algorithm that relies on having different models and making them vote for the prediction.
 An essential feature of random forest is to encourage diversity in the models; that way, we ensure the models have different predictions that will improve their model performance.
 Random forest encourages diversity by using random sampling with a replacement but also changing the root node, which is the first feature in which we split our data.

The results are a set of decision trees with different root nodes taken from similar (not equal) datasets, and each has a vote in the prediction of the model.
 Pros:
 Less prone to overfitting compared to decision trees.
 Cons:
 Interpretability is low compared to decision trees.
 Use case:
 When the model is overfitting and you want better generalization.
XGBoost
 XGBoost algorithm is a gradient boosting algorithm that is highly efficient and scalable.
 Here’s a highlevel overview of the XGBoost algorithm:
 Initialize the model with a weak learner, usually a decision tree stump (a decision tree with a single split)
 Compute the negative gradient of the loss function with respect to the current prediction
 Fit a decision tree to the negative gradient to make a new prediction
 Add the prediction from this tree to the current prediction
 Repeat steps 24 for a specified number of trees, or until a stopping criterion is met
 Combine the predictions from all trees to get the final prediction
 The following figure summarizes parallelizing XGBoost image source: AiEdge.io.
Gradient Boosting
 Gradient Boosting is an ensemble machine learning algorithm that combines multiple weak models to create a strong model.
 It is an iterative process where each iteration, a new model is fit to the residual errors made by the previous model, with the goal of decreasing the overall prediction error.
 The algorithm works as follows:
 Initialize the model with a weak learner, typically a decision tree with a single split.
 Compute the negative gradient of the loss function with respect to the current prediction.
 Fit a new model to the negative gradient.
 Update the prediction by adding the prediction from the new model.
 Repeat steps 24 for a specified number of iterations, or until a stopping criterion is met.
 Combine the predictions from all models to get the final prediction.
References
 IBM tech blog: What is logistic regression?
 Google Crash Course
 Towards Data Science
 Analytics Vidhya
 Science Direct
 SciKit Learn
 KD Nuggets
 Geeks for Geeks
 KD Nuggets Trees
 Edureka
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledMLAlgorithmsCompAnalysis,
title = {ML Algorithms Comparative Analysis},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}