Overview

  • Here’s a deep dive on the pros and cons of the most commonly used regression and classification algorithms and the use-cases for each.

Classification Algorithms

Logistic Regression

  • To start off here, Logistic Regression is a misnomer as it does not pertain to a regression problem at all.
  • Logistic regression estimates the probability of an event occurring based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.
  • You can use your returned value in one of two ways:
    • You may just need an output of 0 or 1 and use it “as is”.
      • Say your model predicts the probability that your baby will cry at night as:

        \[p(cry|night)= 0.05\]
      • Let’s leverage this information to see how many times a year you’ll have to wake up to soothe your baby:

        \[\begin{align} wakeUp &= p(cry|nights)*nights \\ &= 0.05 * 365 \\ &= 18 \text{ days} \end{align}\]
    • Or you may want to convert it into a binary category such as: spam or not spam and convert it to a binary classification problem.
      • In this scenario, you’d want an accurate prediction of 0 or 1 and no probabilities in between.
      • The best way to obtain this is by leveraging the sigmoid function as it guarantees a value between 0 and 1.
      • Sigmoid function is represented as:

      \[y^{\prime}=\frac{1}{1+e^{-z}}\]
      • where,
        • \(y^{\prime}\) is the output of the logistic regression model for a particular example.
        \[z=b+w_{1} x_{1}+w_{2} x_{2}+\ldots+w_{N} x_{N}\]
        • The \(w\) values are the model’s learned weights, and \(b\) is the bias.
        • The \(x\) values are the feature values for a particular example.
  • Pros:
    • Algorithm is quite simple and efficient.
    • Provides concrete probability scores as output.
  • Cons:
    • Bad at handling a large number of categorical features.
    • It assumes that the data is free of missing values and predictors are independent of each other.
  • Use case:
    • Logistic regression is used when the dependent variable (target) is categorical as in binary classification problems.
      • For example, To predict whether an email is spam (1) or (0) or whether the tumor is malignant (1) or not (0).

Naive Bayes Classifier

  • Naive Bayes is a supervised learning algorithms based on Bayes’ theorem which can serve as either a binary or multi-class classifier.
  • It is termed “naive” because it makes the naive assumption of conditional independence between every pair of features given the value of the class variable.
  • Bayes Theorem is represented in the image below:

  • The thought behind naive Bayes classification is to try to classify the data by maximizing:

    \[P(O \mid C_i) P(C_i)\]
    • where,
      • \(O\) is the object or tuple in a dataset.
      • \(i\) is an index of the class.
  • Pros:
    • Under-the-hood, Naive Bayes involves a multiplication (once the probability is known) which makes the algorithm simplistic and fast.
    • It can also be used to solve multi-class prediction problems.
    • This classifier performs better than other models with less training data if the assumption of independence of features holds.
  • Cons:
    • It assumes that all the features are independent. This is actually a big con because features in reality are frequently not fully independent.
  • Use case:
    • When the assumption of independence holds between features, Naive Bayes classifier typically performs better than logistic regression and requires less training data.
    • It performs well in case of categorical input variables compared to continuous/numerical variable(s).

Regression Algorithms

Linear Regression

  • Linear regression analysis is a supervised machine learning algorithm that is used to predict the value of an output variable based on the value of an input variable.
  • The output variable we’re looking to predict is called the dependent variable.
  • The input variable we’re using to predict the output variable’s value is called the independent variable.
  • Assumes a linear relationship between the output and input variable(s) and fits a linear equation on the data.
  • The goal of Linear Regression is to predict output values for inputs that are not present in the data set, with the belief that those outputs would fall on the fitted line.
  • Pros:
    • Performs very well for linearly separated data.
    • Easy to implement and is interpretable.
  • Cons:
    • Prone to noise and overfitting.
    • Very sensitive to outliers.
  • Use case:
    • Linear regression is commonly used for predictive analysis and modeling.

Either Classification or Regression Algorithms

k-Nearest Neighbors

  • Based on the age-old adage “birds of a feather flock together”.
  • The \(k\)-nearest neighbors algorithm, also known as \(k\)-NN, is a non-parametric, supervised machine learning algorithm, which uses proximity to make classifications or predictions about the grouping of an individual data point.
  • While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.
  • The value of \(k\) is a hyperparameter which represents the number of neighbors you’d like the algorithm to refer as it generates its output.
  • \(k\)-NN answers the question that given the current data, what are the \(k\) most similar data points to the query.
    • \(k\)-NN calculates distance typically using either Euclidean or Manhattan distance:
    • Euclidean distance:

      \[d(x, y)=\sqrt{\sum_{i=1}^{n}\left(y_{i}-x_{i}\right)^{2}}\]
    • Manhattan distance:

      \[d(x, y)=\sum_{i=1}^{m}\left|x_{i}-y_{i}\right|\]
  • This is the high-level view of how the algorithm works:
    • For each example in the data:
      • Calculate distance between query example and current example from the data.
      • Add the distance and index to an ordered collection.
      • Sort in ascending order by distance.
      • Pick first \(k\) from sorted order.
      • Get labels of selected \(k\) entries.
      • If regression, return the mean of \(k\) labels.
      • If classification, return the majority vote (some implementations incorporate weights for each vote) of the labels.
  • Pros:
    • Easy to implement.
    • Needs only a few hyperparameters which are:
      • The value of \(k\).
      • Distance metric used.
  • Cons:
    • Does not scale well as it takes too much memory and data storage compared with other classifiers.
    • Prone to overfitting if the value of \(k\) is too low and will underfit if the value of \(k\) is too high.
  • Use case:
    • While SVMs can be used for regression problems, they are typically used for classification (similar to \(k\)-NN).
    • When labelled data is too expensive or impossible to obtain.
    • When the dataset is relatively smaller and is noise free.

Support Vector Machines

  • The objective of Support Vector Machines (SVM) is to find a hyperplane in an \(N\)-dimensional space (where \(N\) is the number of features) that distinctly classifies the data points.
    • Note that a hyperplane is a decision boundary that helps classify the data points. If the number of input features is 2, hyperplane is just a line, if input features is 3, it becomes a 2D plane.
  • The subset of training data points utilized in the decision function are called “support vectors”, hence the name Support Vector Machines.
  • In the instance that the data is not linearly separable, we need to use the kernel trick (also called the polynomial trick). The SVM kernel is a function that takes low dimensional input space and transforms it into higher-dimensional space, i.e., it enables learning a non-linear decision boundary to separate datapoints. The gist of the kernel trick is that learning a linear model in the higher-dimensional space is equivalent to learning a non-linear model in the original lower-dimensional input space. More on this in the article on SVM Kernel/Polynomial Trick
  • The image below displays the linear hyperplane separating the two classes such that the distance from the hyperplane to the nearest data point on each side is maximized. This hyperplane is known as the maximum-margin hyperplane/hard margin.
  • Pros:
    • Works well when there is a clear margin of separation between classes.
    • It is easy to customize the kernel function depending on your dimensionality.
    • It is memory efficient as it uses a subset of training points in the decision function (“support vectors”).
  • Cons:
    • It doesn’t perform well when we have a relatively large dataset because the required training time is higher.
    • It also doesn’t perform very well when the dataset has more noise, i.e., when target classes are overlapping.
  • Use case:
    • While SVMs can be used for regression problems, they are typically used for classification (similar to \(k\)-NN).
    • Works great if the number of features is high and they occur in high dimensional spaces.

Decision Trees

  • Like the name suggests, a Decision Tree is a tree with a flowchart like structure consisting of 3 elements:
    • The internal node denotes a test on an attribute.
    • Each branch represents an outcome of the test.
    • Each leaf node (terminal node) holds a class label.
  • The objective of a Decision Tree is to create a training model that can to predict the class of the target variable by learning simple decision rules inferred from prior data (training data).
  • Pros:
    • Interpretability is high due to the intuitive nature of a tree.
  • Cons:
    • Decision trees are susceptible to overfitting (random forests are a great way to fix this issue).
    • Small changes in data can lead to large structural changes on the tree.
  • Use case:
    • When you want to be able to lay out all the possible outcomes of a problem and work on challenging each option.

Random Forests

  • A random forest is a robust ML algorithm that relies on having different models and making them vote for the prediction.
  • An essential feature of random forest is to encourage diversity in the models; that way, we ensure the models have different predictions that will improve their model performance.
  • Random forest encourages diversity by using random sampling with a replacement but also changing the root node, which is the first feature in which we split our data.
  • The results are a set of decision trees with different root nodes taken from similar (not equal) datasets, and each has a vote in the prediction of the model.

  • Pros:
  • Cons:
  • Use case:
    • When the model is overfitting and you want better generalization.

References

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledMLAlgorithmsCompAnalysis,
  title   = {ML Algorithms Comparative Analysis},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}