What are some drawbacks of the Transformer?

  • Computing all pairs of interactions during self-attention means our computation grows quadratically with the sequence length, i.e., O(T2d), where T is the sequence length, and d is the dimensionality. Note that for recurrent models, it only grew linearly!
  • Say, d=1000. So, for a single (shortish) sentence, T≤30⇒T2≤900⇒T2d≈900K. Note that in practice, we set a bound such as T=512. Imagine working on long documents with T≥10,000!?

Why do we initialize weights randomly? / What if we initialize the weights with the same values?

  • If all weights are initialized with the same values, all neurons in each layer give you the same outputs (and thus redundantly learn the same features) which implies the model will never learn. This is the reason that the weights are initialized with random numbers.
  • Detailed explanation:
    • The optimization algorithms we usually use for training neural networks are deterministic. Gradient descent, the most basic algorithm, that is a base for the more complicated ones, is defined in terms of partial derivatives
    \[\theta_{j}:=\theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J(\Theta)\]
    • A partial derivative tells you how does the change of the optimized function is affected by the \(\theta_j\) parameter. If all the parameters are the same, they all have the same impact on the result, so will change by the same quantity. If you change all the parameters by the same value, they will keep being the same. In such a case, each neuron will be doing the same thing, they will be redundant and there would be no point in having multiple neurons. There is no point in wasting your compute repeating exactly the same operations multiple times. In other words, the model does not learn because error is propagated back through the weights in proportion to the values of the weights. This means that all hidden units connected directly to the output units will get identical error signals, and, since the weight changes depend on the error signals, the weights from those units to the output units will be the same.

    • When you initialize the neurons randomly, each of them will hopefully be evolving during the optimization in a different “direction”, they will be learning to detect different features from the data. You can think of early layers as of doing automatic feature engineering for you, by transforming the data, that are used by the final layer of the network. If all the learned features are the same, it would be a wasted effort.

    • The Lottery Ticket Hypothesis: Training Pruned Neural Networks by Frankle and Carbin explores the hypothesis that the big neural networks are so effective because randomly initializing multiple parameters helps our luck by drawing the lucky “lottery ticket” parameters that work well for the problem.

Describe learning rate schedule/annealing.

  • Am optimizer is typically used with a learning rate schedule that involves a short warmup phase, a constant hold phase and an exponential decay phase. The decay/annealing is typically done using a cosine learning rate schedule over a number of cycles (Loshchilov & Hutter, 2016).

Explain mean/average in terms of attention.

  • Averaging is equivalent to uniform attention.

Linear Regression

Pros

  • Simple to implement and efficient to train.
  • Overfitting can be reduced by regularization.
  • Performs well when the dataset is linearly separable.

Cons

  • Assumes that the data is independent which is rare in real life.
  • Prone to noise and overfitting.
  • Sensitive to outliers.

Logistic Regression

Pros

  • Less prone to over-fitting but it can overfit in high dimensional datasets.
  • Efficient when the dataset has features that are linearly separable.
  • Easy to implement and efficient to train.

Cons

  • Should not be used when the number of observations are lesser than the number of features.
  • Assumption of linearity which is rare in practice.
  • Can only be used to predict discrete functions.

Support Vector Machines

Pros

  • Good at high dimensional data.
  • Can work on small dataset.
  • Can solve non-linear problems.

Cons

  • Inefficient on large data.
  • Requires picking the right kernal.

Decision Trees

  • Decision Trees can be used for both classication and regression.
  • For classification, you can simply return the majority vote of the trees.
  • For regression, you can return the averaged values of the trees.

Pros

  • Can solve non-linear problems.
  • Can work on high-dimensional data with excellent accuracy.
  • Easy to visualize and explain.

Cons

  • Overfitting. Might be resolved by random forest.
  • A small change in the data can lead to a large change in the structure of the optimal decision tree.
  • Calculations can get very complex.

k-Nearest Neighbor

  • k-Nearest Neighbor (kNN) can be used for both classification and regression.
  • For classification, you can simply return the majority vote of the nearest neighbors.
  • For regression, you can return the averaged values of the nearest neighbors.

Pros

  • Can make predictions without training.
  • Time complexity is \(O(n)\).
  • Can be used for both classification and regression.

Cons

  • Does not work well with large dataset.
  • Sensitive to noisy data, missing values and outliers.
  • Need feature scaling.
  • Choose the correct \(K\) value.

k-Means Clustering

  • k-Means Clustering (kMC) is a classifier.

Pros

  • Simple to implement.
  • Scales to large data sets.
  • Guarantees convergence.
  • Easily adapts to new examples.
  • Generalizes to clusters of different shapes and sizes.

Cons

  • Sensitive to the outliers.
  • Choosing the k values manually is tough.
  • Dependent on initial values.
  • Scalability decreases when dimension increases.

Principal Component Analysis

  • Principal Component Analysis (PCA) is a dimensionality reduction technique that reduces correlated (features that show co-variance) features and projects them to a lower-dimensional space.

Pros

  • Reduce correlated features.
  • Improve performance.
  • Reduce overfitting.

Cons

  • Principal components are less interpretable.
  • Information loss.
  • Must standardize databefore implementing PCA.

Naive Bayes

Pros

  • Training period is less.
  • Better suited for categorical inputs.
  • Easy to implement.

Cons

  • Assumes that all features are independent which is rarely happening in real life.
  • Zero Frequency.
  • Estimations can be wrong in some cases.

ANN

Pros

  • Have fault tolerance.
  • Have the ability to learn and model non-linear and complex relationships.
  • Can generalize on unseen data.

Cons

  • Long training time.
  • Non-guaranteed convergence.
  • Black box. Hard to explain solution.
  • Hardware dependence.
  • Requires user’s ability to translate the problem.

Adaboost

Pros

  • Relatively robust to overfitting.
  • High accuracy.
  • Easy to understand and to visualize.

Cons

  • Sensitive to noise data.
  • Affected by outliers.
  • Not optimized for speed.

Explain the central limit theorem and give examples of when you can use it in a real-world problem?

  • The center limit theorem states that if any random variable, regardless of the distribution, is sampled a large enough times, the sample mean will be approximately normally distributed. This allows for studying the properties of any statistical distribution as long as there is a large enough sample size.

Describe the motivation behind random forests and mention two reasons why they are better than individual decision trees?

  • The motivation behind random forest or ensemble models in general in layman’s terms, Let’s say we have a question/problem to solve we bring 100 people and ask each of them the question/problem and record their solution. Next, we prepare a solution which is a combination/ a mixture of all the solutions provided by these 100 people. We will find that the aggregated solution will be close to the actual solution. This is known as the “Wisdom of the crowd” and this is the motivation behind Random Forests. We take weak learners (ML models) specifically, Decision Trees in the case of Random Forest & aggregate their results to get good predictions by removing dependency on a particular set of features. In regression, we take the mean and for Classification, we take the majority vote of the classifiers.

A random forest is generally better than a decision tree, however, you should note that no algorithm is better than the other it will always depend on the use case & the dataset [Check the No Free Lunch Theorem in the first comment]. Reasons why random forests allow for stronger prediction than individual decision trees:

1) Decision trees are prone to overfit whereas random forest generalizes better on unseen data as it is using randomness in feature selection as well as during sampling of the data. Therefore, random forests have lower variance compared to that of the decision tree without substantially increasing the error due to bias.

2) Generally, ensemble models like Random Forest perform better as they are aggregations of various models (Decision Trees in the case of Random Forest), using the concept of the “Wisdom of the crowd.”

Mention three ways to make your model robust to outliers?

Solution (Summary of all of the mentioned answers):

  1. Investigating the outliers is always the first step in understanding how to treat them. After you understand the nature of why the outliers occurred you can apply one of the several methods mentioned below.

  2. Add regularization that will reduce variance, for example, L1 or L2 regularization.

  3. Use tree-based models (random forest, gradient boosting ) that are generally less affected by outliers.

  4. Winsorize the data. Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers. In numerical data, if the distribution is almost normal using the Z-score we can detect the outliers and treat them by either removing or capping them with some value. If the distribution is skewed using IQR we can detect and treat it by again either removing or capping it with some value. In categorical data check for value_count in the percentage if we have very few records from some category, either we can remove it or can cap it with some categorical value like others.

  5. Transform the data, for example, you do a log transformation when the response variable follows an exponential distribution or is right-skewed.

  6. Use more robust error metrics such as MAE or Huber loss instead of MSE.

  7. Remove the outliers, only do this if you are certain that the outliers are true anomalies that are not worth adding to your model. This should be your last consideration since dropping them means losing information.

Given two arrays, write a python function to return the intersection of the two. For example, X = [1,5,9,0] and Y = [3,0,2,9] it should return [9,0]

  • A1 (The most repeated one):
set(X).intersect(set(Y))
  • A2:
set(X) & set(Y)
  • Using sets is a very good way to do it since it utilizes a hash map implementation underneath it.

  • A3:

def common_func(X, Y):
Z=[]
for i in X:
for j in Y:
if i==j and i not in Z:
Z.append(i)
return Z
  • This is also a simple way to do it, however, it leads to the time complexity of O(N*M) so it is better to use sets.
  • Some other answers were mentioned that will work for the mentioned case but will return duplicates for other cases, for example, if X = [1,0,9,9] and Y = [3,0,9,9] it will return [0, 9, 9] not [0,9].

A1:

Res=[i for i in x if i in Y]

A2:

Z = [value for value in X if value in Y]
print(Z)

A3:

d = {}
for value in y:
if value not in d:
d[value] = 1
intersection = []
for value in x:
if value in d:
intersection.append(value)

print(intersection)
  • The time complexity for this is O(n + m) and the space complexity is O(m), the problem of it is that it returns duplicates.

Given an array, find all the duplicates in this array for example: input: [1,2,3,1,3,6,5] output: [1,3]

  • Answer:

Approach 1:

set1=set()
res=[]
for i in list:
if i in set1:
res.append(i)
else:
set1.add(i)
print(res)
  • Approach 2:
arr1=np.array([1,2,3,1,3,6,5])
nums,index,counts=np.unique(arr1,return_index=True,return_counts=True)
print(nums,index,counts)
nums[counts!=1]
  • Approach 3:
a=[1,2,3,1,3,6,5]
j=[i for (i,v) in Counter(a).items() if v>1]

Approach 4: Use map (dict), and get the frequency count of each element. Iterate the map, and print all keys whose values are > 1.

What are the differences and similarities between gradient boosting and random forest? and what are the advantage and disadvantages of each when compared to each other?

  • Similarities:
    1. Both these algorithms are decision-tree based algorithms
    2. Both these algorithms are ensemble algorithms
    3. Both are flexible models and do not need much data preprocessing.
  • Differences:
    1. Random forests (Uses Bagging): Trees are arranged in a parallel fashion where the results of all trees are aggregated at the end through averaging or majority vote. Gradient boosting (Uses Boosting): Trees are arranged in a series sequential fashion where every tree tries to minimize the error of the previous tree.
    2. Radnom forests: Every tree is constructed independently of the other trees. Gradient boosting: Every tree is dependent on the previous tree.
  • Advantages of gradient boosting over random forests:
    1. Gradient boosting can be more accurate than Random forests because we train them to minimize the previous tree’s error.
    2. Gradient boosting is capable of capturing complex patterns in the data.
    3. Gradient boosting is better than random forest when used on unbalanced data sets.
  • Advantages of random forests over gradient boosting :
    1. Random forest isless prone to overfit as compared to gradient boosting.
    2. Random forest has faster training as trees are created parallelly & independent of each other.
  • The disadvantage of GB over RF:
    1. Gradient boosting is more prone to overfitting than random forests due to their focus on mistakes during training iterations and the lack of independence in tree building.
    2. If the data is noisy the boosted trees might overfit and start modeling the noise.
    3. In GB training might take longer because every tree is created sequentially.
    4. Tunning the hyperparameters of gradient boosting is harder than those of random forest.

Small file and big file problem in Big data

  • The “small file problem” is kind of notorious in the big data space.
  • Did you know there’s also the “Big/large file problem”?
  • Say you have a billion records. The small file problem would be like.. 10 records per file and 100 million files. Combining all these files is slow, terrible, and has made many data engineers cry.
  • The large file problem would be the opposite problem. 1 billion records in 1 file. This is also a huge problem because how do you parallelize 1 file? You can’t without splitting it up first.
  • To avoid crying, the solution is sizing your files the right way. Aiming for between 100-200 MBs for file is usually best. In this contrived example, you’d have a 1000 files each with 1 million records.
  • It is worth seeing the spread of files and the size and understanding what optimal file size works out best.
  • Too low and you have the risk of more files, too high and the parallelism isn’t going to be effective.
  • It is recommended to understand up parallelism, and block size and seeing how the distribution of your data (in files) is before adding an arbitrary default file size value.

What are L1 and L2 regularization? What are the differences between the two?

  • Regularization is a technique used to avoid overfitting by trying to make the model more simple. One way to apply regularization is by adding the weights to the loss function. This is done in order to consider minimizing unimportant weights. In L1 regularization we add the sum of the absolute of the weights to the loss function. In L2 regularization we add the sum of the squares of the weights to the loss function.
  • So both L1 and L2 regularization are ways to reduce overfitting, but to understand the difference it’s better to know how they are calculated:
    • Loss (L2) : Cost function + \(L\) * \(weights^2\)
    • Loss (L1) : Cost function + \(L\) * \(\|weights\|\)
      • Where \(L\) is the regularization parameter
  • L2 regularization penalizes huge parameters preventing any of the single parameters to get too large. But weights never become zeros. It adds parameters square to the loss. Preventing the model from overfitting on any single feature.
  • L1 regularization penalizes weights by adding a term to the loss function which is the absolute value of the loss. This leads to it removing small values of the parameters leading in the end to the parameter hitting zero and staying there for the rest of the epochs. Removing this specific variable completely from our calculation. So, It helps in simplifying our model. It is also helpful for feature selection as it shrinks the coefficient to zero which is not significant in the model.

What are the Bias and Variance in a Machine Learning Model and explain the bias-variance trade-off?

  • The goal of any supervised machine learning model is to estimate the mapping function (f) that predicts the target variable (y) given input (x). The prediction error can be broken down into three parts:

    • Bias: The bias is the simplifying assumption made by the model to make the target function easy to learn. Low bias suggests fewer assumptions made about the form of the target function. High bias suggests more assumptions made about the form of the target data. The smaller the bias error the better the model is. If the bias error is high, this means that the model is underfitting the training data.

    • Variance: Variance is the amount that the estimate of the target function will change if different training data was used. The target function is estimated from the training data by a machine learning algorithm, so we should expect the algorithm to have some variance. Ideally, it should not change too much from one training dataset to the next, meaning that the algorithm is good at picking out the hidden underlying mapping between the inputs and the output variables. If the variance error is high this indicates that the model overfits the training data.

    • Irreducible error: It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable. The irreducible error cannot be reduced regardless of what algorithm is used.

  • The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn, the algorithm should achieve good prediction performance. The parameterization of machine learning algorithms is often a battle to balance out bias and variance.

    • For example, if you want to predict the housing prices given a large set of potential predictors. A model with high bias but low variance, such as linear regression will be easy to implement, but it will oversimplify the problem resulting in high bias and low variance. This high bias and low variance would mean in this context that the predicted house prices are frequently off from the market value, but the value of the variance of these predicted prices is low.
    • On the other side, a model with low bias and high variance such as a neural network will lead to predicted house prices closer to the market value, but with predictions varying widely based on the input features.

Briefly explain the A/B testing and its application? What are some common pitfalls encountered in A/B testing?

  • A/B testing helps us to determine whether a change in something will cause a change in performance significantly or not. So in other words you aim to statistically estimate the impact of a given change within your digital product (for example). You measure success and counter metrics on at least 1 treatment vs 1 control group (there can be more than 1 XP group for multivariate tests).

  • Applications:
    1. Consider the example of a general store that sells bread packets but not butter, for a year. If we want to check whether its sale depends on the butter or not, then suppose the store also sells butter and sales for next year are observed. Now we can determine whether selling butter can significantly increase/decrease or doesn’t affect the sale of bread.

    2. While developing the landing page of a website you create 2 different versions of the page. You define a criteria for success eg. conversion rate. Then define your hypothesis,

      • Null hypothesis (H): No difference between the performance of the 2 versions.
      • Alternative hypothesis (H’): version A will perform better than B.
  • Note that you will have to split your traffic randomly (to avoid sample bias) into 2 versions. The split doesn’t have to be symmetric, you just need to set the minimum sample size for each version to avoid undersample bias.

  • Now if version A gives better results than version B, we will still have to statistically prove that results derived from our sample represent the entire population. Now one of the very common tests used to do so is 2 sample t-test where we use values of significance level (alpha) and p-value to see which hypothesis is right. If p-value<alpha, H is rejected.

  • Common pitfalls:
    1. Wrong success metrics inadequate to the business problem
    2. Lack of counter metric, as you might add friction to the product regardless along with the positive impact
    3. Sample mismatch: heterogeneous control and treatment, unequal variances
    4. Underpowered test: too small sample or XP running too short 5. Not accounting for network effects (introduce bias within measurement)

Mention three ways to handle missing or corrupted data in adataset?

  • In general, real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values. However, you should start by asking the data owner/stakeholder about the missing or corrupted data. It might be at the data entry level, because of file encoding, etc. which if aligned, can be handled without the need to use advanced techniques.

  • There are different ways to handle missing data, we will discuss only three of them:

    1. Deleting the row with missing values

      • The first method to handle missing values is to delete the rows or columns that have null values. This is an easy and fast method and leads to a robust model, however, it will lead to the loss of a lot of information depending on the amount of missing data and can only be applied if the missing data represent a small percentage of the whole dataset.
    2. Using learning algorithms that support missing values

      • Some machine learning algorithms are robust to missing values in the dataset. The K-NN algorithm can ignore a column from a distance measure when there are missing values. Naive Bayes can also support missing values when making a prediction. Another algorithm that can handle a dataset with missing values or null values is the random forest model and Xgboost (check the post in the first comment), as it can work on non-linear and categorical data. The problem with this method is that these models’ implementation in the scikit-learn library does not support handling missing values, so you will have to implement it yourself.
    3. Missing value imputation

      • Data imputation means the substitution of estimated values for missing or inconsistent data in your dataset. There are different ways to estimate the values that will replace the missing value. The simplest one is to replace the missing value with the most repeated value in the row or the column. Another simple way is to replace it with the mean, median, or mode of the rest of the row or the column. This advantage of this is that it is an easy and fast way to handle the missing data, but it might lead to data leakage and does not factor the covariance between features. A better way is to use a machine learning model to learn the pattern between the data and predict the missing values, this is a very good method to estimate the missing values that will not lead to data leakage and will factor the covariance between the feature, the drawback of this method is the computational complexity especially if your dataset is large.

How do you avoid #overfitting? Try one (or more) of the following:

  1. Training with more data, which makes the signal stronger and clearer, and can enable the model to detect the signal better. One way to do this is to use #dataaugmentation strategies

  2. Reducing the number of features in order to avoid the curse of dimensionality (which occurs when the amount of data is too low to support highly-dimensional models), which is a common cause for overfitting

  3. Using cross-validation. This technique works because the model is unlikely to make the same mistake on multiple different samples, and hence, errors will be evened out

  4. Using early stopping to end the training process before the model starts learning the noise

  5. Using regularization and minimizing the adjusted loss function. Regularization works because it discourages learning a model that’s overly complex or flexible

  6. Using ensemble learning, which ensures that the weaknesses of a model are compensated by the other ones

Data science /#MLinterview are hard - regardless which side of the table you are on.

  • As a jobseeker, it can be really hard to shine, especially when the questions asked have little to no relevance to the actual job. How are you supposed to showcase your ability to build models when the entire interview revolves are binary search trees?

  • As a hiring manager, it’s close to impossible to evaluate modeling skills by just talking to someone, and false positives are really frequent. A question that dramatically reduces the noise on both sides:

“What is the most machine learning complex concept you came across, and how would you explain it to yourself that would have made it easier for you to understand it before you learned it?”

  • The answer will tell you a lot more about the candidate than you might think:

    • 90% of candidates answer “overfitting”. If they’re junior and explain it really well and they’re junior, it means they’re detailed-oriented and try to gain a thorough understanding of the field, but they sure could show more ambition; if they don’t, it means their understanding of the fundamentals is extremely basic.
    • If they answer back-propagation, and they can explain it well, it means they’re more math-oriented than the average and will probably be a good candidate for a research role as an applied DS role.
    • If their answer has something to do with a brand-new ML concept, , and they can explain it well, it means they’re growth-oriented and well-read.
    • Generally speaking, if they answer something overly complicated and pompous, but can’t explain it well, it means they’re trying to impress but have an overall shallow understanding - a good rule of thumb is not hire them.
  • Now, if you are a candidate, or an ML professional, keep asking yourself that question: “What is the most sophisticated concept, model or architecture you know of?” If you keep giving the same answer, maybe you’ve become complacent, and it’s time for you to learn something new.
  • How would you explain it to a newbie? As Einstein said, if “you can’t explain it simply, you don’t understand it well enough”.

Order of execution of an SQL Query in Detail

  • Each query begins with finding the data that we need in a database, and then filtering that data down into something that can be processed and understood as quickly as possible.

  • Because each part of the query is executed sequentially, it’s important to understand the order of execution so that you know what results are accessible where.

  • Consider the below mentioned query :

SELECT DISTINCT column, AGG_FUNC(column_or_expression), …
FROM mytable
JOIN another_table
ON mytable.column = another_table.column
WHERE constraint_expression
GROUP BY column
HAVING constraint_expression
ORDER BY column ASC/DESC
LIMIT count OFFSET COUNT;
  • Query order of execution:

1.FROMandJOINs

  • TheFROMclause, and subsequentJOINs are first executed to determine the total working set of data that is being queried. This includes subqueries in this clause, and can cause temporary tables to be created under the hood containing all the columns and rows of the tables being joined.

2.WHERE

  • Once we have the total working set of data, the first-passWHEREconstraints are applied to the individual rows, and rows that do not satisfy the constraint are discarded. Each of the constraints can only access columns directly from the tables requested in theFROMclause. Aliases in theSELECTpart of the query are not accessible in most databases since they may include expressions dependent on parts of the query that have not yet executed.

3.GROUP BY

  • The remaining rows after theWHEREconstraints are applied are then grouped based on common values in the column specified in theGROUP BYclause. As a result of the grouping, there will only be as many rows as there are unique values in that column. Implicitly, this means that you should only need to use this when you have aggregate functions in your query.

4.HAVING

  • If the query has aGROUP BYclause, then the constraints in theHAVINGclause are then applied to the grouped rows, discard the grouped rows that don’t satisfy the constraint. Like theWHEREclause, aliases are also not accessible from this step in most databases.

5.SELECT

  • Any expressions in theSELECTpart of the query are finally computed.

6.DISTINCT

  • Of the remaining rows, rows with duplicate values in the column marked asDISTINCTwill be discarded.

7.ORDER BY

  • If an order is specified by theORDER BYclause, the rows are then sorted by the specified data in either ascending or descending order. Since all the expressions in theSELECTpart of the query have been computed, you can reference aliases in this clause.

8.LIMIT/OFFSET

  • Finally, the rows that fall outside the range specified by theLIMITandOFFSETare discarded, leaving the final set of rows to be returned from the query.

Explain briefly the logistic regression model and state an example of when you have used it recently?

  • Logistic regression is used to calculate the probability of occurrence of an event in the form of a dependent output variable based on independent input variables. Logistic regression is commonly used to estimate the probability that an instance belongs to a particular class. If the probability is bigger than 0.5 then it will belong to that class (positive) and if it is below 0.5 it will belong to the other class. This will make it a binary classifier.
  • It is important to remember that the Logistic regression isn’t a classification model, it’s an ordinary type of regression algorithm, and it was developed and used before machine learning, but it can be used in classification when we put a threshold to determine specific categories.
  • There is a lot of classification applications to it: classify email as spam or not, identify whether the patient is healthy or not, etc.

Describe briefly the hypothesis testing and p-value in layman’s terms? And give a practical application for them?

  • In Layman’s terms:
    • Hypothesis test is where you have a current state (null hypothesis) and an alternative state (alternative hypothesis). You assess the results of both of the states and see some differences. You want to decide whether the difference is due to the alternative approach or not.
    • You use the p-value to decide this, where the p-value is the likelihood of getting the same results the alternative approach achieved if you keep using the existing approach. It’s the probability to find the result in the gaussian distribution of the results you may get from the existing approach.
    • The rule of thumb is to reject the null hypothesis if the p-value < 0.05, which means that the probability to get these results from the existing approach is <95%. But this % changes according to task and domain.
    • To explain the hypothesis testing in layman’s term with an example, suppose we have two drugs A and B, and we want to determine whether these two drugs are the same or different. This idea of trying to determine whether the drugs are the same or different is called hypothesis testing. The null hypothesis is that the drugs are the same, and the p-value helps us decide whether we should reject the null hypothesis or not.
    • p-values are numbers between 0 and 1, and in this particular case, it helps us to quantify how confident we should be to conclude that drug A is different from drug B. The closer the p-value is to 0, the more confident we are that the drugs A and B are different.

What is an activation function and discuss the use of an activation function? Explain three different types of activation functions?

  • In mathematical terms, the activation function serves as a gate between the current neuron input and its output, going to the next level. Basically, it decides whether neurons should be activated or not. It is used to introduce non-linearity into a model.
  • Activation functions are added to introduce non-linearity to the network, it doesn’t matter how many layers or how many neurons your net has, the output will be linear combinations of the input in the absence of activation functions. In other words, activation functions are what make a linear regression model different from a neural network. We need non-linearity, to capture more complex features and model more complex variations that simple linear models can not capture.
  • There are a lot of activation functions:
    • Sigmoid function: \(f(x) = 1/(1+exp(-x))\).
      • The output value of it is between 0 and 1, we can use it for classification. It has some problems like the gradient vanishing on the extremes, also it is computationally expensive since it uses exp.
    • ReLU: \(f(x) = max(0,x)\).
      • it returns 0 if the input is negative and the value of the input if the input is positive. It solves the problem of vanishing gradient for the positive side, however, the problem is still on the negative side. It is fast because we use a linear function in it.
    • Leaky ReLU:
    \[F(x) = ax, x < 0 F(x) = x, x >= 0\]
  • It solves the problem of vanishing gradient on both sides by returning a value “a” on the negative side and it does the same thing as ReLU for the positive side.
    • Softmax: it is usually used at the last layer for a classification problem because it returns a set of probabilities, where the sum of them is 1. Moreover, it is compatible with cross-entropy loss, which is usually the loss function for classification problems.

If you roll a dice three times, what is the probability to get two consecutive threes?

  • The answer is 11/216.
  • There are different ways to answer this question:
    • If we roll a dice three times we can get two consecutive 3’s in three ways:
      1. The first two rolls are 3s and the third is any other number with a probability of 1/6 * 1/6 * 5/6.
      2. The first one is not three while the other two rolls are 3s with a probability of 5/6 * 1/6 * 1/6.
      3. The last one is that the three rolls are 3s with probability 1/6 ^ 3.
        • So the final result is \(2 * (5/6 * (1/6)^2) + (1/6)*3 = 11/216\).
    • By Inclusion-Exclusion Principle:
      • Probability of at least two consecutive threes: = Probability of two consecutive threes in first two rolls + Probability of two consecutive threes in last two rolls - Probability of three consecutive threes = 2Probability of two consecutive threes in first two rolls - Probability of three consecutive threes = 21/61/6 - 1/61/6*1/6 = 11/216
      • It can be seen also like this:
        • The sample space is made of (x, y, z) tuples where each letter can take a value from 1 to 6, therefore the sample space has 6x6x6=216 values, and the number of outcomes that are considered two consecutive threes is (3,3, X) or (X, 3, 3), the number of possible outcomes is therefore 6 for the first scenario (3,3,1) till (3,3,6) and 6 for the other scenario (1,3,3) till (6,3,3) and subtract the duplicate (3,3,3) which appears in both, and this leaves us with a probability of 11/216.

You and your friend are playing a game with a fair coin. The two of you will continue to toss the coin until the sequence HH or TH shows up. If HH shows up first, you win, and if TH shows up first your friend win. What is the probability of you winning the game?

  • If T is ever flipped, you cannot then reach HH before your friend reaches TH. Therefore, the probability of you winning this is to flip HH initially. Therefore the sample space will be {HH, HT, TH, TT} and the probability of you winning will be (1/4) and your friend (3/4).

Dimensionality reduction techniques

  • Dimensionality reduction techniques help deal with the curse of dimensionality. Some of these are supervised learning approaches whereas others are unsupervised. Here is a quick summary:

    • PCA - Principal Component Analysis is an unsupervised learning approach and can Handle skewed data easily for dimensionality reduction.
    • LDA - Linear Discriminant Analysis is also a dimensionality reduction technique based on eigenvectors but it also maximizes class separation while doing so. Moreover, it is a supervised Learning approach and it performs better with uniformly distributed data.
    • ICA - Independent Component Analysis aims to maximize the statistical independence between variables and is a Supervised learning approach.
    • MDS - Multi dimensional scaling aims to preserve the Euclidean pairwise distances. It is an Unsupervised learning approach.
    • ISOMAP - Also known as Isometric Mapping is another dimensionality reduction technique which preserves geodesic pairwise distances. It is an unsupervised learning approach. It can handle noisy data well.
    • t-SNE - Called the t-distributed stochastic neighbor embedding preserves local structure and is an Unsupervised learning approach.

Active learning

  • Active learning is a semi-supervised ML training paradigm which, like all semi-supervised learning techniques, relies on the usage of partially labeled data.
  • Active Learning consists of dynamically selecting the most relevant data by sequentially:
    • selecting a sample of the raw (unannotated) dataset (the algorithm used for that selection step is called a querying strategy).
    • getting the selected data annotated.
    • training the model with that sample of annotated training data.
    • running inference on the remaining (unannotated) data.
  • That last step is used to evaluate which records should be then selected for the next iteration (called a loop). However, since there is no ground truth for the data used in the inference step, one cannot simply decide to feed the – data where the model failed to make the correct prediction, and has instead to use metadata (such as the confidence level of the prediction) to make that decision.
  • The easiest and most common querying strategy used for selecting the next batch of useful data consists of picking the records with the lowest confidence level; this is called the least-confidence querying strategy, which is one of many possible querying strategies.

What is the independence assumption for a Naive Bayes classifier?

  • Naive bayes assumes that the feature probabilities are independent given the class \(c\), i.e., the features do not depend on each other are totally uncorrelated.
  • This is why the Naive Bayes algorithm is called “naive”.
  • Mathematically, the features are independent given class:

    \[\begin{aligned} P\left(X_{1}, X_{2} \mid Y\right) &=P\left(X_{1} \mid X_{2}, Y\right) P\left(X_{2} \mid Y\right) \\ &=P\left(X_{1} \mid Y\right) P\left(X_{2} \mid Y\right) \end{aligned}\]
    • More generally: \(P\left(X_{1} \ldots X_{n} \mid Y\right)=\prod_{i} P\left(X_{i} \mid Y\right)\)

Explain briefly batch gradient descent, stochastic gradient descent, and mini-batch gradient descent? List the pros and cons of each.

  • Gradient descent is a generic optimization algorithm capable for finding optimal solutions to a wide range of problems. The general idea of gradient descent is to tweak parameters iteratively in order to minimize a cost function.
    • Batch Gradient Descent:
      • In Batch Gradient descent the whole training data is used to minimize the loss function by taking a step towards the nearest minimum by calculating the gradient (the direction of descent).
      • Pros:
        • Since the whole data set is used to calculate the gradient it will be stable and reach the minimum of the cost function without bouncing around the loss function landscape (if the learning rate is chosen correctly).
      • Cons:
        • Since batch gradient descent uses all the training set to compute the gradient at every step, it will be very slow especially if the size of the training data is large.
    • Stochastic Gradient Descent:
      • Stochastic Gradient Descent picks up a random instance in the training data set at every step and computes the gradient-based only on that single instance.
      • Pros:
        • It makes the training much faster as it only works on one instance at a time.
        • It become easier to train using large datasets.
      • Cons:
        • Due to the stochastic (random) nature of this algorithm, this algorithm is much less stable than the batch gradient descent. Instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, not settling down there. So once the algorithm stops, the final parameters would likely be good but not optimal. For this reason, it is important to use a training schedule to overcome this randomness.
    • Mini-batch Gradient:
      • At each step instead of computing the gradients on the whole data set as in the Batch Gradient Descent or using one random instance as in the Stochastic Gradient Descent, this algorithm computes the gradients on small random sets of instances called mini-batches.
      • Pros:
        • The algorithm’s progress space is less erratic than with Stochastic Gradient Descent, especially with large mini-batches.
        • You can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.
      • Cons:
        • It might be difficult to escape from local minima.

Explain what is information gain and entropy in the context of decision trees?

  • Entropy and Information Gain are two key metrics used in determining the relevance of decision making when constructing a decision tree model and to determine the nodes and the best way to split.
  • The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until we reach a small enough set that contains data points that fall under one label.
  • Entropy is the measure of impurity, disorder, or uncertainty in a bunch of examples. Entropy controls how a Decision Tree decides to split the data. Information gain calculates the reduction in entropy or surprise from transforming a dataset in some way. It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.

What are some applications of RL beyond gaming and self-driving cars?

  • Reinforcement learning is NOT just used in gaming and self-driving cars, here are three common use cases you should know in 2022:
  1. Multi-arm bandit testing (MAB)

    • A little bit about reinforcement learning (RL): you train an agent to interact with the environment and figure out the optimum policy which maximizes the reward (a metric you select).
    • MAB is a classic reinforcement learning problem that can be used to help you find a best options out of a lot of treatments in experimentation.
    • Unlike A/B tests, MAB tries to maximizes a metric (reward) during the course of the test. It usually has a lot of treatments to select from. The trade-off is that you can draw causal inference through traditional A/B testing, but it’s hard to analyze each treatment through MAB; however, because it’s dynamic, it might be faster to select the best treatment than A/B testing.
  2. Recommendation engines

    • While traditional matrix factorization works well for recommendation engines, using reinforcement learning can help you maximize metrics like customer engagement and metrics that measure downstream impact.
    • For example, social media can use RL to maximize ‘time spent’ or ‘review score’ when recommending content; so this way, instead of just recommending similar content, you might also help customers discover new content or other popular content they like.
  3. Portfolio Management

    • RL has been used in finance recently as well. Data scientist can train the agent to interact with a trading environment to maximize the return of the portfolio. For example, if the agent selects an allocation of 70% stock, 10% Cash, and 20% bond, the agent gets a positive or negative reward for this allocation. Through iteration, the agent finds out the best allocation.
    • Robo-advisers can also use RL to learn investors risk tolerance.
    • Of course, self-driving cars, gaming, robotics use RL heavily, but I’ve seen data scientists from industries mentioned above (retail, social media, finance) start to use more RL in their day-to-day work.

You are using a deep neural network for a prediction task. After training your model, you notice that it is strongly overfitting the training set and that the performance on the test isn’t good. What can you do to reduce overfitting?

  • To reduce overfitting in a deep neural network changes can be made in three places/stages: The input data to the network, the network architecture, and the training process:
    1. The input data to the network:
    • Check if all the features are available and reliable
    • Check if the training sample distribution is the same as the validation and test set distribution. Because if there is a difference in validation set distribution then it is hard for the model to predict as these complex patterns are unknown to the model.
    • Check for train / valid data contamination (or leakage)
    • The dataset size is enough, if not try data augmentation to increase the data size
    • The dataset is balanced
      1. Network architecture:
    • Overfitting could be due to model complexity. Question each component:
      • can fully connect layers be replaced with convolutional + pooling layers?
      • what is the justification for the number of layers and number of neurons chosen? Given how hard it is to tune these, can a pre-trained model be used?
    • Add regularization - ridge (l1), lasso (l2), elastic net (both)
    • Add dropouts
    • Add batch normalization
      1. The training process:
    • Improvements in validation losses should decide when to stop training. Use callbacks for early stopping when there are no significant changes in the validation loss and restore_best_weights.

Explain the linear regression model and discuss its assumption?

  • Linear regression is a supervised statistical model to predict dependent variable quantity based on independent variables.
  • Linear regression is a parametric model and the objective of linear regression is that it has to learn coefficients using the training data and predict the target value given only independent values.
  • Some of the linear regression assumptions and how to validate them:
    1. Linear relationship between independent and dependent variables
    2. Independent residuals and the constant residuals at every \(x\): We can check for 1 and 2 by plotting the residuals(error terms) against the fitted values (upper left graph). Generally, we should look for a lack of patterns and a consistent variance across the horizontal line.
    3. Normally distributed residuals: We can check for this using a couple of methods: -Q-Q-plot(upper right graph): If data is normally distributed, points should roughly align with the 45-degree line. -Boxplot: it also helps visualize outliers -Shapiro–Wilk test: If the p-value is lower than the chosen threshold, then the null hypothesis (Data is normally distributed) is rejected.
    4. Low multicollinearity
      • You can calculate the VIF (Variable Inflation Factors) using your favorite statistical tool. If the value for each covariate is lower than 10 (some say 5), you’re good to go.
  • The figure below summarizes these assumptions.

Explain briefly the K-Means clustering and how can we find the best value of K?

  • K-Means is a well-known clustering algorithm. K-Means clustering is often used because it is easy to interpret and implement. It starts by partitioning a set of data into \(K\) distinct clusters and then arbitrary selects centroids of each of these clusters. It iteratively updates partitions by first assigning the points to the closet cluster and then updating the centroid and then repeating this process until convergence. The process essentially minimizes the total inter-cluster variation across all clusters.
  • The elbow method is a well-known method to find the best value of \(K\) in K-means clustering. The intuition behind this technique is that the first few clusters will explain a lot of the variation in the data, but past a certain point, the amount of information added is diminishing. Looking at the graph below of the explained variation (on the y-axis) versus the number of cluster \(K\) (on the x-axis), there should be a sharp change in the y-axis at some level of \(K\). For example in the graph below the drop-off is at \(k=3\).
  • The explained variation is quantified by the within-cluster sum of squared errors. To calculate this error notice, we look for each cluster at the total sum of squared errors using Euclidean distance.
  • Another popular alternative method to find the value of \(K\) is to apply the silhouette method, which aims to measure how similar points are in its cluster compared to other clusters. It can be calculated with this equation: \((x-y)/max(x,y)\), where \(x\) is the mean distance to the examples of the nearest cluster, and \(y\) is the mean distance to other examples in the same cluster. The coefficient varies between -1 and 1 for any given point. A value of 1 implies that the point is in the right cluster and the value of -1 implies that it is in the wrong cluster. By plotting the silhouette coefficient on the y-axis versus each \(K\) we can get an idea of the optimal number of clusters. However, it is worthy to note that this method is more computationally expensive than the previous one.

Given an integer array, return the maximum product of any three numbers in the array.

  • For example:
A = [1, 5, 3, 4] it should return 60
B = [-2, -4, 5, 3] it should return 40
  • If all the numbers are positive, then the solution will be finding the max 3 numbers, if they have negative numbers then it will be the result of multiplying the smallest two negative numbers with the maximum positive number.
  • We can use the heapq library to sort and find the maximum numbers in one step. As shown in the image below.

What are joins in SQL and discuss its types?

  • A JOIN clause is used to combine rows from two or more tables, based on a related column between them. It is used to merge two tables or retrieve data from there. There are 4 types of joins: inner join left join, right join, and full join.
  • Inner join: Inner Join in SQL is the most common type of join. It is used to return all the rows from multiple tables where the join condition is satisfied.
  • Left Join: Left Join in SQL is used to return all the rows from the left table but only the matching rows from the right table where the join condition is fulfilled.
  • Right Join: Right Join in SQL is used to return all the rows from the right table but only the matching rows from the left table where the join condition is fulfilled.
  • Full Join: Full join returns all the records when there is a match in any of the tables. Therefore, it returns all the rows from the left-hand side table and all the rows from the right-hand side table.

Why should we use Batch Normalization?

  • Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch.
  • Usually, a dataset is fed into the network in the form of batches where the distribution of the data differs for every batch size. By doing this, there might be chances of vanishing gradient or exploding gradient when it tries to backpropagate. In order to combat these issues, we can use BN (with irreducible error) layer mostly on the inputs to the layer before the activation function in the previous layer and after fully connected layers.
  • Batch Normalisation has the following effects on the Neural Network:
    • Robust Training of the deeper layers of the network.
    • Better covariate-shift proof NN Architecture.
    • Has a slight regularization effect.
    • Centered and controlled values of Activation.
    • Tries to prevent exploding/vanishing gradient.
    • Faster training/convergence.

What is weak supervision?

  • Weak Supervision (which most people know as the Snorkel algorithm) is an approach designed to help annotate data at scale, and it’s a pretty clever one too.
  • Imagine that you have to build a content moderation system that can flag LinkedIn posts that are offensive. Before you can build a model, you’ll first have to get some data. So you’ll scrape posts. A lot of them, because content moderation is particularly data-greedy. Say, you collect 10M of them. That’s when trouble begins: you need to annotate each and every one of them - and you know that’s gonna cost you a lot of time and a lot of money!
  • So you want to use autolabeling (basically, you want to apply a pre-trained model) to generate ground truth. The problem is that such a model doesn’t just lie around, as this isn’t your vanilla object detection for autonomous driving use case, and you can’t just use YOLO v5.
  • Rather than seek the budget to annotate all that data, you reach out to subject matter experts you know on LinkedIn, and you ask them to give you a list of rules of what constitutes, according to each one of them, an offensive post.
Person 1's rules:
- The post is in all caps
- There is a mention of Politics

Person 2's rules:
- The post is in all caps
- It uses slang
- The topic is not professional

...

Person 20's rules:
- The post is about religion
- The post mentions death
  • You then combine all rules into a mega processing engine that functions as a voting system: if a comment is flagged as offensive by at least X% of those 20 rule sets, then you label it as offensive. You apply the same logic to all 10M records and are able to annotate then in minutes, at almost no costs.
  • You just used a weakly supervised algorithm to annotate your data.
  • You can of course replace people’s inputs by embeddings, or some other automatically generated information, which comes handy in cases when no clear rules can be defined (for example, try coming up with rules to flag a cat in a picture).

What is active learning?

  • When you don’t have enough labeled data and it’s expensive and/or time consuming to label new data, active learning is the solution. Active learning is a semi-supervised ML training paradigm which, like all semi-supervised learning techniques, relies on the usage of partially labeled data. Active Learning helps to select unlabeled samples to label that will be most beneficial for the model, when retrained with the new sample.
  • Active Learning consists of dynamically selecting the most relevant data by sequentially:
    • selecting a sample of the raw (unannotated) dataset (the algorithm used for that selection step is called a querying strategy)
    • getting the selected data annotated
    • training the model with that sample of annotated training data
    • running inference on the remaining (unannotated) data.
  • That last step is used to evaluate which records should be then selected for the next iteration (called a loop). However, since there is no ground truth for the data used in the inference step, one cannot simply decide to feed the data where the model failed to make the correct prediction, and has instead to use metadata (such as the confidence level of the prediction) to make that decision.
  • The easiest and most common querying strategy used for selecting the next batch of useful data consists of picking the records with the lowest confidence level; this is called the least-confidence querying strategy, which is one of many possible querying strategies. (Technically, those querying strategies are usually brute-force, arbitrary algorithms which can be replaced by actual ML models trained on metadata generated during the training and inference phases for more sophistication).
  • Thus, the most important criterion is selecting samples with maximum prediction uncertainty. You can use the model’s prediction confidence to ascertain uncertain samples. Entropy is another way to measure such uncertainty. Another criterion could be diversity of the new sample with respect to exiting training data. You could also select samples close to labeled samples in the training data with poor performance. Another option could be selecting samples from regions of the feature space where better performance is desired. You could combine all the strategies in your active learning decision making process.
  • The training is an iterative process. With active learning you select new sample to label, label it and retrain the model. Adding one labeled sample at a time and retraining the model could be expensive. There are techniques to select a batch of samples to label. For deep learning the most popular active learning technique is entropy with is Monte Carlo dropout for prediction probability.
  • The process of deciding the samples to label could also be implemented with Multi Arm Bandit. The reward function could be defined in terms of prediction uncertainty, diversity, etc.
  • Let’s go deeper and explain why the vanilla form of Active Learning, “uncertainty-based”/”least-confidence” Active Learning, actually perform poorly via real-life datasets:
    • Let’s take the example of a binary classification model identifying toxic content in tweets, and let’s say we have 100,000 tweets as our dataset.
    • Here is how uncertainty-based AL would work:
      1. We pick 1,000 (or another number, depending on how we tune the process) records - at that stage, randomly.
      2. We annotate that data as toxic / not-toxic.
      3. We train our model with it and get a (not-so-good) model.
      4. We use the model to infer the remaining 99,000 (unlabeled) records.
      5. We don’t have ground truth for those 99,000, so we can’t select which records are incorrectly predicted, but we can use metadata, such as the confidence level, as a proxy to detect bad predictions. With least confidence Active Learning, we would pick the 1,000 records predicted with the lowest confidence level as our next batch.
      6. Go to (2) and repeat the same steps, until we’re happy with the model.
    • What we did here, is assume that confidence was a good proxy for usefulness, because it is assumed that low confidence records are the hardest for the model to learn, and hence that the model needs to see them to learn more efficiently.
    • Let’s consider a scenario where it is not. Assume now that this training data is not clean, and 5% of the data is actually in Spanish. If the model (and the majority of the data) was meant to be for English, then chances are, the Spanish tweets will be inferred with a low confidence: you will actually pollute the dataset with data that doesn’t belong there. In other words, low confidence can happen for a variety of different reasons. That’s what happens when you do active learning with messy data.
      • To resolve this, one solution is to stop using confidence level alone: confidence levels are just one meta-feature to evaluate usefulness.
  • In a nutshell, active learning is an incremental semi-supervised learning paradigm where training data is selected incrementally and the model is sequentially retrained (loop after loop), until either the model reaches a specific performance or labeling budget is exhausted.

What are the types of active learning?

  • There are many different “flavors” of active learning, but did you know that active learning could be broken down into two main categories, “streaming active learning”, and “pooling (batch) active learning”?
  • Pooling Active Learning, is when all records available for training data have to be evaluated before a decision can be made about the ones to keep. For example, if your querying strategy is least-confidence, you goal is to select the N records that were predicted with the lowest confidence level in the previous loop, which means all records have to be ranked accordingly to their confidence level. Pooling Active Learning hence requires more compute resources for inference (the entire remainder of the dataset, at each loop, needs to be inferred), but provides a better control of loop sizes and the process as a whole.
  • Streaming Active Learning, is when a decision is made “on the fly”, record by record. If your selection strategy was to select all records predicted with a confidence level lower than X% for the previous loop, you’d be doing Streaming AL. This technique obviously requires less compute, and can be used in combination with Online Learning, but it comes with a huge risk: there is no guarantee regarding the amount of data that will be selected. Set the threshold too low, and you won’t select any data for the next loop. Set the threshold too high, and all the remaining data gets selected, and you lose the benefit of AL.

What is the difference between online learning and active learning?

  • Online learning is essentially the concept of training a machine learning model on streaming data. In that case, data arrives little-by-little, sequentially, and the model is updated as opposed to be trained entirely from scratch.
  • Active learning also consists in training a model sequentially, but the difference is that the training dataset is already fully available. Active learning simply selects small samples of data incrementally; the model is either retrained with the totality of selected records at a given point in time, or updated with the newly selected data.
  • Online learning is required when models are to be trained at the point of collection (e.g, on the edge of a device), but active learning, just like supervised learning, usually involves the model being trained offline.

Why is active learning not frequently used with deep learning?

  • Active Learning was relatively popular among ML scientists during the pre-Deep Learning era, and somehow fell out of favor afterwards.

  • The reason why is actually relatively simple: Active Learning usually doesn’t work as well with Deep Learning Models (at least the most common querying strategies don’t). So people gave up on Deep Active Learning pretty quickly. The two most important reasons are the following:

  1. The least-confidence, by far the most popular querying strategy, requires the computation of a confidence score. However, the softmax technique which most ML scientists rely on, is relatively unreliable (see this article for details to learn about a better way to compute confidence: https://arxiv.org/pdf/1706.04599.pdf)

  2. Active learning, as a process, is actually meant to “grow” a better dataset dynamically. At each loop, more records are selected, which means the same model is retrained with incrementally larger data. However, many hyperparameters in neural nets are very sensitive to the amount of data used. For example, a certain number of epochs might lead to overfitting with early loops and underfitting later on. The proper way of doing Deep Active Learning would be to do hyperparameter tuning dynamically, which is rarely done.

What does active learning have to do with explore-exploit?

  • Using the “uncertainty-based”/”least/lowest-confidence” querying strategy as a selection criteria in an active learning process could cause issues when working with a real-life (messy) dataset, as indicated above.
  • Uncertainty-based active learning aims at selecting records based on how “certain” (or confident) the model already is about what it knows. Assuming the model can be trusted to self-evaluate properly, then:
    • Selecting low confidence records is about picking what the model seems not to know yet; it is a pure exploration process.
    • Selecting high confidence records is about picking what the model seems to already know, and that would be about reinforcing that knowledge; it is a pure exploitation process.
  • While the “uncertainty-based”/”least/lowest-confidence” querying strategy strategy is the most common using active learning, it might be better to balance exploration and exploitation, and that active learning can and should, in fact, be formulated as a reinforcement learning problem.

References

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledInterviewQuestions,
  title   = {Interview Questions},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}