• Let’s talk about concept drifting (also known as “shifting”) which is an important topic in MLOps and ensuring success of your model in production.

  • A predictive ML model learns theta to output \(P(Y \mid X; \theta)\).

    1. Data drift/shift refers to changes in \(P(X)\): different data distributions, different feature space. Data drift usually causes training to fail to produce a good model because the training data does not match real-world examples.
      • Ex: service launched in a new country, expected features missing.
    2. Label schema drift/shift refers to changes in \(Y\): new classes, outdated classes, finer-grained classes. Especially common with high-cardinality tasks.
      • Ex: there’s a new disease to categorize.
    3. Model drift/shift refers to changes in \(P(Y \mid X)\): same inputs expecting different outputs.
      • Ex: when users searched for Wuhan pre-Covid, they expected different results from what they do now.
      • Model drift can be cyclic e.g. ride-share demands weekday vs. weekend.
  • Concept drift doesn’t happen only because of drastic events. It happens all the time, either:
    • suddenly (new competitors, competitors changing their pricing policies, new memes, new events, new fads) or
    • gradually (changes in social norms, cultures, languages)
  • Different features drift at different rates.
    • Ex: an app ranking might be useful for predicting whether it gets downloaded, but it drifts quickly.
  • In production, you sometimes might want to use less good features if they’re more stable.

  • Not all drifts require model retraining or relabeling, but all require monitoring.

Monitoring for drifts

  • Model drifts are not uncommon. Few things that can be done for monitoring:
    1. Having data quality and integrity checks on feature collection and inference services to detect unexpected inputs/outcomes.
    2. Have a hold out set to be reviewed manually by stakeholder teams on a monthly/quarterly basis to detect changes in business models.
  • You can also do it automatically as well using a classifier. Treat your entire training data (\(X\)) and apply pseudo label 1, get a sample from production data (\(X\)), using similar sampling as applied to training data, and apply pseudo label 0. Fit a classifier and measure how well its able to separate the two classes (MCC is a good metric). A drifted space is when it’s very easy for classifier to separate the two. This measures data drift and scales well.
  • Supposedly Netflix monitors by applying the Kolmogorov–Smirnov test, according to a presenter at the Spark + AI Summit.