• In this topic, we will go over an open-ended real world problem. Here is the outline:

Case Study: Build your own Speech Recognition System

Goal: As a CEO of a new startup, you want to build an embedded smart chip so that lamp makers could purchase them and make their products activated by voice. For example: “Hey lamp, turn on” or “Hey lamp, turn off”.

Question: What would you do first to build a speech recognition system capable of recognizing these trigger sentences?

Answer:

  • 1) Conduct an in-depth literature review, read posts and search for open-source implementations. Don’t read papers sequentially. First skim them and read in detail just a subset of them.
  • 2) Get pointers from experts. Clarify doubts of the most relevant papers by emailing authors.
  • 3) Collect data and label it. For trigger word detection, you can label with 1 the end of the trigger word and with 0s the rest.

Question: After training your model, you obtain a 99.5% accuracy in the dev set. While conducting error analysis, you realize your model is always predicting 0s. What would you do next?

Answer:

  • 1) In order to balance the amount of 1s with respect 0s, label the last 0.5s of the trigger word as 1s.
  • 2) Formulate a metric that awards more predicting positive examples correctly than negative examples.

Question: More generally, suppose you have binary classification problem where 99% of the data has positive labels. In this case your model will tend to predict everything as positive since in this way the model will easily reach a high accuracy of 99%. However, we are not looking for a naive model like this. How you will deal with this problem?

Answer:

  • 1) Change the way you evaluate your algorithm. One way is to change your dev set into a more balanced one. Option two is to change the metric to be a more balanced one, e.g., use recall score to replace accuracy score.

Question: We find that the model works well on the training set, but it doesn’t generalize well on dev set. What would you do next?

Answer:

  • 1) Data augmentation. For audio data, you may consider add background noise into your clean data. This could be done easily by recording some random noise in a cafe or bus stop.

Question: What if your model still doesn’t perform well on dev set even with data augmentation. And it might even takes quite a long time to train your model.

Answer:

  • 1) Early stopping would be an option. It works pretty well in real word applications.