Terminology

Phoneme

  • A phoneme is the smallest unit that distinguishes meaning between sounds in a given language.” What does that mean? Let’s look at a word using IPA, a transcription system created by the International Phonetic Association.
  • Let’s look at the word puff. We use broad transcription when describing phonemes. When we are using broad transcription we use slashes (/ /). So the word puff in broad transcription is:
\[/pʌf/\]
  • Here we see that puff has three phonemes /p/, /ʌ/, and /f/. When we store the pronunciation of the word puff in our head, this is how we remember it. What happens if we change one phoneme in the word puff? If we change the phoneme (not the letters) /f/ to the phoneme /k/ we get another word. We get the word puck which looks like this in broad transcription:
\[/pʌk/\]
  • This is a type of test that we can do to see if /f/ and /k/ are different phonemes. If we swap these two phonemes we get a new word so we can say that in English /f/ and /k/ are different phonemes. We’re going to discuss phones now, but keep this in the back of your head, because we are going to come back to it.

Phone

  • Now that we’ve covered what a phoneme is, we can discuss phones. Remember that we defined a phoneme as “the smallest unit that distinguishes meaning between sounds in a given language.” However, a phoneme is really the mental representation of a sound, not the sound itself. The phoneme is the part that is stored in your brain. When you actually produce a sound you are producing a phone. To give an example, let’s say you want to say the word for a small four-legged animal that meows, a cat. Your brain searches for the word in your lexicon to see if you know the word. You find the lexical entry. You see that phonemic representation of the word is /kæt/. Then you use your vocal tract to produce the sounds [k], [æ], and [t] and you get the word [kæt].
  • Phones, the actual sound part that you can hear, are marked with brackets ([]) and the phonemes, the mental representation of the sound, are marked with slashes (/ /).

Why do we need phonemes and phones?

  • Recap: Phonemes are the mental representation of the how a word sounds and phones are the actual sounds themselves. If we take an example from above, the word puff we can write out the phonemic representation (with phonemes using slashes) and the phonetic representation (with phones using brackets).
\[/pʌf/\\ [pʌf]\]
  • What does the above show us? Not a whole lot really. So why do we need two different versions? Recall that the transcription that uses phonemes is called broad transcription while the transcription that uses phones is called narrow transcription. These names can give us a clue about the differences.

  • By looking at the broad transcription, /pʌf/, we can know how to pronounce the word puff. I can pronounce the word, you can pronounce the word, and a non-native English speaker can all pronounce the word. We should all be able to understand what we are saying. However, what if we wanted more information about how the word actually sounds? Narrow transcription can help us with that.

  • Narrow transcription just gives us extra information about how a word sounds. So the word puff can be written like this in narrow transcription:

\[[pʰʌf]\]
  • Well, that’s new. This narrow transcription of the word puff gives us a little more information about how the word sounds. Here, we see that the [p] is aspirated. This means that when pronouncing the sound [p], we have an extra puff of air that comes out. We notate this by using the superscript ʰ. So you are probably asking yourself, why don’t we just put the ʰ in the broad transcription? Remember that broad transcription uses phonemes and by definition, if we change a phoneme in a word, we will get a different word. Look at the following:
\[/pʌf/\\ /pʰʌf/ *\]
  • An asterisk denotes that the above is incorrect. But why? Because in English, an aspirated p and an unaspirated p don’t change the meaning of a word. That is, you can pronounce the same sound two different ways, but it wouldn’t change the meaning. And by definition, if we change a phoneme, we change the meaning of a word. That means there’s only one /p/ phoneme in English. If we were speaking a language where aspiration does change the meaning of a word, then that language could have two phonemes, /p/ and /pʰ/. Since it doesn’t change the meaning in English, we just mark it in narrow transcription.

Allophones

  • We can pronounce the /p/ phoneme in at least two different ways: [p] and [pʰ]. This means that [p] and [pʰ] are allophones of the phoneme /p/. The prefix -allo comes from the Greek állos meaning “other,” so you can think of allopones are just “another way to pronounce a phoneme.”

  • This really helps us when we talk about different accents. Take the word water for example. I’m American, so the phonemic representation that I have for the word water is:

\[/wɑtəɹ/\]
  • But if you’ve ever heard an American pronounce the word water before, you know that many Americans don’t pronounce a [t] sound. Instead, most Americans will pronounce that [t] similar to a [d] sound. It’s not pronounced the same was as a [d] sound is pronounced, though. It’s actually a “flap” and written like this:
\[[ɾ]\]
  • So the actual phonetic representation of the word water for many Americans is:
\[[wɑɾɚ]\]
  • So, in the phonemic representation (broad transcription), we have the /t/ phoneme, but most Americans will produce a [ɾ] here. That means we can say that in American English [ɾ] is an allophone of the phoneme /t/.

  • But sometimes the /t/ phoneme does use a [t] sound like in the name Todd:

\[/tɑd/\]
  • and in narrow transcription:
\[[tʰɑd]\]
  • With these examples, we can see that the phoneme /t/ has at least two allophones: [tʰ] and [ɾ]. We can even look at the word putt and see that the [t] can be pronounced as a “regular” [t] sound:
\[[pʌt]\]
  • Fantastic! So now we know that the phoneme /t/ has at least three allophones, [t], [tʰ], and [ɾ]. But how do we know when to say each one?

Environments of phonemes

  • When we talk about the environments of a phoneme we are talking about where the phoneme occurs, usually in relation to other phonemes in a word. We can use this information to predict how a phoneme will be pronounced. Take for example the name Todd:
\[/tɑd/\]
  • If this was a word that we had never heard before, how would we know how to pronounce the /t/ phoneme? Well, we can already narrow it down to [t], [tʰ], or [ɾ] because we’ve seen in past examples that we can pronounce these when we have /t/ phoneme. But how do we know which phone is the correct one?

  • If you combed through many many words in English, you would find out that the phoneme /t/ is often aspirated when it’s at the beginning of a word. By looking at other words that start with /t/ like tap, take, tack, etc. you’ll find that /t/ becomes [tʰ] when at the beginning of a word and that the narrow transcription of the name Todd would be:

\[[tʰɑd]\]
  • We can use the same process to find out how to pronounce other words in an American accent. If we look at the words eating, little, latter, etc… we can see that in American English all of /t/ are pronounced as [ɾ]. Deciding all of the requirements for realizing /t/ as [ɾ] is beyond the scope of this post, but you can see that in similar environments the /t/ becomes a [ɾ].

Graphemes

  • In linguistics, a grapheme is the smallest functional unit of a writing system.
  • The name grapheme is given to the letter or combination of letters that represents a phoneme. For example, the word ‘ghost’ contains five letters and four graphemes (‘gh,’ ‘o,’ ‘s,’ and ‘t’), representing four phonemes.

Key takeaways

  • A phoneme is a mental representation of a sound, not necessarily a letter. Also, when we swap a phoneme we change the word.
  • A phone is the phonetic representation of a phoneme (the actual sound).
  • Allophones are different ways to pronounce the same phoneme while keeping the same meaning.
  • Sometimes allophones are predictable depending on their environment and who is speaking.

Oscillogram

  • An oscillogram (also called a “timedomain waveform” or simply, a “waveform” in speech context) is a plot of amplitude vs. time. It is a record produced by an oscillograph or oscilloscope.

Spectrum

  • Taking the Fourier transform of a slice of a speech signal yields the spectrum/spectral vector for that slice. A sequence of these spectral vectors yields the plot of frequency vs. time as shown below.

Spectrogram

  • Frequency vs. time representation of a speech signal is referred to as a spectrogram.
  • To obtain a spectrogram, first obtain the spectrum:

  • Rotate it by 90 degrees:

  • Color-code the amplitude:

  • Horizontally tiling the color-coded spectrums yields a spectrogram:

Why spectrograms?

  • Dark regions indicate peaks (formants) in the spectrum:

  • Phones and their properties are better observed in spectrogram:

  • Sounds can be identified much better by the Formants and by their transitions. Hidden Markov Models implicitly model these spectrograms to perform speech recognition.

  • Key takeaways
    • Time-Frequency representation of the speech signal.
    • Spectrogram is a tool to study speech sounds (phones).
    • Phones and their properties are visually studied by phoneticians.
    • Hidden Markov Models implicitly model spectrograms for speech to text systems.
    • Useful for evaluation of text to speech systems: A high quality text to speech system should produce synthesized speech whose spectrograms should nearly match with the natural sentences.

Mel-Filterbanks and MFCCs

  • Empirical studies have shown that the human auditory system resolves frequencies non-linearly, and the non-linear resolution can be approximated using the Mel-scale which is given by

    \[M(f)=1127.01048 \bullet \log _{e} f\]
    • where \(f\) is a frequency (Volkman, Stevens, & Newman, 1937).
  • This indicates that the human auditory system is more sensitive to frequency difference in lower frequency band than in higher frequency band.
  • The figure below illustrates the process of extracting Mel-frequency cepstrum coefficients (MFCCs) with triangular filters that are equally-spaced in Mel-scale. In the linear scale, note that as the frequency increases, the width of the filters increases.

  • An input speech is transformed using the Discrete Fourier Transform, and the filter-bank energies (also called the Mel filter-bank energies or Mel spectrogram) are computed using triangular filters mentioned above. The log-values of the filter-bank energies (also called the log-Mel filterbanks or log-Mel spectrogram) are then decorrelated using the discrete cosine transform (DCT). Finally, the M-dimensional MFCCs are extracted by taking M-DCT coefficients.
  • However, deep learning models are able to exploit spectro-temporal correlations, yielding the use of the log-Mel spectrogram instead of MFCCs equivalent or better in terms of ASR and KWS performance. As a result, a good number of deep KWS works considers log-Mel or Mel filterbank speech features with temporal context.
  • Research has reported that using MFCCs is beneficial for both speaker and speech recognition. While MFCCs are still commonly fed as input speech features to neural nets. However, newer neural net architectures typically rely on the log-Mel filterbank energies (LMFBE) (which can be used to generate MFCCs after DCT-based decorrelation as indicated above).

Perceptual Linear Prediction

  • Apart from MFCCs, perceptual linear prediction (PLP) analysis offers another set of features that are fed into neural nets as input speech features. PLP is a combination of spectral analysis and linear prediction analysis. It uses concepts from the psychophysics of hearing to compute a simple auditory spectrum.
  • Read more in H. Hermansky (1990).

Prosodic Features

  • Prosodic features include pitch and its dynamic variations, inter-pause statistics, phone duration, etc. (Shriberg, 2007) Very often, prosodic features are extracted with larger frame size than acoustical features since prosodic features exist over a long speech segment such as syllables. The pitch and energy-contours change slowly compared to the spectrum, which implies that the variation can be captured over a long speech segment. Many literatures reported that prosodic features usually do not outperform acoustical features but incorporating prosodic features in addition to acoustical features can improve speaker recognition performance (Shriberg, 2007; Sönmez, Shriberg, Heck, & Weintraub, 1998; Peskin et al., 2003; Campbell, Reynolds, & Dunn, 2003; Reynolds et al., 2003).

Idiolectal Features

  • The idiolectal features are motivated by the fact that people usually use idiolectal information to recognize speakers. In telephone conversation corpus, Doddington (2001) reported enrolled speakers can be verified not using acoustic features that are extracted from a speech signal but using idiolectal features that are observed in true-underlying transcription of speech. The phonetic speaker verification, motivated by Doddington (2001)’s work, creates a speaker using his/her phone n-gram probabilities that are obtained using multiple-language speech recognizers (Andrews, Kohler, & Campbell, 2001).

Example spectrogram and oscillogram

  • This is an oscillogram and spectrogram of the boatwhistle call of the toadfish Sanopus astrifer . The upper blue-colored plot is an oscillogram presenting the waveform and amplitude of the sound over time, X-axis is Time (sec) and the Y-axis is Amplitude. The lower figure is a plot of the sounds’ frequency over time, X-axis is Time (sec) and the Y-axis is Frequency (kHz). The amount of energy present in each frequency is represented by the intensity of the color. The brighter the color, the more energy is present in the sound at that frequency. 

Speech Processing

Architectural overview

  • Note that this is an oversimplification of a complex system. Usually, the input data (which is pre-processed into speech features such as Mel-Filterbanks or MFCCs) gets fed as an input to the keyword spotting module which further gates two modules: (i) speaker recognition module, and (ii) automatic speech recognition (ASR) module. The encoder part of the ASR module can run in parallel with a language identification (LID) module which can select the language model correspond to the detected language which feeds the ASR decoder. The ASR output further feeds the natural language understanding (NLU) block which performs intent classification and slot filling. Dialog acts are generated next. Finally, the text-to-speech synthesis (TTS) block yields the end result: a response from the voice assistant. Note that the detected language also helps select which NLU and TTS block to use.
  • Also note that the first phase of the above flow, keyword spotting, can be gated by a voice activity detector (VAD) which is usually a simple neural network with a couple of layers whose job is to just figure out if there’s speech or not in the audio it is listening to. This is usually an always-on on-device block. This helps save power by not having a more complex model like the keyword spotter run as an always-on system.

Fundamental speech tasks

  • There four fundamental speech tasks, as applied to digital voice assistants:
    • Wake word detection/Keyword Spotting (KWS): On the device, detect the wakeword/trigger keyword to get the device’s attention;
    • Automatic speech recognition (ASR): Upon detecting the wake word, convert audio streamed to the cloud into words;
    • Natural-language understanding (NLU): Extract the meaning of the recognized words so that the assistant can take the appropriate action in response to the customer’s request; and
    • Text-to-speech synthesis (TTS): Convert the assistant’s textual response to the customer’s request into spoken audio.

Automatic Speech Recognition

What is automatic speech recognition?

  • Research in ASR (Automatic Speech Recognition) aims to enable computers to “understand” human speech and convert it into text. ASR is the next frontier in intelligent human-machine interaction and also a precondition for perfecting machine translation and natural language understanding. Research into ASR can be traced back to the 1950s in its initial isolated word speech recognition system. Since then, with persistent efforts of numerous scholars, ASR has made significant progress and can now power large-vocabulary continuous speech recognition systems.
  • Especially in the emerging era of big data and application of deep neural networks, ASR systems have achieved notable performance improvements. ASR technology has also been gradually gaining practical use, becoming more product-oriented. Smart speech recognition software and applications based on ASR are increasingly entering our daily lives, in form of voice input methods, intelligent voice assistants, and interactive voice recognition systems for vehicles.

Framework of an ASR System

  • The purpose of ASR is to map input waveform sequences to their corresponding word or character sequences. Therefore, implementing ASR can be considered a channel decoding or pattern classification problem. Statistical modeling is a core ASR method, in which, for a given speech waveform sequence \(O\), we can use a “maximum a posteriori” (MAP) estimator, based on the mode of a posterior Bayesian distribution, to estimate the most likely output sequence \(W*\), with the formula shown in the figure below.

    • where, \(P(O \mid W)\) is the probability of generating the correct observation sequence, i.e. corresponding to the acoustic model (AM) of the ASR system, conditional on \(W\). Likelihood \(P(W)\) is the ‘a priori probability’ of the exact sequence \(W\) occurring. It is called the language model (LM).
  • The figure below shows the structure diagram of a marked ASR system, which mainly comprises a front-end processing module, acoustic model, language model, and decoder. The decoding process is primarily to use the trained acoustic model and language model to obtain the optimal output sequence.

Acoustic Model (Encoder)

  • An acoustic model’s task is to compute \(P(O \mid W)\), i.e. the probability of generating a speech waveform for the mode. An acoustic model, as an important part of the ASR system, accounts for a large part of the computational overhead and also determines the system’s performance. GMM-HMM-based acoustic models are widely used in traditional speech recognition systems.
  • In this model, GMM is used to model the distribution of the acoustic characteristics of speech and HMM is used to model the time sequence of speech signals. Since the rise of deep learning in 2006, deep neural networks (DNNs) have been applied in speech acoustic models. In Mohamed et al. (2009), Hinton and his students used feedforward fully-connected deep neural networks in speech recognition acoustic modeling.

Encoder-Decoder Architectures: Past vs. Present

  • Traditional speech recognition systems adopt a GMM-HMM architecture where the GMM is the acoustic model that computes the state/observation probabilities and the HMM decoder combines these probabilities using dynamic programming (DP). With the advent of deep learning, the role played by the GMM as the acoustic model is now replaced by a DNN. Rather than the GMM modeling the observation probabilities, a DNN is trained to output them. The HMM still acts as a decoder. The figure below illustrates both approaches side-by-side.

  • The DNN computes the observation probabilities and outputs a probability distribution over as many classes as the HMM states for each speech frame using a softmax layer. The number of HMM states depend of the number of phones, with a typical setup of 3 target labels for each phone for the beginning, middle and end of the segment, 1 state for silence, and 1 state for background. The DNN is typically trained to minimize the average cross-entropy loss over all frames between the predicted and the ground-truth distributions. The HMM decoder computes the word detection score using the observation, the state transition, and the prior probabilities. An architectural overview of a DNN-HMM is shown in the diagram below.

  • Compared to traditional GMM-HMM acoustic models, DNN-HMM-based acoustic models perform better in terms of TIMIT database. When compared with GMM, DNN is advantageous in the following ways:
    • De-distribution hypothesis is not required for characteristic distribution when DNN models the posterior probability of the acoustic characteristics of speech.
    • GMM requires de-correlation processing for input characteristics, but DNN is capable of using various forms of input characteristics.
    • GMM can only use single-frame speech as inputs, but DNN is capable of capturing valid context information by means of splicing adjoining frames.
  • Once the HMM model is built, to figure out the maximum a posteriori probability estimate of the most likely sequence of hidden states, i.e., the Viterbi path, a graph-traversal algorithm like the Viterbi decoder algorithm (typically implemented using the concept of dynamic programming in algorithms) can be applied.
  • Given that speech comprises of sequential data, RNNs (and it’s variants, GRUs and LSTMs) are natural choices for DNNs. However, for tasks such as speech recognition, where the alignment between the inputs and the labels is unknown, RNNs have so far been limited to an auxiliary role. The problem is that the standard training methods require a separate target for every input, which is usually not available. The traditional solution — the so-called hybrid approach — is to use Hidden Markov Models (HMMs) to generate targets for the RNN, then invert the RNN outputs to provide observation probabilities (Bourlard and Morgan, 1994). However the hybrid approach does not exploit the full potential of RNNs for sequence processing, and it also leads to an awkward combination of discriminative and generative training.
  • The connectionist temporal classification (CTC) output layer (Graves et al., 2006) removes the need for HMMs for providing alignment altogether by directly training RNNs to label sequences with unknown alignments, using a single discriminative loss function. CTC can also be combined with probabilistic language models for word-level speech and handwriting recognition. Note that the HMM-based decoder is still a good idea for cases where the task at hand involves a limited vocabulary and as a result, a smaller set of pronunciations, such as keyword spotting (vs. speech recognition).
  • Using the CTC loss enables all-neural encoder-decoder seq2seq architectures that utilize an end-to-end neural architecture which generates phonemes at the output of the encoder as an intermediate step, which are consumed by a decoder which utilizes a language model and pronunciation model to generate transcriptions.
  • Recent models in this area such as Listen Attend Spell (LAS) forgo the intermediate phoneme labels altogether and train an end-to-end architecture that directly emits transcriptions at its output.
  • For more in this area, Alex Graves’s book on Supervised Sequence Labelling with Recurrent Neural Networks is a great reference.

Keyword spotting

  • To achieve always-on keyword spotting, running a complex neural network all the time is a recipe for battery drain.
  • The trick is to use a two-pass system by breaking down the problem into two pieces: (i) low-compute (running on an always-on processor) and, (ii) more accurate (running on the main processor) as shown below:

  • The first low-compute piece is running on a special always-on processor (AOP) which is named so because it does not enter sleep states (unlike the main processor) and is always-on. The trade-off is that it is much less powerful than the main processor so you can only do small number of computations if you don’t want to impact the battery life.
  • The second more accurate piece running on the main processor consumes high power but only uses that power for a very limited amount of time when you actually want it to compute something, i.e., when it is not sleeping. The rest of the time it goes to several levels of sleep and most of the time the main processor is completely asleep.
  • Thus, we have a two layer system similar to a processor cache hierarchy. The first level is running all the time with a very simple DNN and then if it is confident with its predictions, it does not wake up the main system. However, if it is unsure of its predictions, it wakes up the main processor which runs a bigger, fancier DNN that is more accurate.
  • Note that both passes share the same set of computed speech features which are typically generated on the AOP.
  • The level-1/first-stage thresholds are tuned to trade-off false-reject rates for high false-accept rates so we don’t easily miss an event where the user was looking to interact with the device due to it not waking up the second-stage (negatives/rejects do not wake up stage-2). Since the level-2/second-stage model is a beefier model, it can mitigate false-accepts easily. It is also important to make sure the precision is not too low as to wake up the level-2/second-stage model (also known as the “checker” model) frequently, and eventually lead to a battery drain. This requires the level-1 stage to have high recall with modest precision to ensure that the voice assistant doesn’t miss a true positive. The level-2 stage, on the other hand, has high precision with modest recall to ensure a smooth user-experience by ensuring that the voice-assistant minimizes false positives (and avoids waking up frequently in the event of a false positive).

Handling dynamic language switching and code switching

  • In order to support such dynamic language switching from one interaction to the next (which is different from code switching, where words from more than one language are used within the same interaction or utterance), the backend intelligence must perform language identification (LID) along with automatic speech recognition (ASR) so that the detected language can be used to trigger appropriate downstream systems such as natural language understanding, text-to-speech synthesis, etc., which usually are language specific.
  • To enable dynamic language switching between two or more pre-specified languages, a commonly adopted strategy is to run several monolingual ASR systems in parallel along with a standalone acoustic LID module, and pass only one of the recognized transcripts downstream depending on the outcome of language detection. While this approach works well in practice, it is neither cost-effective for more than two languages, nor suitable for on-device scenarios where compute resources and memory are limited. To this end, Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching (2021) proposed all-neural architectures that can jointly perform ASR and LID for a group of pre-specified languages and thereby significantly improve the cost effectiveness and efficiency of dynamic language switching. Joint ASR-LID modeling, by definition, involves multilingual speech recognition. Multilingual ASR models are trained by pooling data from the languages of interest, and it is often observed that languages with smaller amounts of data (i.e., low-resource languages) benefit more from this.

Online training with live data to improve robustness

  • To improve the robustness of a speech model such as ASR, Keyword Spotting etc., we can train it on online data gathered from the field. To do so, we run the classifier with a very low threshold, for instance, in case of keyword spotting, if the usual detection threshold is 0.8, we run it at a threshold of 0.5. This leads to the keyword spotter firing on a lot of keywords – most of them being false positives and the others being true positives. This is because for a classifier, lower the classification threshold, more the number of false positives, while higher the classification threshold, more the number of false negatives.
  • Next, we manually sift through the results and identify the true positives and use them for re-training the classifier. The false positives, in turn, are used as hard negatives to also re-train the classifier and thus, make it more robust to erroneous scenarios involving false accepts.

Speaker Recognition

  • Speaker recognition can be classified into either (1) speaker verification, or (2) speaker identification. Speaker verification aims to verify whether an input speech corresponds to the claimed identity. Speaker identification aims to identify an input speech by selecting one model from a set of enrolled speaker models. In some cases, speaker verification follows speaker identification in order to validate the identification result.
  • Speaker verification is gauged using the EER metric on the DET-curve while speaker identification is gauged using accuracy.
  • i-vectors by Dehak et al. in 2010 were the leading technology behind speaker recognition, up until DNNs took over with d-vectors by Variani et al. in 2014. The latest in speaker recognition is x-vectors by Snyder et al. in 2018 which proposed using data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The figure below shows the embedding extraction process with d-vectors:

  • Read more on speaker recognition in this book chapter by Jin and Yoo from KAIST.

Data augmentation

Gain

  • Gain augmentation can be applied by multiplying the raw audio by a scalar.

SpecAugment

  • This paper by Park et al. (2019) from Google presents SpecAugment, a simple data augmentation method for speech recognition.
  • SpecAugment greatly improves the performance of ASR networks. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. They apply SpecAugment on Listen, Attend and Spell (LAS) networks for end-to-end speech recognition tasks.
  • They achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks on end-to-end LAS networks by augmenting the training set using simple handcrafted policies, surpassing the performance of hybrid systems even without the aid of a language model. SpecAugment converts ASR from an over-fitting to an under-fitting problem, and they are able to gain performance by using bigger networks and training longer. On LibriSpeech, they achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, they achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5’00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

Types of noise

  • Reverberation sounds: Reverberation noise, also known as room impulse response (RIR), is produced when a sound source stops within an enclosed space. Sound waves continue to reflect off the ceiling, walls and floor surfaces until they eventually die out. These reflected sound waves are known as reverberation.
  • Babble noise: considered as one of the best noises for masking speech. Refer Babble Noise: Modeling, Analysis, and Applications.

References

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledSpeechProcessing,
  title   = {Speech Processing},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}