The pronunciation of a word consists of a series of phonemes, and the same word can be pronounced in different ways. The Pronunciation Model (also known as a Pronunciation Dictionary or Lexicon) lists all pronunciations of all words that the system will be able to recognize.
In order to take into account the speech speed of a word, a mathematical model is used to analyze the variable durations of phonemes. The Hidden Markov model or HMM, is a probabilistic automaton that allows to take into account the temporality of the audio signal, thanks in particular to the transition on the same phoneme. Each internal probability consists of the phoneme recognition we have seen before.
This combination of the acoustic model and the pronunciation model is called the acoustic-phonetics model. This model allows to assign an HMM to each word. During the learning phase, probabilities of transition between states (here phonemes) are calculated and stored. During the decoding phase, the probabilities that have been pre-calculated are used.
Figure 1: HMM of the word "Tomato"
The advantage is that, thanks to this list of HMM, which is the pronunciation dictionary, we are able to recognize only words. However, there is a flaw: acronyms and proper names that do not belong to the pronunciation dictionary cannot be predicted.
Acoustic-phonetics decoding alone does not allow a sentence to be detected. For the moment, the system is able to predict word sequences that are not correct. For example,"you whereas he or but until however" is possible however the sentence does not make sense.
We have seen how to reduce the Phoneme Error Rate (PER) by using pronunciation dictionary.
In a future article, we will look at the language model, which allows us to add consistency to the predicted word sequences.
Authôt. You speak. We write