We have seen in our previous articles how the system is able to recognize words, and the importance of pronunciation. We also concluded that the system could recognize a sequence of inconsistent words.
In this article we will focus on the language model and describe how to force the detection of coherent sentences.
The language model, just like the acoustic model, is built on a statistical study. There are many methods, but here we will focus specifically on the n-gram model.
During the learning phase, large quantities of texts are analyzed in order to estimate the conditional probability of word. This means that for each word present in a text, we will study the probability that this word appears by knowing the last n − 1 words.
This study, when carried out on a large amount of text, makes it possible to model links between words. It is more likely that a verb will be preceded by a subject or that an adjective will be preceded or followed by a name. Indeed, in texts used during the learning process, these cases will be further considered.
During the decoding phase, we use the pre-calculated statistics to predict a future word with the previous word. For example, it will be more likely to observe this sequence of words "I want to eat a tomato",rather than "I want to hit a tomato", and even more likely than "I want to eat a carpet". All these probabilities are modelled in the form of a graph called a Word Lattice.
Figure 1: Example of a Word Lattice
Thanks to the language model we are able to create probabilistic links between words, which allows us to obtain a more logical sequence of words. The flaw is that the speech contains syntax errors, hesitations and formulations specific to a spoken language. This is simply because we do not speak in the same way as we write. For example, while it is more common to say ”Went to Barcelona for the weekend. Lots to tell you. ", it is more common to write ”We went to Barcelona for the weekend. We have a lot of things to tell you. ”. It will be more challenging to model these differences.
Figure 2: Summary of the different models
This concludes our R&D series which aims to show you how the system is able to make the link between:
– an audio file containing speech,
– the pronounced text.
In these 3 articles, we looked at the question of phonemes, the importance of pronunciation, and finally how the language model allows the system to make the word sequences coherent.
All these steps are essential to understand how speech recognition technology works.
Authot. You speak. We write.