Speech Production Acquisition Chain
Following the series of articles published in summer 2016 and presenting the fundamental components of the Speech Production Acquisition Chain, we will discuss in more detail how the system is able to make the link between:
– an audio file containing speech and,
– the text spoken.
For a better understanding, we invite you to read the previous articles:
training and testing.
The acoustic model is first calculated during the training phase, then the model is used during the decoding phase to transcribe the audio statement into text.
During this learning phase, we use large audio volumes (several hundred hours), for which the data has been previously transcribed. These data make possible the link between an acoustic realization and a phoneme. For each phoneme, a large number of acoustic realizations will be studied: these different realizations can be variable because of noise, reverberation, different speakers, different phonetic contexts (previous phoneme and following phoneme)…
For example, if we take the case of phoneme [æ] (cat). Analysis of energy behaviour in time-frequency domain of a very large number of phonemes [æ], pronounced under different conditions, will allow the creation of a "general" model of [æ] using a mixture of Gaussian Mixture Model (GMM).
Figure 1: Modeling phoneme [ae] using multiple occurrences of phoneme [ae].
As can be seen in Figure 1, the [æ] pronounced by different speakers are slightly different. This is due to variations in the “vowel space” that is specific to speakers.
In order to make the best use of our general [æ] model, we will have to adapt this model to unknown speakers during decoding (which automatically transcribes an audio file into text). Since there are a lot of adaptation methods, we'll just look at the basic principle.
The model of [æ], previously calculated during the learning phase, will undergo a mathematical transformation of these parameters, such as translations and rotations so that the space of these parameters is closest to the space of the parameters of an unknown speaker. Once this transformation is completed, our general model will specialize in order to better recognize phonemes from the unknown speaker.
Figure 2: Adaptation of the general model [ae] to the speaker x
Once our acoustic model is adapted, it is ready to use.
Over time, we will analyze the behavior of the energy in the time-frequency marker of the audio file of which we would like to know the most likely pronounced phonemes. If the observation n is closer to the phoneme model [æ], then the phoneme [æ] will be the most likely pronounced phoneme.
Figure 3: Principle of phoneme detection
We have seen how the system is able to recognize a phoneme. However, phoneme detection is not always correct.
Figure 4: Current phoneme error rate on the TIMIT database (read speech corpus)
For the moment, the system is able to predict phoneme sequences that are not words. For example, [tʃ] [ɡ] [ð] [eɪ] (ch_g_th_ay ) is possible.
In a later article we will look at the pronunciation model, which will force the detection of a series of phonemes in order to recognize words only.
Authôt. You speak. We write