Automatic Speech-to-Text conversion typically follows four key steps, including speech production, its capture by a microphone, an analog-to-digital conversion and finally the application of automatic speech recognition algorithm. Here, we discuss the process of converting voice waves into a digital format usable by computers.
As presented in a previous post, voice spreads through the air in the shape of acoustic waves perceivable as pressure changes. By producing specific vibrations for each sound of a language, voice can encode and transmit speech messages towards interlocutors. Computers have no ears, and manipulate binary data made of zeros and ones. Hence, the challenge of speech digitization is first to capture pressure variations created by voice, and then convert them into digital values, with a minimum loss of the initial spoken information.
A microphone is an energy converter based on the coupling between a mechanical system and an electrical circuit. In practice, the microphone membrane vibrates with the variations of pressure, and this motion is printed on an analog electrical signal available at the output of the device.
All microphones do not capture sounds in the same manner, and according to their design, they may have issues vibrating at some speeds. The electrical circuit may as well introduce noises in the recording. In all cases, the frequency response diagram provided by manufacturers can help to check the performance of your microphone. Especially for speech recording, it should not significantly alter voice frequencies between 40Hz and 8kHz.
Some microphones are designed to capture sounds coming from one single direction, and others are sensitive in all directions of space. The directivity of your microphone must be considered especially if you are recording in a slightly noisy environment. Indeed, microphones do not make differences between voice, surrounding noises and their reverberations. So, to improve the quality of your recording, instead of speaking loudly, try to get closer from the microphone. In that way the input level of the voice will remain much larger than the level of any other captured noise. Take care, however, being closer from the microphone also increases the risk of saturating the output signal, causing an irremediable degradation of your recording.
Directivity diagram of a cardioid microphone[/caption]
The electrical signal provided by the microphone is not yet usable by the computer and requires an analog-to-digital conversion.
The analog-to-digital converter is an electronic device observing a variable analog electrical signal at its input, and producing a digital representation of the variation of the signal at its outputs. This conversion is usually done by the sound card of a computer and the digital data are stored as a file on a hard drive.
During a first step, a track-and-hold circuit looks at the evolution of the amplitude of the input signal, and at constant time intervals, blocks the signal to this value.
This process, named sampling, is just like picking up values of the electrical input signal provided by the microphone at regular time intervals. For speech signals, frequencies can reach 8 kHz, and according to Shannon’s theorem, sampling must be done at least 16,000 times per second, which means at sampling frequency equal to 16kHz.
In a second time, the values blocked at the the previous step are converted into binary words. This operation, performed by a digital quantifier, inevitably adds a measurement error called the quantization noise. The resolution of the digital conversion is limited by the number of bits available to encode the output signal into sequences of zeros and ones. For speech recordings, a resolution on 16 bits generally causes an acceptable quantization error and does not affect significantly performances of automatic speech transcription systems.
The MPEG Layer 3 codec, also known as mp3, is a lossy compression algorithm used to reduce the size of audio files at least with a factor of 10. Several studies have already shown that the performance of automatic speech recognition systems are not significantly affected by the mp3 compression. Nevertheless, before compression, the files must have been digitalized with at least a 16kHz sampling frequency and be coded on at least 16-bit resolution. The bitrate value used to adjust the quality of the mp3 conversion must also be higher than 32 kilobits per second.
Thanks for your attention!
How is a computer program converting speech to text?
This is the question we will continue to develop on the Authot’s blog this summer.
Authôt : You speak. We write.