The automatic conversion of Speech into Text typically consists in four key steps:
We discuss here about speech production and we start by presenting two fundamental concepts often confused: voice and speech.
There are fundamental differences between voice and speech.
Automatic Speech Transcription aims at recognizing the spoken content of a message message and its conversion into text.
Conversely, voice recognition technologies are the set of techniques used to identify someone through his voice, for example during a police investigation.
In summary, a speech message can be conveyed by many different voices, but one voice is usually unique, mainly because it is strongly linked to one body shape. This brings us now to introduce some notions about the speech production system.
Speech production is based on complex phenomena, widely studied for their role in human cognition and communication. Here, we focus on physiological aspects.
A healthy human being produces sounds by driving air from his lungs. And the coupling between lungs, vocal folds, vocal tract, the oral and nasal cavities, but also the position of the tongue, the jaws, the lips, and the teeth, enables voice modulation and the distribution of energy into specific vibrational modes for different speech sound units.
By simply placing your hand on your throat, you can distinguish two types of sounds.
Voiced sounds are produced by vibration of the vocal folds and correspond to vowels as /a/ and /o/. On the left of the red curves in the figure below, these voiced sounds show resonance peaks in the low and medium frequencies.
Non-voiced sounds such as wheezing /s/ and explosive /p/ do not require vocal folds to vibrate. In this case, the positions of the tongue and the lips will lead to totally different energy distributions.
These differences are exploited by automatic speech recognition algorithms.
Thank you for your attention!
How is a computer program automatically converting speech into text?
This is the questions we will continue to develop on the Authot’s blog this summer. Stay tuned!
Authôt: You speak. We write.