As a major player in automatic speech to text transcription, we choose to underline the development of language. The idea is to understand how does a language form and then, how the voice recognition technology works?
Linguistics is the scientific study of language, including language form, meaning and language in context. There are different fields of analysis:
Phonetics gives a very specific description of sounds but no sound is unique. The variety of sounds that the human voice is able to produce is infinite.
For example the sound [i] is different if it is pronounced by a man or by a woman, and it can also be different from a single person depending on moments (emotion, cold).
Phonology’s response is that language does not retain all sound differences, it only retains difference within relevant sounds in the linguistic system, significant variations. There are therefore functions.
The sound [i] in English complete the same function regardless the speaker or the pronunciation, it corresponds to the phoneme /i/.
The Collins defines a phoneme as one of the set of speech sounds in any given language that serve to distinguish one word from another. A phoneme may consist of several phonetically distinct articulations, which are regarded as identical by native speakers, since one articulation may be substituted for another without any change of meaning.
A phoneme is the smallest unit of speech that can be used to make one word different from another word. In linguistics, we speak of distinctive units of pronunciation. Phonemes are represented by letters between slashes: /a/, /b/,/r/… , under the rule one phoneme = one symbol.
In English, there are 44 phonemes: 24 consonants + 20 vowels.
Some phonemes are phonetically very close, although a student who have difficulties with the spoken language, is more likely to make mistakes in transcription or spelling.
The reference for the learner will be what « he hears in his head » when he breaks down the word to write. In this way, if he pronounces incorrectly, he will transcribe what “he has heard”.
Production of a sound and perception of phonemes may therefore cause problems in spelling. This can further complicate the automatic speech recognition or voice recognition.
In Speech and human machine dialog, of Wolfgang Minker and Samir Bennacef, authors explain that the automatic speech recognition is a complex area, because there is a major difference between the formal language, used and understood by machines, and the natural language used by humans. The formal language is defined by strict and unambiguous syntax rules. On the opposite, in natural language, words or sentences can have several meanings depending on the speaker’s intonation or the context.
Some mistakes of human or machine transcription may be caused by a poor phonological understanding of language. We can say that quality of speech to text transcription depends on the speaker’s language (pronunciation, intonation).
Voice recognition system allowing automatic transcription of Authôt will be presented in upcoming article, stay tuned! 🙂
Authôt: You speak. We write.