It is a fact that the advances in speech recognition has accelerated. To the degree that some banks have begun to adopt automated Speech to Text transcription technology to help their task of compliance of trading rooms. But is the technology available today really effective?

We have seen some exciting announcements and claims. Microsoft has recently stated that the machine has now surpassed the human in the task of transcribing speech into text, How should we take such statements? How do we deal with the expectations created? What can we expect it to safely deliver?

Part of the answer lies in the evolution of speech recognition, which is inherently linked to the evolution of artificial intelligence and deep learning. Yet a deeper understanding of the terms of reference and metrics used along with the test conditions is needed, in order to grasp what is really at stake here, in this rapidly changing technology field, at the moment it enters the large banks’ trading rooms.

1. Major Progress and Dramatic Announcements

In 2012, a series of papers drew the attention of the general public to voice technology. Microsoft publishes papers on deep neural networks[1], then demonstrates recognition and simultaneous translation from English to Chinese[2]. In the same year, the New York Times devotes an article to Google Brain, for its results in image recognition and applied to speech recognition[3]. 2016 saw the start of a race for greatest speech recognition rates between Microsoft, Google and IBM, and in the summer of 2017, Microsoft announces that the machine surpasses the human at the task of speech transcription[4].

Is where the automatic speech to text story ends? And the beginning of flawless delivery of speech to text into business applications?
Not exactly! So how do we unravel the truth and separate the fact from the fantasy? We will start with the science behind the claims.

2. The link with Artificial Intelligence

The current craze for all thing AI (Artificial Intelligence), has the same technological base as the recent advances in speech recognition: deep learning, i.e. network-based learning of neurons, not only endowed with and entrance layer and an exit layer, but also a large number of so-called “hidden” layers, a model imitating, with equal proportions, the organisation of the human cerebral cortex (figure 1).

Figure 1. Neural Network
(a) Diagram of a neural network with an input layer, two hidden layers and an output layer
(b) Principle of the functioning or the neural network, inspired by the biological model neuron-synapse.
Source: [5]

neural network diagram

It should be remembered that this family of models was first developed in image recognition, then in speech recognition, before being used for problem solving and games, with the resounding success of the Google DeepMind / AlphaGo project. , which beat the world champion in the game of Go. Other more recent fields of application include automatic translation and natural language processing.

Image and speech recognition are somehow the eldest daughters of the current AI.

3. So Fact or Fantasy?

The acceleration of advances in speech recognition, as in the entire field of artificial intelligence, is a fact. Google announced to have divided its error rate by two or even four, between 2012 and 2017. This type of progress is found with most players in the forefront of the field. However, to find an improvement of this magnitude, we must go back at least three or four decades.

4. A historical perspective on speech recognition

The type of break that we are experiencing today with deep neural networks, is reminiscent of the one that, in its time represented the advent of statistical approaches to expert systems, much in vogue in the 60s and 70s. The latter relied on decision trees to classify consonants and vowels (called “phonemes”), according to characteristics such as voicing – whether or not the sound conveys the tone of the voice – the plosive character – if the sound is marked or not by a closing phase of the larynx before being pronounced, such as “p”, “t”, “k”, and their voiced variants “b”, “d”, “g” etc. Most of these properties are visible in the spectrogram (Figure 2), that is to say the frequency analysis of the signal over time [6].

Figure 2 . Sound signal and spectogramme
“Eyjafjallajökull *”: below the sound signal and its amplitude over time; at the top: the spectrogram of the signal; in the middle the chain of phonemes. Source [7]

Sound signal and spectogramme

(*)name of the Icelandic volcano, whose eruption in 2010 strongly disrupted the air traffic

Theoretically, a set of expert distinctions, following the decision trees, should allow to identify each pronounced sound as a precise phoneme. . However, this acoustic matching is not so easy in the real world: the space of vocalizations is vast, variable and full of local ambiguities, depending on the speakers, the context, the ambient noise, etc. Moreover, the transitions between phonemes are often subtle, and the tempo of pronunciation and articulation of words is hugely variable.

Statistical modeling – hidden Markov chains (HMM) and Gaussian mixtures (GMM) – began to be used to capture these variables. This probabilistic approach, largely based on data, seemed to distance the scientific community from its insights and comprehension of its models, but proved to be far more powerful than expert systems, especially since it coincided with both the competitions launched by the American defense agency ARPA / DARPA, which gave rise to the first large bodies of training data, and with the exponential progress in the computing capabilities of micro-processors, following the famous “Moore’s Law”, doubling capacity every 18 months .

From that time, in the mid-80s,dates the mythical and controversial phrase of the late Fred Jelinek, a great figure of supervised and unsupervised learning, then director of the research group in speech at IBM, who would have said:

“At the moment a linguist leaves the group, the recognition rate increases.”

Phrase with a legendary status. In any case, the HMM-GMM statistical approach ruled unchallenged – or almost – for the next three decades, until recently it was dethroned by deep neural networks.

We now know that major breakthrough announcements are justified, but what about the performance comparisons between machine and human?

5. Understand performance metrics for voice recognition

The measure generally used to evaluate speech-to-text systems is the word error rate (“word error rate” or “WER”). It consists in adding all the word errors, namely substitutions, insertions and deletions of words, and divide by the total number of words in the reference transcription: WER = (S + I + D) / N

A WER of 10% corresponds to a recognition rate of 90%. This metric is not perfect, since it gives the same importance to all the errors, while some have much more impact than others with regard to the intended application, but it has the merit of being rigorous and standard, allowing comparisons over time and between systems.

6. What the ratings of Microsoft, IBM, Google and Baidu say

When Microsoft, IBM, Google or Baidu – the “Chinese Google” – compare the error rate of the machine to the error rate of human transcription, they evaluate the competence of human transcribers who are themselves outside of the conversations, not that of the conversation participants themselves, which would have an error rate much closer to zero. Moreover, errors made by automatic systems are generally of a different nature from those made by humans.
The diagram below (Figure 3) shows that the error rate of automatic transcription has fallen below the error rate of human transcribers. But these figures are measured on 20 or 40 telephone conversations, from the Corpus “Switchboard”, collected in the United States in 2000, and renowned for its particularly clean audio quality.

Figure 3 Evolution of the word error rate
Decrease of the word error rate (WER) on a sub-part of Corpus Switchboard conversations, down to below the error level of human transcribers. Sources [8]

word error rate

Figure 4 Comparison of machine and human error rates.

Except in the case of very low noise data (High SNR = Ratio Signal on Noise, high), the error rate of the machine is always higher than that of the human.
Sources: [8] and [9]

comparaison of machine and human error rates

Baidu’s teams [9] have shown (figure 4) that the technology is still challenged: notably non-American English accents, non-native accents, very noisy speech. The areas to improve are very evident, despite the recent progress made.

7. Conclusion

We cannot down play the announcements made of great progress in speech transcription: they are a result of deep learning and the increase in computation capability and availability of training data, phenomena that affects all domains of artificial intelligence. On the other hand, the comparisons of performance ranking the machine in front of the human are questionable, and at best, correspond to an isolated and particularly favorable scenario: American English in an environment with very little noise. The non-American English accents, the non-native accents, and very noisy speech continue to be fields where the transcription of speech still has a long way to go to narrow the gap.

8. Epilogue

What do these advances mean for the transcription of traders in the trading room?

In the context of the introduction of MiF2 and the rise in regulatory risk [10], automatic transcription is seen as a tremendous asset to help bank compliance in trading rooms. But this is where all of the adverse condition described above can be found: non-American English accents, non-native accents, very noisy speech. However, the challenge has recently been successfully met: in Europe, several finance and investment banks have just adopted the “speech-to-text” technology, which proves the progress can deliver production services but also demonstrates that technology still has more to deliver.

Back to the future
The long winter for neural networks

Neural approaches date back to the 1950s, and were quite popular in the early 1990s, but they saw a “long winter” before the recent thaw, with advances in compute a catalyst to its effectiveness. We can liken this phenomenon to the famous “long winter of Artificial Intelligence”, even if the two disciplines have not yet converged. We borrowed the diagrams below, with the permission of Nikko Ström, senior scientist at Amazon, to illustrate this long winter, during which the audacious few who risked working on the subject, saw their articles refused in major academic conferences.

Deep Learning millestones
deep learning in speech recognition

Today, the general consensus of opinion is that the resurgence of use of Neural Networks corresponds with three phenomena coinciding:

  • Training data corpora 10 times to 100 times larger than in the 90s;
  • Parallel computing capabilities multiplied by a factor close to 100 thanks to the use of GPUs – graphical processing units – graphic cards initially dedicated to 3D graphical rendering calculations for video games, and diverted from their use to the service. Massively parallel calculations of deep learning.
  • Advances in training algorithms for neural networks with a large number of hidden layers.

In a rather unique way, a 2012 publication [12] co-authored by research groups at IBM, Google, Microsoft, and the University of Toronto – where most of the movement’s initiators were – formalized recognition to neural networks, which today constitute the majority of the publications at numerous conferences.

References
[1] Clayton S., (2012), “A Breakthrough in Speech Recognition with Deep-Neural-Network Approach,” Microsoft News Center, June 20
[2] BBC News, (2012), “Microsoft demos instant English-Chinese translation,” BBC News website, November 9
[3] Markoff J., (2012), “How Many Computers to Identify a Cat? 16,000, New York Times, June 25
[4] Xuedong H., (2017), “Microsoft researchers achieve new conversational milestone speech recognition,” August 10
[5] Alves, M., and Lemberger, P., (2015), “The Deep Learning Step by Step”, www.technologies-eb
[6] Rabiner LR, Juang BH., (1993), “Fundamentals of Speech Recognition,” Prentice Hall ed.
[7] Liberman, M., (2010), “Little Icelandic Phonetics,” LDC-UPenn Blog, April 19
[8] Hannun, A. Y., (2017), “Speech Recognition Is Not Solved,” Stanford University blog, October. NB: the figures come from [9], of which he is co-author.

About the Author

Arianne Nabeth Halber

Ariane Nabeth-Halber, Director, Strategic Line “Speech”, Bertin IT; Member of the Board of LT-Innovate, Language Technology Industry Association;
Expert and Reviewer at the European Commission; Doctor in Computer Science and Signal Processing.