I am musing today on the famous phrase, “The pen is mightier than the sword” which comes from Edward Bulwer-Lytton's historical play Richelieu (1839). (Bulwer-Lytton is also infamous for opening his novel Paul Clifford with the line "It was a dark and stormy night.”, inspiring an eponymous fiction contest looking for the "opening sentence of the worst of all possible novels”). Words may be mightier than the sword, but something is more powerful still … speech! We may think of words and speech as synonymous, but words are just text, merely capturing and distilling the basic intent of speech. Every playwright and every actor knows what a great gap exists between the letters on the page and the full impact of human expression. Speech, of course, includes words, but also carries the rich nuance of stress, intonation, timing, volume, timbre and countless non-verbal utterances. Even the background sounds and resonances are part of the experience. These extra dimensions of speech transmit enormous insight into speaker, their situation, mood and character, and even the space where they are speaking. There is a gold mine of insight available to us.
We have enjoyed speech-based electronic communications for almost 150 years, but progress is accelerating. We now routinely expect speech-centric interfaces for almost any kind of electronic interaction - both with personal devices like a smartphone and with product services like customer support. Ironically, most of today’s systems rely on text alone. Speech serves as the input to different automated tools including automatic speech recognition (ASR), or text-to-speech (TTS) tools, and many natural language tools, ranging from Amazon “skills” to chatbots that provide services based on the text stream. While a wide range of robustness and quality exists between different companies and their services, two significant issues that block pervasive use stand out: identifying the most important speech in high noise environments and the loss of speech and speaker intent.
For the first issue, ASRs operate without prejudice to the words or phrases that are most significant or most likely - it does the best job it can in finding the mostly likely sequence of words in the target language. Noise and reverberation distortion make this process tricky. The ASR system has to make crude guesses to decipher impaired words, often incorrectly. While available ASR systems vary widely in their noise tolerance, they all suffer from significant errors as the noise magnitude reaches or exceeds the speech magnitude. BabbleLabs develops products aimed at overcoming these problems in two ways. First, we build deep-learning based speech enhancement, trained on extremely noisy speech, that subtracts out background noise, allowing many ASR systems to improve their transcription performance, especially when the ASR and the speech enhancement are coordinated or co-trained. Second, we have developed specialized speech recognizers aimed at a large command vocabulary - typically up to a hundred phrases. These focused vocabulary recognizers have a fundamentally easier time extracting the useful text from noisy environments because the best match search looks only for this set of target phrases. These methods make speech understood dramatically better in real-world speech environments - outdoors, in cars and public transportation, in noisy restaurants, and on the factory floor.
The simple process of converting an audio stream to text leaves behind much of the real information content of the speech, creating the second issue. If the rest of the information can also be captured - the identity of the speaker, the noise environment, the speaker’s mood, physical condition, and other emotional context - we will be able to far better understand and respond to that person. We have all had the experience of struggling to express ourselves in an email, realizing afterwards that a short phone call would work much better because in live communication, all of that additional information can come across. Our Clear speech enhancement products and new developments that capture the content of the speech address this problem. Instead of losing this critical information causing all parties to suffer when communicating via speech user interfaces, the speech and the valuable human nuance can be retained.
BabbleLabs is now working to enrich speech technology with better ways to mine all this non-text information, to allow systems to better accommodate the needs, limitations and style of the individual. Moreover, this enrichment of speech analytics is not at odds with privacy and fairness. We picture fine-grained control by speakers or their delegates of this additional information, permitting them to share as much information as needed to accomplish their ends. While we have much work to do, the impact of better speaker, environment, and sentiment analysis may ultimately transform, and, in a sense, humanize our interactions through machine agents.