You are on page 1of 10

Tech Enthusiast Magazine

SPEAKING IN [BINARY] TONGUES


How speech recognition can let us talk to our computers (or at least try to)
By Matthew Bruchon
Back on modern-day Earth, our I grew up watching Star Trek, and I computers cant master the Universal didnt concern myself too much with Translators simplest aspect: the ability the faster-than-light-speed travel, the to hear a persons voice in a familiar loud, fiery explosions in the vacuum of language, and to space, or many of figure out what the shows other words are being scientific said. That is the impossibilities. fundamental goal One question of the speech always did bother recognition me, though: how systems we have did the Klingons today. We dont and Ferengis and yet know how to Vulcans all come to Is Worf a native speaker of English? make computers speak English so understand 100% of our speech yet, and well? Eventually, I discovered that the much of the time, we might as well be shows writers had created a device speaking in tongues. Until the human called the Universal Translator, a tiny race learns how to speak in binary computer that processes speech from machine code, there will be a need for all languages, known or unknown, and improved speech recognition systems. converts it to the users native tongue.

April 2007

Tech Enthusiast Magazine

THE PROMISE OF TOMORROW

The problem wasnt the speed or the clarity of his voice. His evenlymeasured baritone was no more The reasons for wanting our computers rapidly spoken than the average to be able to understand our voices are persons voice, and his syllables were seemingly endless. Many of those clearly articulated. And Id like to reasons stem from the fact that all of us, think the problem wasnt my typing except maybe the very most skilled abilities. In high school I took a typing typists, can speak more quickly than we class, and my keyboard proficiency has can type. Nuance Technologies, a been shaped by years of instant company specializing in speech messaging and web surfing. The issue recognition, estimates in its marketing was the basic fact that our hands are a materials that most people speak more clumsy way to convert our thoughts than 120 words per minute, but type into a readable form. On the other fewer than 40 words per minute. hand, our voices I recently are like a wormhole learned just how Our voices are like a leap straight from realistic that number wormhole leap Star Trek, a direct is. I was trying to portal from our transcribe the straight from Star brains to the recording of an Trek, a direct portal outside world. interview I had just If a from our brains to conducted with Dr. computer could David McAllister, the outside world. have automatically Computer Science converted Dr. Professor at North McAllisters voice into text for me, the Carolina State University in Raleigh. process would have taken much less Dr. McAllister is, among many other time on my part. Looking at society as things, part of a research team doing a whole, similar scenarios are plentiful. work in computerized speech Transcriptions of medical and legal processing. When transcribing the information, for example, currently are interview, I found myself needing to very time-consuming, and can be made pause the recording every ten seconds much more efficient with the use of or so, sometimes rewinding to re-listen speech recognition. And time is to words Id missed. My fingers simply money, of course. could not keep up with the pace of his voice.

Speaking in [BINARY] Tongues

Tech Enthusiast Magazine


Many of us are familiar with at least a few everyday conveniences provided by speech recognition. Most cell phones include features like voice dialing and the ability to answer calls with a voice commandeven my Kyocera KX1, the cheapest phone available with my wireless plan. Many telephone menus for things like customer service can now be navigated by voice command as an alternative to button presses. And most in-car GPS systems can be commanded by voice. But imagine being able to use your voice not only to show directions, but also to actually drive your car. (If nothing else, it would mean people finally would stop talking on their cell phone while they drove!) Thats probably out of the question for the near, and not-so-near, future. But there are some pretty neat advances that arent so far down the pipeline. This is perhaps most evident in the area of personal computing. Imagine being able to control your computers every action by voice command, for example. You wouldnt have to use the keyboard and mouseforever the bane of heavy computer users and carpal tunnel syndrome sufferers, two groups that go hand in hand (pun intended). Youd also be freed from your desk, and could get things done from the other side of the room if you wanted.

THE REALITY OF TODAY


In fact, speech recognition systems are already being used for personal computing. One group that relies on these systems is the population of disabled people who cant use their hands to type or to move the mouse. Dr. McAllisters own neighbor, for example, suffers from hand muscle atrophy and uses speech recognition

NaturallySpeaking 9 Standard
software regularly. He talks to his computer and has it do things for him, says McAllister. He uses it to create email and other messages, and stuff like that works very well. Its not always perfect, but its much better than you would think. His neighbor uses a standalone program called Dragon NaturallySpeaking, produced by Nuance Technologies. Its the worlds bestselling speech recognition for

April 2007

Tech Enthusiast Magazine


professional use. Its existed in various forms since 1990, when a DOS-based version was made available for $9,000. That version required that the user pause between every word, to help it identify word boundaries. The latest version of NaturallySpeaking retails for $99.99, allows the user to speak in their normal, casual fashion, and advertises up to 99% accuracy. you choose to be. I chose to skip that step, because one of the latest versions selling points is that NaturallySpeaking requires no training, so you can get started dictating right away. Armed with several pages worth of test materials ranging from tongue twisters to Shakespeare monologues, I began to recite in a natural, perhaps slightly more carefully articulated voice.

Dragon NaturallySpeakings Accuracy: A Sampling


When I said...
Peter Piper picked a pick of pickled peppers One small step for man, one giant leap for mankind To be or not to be: that is the question NaturallySpeaking is the greatest piece of software

NaturallySpeaking Recorded...
Haircut or effect of takeover tactics Was offset from them, when I believe in mankind To be order not the: man is the question NaturallySpeaking is the greatest piece of software

I decided to try the program for myself. Luckily, the N.C. State librarys Assistive Technologies Center had a copy of the program available for me to try out. Getting started with the program was a very simple process: I just put on a headset with a microphone attached, opened the program, and started talking. There is an option to set up a new profile and train the program to understand your voice, a process that takes roughly 30 minutes depending on how thorough

As the figure above shows, the results of my trial were decidedly mixed. I measured my average voice dictation speed to be roughly 200 words per minute (I average about 60 when typing), but I cant say the improved speed fully made up for the errors. To be fair, the examples I chose are some of the worst. Realistically, the dictation averaged about one or two errors per sentence. And I could see a moderate amount of improvement as my trial progressed: I was learning

Speaking in [BINARY] Tongues

Tech Enthusiast Magazine

Say Mousegrid to show a 3x3 grid.

Say One to pick the upper left box.

Say Four to move to the File menu.

Say Click to click at that position.

Saying Close closes the window.

how to use the program (using keywords to dictate commas and periods, for example) and as I corrected its errors, it was beginning to train itself to my voice. Its probably safe to say the results would have been much more agreeable if Id trained the program for period of days or weeks, just as any serious user of the program would. (McAllisters neighbor had done this, of course.) Another feature of NaturallySpeaking is the ability to control the mouse by voice. This is accomplished by something called the Mousegrid, which divides the screen into increasingly small numbered rectangles and moves the mouse into the rectangle you command it to. The figure to the left demonstrates how I used the Mousegrid to close a browser window. It was easy enough to use, and for someone who cant use a mouse it would be an essential feature. However, it takes the computer a moment to render each grid onto the screen, and it was necessary to pause a bit between words. It took a total of approximately 5 seconds for me to close the window. This may not sound very long, but closing a window using the mouse itself takes under a second. NaturallySpeaking is the most widely used standalone speech recognition program, but many personal computers are sold with a

April 2007

Tech Enthusiast Magazine


audiences laughter. speech recognition Heaven help you if Recovering from system built-in. that PR nightmare Microsoft Office XP youre eating a may take awhile. is bundled with a burrito while you The speech speech recognition engine in Tiger OS engine (though it want to use speech is, for the most part, isnt installed by control. unchanged from default), and it is a previous releases of standard feature in Apples OS X. A blogger at Microsofts Vista and Apples Tiger OS. systemsboy.blogspot.com said the Given that Vista was just speech engine often froze, and that it released at the end of last year, the jury was overly sensitive to noise: Heaven is out on the quality of its built-in help you if youre eating a burrito speech recognition. Extremetech.com while you want to use speech control. reviewed it thoroughly, and concluded One at crunchgear.com reported that that while it isnt perfect, it becomes so Apples voice recognition is an accurate that its a joy to use given afterthought at best and cripple-ware at enough training. worst. The same blogger pointed out The technologys rollout at a that, as shown below, setting up the OS public demo was, for the most part, X speech engine isnt practical without successful, but the publics perception using a mouse, which would be a of it was largely shaped by one problem for the disabled. embarrassing moment that spread virally throughout the blogosphere and even network TV news. When the presenter of the demopresumably trained with the speech recognition software in advancetried to write a Dear Mom letter by voice, the speech engine produced Dear aunt, and his repeated attempts to delete the error were misunderstood. The final product was a pathetic Dear aunt, lets set so double the killer delete select all. I think its picking up a little bit of echo, the flustered presenter said, to the

A mouse is needed to configure the Mac OS speech engine setup screen.

Speaking in [BINARY] Tongues

Tech Enthusiast Magazine


In defense of the speech engines found in Vista and Tiger, the bulk of the complaints seem to deal more with their initial setup and with controlling applications through voice. I came across relatively few frustrated users of the basic dictation feature, which is still the most widely used feature of the speech engines and their bread and butter. For that purpose, at the very least, the speech engines perform well given enough training. entered the area of speech processing. Since early in his career, much of his research has dealt with the area of stereo computer graphics and threedimensional imaging. One of his projects, for instance, was to help the Defense Mapping Agency process its warehouses full of high altitude photographs, McAllister says, and provide elevation values for every place on the earth. Over time, he became a highly regarded expert in the field, publishing two books in the area. His involvement in 3-D imaging continues to this day. McAllister became involved in speech processing during a project related to lip synching, the matching of lip movements to speech. The project, he says, used filtering, sophisticated techniques and signal processing

RESEARCHING SPEECH
I spoke with Dr. McAllister to learn more about the science behind speech processing and whats holding it back from working perfectly. McAllisters research career was already well underway when he

Dr. David McAllister in his office at N.C. State University


April 2007 7

Tech Enthusiast Magazine

which had not been applied to tell what a person was saying. These complex methods were used to process speech signals and produce a computer animation of them being spoken. Such a method was of interest to video game and movie animation companies, for example. New to the area of signal processing at the time, McAllister played the role of graduate student for awhile. After that, McAllister and his research partners realized their new signal processing techniques could be used for an entirely different type of speech processing, called speaker recognition. Unlike speech recognition, which seeks to identify the words being spoken, speaker recognition is concerned with identifying the speaker. Many of the underlying problems are shared between the two areas, but the majority of McAllisters speech processing experience is in speaker recognition. There are many uses for speaker recognition technology, including criminal justice and security.

The plots above are from a 2002 paper written by McAllister and four colleagues at N.C. State. It uses a complex mathematical technique to model the speakers voice in two dimensions, as shown on the plots. Even without understanding exactly what the plots mean, its easy to see that the two left plots are much more similar than the other two, because theyre the same speaker. Much of the research being done in speaker recognition deals with criminal justice, and is being subsidized by the government. It is of interest for the FBI, for instance, to be able to identify people who have issued bomb threats over the telephone, says McAllister, and lawyers would like to be able to establish that either a person did or didnt say certain things on the telephone. In cases in which its known for a fact that the speaker is a member of a given group of people called a closed set problemthe speaker can be chosen at a forensic quality of 95% or more, given enough

Speaking in [BINARY] Tongues

Tech Enthusiast Magazine


you want to be able to operate in real voice samples. But in many criminal time. You could do lots of things if you justice situations, where the speaker (or arent in a hurry that you cant do if the suspect, as the case may be) could you want information now. be a member of that group or notan One unique aspect of the speech open set problemthere has been processing field is its multidisciplinary much less success determining the nature. McAllisters specialty is speaker. Theres a lot of trouble in mathematicshes a flunky numerical making such conclusions with enough analyst, he jokes. Dr. Robert Rodman, accuracy that it would stand up in one of his closest research partners, is a court, says McAllister. computational Many of the linguist and Dr. obstacles that The problem is, you Donald Bitzer, plague speaker want to be able to another member of recognition are the speech shared by speech operate in real time. processing team, is recognition. For You could do lots of a signal processing example, McAllister things if you arent in expert. Theyre says two common three very different problems in speaker a hurry that you cant specialties, but recognition are a do if you want McAllister says shortage of samples information now. that all three of and a speech signal them fit together with a lot of noise. quite nicely. Similarly, its The future of speech processing common that a voice is disguised, research will have plenty of room for either intentionally or by accident. A more research, and more progress. bomb threat caller might speak in Aside from the issues of identifying falsetto or in a fake accent, for example, words and speakers, for example, or the speaker could have laryngitis. theres the problem of dividing And one perpetual problem in speech sentences properly. Beyond that, processing is the finite amount of theres an even more complex issue, computing power available. As one that McAllister says still needs a McAllister puts it, The machines are lot of researchhow a computer can becoming faster and we can crunch figure out what a sequence of words numbers faster, and the algorithms can means, and whether its gibberish or get more complicated. The problem is,

April 2007

Tech Enthusiast Magazine


not. To address these complex problems, some of the same methods can be used.. For instance, one approach computers use is to look at common acoustic features of voices and sounds. A similar approach could be used by analyzing common features of words and sentences. Feature extraction is a problem, McAllister says. What are the features that you want? Can you reduce the number of features that matter? And how do you use the features to group individuals into categories? But until these kinds of highlevel problems are solved, he says, we might have to rephrase something repeatedly until the computer understands what is being asked of it. These problems are large enough to make some of the current bugs and inconveniences in speech recognition systems seem trivial by comparison.

Fortunately, it should be a while before we start running into any Klingons or Ferengis.

WHAT TO EXPECT
Its clear that some uses of speech recognition are more realistic in the near future than others. We probably can expect more speech systems that help make our lives more convenient, as in the case of hands free computer use. Its been demonstrated that under the right conditions, that sort of thing can be done at a high level of reliability.

But until that reliability goes from high to perfect, we cant expect to see things that rely on speech processingonly ones that use it as a supplement. Imagine if voice was used to log into your computer instead of a password. What if you had a sore throat and couldnt log in at all? Its safe to say well all own keyboards for the foreseeable future, even if we might not be typing on them quite as often. If the Universal Translator only worked 90% (or even 99%) of the time, he Star Trek shows would be more dramatic, to say the least. Its probably safe to say at least a few intergalactic wars wouldve been caused when a word or two got misinterpreted. Fortunately, it should be a while before we start running into Klingons or Ferengis, and theres plenty of time to get our Universal Translators ready for that day.

10

Speaking in [BINARY] Tongues

You might also like