You are on page 1of 14

-: SPEECH RECOGNITION :-

Introduction
One don’t have to be a scientist to know that the computer of the future will
talk, listen and understand. One of them is the Apple Macintosh of today.
Apple’s Speech Recognition and Speech Synthesis Technologies now give
speech-savvy applications the power to carry out your voice commands and
even speak back to you in plain English.
Apple Speech Recognition lets the system (Macintosh) understand what you
say, giving you a new dimension for interacting with and controlling your
computer by voice. You don’t even have to train it to understand your voice,
because it already understands you, from your very first word. You can
speak naturally, without pausing or stopping. Apple’s leadership in speech
recognition technology makes it possible by bringing a whole new dimension
to the user interface: speech. Combined with Voice-Over, speech synthesis
will help turn the graphical user interface into a vocal user interface.
Speech recognition (in many contexts also known as 'automatic speech
recognition', computer speech recognition or erroneously as Voice
Recognition) is the process of converting a speech signal to a sequence of
words, by means of an algorithm implemented as a computer program.
Speech recognition applications that have emerged over the last years
include voice dialing (e.g., Call home), call routing (e.g., I would like to
make a collect call), simple data entry (e.g., entering a credit card number),
and preparation of structured documents (e.g., a radiology report).
Voice Verification or speaker recognition is a related process that attempts to
identify the person speaking, as opposed to what is being said.

1
Speech Technology Development at IBM:
The overall view, with emphasis on Via-Scribe and Accessibility

 Speech technologies – development, deployments


Technology Applications
Large Vocabulary Speech Broadcast news transcription, Content spotting and indexing,
Recognition Via-Scribe, MALACH, DARPA projects
Telephony Speech Mutual funds transactions, contact center call routing, contact
Recognition (+natural center analytics
language understanding)
Embedded Speech Embedded speech in telematics (e.g., vehicles), devices (e.g.,
Recognition cell phones, pdas, etc.) And other consumer appliances (e.g.,
(+ multimodal input) set top boxes, DVD players).
Audio Visual Speech Improved ASR on trading floor

2
Recognition
Conversational Biometrics Speaker identification, speaker verification
Text to Speech Synthesis Home Page Reader, viavoice
Machine Translation MASTOR, DARPA projects, websphere

 Speech Analytics: Automated Quality Assurance Application


• Monitor 100% of calls
• Download recorded calls daily from across North America
• Answer questions and assign default ratings
• Provide a ranked list to human monitors to focus on bad calls

Speech recognition is the process of converting an acoustic signal, captured by


a microphone or a telephone, to a set of words. The recognized words can be the
final results, as for applications such as commands & control, data entry, and
document preparation. They can also serve as the input to further linguistic
processing in order to achieve speech understanding.

An isolated-word speech recognition system requires that the speaker pause


briefly between words, whereas a continuous speech recognition system does not.
Spontaneous, or extemporaneously generated, speech contains disfluencies, and is
much more difficult to recognize than speech read from script. Some systems
require speaker enrollment---a user must provide samples of his or her speech
before using them, whereas other systems are said to be speaker-independent, in
that no enrollment is necessary. Some of the other parameters depend on the
specific task. Recognition is generally more difficult when vocabularies are large
or have many similar-sounding words. When speech is produced in a sequence of
words, language models or artificial grammars are used to restrict the combination
of words.

3
Speech recognition is a technology that is constantly evolving. It is a technology
that is experiencing tremendous growth in the commercial market, apart from its
original niche as an assistive technology product. There are presently three major
companies with speech recognition products, Dragon Systems, Lernout & Hauspie
(L&H), and IBM. Stiff competition between these companies and more demand
from consumer and business markets, has led to a tremendous drop in prices over
the last few years. Competition has also fueled the development of a plethora of
new products. Each company has several products available, ranging in price,
features, and the applications that they support. This paper seeks to make sense of
the overwhelming array of products so that persons who are shopping for speech
recognition will have a better understanding of their choices.
What are the Types of Speech Recognition?
*Discrete
• Slower dictation process - better for persons with difficulty in language
processing or in fluid speech
• Word-by-word style, rather than phrases, reflects the way beginning writers
form sentences
*Continuous
• Processes speech by phrase
• Takes context into account
• Is less accurate if phrases are interrupted
• Advantages: Speed and accuracy (for most users)
Who Can Benefit from Speech Recognition?
• Persons with mobility impairments or injuries that prevent keyboard access
• Persons who have or who are seeking to prevent repetitive stress injuries
• Persons with writing difficulties
• Any person who want hands-free access to the computer

4
• Any persons who wants to increase their typing speed
(reportedly up to 160 wpm)

What is Required to Use Speech Recognition?


• A Powerful Computer
• Consistent Speech (not necessarily intelligible)
• Fluid speech (i.e., not pausing between words) desirable for use of
continuous speech products
• Patience
• Basic knowledge of computers
• Fairly high cognitive ability
Applications of speech recognition
• Command recognition - Voice user interface with the computer
• Dictation
• Interactive Voice Response
• Automotive speech recognition
• Medical Transcription
• Pronunciation Teaching in computer-aided language learning applications
• Automatic Translation
• Hands-free computing

Speech Analysis

Speech analysis/input deals with the the following research areas;

Speech Analysis

5
WHO? What? How?

Verification Identification Recognition Understanding

• Human speech has certain characteristics determined by a speaker. Hence,


speech analysis can serve to analyze who is speaking,i.e. To recognize a
speaker for his/her identification and verification. The computer identifies and
verifies the speaker using an acoustic finger print. An acoustic finger print is a
digitally stored speech probe of a person; for example a company that uses the
speech analysis for identification and verification of the employees. The
employee has to say a certain sentence into a microphone. The computer
system gets the speaker’s voice, identifies it and verifies the spoken statement.
• Another main task of the speech analysis is to analyses what has been said,i.e.
To recognize and understand the speech signal itself. Based on the speech
sequence the corresponding text is generated. This can lead to a speech
controlled type writer, a translation system or part of a workplace for the
physically-challenged.
• Another area of speech analysis tries to reseach sppech paterns with respect to
how a certain statement was said. For example, a spoken sentence sound s
differently if a person is angry or calm. An another application of this research
could be a LIE-DETECTOR.

The primary goal of the speech analysis is to correctly determine individual words
with probability ≤ 1. A word is recognized only with a certain probability.
Environmental noise, room acoustics and a speaker’s physical and
psychological conditions play an important role.

6
For example, let’s assume extremely bad individual words recognition with a
probability of 0.95. This means that 5% of the words are incorrectly
recognized. If we have a sentence with three words, the probability of

recognizing the sentence correctly is 0.95 × 0.95 × 0.95 = 0.857.


This small example should emphasize the fact that speech analysis system
should have a very high individual word recognize the fact that speech
analysis system should have a very high individual word recognition
probability.
Speech recognition system

Reference storage:
Properties of
Learned Material

Speech Analysis: Problem Recognition:


Parameters; Comparison with
Speech
Response, Reference,
Property Extraction Decision

Special chip Main Program

Recognized Speech
--: Speech Recognition System :--

The speech recognition system is divided into system components according


to a baisic principle: “Data Reduction Through property Extraction”.
• First speech analysis occrs where properties must be determined.

7
• Properties are extracted by comparision of individual speech element
characteristics with a sequence of in advance given speech element
characteristics. The characteristics with a sequence of in advance
given speech elements are present.
• Second, the speech elements are compared with existent reference to
determine the mapping to one of the existent speech elements. The
identified speech can be stored, transmitted or processed as a
parameterized sequence of speech elements.
Usually the comparison and decision are executed through the main
system processor. The computer’s secondary storage contains the
letter0to-phone rules, a Dictionary of exceptions and a reference
characteristics. The concrete methods differ in definition of the
characteristics. The principle of “data reduction through property
extraction,” can be applied several times to different characteristics. The
system which provides recognition and understanding of a speech signal
applies this principle several times:-

Sound pattern Syntax Semantics


Word model

Understood
Speech speech
Acoustical and Syntactical Semantic
Phonetic Analysis Analysis Analysis

Recognized Speech

Components of speech recognition and understanding.

8
• In 1st step, the principle is applied to a sound pattern and/or word
model. An acoustical and phonetical analysis is performed.
• In 2nd step, certain speech units go through syntactical analysis;
there by, the errors of the previous step can be recognized. Very
often during the first step, no unambiguous decision can be made.
In this case, syntactical analysis provides additional decision help
the result is a recognized speech.
• The 3rd step deals with the semantic of the previously recognized
language. Here the decision errors of the previous step can be
recognized and corrected with other analysis methods. Even today,
this step is non trivial to implement with current method s known in
artificial intelligence and neural nets research. The result of this
step is an understood speech.
There are still many problems into which speech recognition research
is being conducted:
 A specific problem is presented by room acoustic with
existent environmental noise. The frequency dependent
reflections of a sound wave from walls and objects can
overlap with the primary sound wave.
 Word boundary must be determined.
 During comparison time normalization is necessary. The

same word can be spoken quickly or slowly. There are


individual sounds extended differently and minimal time
duration for their recognition.

9
Speech recognition systems are divided into speaker –independent
recognition systems and speaker-dependent recognition system. A speaker
independent system can recognisewith the same reliability essentially fewer
words than a speaker dependent system because the latter is TRAINED IN
ADVANCE. Training in advance means that there exists a training phase for
the speech recognition system, which takes a half an hour. speaker-
dependent recognition system can recognize around 25,000 words, speaker-
independent recognition system can recognize around 500 words but with a
worse recognition rate. These should be understood as gross guidelines.

Speech Transmission
The area of speech transmission deals with the efficient coding to
transmit the speech/sound signal correctly and efficiently over networks
such that the same quality of speech/sound. Some principles are:
 Signal form coding
Here no speech specific properties and parameters are needed.
Here the goal is to schieve the most effiecent of the audio signal. The
data rate of a PCM –coded sterio audio signal with CD-quqlity
requirements is 1,411,200 bits/s.
Telephony quality , in comparision to Cd quality needs only 64
kbits/s. using DPCM,the data rate can be lowered to 56 kbits/s
without loss of quality.
 Recognition/synthesis Methods
There have been attempt to reduce transmission rate using pure
recognition /synthesis methods. Speech analysis (recognition)

10
follows on the sender side of a speech transmission system and
speech synthesis (generation) follows on the receiver side.

Analog Speech signal Speech Recognition

Coded Speech Signal

Speech Synthesis Analog speech signal

Conclusion

 The major players in the speech recognition market are


Dragon Systems, Lernout & Hauspie (L&H), and IBM. Each
company offers several products, ranging in price and features. Because of
the variety of products available, shopping for a speech recognition system
can be an overwhelming experience.
Dragon’s original product, Dragon Dictate, is currently the only product
that uses the discrete speech model. Discrete speech, is the best solution for
persons with difficulty in language processing or in fluid speech, or who
form sentences one word at a time, rather than in phrases. The latest version,
3.0 Classic, offers fully functional voice control across all applications. It is
the only current speech recognition product that supports Windows 3.x.
Because it uses discrete speech, it is better than current continuous speech
products at recognizing the speech patterns of persons who naturally pause
between words, and seems to be better at learning to recognize persons with
unique speech patterns. Unfortunately, Dragon Systems has discontinued
development on this product, as the company’s focus is now on continuous
speech products, which are more viable in the larger commercial market.

11
Dragon’s current continuous speech product line, known as Dragon
NaturallySpeaking, includes a Standard, Preferred, and Professional edition,
listed in order from low end to high end. The Preferred edition includes
dictation playback and text-to-speech, features that distinguish it from the
Standard edition. The Preferred edition also supports input from an external
recording device, although no recording device is provided. A special
version of the Preferred edition, Dragon NaturallySpeaking Mobile, does
include a digital recording device for additional cost. On the high end of
Dragon’s NaturallySpeaking product line, the Professional edition is
distinguished by its expanded macro and scripting capabilities, which allow
users to dictate long sections of text or complex computer operations with
simple commands. The Professional edition also comes in Legal and Medical
versions, which feature custom vocabularies for these disciplines.

 L & H products are based on speech recognition technology


developed by Kurzweil, a major pioneer in speech recognition.
The current L&H product line, called VoiceXpress, includes a Standard,
Advanced, and Professional edition. The differences in these editions are
fairly straightforward. In the Standard edition, VoiceXpress’s natural
language command interface works only in L&H’s own word processing
application, called XpressPad. The Advanced edition extends natural
language support to include Microsoft Word. The Professional edition
further extends natural language support to encompass the entire Microsoft
Office suite, plus Internet Explorer. The Professional edition also provides
support for recorded dictation, and includes a bundled digital recorder.

 IBM has been a major player in speech recognition for many


years. Its discrete speech product, IBM VoiceType, was a

12
major competitor of Dragon Dictate. However, IBM has discontinued this
product and is now focusing all its efforts on developing continuous speech
products. Its current product line, IBM ViaVoice Millenium, includes a
Standard, Web and Professional edition. The web edition features natural
language commands for Internet Explorer, Netscape Communicator and
America Online. The web edition also features a specialized vocabulary for
on-line chats. The Professional edition provides most of the features of the
Web edition, but also provides natural language commands for the entire
Microsoft Office suite, and specialized business, finance, and computer
vocabularies.

Although speech recognition got its start as an assistive technology product,


the commercial market has fueled its rapid development in recent years, and
the primary target market of each of the companies described above is now
the general public, rather than persons with disabilities.

A person who has a disability or who works with persons with disabilities
will come out of this system with a more accurate representation on which
speech recognition products will best work with them. There is a lot of
confusion today about speech recognition products. The main focus of this
presentation is to clarify the speech recognition technology.

References

13
Multilingual Speech Processing, Edited by Tanja Schultz and
Katrin Kirchhoff, April 2006
Multimedia : COMPUTING ,COMMNICATIONS &
APPLICATIONS (By. RALF STEINMETZ & KLARA
NABRSTED)
www.software.ibm.com/speech/
www.dragonsys.com
http://cslu.cse.ogi.edu/HLTsurvey/ch1node5.html
http://www.apple.com/macosx/developertools/

14

You might also like