Artificial Intelligence

Automatic Speech Recognition

Natural Language Processing
Machine learning Symbol based
Problem-Solving Methods
Sentiment Analysis

Automatic Speech Recognition

It is getting a computer to understand spoken language by “understand” we might mean

through Reacting appropriately, Converting the input speech into another medium, e.g.
text and Several variables impinge on this that we might see later.
Humans doing it through producing articulation, sound waves which the ear conveys to
the brain for processing
How might computers do it buy acoustic waveform and acoustic signal;
● Digitization – Is a converting analogue signal into digital representation
● Signal processing – Separating speech from background noise
● Phonetics – Variability in human speech
● Phonology – Recognizing individual sound distinctions (similar phonemes)
● Lexicology and syntax– Disambiguating homophones and features of continuous
● Syntax and pragmatics – Interpreting prosodic features
● Pragmatics– Filtering of performance errors (disfluencies)

Analogue to digital conversion and sampling and quantizing it uses filters to measure
energy levels for various points on the frequency spectrum knowing the relative
importance of different frequency bands for speech and makes this process more
efficient and high frequency sounds are less informative, so can be sampled using a
broader bandwidth or log scale.
There are ways in separating speech from background noise:
The noise canceling microphones, Two mics, one facing speaker, the other facing away
and ambient noise is roughly the same for both mics. And knowing which bits of the
signal relate to speech spectrogram analysis
● Speaker-(in)dependent systems
– Require “training” to “teach” the system your individual idiosyncrasies and;
- Language coverage is reduced to compensate need to be flexible in phoneme
● Identifying phonemes
In identifying Phonemes you must recognize the differences between some phonemes
are sometimes very small, it may be reflected in speech signal (vowels have more or
less distinctive f1 and f2) and Often show up in coarticulation effects (transition to next
● Disambiguating homophones
Mostly, differences are recognised by humans by context and need to make sense,
Systems can only recognize words that are in their lexicon, so limiting the lexicon is an
obvious plot.
Discontinuous speech much easier to recognize, single words tend to be pronounced
more clearly
Continuous speech involves contextual coarticulation effects; it has Weak forms,
Assimilation and Contractions.
In Interpreting prosodic features there are characteristics that we need to identify in
terms of speaking and used to indicate Stress, the Pitch, length and loudness.
All of these are relative on a speaker-by-speaker basis and in relation to context
Also; Pitch and length are phonemic in some languages
Pitch - is a contour that can be extracted from speech signals but pitch differences are
relative, One man’s high is another (wo)man’s low. Pitch range is variable and
contributes to intonation but has other functions in tone languages.
Length - is easy to measure but difficult to interpret, It is phonemic in many languages
speech rate is not constant it slows down at the end of a sentence. And;
Loudness - is easy to measure but difficult to interpret
Performance errors include: Non-speech sounds, Hesitations, False starts and
repetitions also filtering implies handling at syntactic level or above. Some disfluencies
are deliberate and have pragmatic effect – this is not something we can handle in the
near future
Machine Learning - Is an Acoustic and Lexical Models that Analyze training data in
terms of relevant features, Learn from large amounts of data different possibilities,
different phone sequences for a given word and different combinations of elements of
the speech signal for a given phone/phoneme.
Language Model - Models likelihood of word given previous word(s)
n-gram models:
Build the model by calculating bigram or trigram probabilities from text training corpus
and Smoothing issues
The Noisy Channel Model - Use the acoustic model to give a set of likely phone
sequences, it also uses the lexical and language models to judge which of these are
likely to result in probable word sequences.
The trick is having sophisticated algorithms to juggle the statistics and A bit like the
rule-based approach except that it is all learned automatically from data
•Robustness – graceful degradation, not catastrophic failure
•Portability – independence of computing platform
•Adaptability – to changing conditions (different mic, background noise, new speaker,
new task domain, new language even)
•Language Modelling – is there a role for linguistics in improving the language
•Confidence Measures – better methods to evaluate the absolute correctness of
•Out-of-Vocabulary (OOV) Words – Systems must have some method of detecting
OOV words, and dealing with them in a sensible way.
•Spontaneous Speech – disfluencies (filled pauses, false starts, hesitations,
ungrammatical constructions etc) remain a problem.
•Prosody –Stress, intonation, and rhythm convey important information for word
recognition and the user's intentions (e.g., sarcasm, anger)
Accent, dialect and mixed language – non-native speech is a huge problem,
especially where code-switching is commonplace.


General speech and language understanding and generation capabilities of politeness,
self-awareness, and belief ascription. The Recognition of emotion from speech is a
vision capability including visual recognition of emotions and faces and also situational


•Question- answering systems, where natural language is used to query a database
•Automated customers Service over the telephone
•Spoken language control of a machine
•Finding appropriate documents on certain topics from a database of texts
•Extracting information from message or articles on certain topics
•Translating documents form one language to another
•Summarizing texts for certain purposes
•Tutoring systems, where the machine interacts with a student
•Spoken language control of a machine

Stages in a Comprehensive NLP System

1.Tokenization- is used in natural language processing to split paragraphs and
sentences intl smaller units that can be more easily assigned meaning.
2. Morphological Analysis- focuses on how the components with a word are arranged
or modified to create different meanings.
3. Syntactic Analysis- is the process of analyzing natural language with the rules of a
formal grammar
4. Semantic Analysis- analyzes the grammatical format of sentences, including the
arrangement of words, phrases and clauses, to determine relationships between
independent terms in a specific context.
5. Pragmatics and Discourse Analysis- involve the study of language in its context of
6. Knowledge- Bases Reasoning- methods for acquiring and representing such
knowledge and for applying the knowledge to solve well known problems in NLP such
as ambiguity resolution.
7. Text Generation- is a subfield of natural language processing.

