Professional Documents
Culture Documents
K.R.ROAD,V.V.PURA,BENGALURU-04
Presented by :
Anigani Nikhila 1BI19ET007
Kandanathi Sahitya 1BI19ET017
Mangalanageswari 1BI19ET024
Rakshitha H Raj 1BI20ET403
INTRODUCTION :
• Speech recognition is a technique that enables machines to automatically identify the
human voice through speech signals. In other words, it helps create a communication link
between machines and humans.
• Speech recognition uses an acoustic and modelling algorithm to execute its task. It allows
interaction between computer interfaces and a natural human voice.
• Speech recognition is one of the current topics of discussion in the twenty-first century.
The advent of technological gadgets in modern society has become rampant through
vigorous efforts made by scientists in realizing their aim of developing an algorithm that
will allow machines to interact with human beings.
• It is obvious that this situation previously thought of as fiction, has been achieved.
Hence, there are several applications that enhance speech recognition.
• Syllables :
For pronunciation , we split a word into
syllable(s). A syllable usually contains one vowel
sound, with or without surrounding consonants.
the acoustic model typically deals with the raw audio waveforms of human speech.
it predicts what phoneme each waveform corresponds to (typically at the character or sub-word level).
• LANGUAGE MODEL
the language model guides the acoustic model, discarding predictions which are improbable.
• LEXICONS
Lexicons glues these two models together by controlling what sounds we recognise and what words we predict
• PRONUNCIATION MODEL
handle differences between accents, dialects, age, gender, and the many other factors that make words we
predict.
Transforming raw audio waves to spectrogram images for input to a deep learning model
Selects input data that consists of audio files of the spoken speech in an audio
format such as “.wav”
Read the audio data from the file and load it into a 2D Numpy array. This array
consists of a sequence of numbers, each representing a measurement of the
intensity or amplitude of the sound at a particular moment in time. The number of
such measurements is determined by the sampling rate. For instance, if the
sampling rate was 44.1kHz, the Numpy array will have a single row of 44,100
numbers for 1 second of audio.
Audio can have one or two channels, known as mono or stereo, in common
parlance.
We might have a lot of variation in our audio data items. Audios might be sampled at
different rates, or have a different number of channels. The clips will most likely have
different durations. As explained above this means that the dimensions of each audio item
will be different.
Since our deep learning models expect all our input items to have a similar size, we now
perform some data cleaning steps to standardize the dimensions of our audio data. We
resample the audio so that every item has the same sampling rate. We convert all items to
the same number of channels. All items also have to be converted to the same audio
duration. This involves padding the shorter sequences or truncating the longer sequences.
If the quality of the audio was poor, we might enhance it by applying a noise-removal
algorithm to eliminate background noise so that we can focus on the spoken audio.
The model decodes the character probabilities to produce the final output
The audio and the spectrogram images are not pre-segmented to give us this
information.
• In the spoken audio, and therefore in the spectrogram, the sound of each character
could be of different durations.
• There could be gaps and pauses between these characters.
• Several characters could be merged together.
• Some characters could be repeated. eg. in the word ‘apple’, how do we know whether
that “p” sound in the audio actually corresponds to one or two “p”s in the transcript?
• This is actually a very challenging problem, and what makes speech recognition so
tough to get right. It is the distinguishing characteristic that differentiates speech
recognition from other audio applications like classification and so on.
• we tackle this by using Connectionist Temporal Classification(CTC).
The continuous audio is sliced into discrete frames and input to the RNN
Count the Insertions, Deletions, and Substitutions between the Transcript and the Prediction
People who have little keyboard skills or experience, who are slow typists, or do not have the
time or resources to develop keyboard skills.
People who have problems with character or word use and manipulation in a textual form.
People with physical disabilities that affect either their data entry, or ability to read (and therefore
check) what they have entered.