MathLab Based Speech Processing

INTRODUCTION
Speech is the fundamental analog form of message. It is designed to carry sound

to aid in hearing. Speech signals, can be converted to an electrical waveform by a
microphone. Analog and digital signal processing methods can be used to
manipulate these signals. Speech signals can be again converted back to acoustic
form by a loud speaker or headphone. Speech recognition is a technology that can
translate spoken words into text. A speech recognition system can analyze the
persons specific voice for recognition the persons speech. They can be of two
types. Speaker dependent systems and speaker independent systems. Speaker
dependent systems uses training whereas speaker independent systems does not
use training.
Related works
In the year 1995 speech recognition using Neutral Networks was proposed by joe
Tebelskis where he had examined how artificial neutral networks can benefit a large
vocabulary, speaker independent, continuous speech recognition system. He explored
two different ways to use neural networks for acoustic modeling prediction and
classification. He found that predictive networks yield poor results because of
discrimination, but classification networks gave excellent results. He also verified
that, in accordance with theory, the output activation of a classification network form
highly accurate estimates of the posterior probabilities P , and he showed how these
can easily be converted to likelihood P for standard HMM recognition algorithms
In the year 2003 chulhee Lee, Donghion Hyun, Euisun Chol, jinwook Go, and
Chungyong Lee in their paper optimizing feature extraction for speech recognition
had proposed a method to minimize the lose of information during the feature
1
extraction stage in speech recognition by optimizing the parameters of the
melcepstrum transformation , a transform which is widely used in speech recognition.
The melcepstrum was obtained by critical band filters whose characteristics play an
important role in converting a speech signal into a sequence of vectors. First , they
analyze the performance of the melcepstrum by changing the parameters of the filters
such as shape , center frequency, and band width.Then they proposed an algorithm to
optimize the parameters of the filters using the simplex method.
In the year 1997 K.Ohtsuki , S.Matsunaga,T.Matsuoka, anind Sfurui, In their paper

topic extraction based on continuous speech recognition in broadcast news speech had
studied, the extraction of several topics words from broadcast news using continuous
and found that a combination of multiple topic words represents the content of the
news . They had trained the topic extraction model wit 5 years of newspapers, using
the frequency of topic words taken from headlines and words in articles. The degree
of relevance between topic words in articles is calculated on the basis of statistical
measures, that is mutual information or the X2 value. In topic extraction experiments
for recognized broadcast news speech, they had extracted 5 topic eords using X2
based model and found that 75% of them agreed with topic words chosen by subject.
3. An overview of speech recognition
Speech recognition is a technique that converts pulse code modulation digital audio
from a sound card into recognized speech.It is wavy line which just looks like the
output of an oscilloscope. Wile transforming the PCM digital audio into frequency
domain , it mainly identifies the frequency component of a sound. The main objective
of the speech recognition system is to recognize the speech what user have told .
Therefore it must understand the phoneme of the spoken word. But unfortunately it
becomes diuretic for the following reasons. Every time the word spoken by the user
sounds different. Users may not generate exactly the same sound for the same
2
phoneme. Also the background nose from the microphone and users room sometime
cause the recognizer to hear the different sound then it would have if the user was in a
quite room with the high quality microphone. Various methods used for speech
recognition are fast Fourier transform, training using neural network, various
statistical techniques etc. But hear we have suggested a different approach for speech
recognition which is based on image processing .
4.Basic image processing concept
A digital image is composed of a grid of pixels and stored as an array. A single pixel
represents a value of either light intensity or color . Images are processed to obtain
information what is visible beyond the given the image initial pixel values
4.1 Binary Image
A binary image basically consists of two values that is either 0 or 1. This type of
image is commonly used as a multiplayer to mask regions within another image.
4.2 Grey scale Image
A gray scale digital image is an Image in which the value of each pixel is having a
single component that is only intensity information.This type of image are also known
as black and white, are composed exclusively of shades of grey, varying from black
from the least intensity to white at the most intensity. Grey scale images are distinct
from one- bit bi- tonal black and white images, which are having two colors, only that
is black and white. Grey scale images have many shades of grey in between. Grey
scale images are also called monochromatic ,denoting the presence of only one color
3
4.3 RGB Image
An RGB image is having 3d out of which 2 of the dimensions specify the location of
a pixel within an image. The other dimension specify the color of each pixel. The
color dimension consist of 3 components which is composed of the red, green and
blue color bands. In the RGB color model, a color image can be represented by the
intensity function.1RGB=(FR,FG,FB) , where FR(x,y) is the intensity of pixel (x,y) i
the green channel , and FB (x,y) is the intensity of the pixel (x,y) in the blue channel.
The luminance of grey scale image is matched with luminance color image during
RCB to grey scale conversion. One method is to obtain the values of red, green, and
blue primaries in linear intensity encoding by using gamma expansion. Then 30% of
the red value, 59% of green value, and11% of the blue value are added together
4.4 Histogram
A histogram is a graphical display of data using bars of different heights.Histogram

groups numbers into ranges decides by the user.It shows the visual impression of
distribution of data through graphical representation.The distribution is shown by
adjacent rectangles over discrete intervals with an area equal to the frequency density
and the total number of data represents the area of the Histogram.An image Histogram
represents the lightness property or brightness perception of a color of digital image in
graphical form. The vertical axis represents the number of pixels in the image and the
horizontal axis represents the brightness value.
4.5 Correlation coefficients
The Correlation coefficient computed from the sample data measures the strength and
direction of a relationship between two variables. The Correlation coefficient is a
number between 0 and 1. If there is no relationship between the predicted values and
4
the actual values the Correlation coefficient is 0 or very low.As the strength of the
relationship between the predicted values and actual values increases so does
Correlation coefficient. A perfect fit gives a coefficient of 1.0.Thus the higher the
Correlation coefficient the better[ 9,11]. Corr 2 computes the Correlation coefficient
using
MATLAB PROGRAM
%speech recognition using correlation method

%write folowing command on command window
%speech recoginition('test.wav)
clc;
clear all;
close all;
%key word
voice=audioread('ok_gogle.wav');
x=voice;
x=x';
x=x(1,:);
x=x';
%input-1
y1=audioread('ok_google.wav');
y1=y1';
y1=y1(1,:);
y1=y1';
z1=xcorr(x,y1);
m1=max(z1);
l1=length(z1);
t1=-((l1-1)/2):1:((l1-1)/2);
t1=t1';
%input-2
y2=audioread('whatsup.wav');
y2=y2';
y2=y2(1,:);
y2=y2';
z2=xcorr(x,y2);
m2=max(z2);
l2=length(z2);
t2=-((l2-1)/2):1:((l2-1)/2);
t2=t2';
5
%input-3
y3=audioread('hey_there.wav');
y3=y3';
y3=y3(1,:);
y3=y3';
z3=xcorr(x,y3);
m3=max(z3);
l3=length(z3);
t3=-((l3-1)/2):1:((l3-1)/2);
t3=t3';
%input-4
y4=audioread('hello.wav');
y4=y4';
y4=y4(1,:);
y4=y4';
z4=xcorr(x,y4);
m4=max(z4);
l4=length(z4);
t4=-((l4-1)/2):1:((l4-1)/2);
t4=t4';
zmax=max([max(z1),max(z2),max(z3),max(z4)]);
zmin=min([min(z1),min(z2),min(z3),min(z4)]);
%test 1
subplot(2,2,1);plot(t1,z1);grid;
title('OK GOOGLE');
axis([min(t1) max(t1) zmin zmax]);
%test 2
title('WHATs UP');
%test 3
title('HEY THERE');
%test 4
title('HELLO');
6
RESULT
7
References
https://www.google.co.in/
https://en.wikipedia.org/
https://www.ieee.org/
https://in.mathworks.com/

MathLab Based Speech Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MathLab Based Speech Processing

Uploaded by

Copyright:

Available Formats

INTRODUCTION

Speech is the fundamental analog form of message. It is designed to carry sound

In the year 1997 K.Ohtsuki , S.Matsunaga,T.Matsuoka, anind Sfurui, In their paper

3. An overview of speech recognition

4.Basic image processing concept

4.1 Binary Image

4.2 Grey scale Image

A histogram is a graphical display of data using bars of different heights.Histogram

4.5 Correlation coefficients

%speech recognition using correlation method

You might also like