You are on page 1of 44

CHAPTER ONE

1.1. INTRODUCTION

Language is the ability to express one’s thoughts by means of a set of signs (text), gestures, and
sounds. It is a distinctive feature of human beings, who are the only creatures to use such a
system. Speech is the oldest means of communication between people and it is also the most
widely used (Albert A., 2003).
Text to speech also known as (TTS) were first developed to aid the visually impaired by offering
a computer generated spoken voice that will read to the user. A text-to-speech (TTS) system
simply converts text to speech. Many computer operating systems have included speech
synthesizers since the early 1990s (Arthur C., 2005).
Vision impartment students always find it difficult to read a text on the screen but with the
application of text to speech system, they can easily listen while the text is being read to them.
Text-to-speech synthesis (TTS) is the automatic conversion of a text into speech that resembles,
as closely as possible, a native speaker of the language reading that text. Text-to-speech/ Audio
system is the technology which let a computer speak to you. The TTS system gets the text as the
input and then a computer algorithm which is called a TTS engine analyses the text, pre-
processes the text and synthesizes the speech with some mathematical models. The TTS engine
usually generates sound data in an audio format as the output (Alfred A., 2005).

Speech synthesis is the artificial production of human speech. A computer system used for this
purpose is called a speech synthesizer, and can be implemented
in software or hardware products. A text-to-speech (TTS) system converts normal language text
into speech; other systems render symbolic linguistic representations like phonetic
transcriptions into speech. (Allen, Jonathan et. al.,1987) The reverse process is speech
recognition.

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in
a database. Systems differ in the size of the stored speech units; a system that
stores phones or diphones provides the largest output range, but may lack clarity. For specific
usage domains, the storage of entire words or sentences allows for high-quality output.

1
Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice
characteristics to create a completely "synthetic" voice output. (Rubin, et.al.1981).

The quality of a speech synthesizer is judged by its similarity to the human voice and by its
ability to be understood clearly. An intelligible text-to-speech program allows people with visual
impairments or reading disabilities to listen to written words on a home computer. Many
computer operating systems have included speech synthesizers since the early 1990s.

A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-end. The


front-end has two major tasks. First, it converts raw text containing symbols like numbers and
abbreviations into the equivalent of written-out words. This process is often called text
normalization, pre-processing, or tokenization. The front-end then assigns phonetic
transcriptions to each word, and divides and marks the text into prosodic units,
like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is
called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and
prosody information together make up the symbolic linguistic representation that is output by the
front-end. The back-end—often referred to as the synthesizer—then converts the symbolic
linguistic representation into sound. In certain systems, this part includes the computation of
the target prosody (pitch contour, phoneme durations), which is then imposed on the output
speech. (van Santen, et. al., 1997)

1.2. STATEMENT OF THE PROBLEM

The problem area in speech synthesis is very wide. There are several problems in text
preprocessing, such as numerals, abbreviations, and acronyms. This system will help solve the
problems of those with learning disabilities; some people have basic literary levels of mis-
pronunciation of words. They often get frustrated trying to browse through text. People with
visual impairment often find it hard to read through text file on the screen which is less pleasant
to them. Text to speech can be a very useful tool for the mild or moderately visually impaired
(Alfred A., 2005).
Even for people with the visual capability to read, the process can often cause too much strain to
be of any use or enjoyment.

2
The implementation of Text to Speech Audio system provide a user interactive interface that
allow people with visual impairment to take in all manner of content in a document with comfort
instead of strain.

1.3. JUSTIFICATION OF STUDY

The advancement of the programs written on the computer system has widely proved to be very
reliable in terms of given out accurate information and performing complex task by the user. The
following are the benefits provided by implementing text to speech audio system.

1 conversion of text to speech sounds.

2. It allow the visual impairment to know exactly what they are typing

3. It aid people with disability on reading to get information easily without any stress

4. It also serves as foundation and guide to other research student.

5. It also help student to read pdf’s note and literature.

1.4. AIM AND OBJECTIVES

The main aim of this project is to design and implement a text to speech application for the
visual impairment student.

The objectives of this project are listed as follows;

 To create an application that will convert text to speech in order for the visually impaired
student to know exactly what they are typing and imputing in the computer system.
 The visually impaired student will be well assured of what they are typing and know how
to correct their mistake if any typographical error is their work.
 The application will build a platform to aid people with disabilities especially on reading
and also help get information easily without any stress
 The study will serve as a foundation and guide to other research students interested in
researching on Text-to-Speech systems.

3
1.5. SCOPE OF THE STUDY

The project deals with the conversion of text to speech by analyzing and processing the text
using Natural Language Processing (NLP) and then using Digital Signal Processing (DSP)
technology to convert this processed text into synthesized speech representation of the text. The
scope of this research work covers written text such that text can be copied and pasted into the
application develop via the user interface after which user will be afforded the privilege of
listening to their written or pasted text upon a driven event of click furthermore, the scope further
permit selection of speech between feminine or otherwise.

To further justify the scope, text could be in a picture file, word file, video file, pdf file and many
more but the scope to which this study covers is mainly written text typed/pasted into the
application developed.

1.6. RESEARCH METHODOLOGY


The data and information needed for this project work are gathered through the following
Personal observation
Visual Basic .Net for developing the program.

1.7. DEFINITION OF TERMS


TEXT: A writing consisting of characters, symbols or sentences
SPEECH: The ability to speak or to use vocalization to communicate
VISUAL BASIC: It is a high level programming language use to develop programs.
COMPUTER: It is an electronic machine that accepts data as input, process data and gives out
information as output.
PHONETIC: Relating to the sound of spoken language.
VISION: this can be defined as the ability to see
VISUAL IMPAIRMENT: These are people that have disability with their vision

4
SOFTWARE: This are set of instruction writing for the computer to perform
STUDENT: This is a person who attend school.

CHAPTER TWO
LITERATURE REVIEW
2.1 BACKGROUND THEORY OF STUDY
2.1.1What is Text To Speech (Speech synthesis)
Computers do their jobs in three distinct stages called input (where you feed information in,
often with a keyboard or mouse), processing (where the computer responds to your input, say, by
adding up some numbers you typed in or enhancing the colors on a photo you scanned), and
output (where you get to see how the computer has processed your input, typically on a screen or
printed out on paper) (Robert O., 2007).
Text to speech is simply a form of output where a computer or other machine reads words to you
out loud in a real or simulated voice played through a loud speaker. It is also called Speech
synthesis.
Talking computers sound like something out of science fiction and indeed, the most famous
example of speech synthesis is exactly that. In Stanley Kubricks groundbreaking movie 2001: A
Space Odyssey (based on the novel by Arthur C. Clarke), a computer called HAL famously
chatters away in a humanlike voice.
2.1.2 HOW DOES SPEECH SYNTHESIS WORK
Let us assume you have a paragraph of written text that you want your computer to speak aloud.
How does it turn the written words into ones you can actually hear? There are essentially three
stages involved, which can be referred to as text to words, words to phonemes, and phonemes to
sound (Robert O., 2004).
2.1.3 TEXT TO WORDS
Reading words sounds easy, but if you have ever listened to a young child reading a book that
was just too hard for them, you will know it is not as trivial as it seems. The main problem is that
written text is ambiguous the same written information can often mean more than one thing and
usually you have to understand the meaning or make an educated guess to read it correctly. So
the initial stage in speech synthesis, which is generally called pre-processing or normalization, is

5
all about reducing ambiguity it is about narrowing down the many different ways you could read
a piece of text into the one that is the most appropriate.
Preprocessing involves going through the text and cleaning it up so the computer makes fewer
mistakes when it actually reads the words aloud. Things like numbers, dates, times,
abbreviations, acronyms, and special characters (currency symbols and so on) need to be turned
into words and that is harder than it sounds. The number 1843 might refer to a quantity of items
("one thousand eight hundred and forty three"), a year or a time ("eighteen forty three"), or a
padlock combination("one eight four three"), each of which is read out slightly differently. While
humans follow the sense of what's written and figure out the pronunciation that way, computers
generally don't have the power to do that, so they have to use statistical probability techniques
(typically Hidden Markov Models) or neural networks (computer programs structured like arrays
of brain cells that learn to recognize patterns) to arrive at the most likely pronunciation instead.
So if the word "year" occurs in the same sentence as "1843," it might be reasonable to guess this
is a date and pronounce it" eighteen forty three." If there were a decimal point before the
numbers (".843"), they would need to be read differently as "eight four three."(Reference
Wikipedia).
2.1.4 WORDS TO PHONEMES
Having figured out the words that need to be said, the speech synthesizer now has to generate the
speech sounds that make up those words. In theory, this is a simple problem: the entire computer
needs is a huge alphabetical list of words and details of how to pronounce each one (much as you
would find in a typical dictionary, where the pronunciation is listed before or after the
definition). For each word, we would need a list of the phonemes that make up its sound.
Phonemes they are the atoms of spoken sound, the sound components from which you can make
any spoken word you like. The word cat consists of three phonemes making the sounds /k/ (as in
can), /a/ (as in pad), and /t/ (as in tusk). Rearrange the order of the phonemes and you could
make the words "act" or "tack." (Robert O., 2007).
There are only 26 letters in the English alphabet, but over 40 phonemes. That's because some
letters and letter groups can be read in multiple ways (a, for example, can be read differently, as
in 'pad' or 'paid'), so instead of one phoneme per letter, there are phonemes for all the different
letter sounds. Some languages need more or fewer phonemes than others (typically 20-60)
(Robert O., 2007).

6
In theory, if a computer has a dictionary of words and phonemes, all it needs to do to read a word
is look it up in the list and then read out the corresponding phonemes, right? In practice, it is
harder than it sounds. As any good actor can demonstrate, a single sentence can be read out in
many different ways according to the meaning of the text, the person speaking, and the emotions
they want to convey (in linguistics, this idea is known as prosody and it is one of the hardest
problems for speech synthesizers to address). Within a sentence, even a single word (like "read")
can be read in multiple ways (as "red"/"reed") because it has multiple meanings. And even
within a word, a given phoneme will sound different according to the phonemes that come before
and after it.
An alternative approach involves breaking written words into their graphemes (written
components units, typically made from the individual letters or syllables that make up a word)
and then generating phonemes that correspond to them using a set of simple rules. This is a bit
like a child attempting to read words he or she has never previously encountered (the reading
method called phonics is similar). The advantage of doing that is that the computer can make a
reasonable attempt at reading any word, whether or not it's a real word stored in the dictionary, a
foreign word, or an unusual name or technical term. The disadvantage is that languages such as
English have large numbers of irregular words that are pronounced in a very different way from
how they are written (such as "colonel," which we say as kernel and not "coll-o-nell"; and
"yacht," which is pronounced "yot" and not "yach-t") —exactly the sorts of words that cause
problems for children learning to read and people with what's known as surface dyslexia (also
called orthographic or visual dyslexia).(Robert O., 2007).
2.1.5 PHONEMES TO SOUND
Okay, so now we have converted our text (our sequence of written words) into a list of phonemes
(a sequence of sounds that need speaking). But where do we get the basic phonemes that the
computer reads out loud when it is turning text into speech? There are three different approaches.
One is to use recordings of humans saying the phonemes, another is for the computer to generate
the phonemes itself by generating basic sound frequencies (a bit like a music synthesizer), and a
third approach is to mimic the mechanism of the human voice.
2.1.6 CONCATENATIVE
Speech synthesizers that use recorded human voices have to be preloaded with little snippets of
human sound they can rearrange. In other words, a programmer has to record lots of examples of

7
a person saying different things, break the spoken sentences into words and the words into
phonemes. If there are enough speech samples, the computer can rearrange the bits in any
number of different ways to create entirely new words and sentences. This type of speech
synthesis is called concatenative (from Latin words that simply mean to link bits together in a
series or chain).
Since it is based on human recordings, concatenation is the most natural-sounding type of speech
synthesis and it is widely used by machines that have only limited things to say (for example,
corporate telephone switchboards). It is main drawback is that it is limited to a single voice (a
single speaker of a single sex) and (generally) a single language ( Afolabi S.A., 2010).
2.1.7 FORMANT
If you consider that speech is just a pattern of sound that varies in pitch (frequency) and
volume(amplitude) like the noise coming out of a musical instrument it ought to be possible to
make an electronic device that can generate whatever speech sounds it needs from scratch, like a
music synthesizer. This type of speech synthesis is known as formant, because formants are the
3–5 key(resonant) frequencies of sound that the human vocal apparatus generates and combines
to make the sound of speech or singing. Unlike speech synthesizers that use concatenation,
which are limited to rearranging prerecorded sounds, formant speech synthesizers can say
absolutely anything even words that don't exist or foreign words they have never encountered.
That makes formant synthesizers a good choice for GPS satellite (navigation) computers, which
need to be able to read out many thousands of different (and often unusual) place names that
would be hard to memorize. In theory, formant synthesizers can easily switch from a male to a
female voice (by roughly doubling the frequency) or to a child's voice (by trebling it), and they
can speak in any language. In practice, concatenation synthesizers now use huge libraries of
sounds so they can say pretty much anything too. A more obvious difference is that
concatenation synthesizers sound much more natural than formant ones, which still tend to sound
relatively artificial and robotic (Hullary C., 2009).
2.1.8 ARTICULATORY
The most complex approach to generating sounds is called articulatory synthesis, and it means
making computers speak by modeling the amazingly intricate human vocal apparatus. In theory,
that should give the most realistic and humanlike voice of all three methods. Although numerous
researchers have experimented with mimicking the human voice box, articulatory synthesis is

8
still by far the least explored method, largely because of its complexity. The most elaborate form
of articulatory synthesis would be to engineer a "talking head" robot with a moving mouth that
produces sound in a similar way to a person by combining mechanical, electrical, and electronic
components, as necessary.
2.2. HISTORY OF SPEECH SYNTHESIS
Artificial speech has been a dream of the humankind for centuries. To understand how the
present systems work and how they have developed to their present form, a historical review
may be useful.

Long before electronic signal processing was invented, there were those who tried to build
machines to create human speech. Some early legends of the existence of "Brazen Heads"
involved Pope Silvester II (d. 1003 AD), Albertus Magnus (1198–1280), and Roger Bacon
(1214–1294). In 1779, the Danish scientist Christian Kratzenstein, working at the Russian
Academy of Sciences, built models of the human vocal tract that could produce the five long
vowel sounds (in International Phonetic Alphabet notation, they are [aː], [eː], [iː], [oː] and [uː]).
This was followed by the bellows-operated "acoustic-mechanical speech machine" by Wolfgang
von Kempelen of Pressburg, Hungary, described in a 1791 paper. This machine added models of
the tongue and lips, enabling it to produce consonants as well as vowels. According to Charles
(1857), Wheatstone produced a "speaking machine" based on von Kempelen's design, and in
1857, M. Faber built the "Euphonia". Wheatstone's design was resurrected in (1923 by Paget).

In the early 1930s, Bell Labs developed the vocoder, which automatically analyzed speech into
its fundamental tone and resonances. From his work on the vocoder, Homer Dudley developed a
keyboard-operated voice synthesizer called The Voder (Voice Demonstrator), which he
exhibited at the 1939 New York World's Fair. The Pattern playback was built by Dr. Franklin S.
Cooper and his colleagues at Haskins Laboratories in the late 1940s and completed in 1950.
There were several different versions of this hardware device but only one currently survives.
The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram
back into sound. Using this device, Allen J (2007) were able to discover acoustic cues for the
perception of phonetic segments (consonants and vowels).

The following are inventors and years text to speech machines are developed;

9
 1769: Austro-Hungarian inventor Wolfgang von Kempelen develops one of the world's
first mechanical speaking machines, which uses bellows and bagpipe components to
produce crude noises similar to a human voice. It's an early example of articulatory
speech synthesis.
 1770s: Around the same time, Danish scientist Christian Kratzenstein, working in Russia,
builds
a mechanical version of the human vocal system, using modified organ pipes, that can
speak the five vowels. In 1791, he writes a book on the subject titled “Mechanismus der
menschlichen Spracheneb st Beschreibungeinersprechenden Maschine”(Mechanism of
Human Language with a Description of a Speaking Machine).
 1837: English physicist and prolific inventor Charles Wheatstone, long fascinated by
musical instruments and sound, rediscovers and popularizes an improved version of the
von Kempelen speaking machine.
 1928: Working at Bell Laboratories, American scientist Homer W. Dudley develops an
electronic speech analyzer called the Vocoder (not to be confused with the famous voice-
altering Vocoder used in many electronic pop records in the 1970s). Dudley develops the
Vocoder into the Voder, an electronic speech synthesizer operated through a keyboard. A
writer from The New YorkTimes sees the device demonstrated at the 1939 World's Fair
and declares "My God it talks!".
 1940s: Another American scientist, Frank Cooper of Haskins Laboratories, develops a
system called Pattern Playback that can generate speech sounds from their frequency
spectrum.
 1953: American scientist Walter Lawrence makes PAT (Parametric Artificial Talker), the
first formant synthesizer, which makes speech sounds by combining four, six, and later
eight formant frequencies.
 1958: MIT scientist George Rosen develops a pioneering articulatory synthesizer called
DAVO(Dynamic Analog of the vocal tract).
 1960s/1970s: Back at Bell Laboratories, Cecil Coker works on better methods of
articulatory
synthesis, while Joseph P. Olive develops concatenative synthesis.

10
 1978: Texas Instruments releases its TMC0281 speech synthesizer chip and launches a
hand held electronic toy called Speak & Spell, which uses crude formant speech synthesis
as a
teaching aid.
 1984: Apple Macintosh computer ships with built-in Mac In Talk speech synthesizer,
widely used
in popular songs such as Radiohead's Fitter Happier and Paranoid Android.
 2001: AT&T introduces Natural Voices, a natural-sounding concatenative speech
synthesizer based on a huge database of sound samples recorded from real people. The
system is widely used in online applications, such as websites that can read emails aloud.
 2011: Apple adds Siri, a voice-powered "intelligent agent," to its iPhone (smartphone).
 2014: Microsoft announces Skype Translator, which can automatically translate a spoken
conversation from one language into one of 40 others. The same year, Microsoft
demonstrates Cortana, its own version of Siri.
 2015: Amazon Echo, a personal assistant featuring voice software called Alexa, goes on
general
release.
 2016: Google joins the club by releasing Google Assistant, its answer to Siri and Cortana,
later
incorporating it into Google Home.

2.3. E-LEARNING FOR THE VISION IMPAIRED


The use of the Internet is an integral tool for communication in the twenty first century; however
there are many people with vision impairments who need to learn specific skills in order to take
advantage of this tool. Such people have previously been disadvantaged due to inaccessible
learning materials or instructional media which have not been tailored to their specific needs.
Generally, people who have acute vision disabilities find it difficult to obtain suitable
employment. This results in low income and in turn affects their quality of life. Numerous
research projects report low achievement at secondary and tertiary levels for the vision impaired
and this is often the result of a lack of accessibility of the learning materials in addition to a lack

11
of knowledge and understanding of disabilities by the teaching staff (Dept Finance, 2010).
Figures produced by the Australian Bureau of Statistics (ABS, 2009) demonstrate that there is a
significant difference between educational achievement and employment income between those
with a disability and those without.(Solomon F., 2009).
Vision Australia (2007) reports that 63% of people who are blind or vision impaired and who are
of working age are unemployed. Education is a vital factor in preparing students to develop into
responsible adults who can take their place in the work force. It is therefore important that those
with vision impairments are able to access as complete an education as possible so that they can
gain useful employment and participate constructively in society.
For all students, vision is the primary sense necessary for successful learning and development
(Kelly et al., 2000). One of the main difficulties caused by visual impairment is the problem of
access to information, and with the developing use of technology this difficulty is increasing
(Armstrong et al., 2010). Without vision, students and teachers use speech to a much greater
extent and a virtual classroom is needed to supplement the physical classroom and laboratory
setting.
2.3.1. PROBLEMS FACED BY VISION IMPAIRED STUDENTS
Some of the most common problems faced by students with acute vision impairment include
inaccessibility of Web sites, inaccessibility of learning materials and different learning needs due
to their disabilities. One of the most prominent problems is that e-learning IT courses are not
specifically designed for vision impaired students. The guidelines for Web accessibility for the
vision impaired are not specific enough for the effective design of learning materials for the
vision impaired. There is also a misalignment of guidelines for the development of accessible
teaching and learning materials and Web accessibility standards and guidelines. Additional
teaching aids created specifically for vision impaired students are necessary to ensure the
students understand the concepts being taught.(Olamide S., 2013).
The second problem is that e-learning models are commonly designed for sighted students and
do not incorporate considerations for students with disabilities, particularly vision disabilities.
Learning outcomes commonly assume that all students are sighted and vision impaired students
are expected to attain the same learning outcomes to succeed in the course. More specific and
broader communications are required in an e-learning environment for the vision impaired.
Without vision, students and teachers use speech to a much greater extent and a virtual classroom

12
is needed to supplement the physical classroom and laboratory setting. There are major
differences between the needs of vision impaired students and sighted students. Sighted students
are able to access images, diagrams and tables and easily interpret these, whereas vision impaired
students are not able to access these at all. E-learning materials are not frequently designed to
integrate with the range of assistive technologies used, resulting in vision impaired students
receiving incomplete or inaccurate translations, or, at worst, no accessibility at all .(Olamide S.,
2013).
A further problem is that vision impaired students are often isolated by their disability and e-
learning models seldom include considerations of social elements. Vision impaired students need
confidence building through the sharing of knowledge and skills. Means of communication on
issues including assistive technologies, the technology of the learning environment, learning
matters, accessibility and general matters need to be part of the learning environment. Students
with a vision disability readily share their knowledge so that the group achieves the learning
outcomes, not just the individual. Therefore IT is important, so that students have a ready means
of communicating their knowledge within the group. .(Olamide S., 2013).
A final problem to be discussed here is that teachers seldom understand the needs of vision
impaired students and the barriers to learning these students face. Teachers need to know how to
solve learning problems that relate to vision disabilities; they need to understand not only
assistive technologies, but also how to work around inaccessible features of the curriculum and
the learning environment.
An understanding of the needs of vision impaired students is an essential component when
designing an effective and accessible e-learning environment. However, there are a number of
overlapping areas which also need to be investigated and incorporated into a model in order to
generate a more holistic approach, and these are reflected in the model described later. (Olamide
S., 2013).

13
2.4 RELATED WORD
S/N YEAR TITLE JOURNAL OBJECTIVES METHOD CHALLENGES

1 2022 Environment arXiv:2110.03887v4 This study aims acoustic System is not as


Aware Text-to- [eess.AS] at designing an environments good as the
Speech Synthesis environment- baseline system in
aware text-to generating speech
speech (TTS) on the seen
system that can speaker-
generate speech environment
to suit specific combinations.
acoustic
environments.
2 2022 Review Paper on Journal of Emerging In this paper, we Data Modeling require a large
SPEECH TO Technologies and have studied the amount of
TEXT USING Innovative Research algorithms or hardware
MACHINE (JETIR techniques which resources and are
LEARNING helps to classify mostly trained in
both the facial English
expression and
the music too.
3 2022 Text to Speech International Journal In this system our Google Text to weakness in the
Conversion using of Advanced main goal is to Speech API system is that the
Python Research in Science, help people with words to be
Communication and virtual learning spoken has been
Technology disabilities and predefined and it
(IJARSCT) introduce people is not converting
on new Text-To-Speech.
technology for
their day-to-day
life
4 2019 Android-Based IEEE-SEM This paper Optical weakness in the

14
Mobile Text-to- focused on the Character system is that the
Speech Enabled design and Recognition words to be
Malaria Diagnosis implementation (OCR) and spoken has been
System of a text-to- speech predefined and it
speech enabled synthesizer is not converting
medical Text-To-Speech.
diagnosis
system that is
capable of
reading out
diagnosis
process and
report to the
user of the
system in
English
text/speech or
its translation
from English to
Hausa, Igbo and
Yoruba
speeches
depending on
the language
option selected
by the user
5 2019 DEVELOPMENT COOU Journal of The principal OOADM models systems do not
OF AN Physical sciences aim of this replicate the
INTELLIGENT research is to human natural
TEXT-TO- develop a very speech.
SPEECH MODEL simple and
FOR VISUALLY interactive text-
IMPAIRED to-speech model
STUDENTS using optical
USING OPTICAL character
CHARACTER recognition
READER which can scan
typed text, voice
out typed text
and saved it for
a repeat play
6 2018 IMAGE TO International Journal The main purpose MATLAB The system
SPEECH of Advanced of this project is currently in
CONVERSION Research in to create an existence either
USING DIGITAL Engineering and application to has a limited
IMAGE Technology recognize the text scope or requires

15
PROCESSING character from a heavy
any natural image investment
and convert it
into speech
signal.
7 2017 Implementation of International Journal aims of this paper MATLAB one character can
Text and Pictures for Research Trends ar to be converted into
to Speech and Innovation acknowledge text text at once
Conversion and pictures and
Victimisation to review on
OCR Character
Recognition with
speech synthesis
technology and
to develop a
value effective
user friendly
image to speech
conversion
system
victimisation
MATLAB.
8 2016 Text to Speech Indian Journal of This paper Optical Text Extraction
Conversion Science and describes the Character from color images
Technology design, Recognition is a challenging
implementation (OCR) and task in computer
and experimental Text to Speech vision
results of the Synthesizer
device (TTS)
9 2015 VOICE Journal of Applied is a software Hidden Markov Background
RECOGNITION and Fundamental that lets the Model(HMM) noise 
SYSTEM: Sciences user control
SPEECH-TO- computer
TEXT functions and
dictates text by
voice.

10 2015 Review on Text- International Journal This paper aims Concatenative intelligibility and
To-Speech of Advanced to give an comprehensibility
Synthesizer Research in overview of of synthetic
Computer and speech synthesis speech has not
Communication in Indian reached the
Engineering languages, acceptable level
summarizes and
compares the
characteristics of
various synthesis

16
techniques used.
11 2015 Text To Speech INTERNATIONAL This TTS system is digital signal the output speech
Conversion Using JOURNAL OF mainly used for processing of words are
Different Speech SCIENTIFIC & visual discontinuities
Synthesis TECHNOLOGY impairments and between
RESEARCH handicapped transitions of
people. phoneme.
12 2014 A Comparative I.J. Information this paper reports natural Ten visually
Study of Arabic Engineering and an empirical language impaired or blind
Text-to-Speech Electronic Business study that processing undergraduate
Synthesis Systems systematically (NLP) students took part
compares two in this
screen readers, experiment. The
namely, participants were
NonVisual native Arabic
Desktop Access speakers. The
(NVDA) and experiment lasted
IBSAR. for approximately
30 min
13 2014 Implementation of International Journal The main aims of MATLAB The development
Text to Speech of Engineering this paper are to of intonation,
Conversion Research & study on Optical accent and
Technology (IJERT) Character pronunciation, as
Recognition with well as the ability
speech synthesis to interpret the
technology and context in which a
to develop a cost word is used in
effective user order to
friendly image to pronounce it
speech correctly.
conversion
system using
MATLAB.
14 2014 Text – To – International Journal The general Object Oriented intelligibility and
Speech Synthesis of Research in objective of the Analysis and comprehensibility
(TTS) Information project is to Development of synthetic
Technology (IJRIT) develop a Text- Methodology speech has not
to-speech (OOADM) reached the
synthesizer for acceptable level
the physically
impaired and the
vocally
disturbed
individuals
using English
language.
15 2013 Isolated Speech Journal of Applied is a software Dynamic Time a lot of data is

17
Recognition using and Fundamental that lets the Warping(DTW provided
MFCC and DTW Sciences user control )
computer
functions and
dictates text by
voice.

16 2013 Text to Speech International Journal The main aim of FLITE A system which
Conversion Using of Science and text-to-speech Algorithm stores phones or
FLITE Algorithm Research (IJSR) (TTS) system is diphones provides
to convert the largest output
normal language range, and it give
text into speech low clarity
17 2013 A Survey on Text- International Journal This paper gives Natural Conversion
To-Speech of Recent Advances us an overall idea Language system of a
Translation of in Engineering & on the different Processing language is
Multi-Language Technology approaches taken dependent on the
(IJRAET) by them to linguistics
convert text into structure of that
speech with their language
respective tools
and techniques.
18 2012 Text to Speech: A International Journal This paper made Acoustic synthesizers to
Simple Tutorial of Soft Computing a clear and Processing continue to
and Engineering simple overview improve research
(IJSCE of working of in prosodic
text to speech phrasing,
system (TTS) in improving quality
step by step of speech, voice,
process emotions and
expressiveness in
speech and to
simplify the
conversion
process so as to
avoid complexity
in the program.
19 2012 SPEECH TWINKLE SAHU This paper is Markov Model Vocabulary size
RECOGNITION CSE 6 TH SEM focused on the and confusability
SYSTEMS implementation
of a speech
recognition
system using
various method
20 2011 SPEECH Seminar Topics This paper deals SPEECH Isolated,
RECOGNITION with the topic ANALYZER discontinuous, or
SPEECH continuous speech
18
RECOGNITION
which can make a
revolution in the
years to come.

CHAPTER THREE
METHODOLOGY
This Chapter will provide an overview of the convert text to speechion system developed. This
program requires necessary .NET and precisely C# language and the method used to develop this
program.

3.1 USING SYSTEM.SPEECH. SYNTHESIS


It is a library code that is use to configure and generate speech

3.2 USING SYSTEM.IO


It is a library code allow user to import and save document into the computer library, it
allow user to access the file explorer

3.3 VOLUME AND SPEED


Volume and Speed is a track bar that is use to increase and decrease volume and speed in
the program, it is use in the program by the process of dragging and dropping the tool
from the toolbar into the user interface layout and been customized (resizing and changing
the properties) to the user understanding.

3.4 CHANGE VOICE


Change voice is a combo box it is use to select the kind of voice the user wants between
the male and female gender, it is use in the program by process of dragging and dropping
the tool from the toolbar into the user interface layout and been customized (resizing,
inserting the values to that can choose) to the user understanding.
3.5 PLAY
Play is a button that is use to play or read the text imported or typed into the textbox after
the user has select the voice, the volume and speed when clicked, it is use in the program
by process of dragging and dropping the tool from the toolbar into the user interface

19
layout and been customized (resizing, changing the color, changing the font style and size)
to the user understanding.

3.5 PAUSE
Pause is a button it is use to suspend the program from reading the text when clicked, it is
use in the program by process of dragging and dropping the tool from the toolbar into the
user interface layout and been customized (resizing, changing the color, changing the font
style and size) to the user understanding.

3.6 RESUME
Resume is a button it is use to resume the reading when clicked, it is use in the program by
process of dragging and dropping the tool from the toolbar into the user interface layout
and been customized (resizing, changing the color, changing the font style and size) to the
user understanding.

3.7 SAVE
Save is a button it is use in the program to save the typed or imported text in program to
the computer document to be accessed another time., it is use in the program by process of
dragging and dropping the tool from the toolbar into the user interface layout and been
customized (resizing, changing the color, changing the font style and size) to the user
understanding.

3.8 RECORD
Record is a button it is use in the program to record the reader voice into a music file
(WAVE FILE) into the computer library, so the user can access this anytime without load
the program anytime, it can be accessed by a music or video software such VLC, music,
Window player etc., it is use in the program by process of dragging and dropping the tool
from the toolbar into the user interface layout and been customized (resizing, changing the
color, changing the font style and size) to the user understanding.

3.9 OPEN
Open is a button This is use in the program to open document (text file) into text box in
order to read to the user., it is use in the program by process of dragging and dropping the
tool from the toolbar into the user interface layout and been customized (resizing,
changing the color, changing the font style and size) to the user understanding.

20
3.10 LOGOUT
Logout is a button it is use in the program to close or exit the application., it is use in the
program by process of dragging and dropping the tool from the toolbar into the user
interface layout and been customized (resizing, changing the color, changing the font style
and size) to the user understanding.

3.11 TEXT BOX


Text box is a rich text box that allows u to type text it is use to collect typed or opened text
or document in the program., it is use in the program by process of dragging and dropping
the tool from the toolbar into the user interface layout and been customized (Resizing and
changing the font style and size) to the user understanding.

3.12 NOVA SPEECH AI


Is a label it is use in the program to input the name or title of the program, it is use in the
program by process of dragging and dropping the tool from the toolbar into the user
interface layout and been customized (resizing, changing the color and changing the font
style and size) to the user understanding.

3.13 ICON
Icon is a picture box it is use in the program to display the logo, minimized sign close sign
in the program., it is use in the program by process of dragging and dropping the tool from
the toolbar into the user interface layout and been customized (importing the picture and
resizing) to the user understanding.

21
CHAPTER FOUR

SYSTEM DESIGN

This chapter with the functional specification of what the new system is to achieve and end with
detailed specification from which the program can be performed. The proposed system intends to
achieve the following automated versions.

4.1 OUTPUT DESIGN


The design of the output is obviously the first task to be performed, it dictates both the
requirement for input and file in the system. The output design is sub-sectioned into three:
a) Report to be generated
b) Screen forms of report
c) File used to produce report

A. REPORT TO BE GENERATE
 text report

B. SCREEN FORM OF REPORT

22
Fig 4.1:A figure showing screen form of text report

C. FILE USED TO PRODUCE REPORT

File used for the generate report includes;

 text report (text.vb)

4.2 INPUT DESIGN

This entails the designed form that was used for the generation of the output design. However,
input design is sub-sectioned into three parts namely:

a) List of input items required


b) Data capture screen form for input
c) File used to retain input

A) LIST OF INPUT ITEMS REQUIRED

1. Play
2. Pause
3. Record

(B) DATA CAPTURE SCREEN FORM FOR INPUT

23
Fig 4.2: A figure showing screen form of file structure

(C)FILES USED TO RETAIN INPUT

The files used to retain or curtail input is as follow;

 Text.vb

4.3 PROCESS DESIGN

These can be categorized as follows:

(a) List of programming activities


(b) Identification of program module to be developed
(c) Visual table of content (VTOC)

(A) LIST OF PROGRAMMING ACTIVITIES


The programming activities involved in this project is the design of programming modules forms
used in extracting activities partake in Text to Speech Audio System.

(B) IDENTIFICATION OF PROGRAM MODULES TO BE DEVELOPED


Below is a list of program modules to be developed:

 Text module

(C) VISUAL TABLE OF CONTENT (VTOC)

24
This is a structural chart showing each of the modules and sub modules used in the program
developed, the VTOC is shown below.

PROGRAM SPLASH
(NOVA SPEECH AI)

TEXT TO SPEECH SYSTEM


DASHBOARD

PAUSE VTOC

PAUSE

RESUME

OPEN
SAVE

RECORD 25
Figure 4.3: Virtual Table of Content Chart

4.4 STORAGE DESIGN

These are the devices used to store the tables created in the base and the description are as
follows:

(A) DESCRIPTION OF DATABASE USED

The database used for the storage of this new system is Microsoft Structural Query Language
(SQL); this was used to link up with the Visual Studio 2012, software that was used in the design
of Rental Service Management system.

(B) DESCRIPTION OF FILES USED


Description of file used is as follows:

 DOCUMENT INFORMATION FILE


The item registration file, (text.vb) is designed to store document details.

(C) RECORD STRUCTURE OF ALL FILES USED


Name of file: text
Database name: db.text
Input medium: keyboard, mouse
Output medium: printer, Monitor

26
Storage medium: hard disk
FIELD NAME DATA TYPE SIZE

Title Text 100


Text Text MAX
Table 4.1 A table showing document information details

4.4 DESIGN SUMMARY


In the course of input and output design, hardware and software components of the
system are involved in proper execution of Text to Speech Audio System. It involved two parts:
(a) System Flowchart
(b) HIPO Chart

(A) SYSTEM FLOWCHART


This is the graphical representation that shows the processing procedure and how the procedure
is arranged from input with the details as how the processing is to be achieved.

27
Start

Program Splash
NOVA SPEECH AI

DASHBOARD
1. PLAY
2. PAUSE
3. RESUME
4. SAVE
5. OPEN
6. RECORD
7. LOGOUT

YES
NO

NO YES
IF OPTION PROCESS
2=TRUE PAUSE

NO YES
IF OPTION PROCESS
3=TRUE RESUME

NO
IF OPTION
YES PROCESS
4=TRUE SAVE

28
NO YES
IF OPTION PROCESS
5=TRUEIF OPENPRO
OPTION 1=TRUE CESS
NO YES
IF OPTION PROCESS
6=TRUE STORAGE

NO
IF OPTION
7=TRUE
STORAGE
YES

Stop

(B) HIPO (HIERARCHY PLUS INPUT- PROCESS PACKAGE) CHART

This is a tool for program design and documentation. It includes an overview diagram, which are
interwoven to form a HIPO chart. The diagram below gives detailed description.

INPUT PROCESS OUTPUT

Process
Play
Text to Speech

Pause Process Text to Speech

Record Process
Text to Speech

29

Save Document Process


Text to Speech
CHAPTER FIVE

5.1 SUMMARY

Language is the ability to express one’s thoughts by means of a set of signs (text), gestures, and
sounds. It is a distinctive feature of human beings, who are the only creatures to use such a
system. Speech is the oldest means of communication between people and it is also the most
widely used.

Vision impartment students always find it difficult to read a text on the screen but with the
application of text to speech system, they can easily listen while the text is being read to them.
Text-to-speech synthesis (TTS) is the automatic conversion of a text into speech that resembles,
as closely as possible, a native speaker of the language reading that text.

Text to speech is simply a form of output where a computer or other machine reads words to you
out loud in a real or simulated voice played through a loudspeaker

This project also allows to switch the voice of the reader between the two gender, it allows the
user to increase the speed of the reader and it allows you the record the text to a music file
(WAVE file) that allows you play it over and over again at your own convenient time.

5.2 CONCLUSION

30
The challenge that was picked up that lead to this piece of project work is that the people with
visual impairment find it difficult to read what is on the screen and also they often find it difficult
to type on the computer system due to loss of error. there is a need to develop a text to speech
application to enhance the visual impairment in typing and reading from the screen which
enhance them in providing standard documentation and easy access to information on the screen.

5.3 RECOMMENDATION

It is recommended that any student willing to have interest in the project should work on the
speech to text audio system which convert speech to text to enhance the people with total
blindness and disabled arm to communicate with their electronic through their voice.

REFERENCES

1) ^ Allen, Jonathan; Hunnicutt, M. Sharon; Klatt, Dennis (1987). From Text to Speech:


The MITalk system. Cambridge University Press. ISBN 978-0-521-30641-6.
2) ^ Rubin, P.; Baer, T.; Mermelstein, P. (1981). "An articulatory synthesizer for perceptual
research". Journal of the Acoustical Society of America. 70 (2): 321–
328. Bibcode:1981ASAJ...70..321R. doi:10.1121/1.386780.
3) ^ van Santen, Jan P. H.; Sproat, Richard W.; Olive, Joseph P.; Hirschberg, Julia
(1997). Progress in Speech Synthesis. Springer. ISBN 978-0-387-94701-3.
4) ^ Van Santen, J. (April 1994). "Assignment of segmental duration in text-to-speech
synthesis". Computer Speech & Language. 8 (2): 95–128. doi:10.1006/csla.1994.1005.
5) Abd Rakaq (2012). Supported eText: Assistive technology through text transformations.
Reading Research Quarterly, 42(1), 153-160. doi:10.1598/RRQ.42.1.8
6) Adebola A. (2019). Assitive Benefits of e-learning in improving capablilties of student
with disabliltes. Retrieved Septemeber 05, 2019 from https://ICTs-with-cover.pdf
7) Albert A. (2003). Language as a medium for communication , Journal of languge in our
society, 5(29), 185-192. Retrieved December 05, 2017 from
https://pdfs.semanticscholar.org/314d/93
54dc778aa95432a275641abbde4c4y0b23a17.pdf

31
8) Authur C . (2005) , Text To Speech Machine, a case study of the visual imparment
people in texas ,US. Retrived October 07, 2005 from from https://Understanding-text-to-
speech348489375930034x45678656-y655545.pdf
9) Kelly et al (2000) ). E-learning Teaching Students with Visual Impairments in Inclusive
Classrooms. A Case Study of One Secondary School in Tanzania. University of Oslo.
Retrieved December 05, 2017 from Solomon F . (2009) The Case Research Strategy in
Studies of Information on E-learning for visual impaired, 11, 3, Sept 1987 pp. 369–38.
10) Olamide .S (2013). Assitive technology for students with visual disabled. Retrieved
December 05,2013 from https://Exploring%20Text-toSpeech%20Readers%20for
%20Students%20with%20Disabilities.pdf
11) Olamide S. 2013. (2013). Challenges Faced by Learners with Visual Impairments.
Journal of Education and Practice, 5(29), 185-192. Retrieved December 05, 2017 from
https://pdfs.semanticscholar.org/314d/9354dc778aa95432a21abe7bde4c40gfsb23a17.pdf

APPENDICES

32
33
APPENDIX A

34
PROGRAM
LISTING

12) using System;


13) using System.Collections.Generic;
14) using System.ComponentModel;
15) using System.Data;
16) using System.Drawing;
17) using System.Linq;
18) using System.Text;
19) using System.Threading.Tasks;
20) using System.Windows.Forms;
21) using System.IO;
22) using System.Speech.Synthesis;
23)
24)
25) namespace final_tts
26) {
27) public partial class Form2 : Form
28) {
29) SpeechSynthesizer sped;
30) public Form2()
31) {
32) InitializeComponent();
33) sped = new SpeechSynthesizer();

35
34) sped.SpeakProgress += new EventHandler<SpeakProgressEventArgs>(speek);
35) }
36)
37) private void speek(object sender, SpeakProgressEventArgs e)
38) {
39) richTextBox1.HideSelection = false;
40) int textposition = e.CharacterPosition;
41) richTextBox1.Find(e.Text, textposition, RichTextBoxFinds.WholeWord);
42) //richTextBox1.Text += " " + e.Text;
43)
44) }
45)
46) private void Form2_Load(object sender, EventArgs e)
47) {
48)
49) }
50)
51) private void button13_Click(object sender, EventArgs e)
52) {
53) switch (comboBox1.SelectedIndex)
54) {
55)
56) case 0:
57) sped.SelectVoiceByHints(VoiceGender.Male);
58) break;
59) case 1:
60) sped.SelectVoiceByHints(VoiceGender.Female);
61) break;
62) default:
63) break;
64) }
65) sped.Volume = trackBar1.Value;
66) sped.Rate = trackBar2.Value;
67) sped.SpeakAsync(richTextBox1.Text);
68)
69) }
70)
71) private void button12_Click(object sender, EventArgs e)
72) {
73) if (sped.State == SynthesizerState.Speaking)
74) {
75) sped.Pause();
76) }
77) }
78)
79) private void button11_Click(object sender, EventArgs e)
80) {
81) if (sped.State == SynthesizerState.Paused)
82) {
83) sped.Resume();
84) }
85) }
86)
87) private void button9_Click(object sender, EventArgs e)
88) {
89) if (richTextBox1.Text != "")
90) {
91) saveFileDialog1.Filter = "TXT Files (*txt*)| *,txt";

36
92) if (saveFileDialog1.ShowDialog() == DialogResult.OK)
93) {
94) File.WriteAllText(saveFileDialog1.FileName,
richTextBox1.Text);
95) }
96) }
97) else
98) {
99) MessageBox.Show("BOX IS EMPTY PLEASE TYPE SOMETHING");
100) }
101) }
102)
103) private void button1_Click(object sender, EventArgs e)
104) {
105) SpeechSynthesizer sa = new SpeechSynthesizer();
106) sa.Rate = trackBar2.Value;
107) sa.Volume = trackBar1.Value;
108) SaveFileDialog sav = new SaveFileDialog();
109) sav.Filter = "Wave Files|*.wav";
110) sav.ShowDialog();
111) sa.SetOutputToWaveFile(sav.FileName);
112) sa.Speak(richTextBox1.Text);
113) sa.SetOutputToDefaultAudioDevice();
114) MessageBox.Show("RECORDING...... COMPLETED", "TEXT 2 SPEECH");
115) }
116)
117) private void button4_Click(object sender, EventArgs e)
118) {
119) OpenFileDialog Openfiledialog = new OpenFileDialog();
120) if (Openfiledialog.ShowDialog() == DialogResult.OK)
121) {
122) string s = File.ReadAllText(Openfiledialog.FileName);
123) richTextBox1.Text = s;
124) }
125) }
126)
127) private void comboBox1_SelectedIndexChanged(object sender,
EventArgs e)
128) {
129)
130) //sped.SpeakAsyncCancelAll();
131) //sped.Dispose();
132) //sped.SpeakAsync(richTextBox1.Text);
133) }
134)
135) private void panel4_Paint(object sender, PaintEventArgs e)
136) {
137)
138) }
139)
140) private void trackBar2_Scroll(object sender, EventArgs e)
141) {
142) //sped.SpeakAsyncCancelAll();
143) //sped.Dispose();
144) //sped.SpeakAsync(richTextBox1.Text);
145) }
146)
147) private void pictureBox1_Click(object sender, EventArgs e)

37
148) {
149) Close();
150) }
151)
152) private void pictureBox2_Click(object sender, EventArgs e)
153) {
154) this.WindowState=FormWindowState.Minimized;
155) }
156)
157) private void button2_Click(object sender, EventArgs e)
158) {
159) Close();
160) }
161) }
162) }

APPENDIX B
38
39
PROGRAM FLOW
CHART

Start

Program Splash
NOVA SPEECH AI

DASHBOARD
1. PLAY
2. PAUSE
3. RESUME
4. SAVE
5. OPEN
6. RECORD
7. LOGOUT

YES
NO

NO YES
IF OPTION 40 PROCESS
2=TRUE PAUSE
3=TRUE RESUME

NO
IF OPTION YES PROCESS
4=TRUE SAVE

NO YES
IF OPTION PROCESS
5=TRUEIF OPENPRO
OPTION 1=TRUE CESS
PLAY
NO YES
IF OPTION PROCESS
6=TRUE STORAGE

NO
IF OPTION
7=TRUE
STORAGE
YES

Stop

41
APPENDIX C

42
SAMPLE
OUTPUT

43
SHOWING SCREEN FORM OF TEXT REPORT

44

You might also like