You are on page 1of 75

TEXT TO SPEECH CONVERTER

1.INTRODUCTION

A Text-To-Speech (TTS) synthesizer is a computer-based system that should be able to read any
text aloud, whether it was directly introduced in the computer by an operator or scanned and
submitted to an Optical Character Recognition (OCR) system. Let us try to be clear. There is a
fundamental difference between the system we are about to discuss here and any other talking
machine (as a cassette-player for example) in the sense that we are interested in the automatic
production of new sentences. This definition still needs some refinements. Systems that simply
concatenate isolated words or parts of sentences, denoted as Voice Response Systems, are only
applicable when a limited vocabulary is required (typically a few one hundreds of words), and
when the sentences to be pronounced respect a very restricted structure, as is the case for the
announcement of arrivals in train stations for instance. In the context of TTS synthesis, it is
impossible (and luckily useless) to record and store all the words of the language. It is thus more
suitable to define Text-To-Speech as the automatic production of speech, through a grapheme-to-
phoneme transcription of the sentences to utter.

At first sight, this task does not look too hard to perform. After all, is not the human being
potentially able to correctly pronounce an unknown sentence, even from his childhood? We all
have, mainly unconsciously, a deep knowledge of the reading rules of our mother tongue. They
were transmitted to us, in a simplified form, at primary school, and we improved them year after
year. However, it would be a bold claim indeed to say that it is only a short step before the
computer is likely to equal the human being in that respect. Despite the present state of our
knowledge and techniques and the progress recently accomplished in the fields of Signal
Processing and Artificial Intelligence, we would have to express some reservations. As a matter
of fact, the reading process draws from the furthest depths, often unthought of, of the human
intelligence.
1.2 PROJECT OVERVIEW:

Text To Speech Converter is a web based application for converting text documents into    audio
files that is play with media players with effective speech for listening. In our project, we have
focused on “SEMESTER RECOGNIZATION SYSTEM” which is beneficial for students as
they can know about their subjects and respectively search best books according to their semester
& branch.

In this project we can record different types of files like documents, PDF files and power point
presentation in to speech automatically and the file will be generated in WAV audio format. This
project is tested for semester reorganization system which is used for converting books in to
audio files and uses it in iphones.
2. SYSTEM ANALYSIS

2.1 FEASIBILITY STUDY:

Feasibility analysis is a measure of how beneficial or practical the development of a


software system will be to an organization. This analysis recurs through different cycle. There
different types of feasibility for analysis:

1.Technical Feasibility
2.Operational Feasibility
3.Economical Feasibility

1. TECHNICAL FEASIBILITY:

No extra devices are needed for running the system. It can run on any platform.

2. OPERATIONAL FEASIBILITY:

This system is also operationally feasible. Operational feasibility is a measure of how well a
proposed system solves the problems, and takes advantage of the opportunities identified during
scope definition and how it satisfies the requirements identified in the requirements analysis
phase of system development.

3.ECONOMIC FEASIBILITY:

This system is totally economically feasible. This project is economically perfect to be


completed at low cost. The time is consumed only on making dictionary for the TTS speech
system to convert the text to speech.

EVALUATION:

An evaluation of a TTS system can be performed through several aspects. The main
Relevant evaluation issues for a n implementation area of embedded systems, as the

Case in this thesis, are:

STORAGE PROPERTIES –
 The size of the data base needed, which, except for the amount of information, also
depends on the coding possibilities of the data and the choice of codec.

COMPUTATIONAL COMPLEXITY –
 Number and type of operations per sample needed for synthesizing a text.

USABILITY –
 Investigation of suitable application areas , such as reading e-mails or items in a menu ,
and the possibility to extend the TTS system to function for other languages than English.

SPEECH QUALITY-

 The perceived Characteristics of the synthesized speech in the aspects of intelligibility,


fluidity and naturalness

IMPLEMENTATION COSTS –

 The time needed for develop the system, including the collection of data, and the cost of
the required hardware devices.

The TTS synthesizer developed in this thesis is designed to suit the defined implementation area.
Since it is a feasibility study on a TTS system for embedded devices it cannot define evaluated
for all the issues above. The evaluation presented in this chapter refers to the quality of the
synthesized speech and the chosen solution.
Because of the restricted time limit for this project, an extensive quality evaluation cannot be
performed. For an overall quality judgment of a speech synthesis system usually a MOS analysis
method is performed, which requires many person for a reliable result. The listener grades the
quality in the three categories.(Intelligibility, fluidity, naturalness) by the five level MOS scale
from 1 to 5,where 1 corresponds to bad and 5 to excellent.
The audible evaluations describes in this chapter are performed by only one listener. The
quality conclusions are based on relative clear differences and are assumed legible for any
arbitrary listener. Additionally the judging of the quality can only assumed legible for any
arbitrary listener. Additionally the judging of the quality can only be relative and not absolute.
This arises from the lack of an original or maximum quality-valid signal and therefore only a
comparison between different synthetic speech signals can be performed. The intelligibility of
the evaluated speech signals is very good for all cases and hence is not generally mentioned in
the quality judgements.

2.2 EXISTING SYSTEM:

 In the existing system mostly the blind peoples are need a help of a human to reading the
book. If the blind student asks their neighbours or friends to read the book some time
they shout them they will be mentally affected for them. It cause them name problems.

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty
Russian sentences into English. The authors claimed that within three or five years, machine
translation would be a solved problem. However, real progress was much slower, and after
the ALPAC report in 1966, which found that ten years long research had failed to fulfill the
expectations, funding for machine translation was dramatically reduced. Little further research in
machine translation was conducted until the late 1980s, when the first statistical machine
translation systems were developed.

Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural
language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA,
a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to
1966. Using almost no information about human thought or emotion, ELIZA sometimes
provided a startlingly human-like interaction. When the "patient" exceeded the very small
knowledge base, ELIZA might provide a generic response, for example, responding to "My head
hurts" with "Why do you say your head hurts?".

During the 70's many programmers began to write 'conceptual ontologies', which structured real-
world information into computer-understandable data. Examples are MARGIE (Schank, 1975),
SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM
(Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time,
many chatterbotswere written including PARRY, Racter, and Jabberwacky.

Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting
in the late 1980s, however, there was a revolution in NLP with the introduction of machine
learningalgorithms for language processing. This was due both to the steady increase in
computational power resulting from Moore's Law and the gradual lessening of the dominance
of Chomskyantheories of linguistics (e.g. transformational grammar), whose theoretical
underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning
approach to language processing.[3] Some of the earliest-used machine learning algorithms, such
as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.
Increasingly, however, research has focused on statistical models, which make
soft, probabilistic decisions based on attaching real-valued weights to the features making up the
input data. The cache language modelsupon which many speech recognition systems now rely
are examples of such statistical models. Such models are generally more robust when given
unfamiliar input, especially input that contains errors (as is very common for real-world data),
and produce more reliable results when integrated into a larger system comprising multiple
subtasks.

Many of the notable early successes occurred in the field of machine translation, due especially
to work at IBM Research, where successively more complicated statistical models were
developed. These systems were able to take advantage of existing multilingual textual
corpora that had been produced by the Parliament of Canada and the European Union as a result
of laws calling for the translation of all governmental proceedings into all official languages of
the corresponding systems of government. However, most other systems depended on corpora
specifically developed for the tasks implemented by these systems, which was (and often
continues to be) a major limitation in the success of these systems. As a result, a great deal of
research has gone into methods of more effectively learning from limited amounts of data.

Recent research has increasingly focused on unsupervised and semi-supervised learning


algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the
desired answers, or using a combination of annotated and non-annotated data. Generally, this
task is much more difficult than supervised learning, and typically produces less accurate results
for a given amount of input data. However, there is an enormous amount of non-annotated data
available (including, among other things, the entire content of the World Wide Web), which can
often make up for the inferior results.

 The Java Speech API is an extension to the Java platform.

 Extensions are Packages of classes written in the Java programming language (any
associated native code) that application developers can use to extend the Functionality of
the core part of the Java platform.

2.3 PROPOSED SYSTEM:

 In the proposed system we are implemented, A wide variety of learners in school and at
home use text-to-speech (TTS) technology to gain access to important information. The
goal of this tutorial is to give those who use, or help others use TTS, background
information about the technology, ideas for how to use it, and information about
acquiring digital text.

 Text-to-speech (TTS) technology on a computer refers to the combination of text


appearing on the computer display together with the computer speaking that text aloud
with a digitized or synthesized voice. Digitized speech is a recorded (or digitized) human
voice speaking, and synthesized voice is a computer-generated voice speaking the text.
This tutorial focuses on TTS software tools that use synthesized speech to read any text.

 Speech recognition: Given a sound clip of a person or people speaking, determine the
textual representation of the speech. This is the opposite of text to speech and is one of the
extremely difficult problems colloquially termed "AI-complete" (see above). In natural
speech there are hardly any pauses between successive words, and thus speech
segmentation is a necessary subtask of speech recognition (see below). Note also that in most
spoken languages, the sounds representing successive letters blend into each other in a
process termed coarticulation, so the conversion of the analog signal to discrete characters
can be a very difficult process.
 Speech segmentation: Given a sound clip of a person or people speaking, separate it into
words. A subtask of speech recognition and typically grouped with it.
 Topic segmentation and recognition: Given a chunk of text, separate it into segments
each of which is devoted to a topic, and identify the topic of the segment.
 Word segmentation: Separate a chunk of continuous text into separate words. For a
language like English, this is fairly trivial, since words are usually separated by spaces.
However, some written languages like Indian, Japanese and Thai do not mark word
boundaries in such a fashion, and in those languages text segmentation is a significant task
requiring knowledge of thevocabulary and morphology of words in the language.
 Word sense disambiguation: Many words have more than one meaning; we have to select
the meaning which makes the most sense in context. For this problem, we are typically given
a list of words and associated word senses, e.g. from a dictionary or from an online resource
such as WordNet.

In some cases, sets of related tasks are grouped into subfields of NLP that are often considered
separately from NLP as a whole. Examples include:

 Information retrieval (IR): This is concerned with storing, searching and retrieving


information. It is a separate field within computer science (closer to databases), but IR relies
on some NLP methods (for example, stemming). Some current research and applications
seek to bridge the gap between IR and NLP.
 Information extraction (IE): This is concerned in general with the extraction of semantic
information from text. This covers tasks such as named entity recognition, Coreference
resolution,relationship extraction, etc.
 Speech processing: This covers speech recognition, text-to-speech and related tasks.
The following is a list of some of the most commonly researched tasks in NLP. Note that some
of these tasks have direct real-world applications, while others more commonly serve as subtasks
that are used to aid in solving larger tasks. What distinguishes these tasks from other potential
and actual NLP tasks is not only the volume of research devoted to them but the fact that for
each one there is typically a well-defined problem setting, a standard metric for evaluating the
task, standard corpora on which the task can be evaluated, and competitions devoted to the
specific task.

 Automatic summarization: Produce a readable summary of a chunk of text. Often used to


provide summaries of text of a known type, such as articles in the financial section of a
newspaper.
 Coreference resolution: Given a sentence or larger chunk of text, determine which words
("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example
of this task, and is specifically concerned with matching up pronouns with the nouns or
names that they refer to. For example, in a sentence such as "He entered John's house
through the front door", "the front door" is a referring expression and the bridging
relationship to be identified is the fact that the door being referred to is the front door of
John's house (rather than of some other structure that might also be referred to).
 Discourse analysis: This rubric includes a number of related tasks. One task is identifying
the discourse structure of connected text, i.e. the nature of the discourse relationships
between sentences (e.g. elaboration, explanation, contrast). Another possible task is
recognizing and classifying the speech acts in a chunk of text (e.g. yes-no question, content
question, statement, assertion, etc.).
 Machine translation: Automatically translate text from one human language to another.
This is one of the most difficult problems, and is a member of a class of problems
colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that
humans possess (grammar, semantics, facts about the real world, etc.) in order to solve
properly.
 Morphological segmentation: Separate words into individual morphemes and identify the
class of the morphemes. The difficulty of this task depends greatly on the complexity of
themorphology (i.e. the structure of words) of the language being considered. English has
fairly simple morphology, especially inflectional morphology, and thus it is often possible to
ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens,
opened, opening") as separate words. In languages such as Turkish, however, such an
approach is not possible, as each dictionary entry has thousands of possible word forms.
 Named entity recognition (NER): Given a stream of text, determine which items in the
text map to proper names, such as people or places, and what the type of each such name is
(e.g. person, location, organization). Note that, although capitalization can aid in recognizing
named entities in languages such as English, this information cannot aid in determining the
type of named entity, and in any case is often inaccurate or insufficient. For example, the
first word of a sentence is also capitalized, and named entities often span several words, only
some of which are capitalized. Furthermore, many other languages in non-Western scripts
(e.g. Indian or Arabic) do not have any capitalization at all, and even languages with
capitalization may not consistently use it to distinguish names. For
example, German capitalizes all nouns, regardless of whether they refer to names,
and French and Spanish do not capitalize names that serve asadjectives.
 Natural language generation: Convert information from computer databases into readable
human language.
 Natural language understanding: Convert chunks of text into more formal representations
such as first-order logic structures that are easier for computer programs to manipulate.
Natural language understanding involves the identification of the intended semantic from the
multiple possible semantics which can be derived from a natural language expression which
usually takes the form of organized notations of natural languages concepts. Introduction and
creation of language metamodel and ontology are efficient however empirical solutions. An
explicit formalization of natural languages semantics without confusions with implicit
assumptions such as closed world assumption (CWA) vs. open world assumption, or
subjective Yes/No vs. objective True/False is expected for the construction of a basis of
semantics formalization.[4]

 Optical character recognition (OCR): Given an image representing printed text,


determine the corresponding text.
 Part-of-speech tagging: Given a sentence, determine the part of speech for each word.
Many words, especially common ones, can serve as multiple parts of speech. For example,
"book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be
a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Note
that some languages have more such ambiguity than others. Languages with little inflectional
morphology, such as English are particularly prone to such ambiguity. Indian is prone to
such ambiguity because it is a tonal language during verbalization. Such inflection is not
readily conveyed via the entities employed within the orthography to convey intended
meaning.
 Parsing: Determine the parse tree (grammatical analysis) of a given sentence.
The grammar for natural languages is ambiguous and typical sentences have multiple
possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands
of potential parses (most of which will seem completely nonsensical to a human).
 Question answering: Given a human-language question, determine its answer. Typical
questions have a specific right answer (such as "What is the capital of Canada?"), but
sometimes open-ended questions are also considered (such as "What is the meaning of
life?").
 Relationship extraction: Given a chunk of text, identify the relationships among named
entities (e.g. who is the wife of whom).
 Sentence breaking (also known as sentence boundary disambiguation): Given a chunk of
text, find the sentence boundaries. Sentence boundaries are often marked by periods or
otherpunctuation marks, but these same characters can serve other purposes (e.g.
marking abbreviations).
 Sentiment analysis: Extract subjective information usually from a set of documents, often
using online reviews to determine "polarity" about specific objects. It is especially useful for
identifying trends of public opinion in the social media, for the purpose of marketing.
Advantages

 Accepts trained voice as input.


 Along with speech output the corresponding information will be displayed.
 Speech recognition
 Speech segmentation
 Information retrieval
 Information extraction
 Sentence Identification
 Word segmentation
 Word disambiguation
 Speech Processing
 Query Answering
 Sentence Breaking
 Token Management
 Sentiment analysis
 Disclosure Analysis
 Machine Translation
 Parsing
 Relationship Extraction
 Automatic Summarization
 Natural Language Generation
 Natural Language Understanding
3. SYSTEM CONFIGURATION

3.1 HARDWARE SPECIFICATION:

 Pentium 4 1.8GHz Processor (Recommended Pentium 4 2.4Ghz Processor or greater)


 512 MB RAM on Windows XP (1GB RAM recommended)
 1GB RAM on Windows Vista and above (2GB recommended)
 Sound Card + Speakers
 25MB Free Disk Space

3.2 SOFTWARE SPECIFICATION:

 Operating Systems : Windows 8, 7, Vista, XP


 Language : Java

3.3 ABOUT THE SOFTWARE:

Java Technology:
Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all
of the following buzzwords:

 Simple
 Architecture neutral
 Object oriented
 Portable
 Distributed
 High performance
 Interpreted
 Multithreaded
 Robust
 Dynamic
 Secure

With most programming languages, you either compile or interpret a program so that you
can run it on your computer. The Java programming language is unusual in that a program is
both compiled and interpreted. With the compiler, first you translate a program into an
intermediate language called Java byte codes —the platform-independent codes interpreted by
the interpreter on the Java platform. The interpreter parses and runs each Java byte code
instruction on the computer. Compilation happens just once; interpretation occurs each time the
program is executed. The following figure illustrates how this works.

Fig JAVA working Models


You can think of Java bytecodes as the machine code instructions for the Java Virtual
Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser
that can run applets, is an implementation of the Java VM. Java bytecodes help make “write
once, run anywhere” possible. You can compile your program into bytecodes on any platform
that has a Java compiler. The bytecodes can then be run on any implementation of the Java VM.
That means that as long as a computer has a Java VM, the same program written in the Java
programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

Fig JAVA platform Independent

The Java Platform

A platform is the hardware or software environment in which a program runs. We’ve already
mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS.
Most platforms can be described as a combination of the operating system and hardware. The
Java platform differs from most other platforms in that it’s a software-only platform that runs on
top of other hardware-based platforms.

The Java platform has two components:


 The Java Virtual Machine (Java VM)
 The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported
onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful
capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into
libraries of related classes and interfaces; these libraries are known as packages. The next
section, What Can Java Technology Do?, highlights what functionality some of the packages in
the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows,
the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware
platform. As a platform-independent environment, the Java platform can be a bit slower than
native code. However, smart compilers, well-tuned interpreters, and just-in-time bytecode
compilers can bring performance close to that of native code without threatening portability.

What Can Java Technology Do?

The most common types of programs written in the Java programming language are applets and
applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet
is a program that adheres to certain conventions that allow it to run within a Java-enabled
browser.

However, the Java programming language is not just for writing cute, entertaining applets for the
Web. The general-purpose, high-level Java programming language is also a powerful software
platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of
application known as a server serves and supports clients on a network. Examples of servers are
Web servers, proxy servers, mail servers, and print servers. Another specialized program is a
servlet. A servlet can almost be thought of as an applet that runs on the server side. Java Servlets
are a popular choice for building interactive web applications, replacing the use of CGI scripts.
Servlets are similar to applets in that they are runtime extensions of applications. Instead of
working in browsers, though, servlets run within Java Web servers, configuring or tailoring the
server.

How does the API support all these kinds of programs? It does so with packages of software
components that provide a wide range of functionality. Every full implementation of the Java
platform gives you the following features:
 The essentials: Objects, strings, threads, numbers, input and output, data
structures, system properties, date and time, and so on.
 Applets: The set of conventions used by applets.
 Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram
Protocol) sockets, and IP (Internet Protocol) addresses.
 Internationalization: Help for writing programs that can be localized for users
worldwide. Programs can automatically adapt to specific locales and be displayed
in the appropriate language.
 Security: Both low level and high level, including electronic signatures, public
and private key management, access control, and certificates.
 Software components: Known as JavaBeansTM, can plug into existing component
architectures.
 Object serialization: Allows lightweight persistence and communication via
Remote Method Invocation (RMI).
 Java Database Connectivity (JDBCTM): Provides uniform access to a wide
range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration,
telephony, speech, animation, and more. The following figure depicts what is included in the
Java 2 SDK.

How Will Java Technology Change My Life?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language.
Still, it is likely to make your programs better and requires less effort than other languages. We
believe that Java technology will help you do the following:

 Get started quickly: Although the Java programming language is a powerful


object-oriented language, it’s easy to learn, especially for programmers already
familiar with C or C++.
 Write less code: Comparisons of program metrics (class counts, method counts,
and so on) suggest that a program written in the Java programming language can
be four times smaller than the same program in C++.
 Write better code: The Java programming language encourages good coding
practices, and its garbage collection helps you avoid memory leaks. Its object
orientation, its JavaBeans component architecture, and its wide-ranging, easily
extendible API let you reuse other people’s tested code and introduce fewer bugs.
 Develop programs more quickly: Your development time may be as much as
twice as fast versus writing the same program in C++. Why? You write fewer
lines of code and it is a simpler programming language than C++.
 Avoid platform dependencies with 100% Pure Java: You can keep your
program portable by avoiding the use of libraries written in other languages. The
100% Pure JavaTM Product Certification Program has a repository of historical
process manuals, white papers, brochures, and similar materials online.
 Write once, run anywhere: Because 100% Pure Java programs are compiled into
machine-independent bytecodes, they run consistently on any Java platform.
 Distribute software more easily: You can upgrade applets easily from a central
server. Applets take advantage of the feature of allowing new classes to be loaded
“on the fly,” without recompiling the entire program.
JAVA APPLETS
A Java applet is an applet delivered to users in the form of Java bytecode. Java applets can run in
a Web browser using a Java Virtual Machine (JVM), or in Sun's AppletViewer, a stand-alone
tool for testing applets. Java applets were introduced in the first version of the Java language in
1995, and are written in programming languages that compile to Java bytecode, usually in Java,
but also in other languages such as Jython JRuby, or Eiffel (via SmartEiffel).

Java applets run at speeds comparable to, but generally slower than, other compiled languages
such as C++, but until approximately 2011 many times faster than JavaScript. In addition they
can use 3D hardware acceleration that is available from Java. This makes applets well suited for
non trivial, computation intensive visualizations. When browsers have gained support for native
hardware accelerated graphics in the form of Canvas and WebGL, as well as Just in Time
compiled JavaScript, the speed difference has become less noticeable.

Since Java's bytecode is cross-platform or platform independent, Java applets can be executed by
browsers for many platforms, including Microsoft Windows, Unix, Mac OS and Linux. It is also
trivial to run a Java applet as an application with very little extra code. This has the advantage of
running a Java applet in offline mode without the need for any Internet browser software and
also directly from the integrated development environment (IDE).
Overview
Applets are used to provide interactive features to web applications that cannot be provided by
HTML alone. They can capture mouse input and also have controls like buttons or check boxes.
In response to the user action an applet can change the provided graphic content. This makes
applets well suitable for demonstration, visualization and teaching. There are online applet
collections for studying various subjects, from physics to heart physiology. Applets are also used
to create online game collections that allow players to compete against live opponents in real-
time.

Advantages
A Java applet can have any or all of the following advantages:[27]

* It is simple to make it work on Linux, Microsoft Windows and Mac OS X i.e. to make it
cross platform. Applets are supported by most web browsers.

* The same applet can work on "all" installed versions of Java at the same time, rather than
just the latest plug-in version only. However, if an applet requires a later version of the Java
Runtime Environment (JRE) the client will be forced to wait during the large download.

* Most web browsers cache applets, so will be quick to load when returning to a web page.
Applets also improve with use: after a first applet is run, the JVM is already running and starts
quickly (the JVM will need to restart each time the browser starts afresh).
* It can move the work from the server to the client, making a web solution more scalable with
the number of users/clients.

* If a standalone program (like Google Earth) talks to a web server, that server normally needs
to support all prior versions for users which have not kept their client software updated. In
contrast, a properly configured browser loads (and caches) the latest applet version, so there is no
need to support legacy versions.

* The applet naturally supports the changing user state, such as figure positions on the
chessboard.

* Developers can develop and debug an applet direct simply by creating a main routine (either
in the applet's class or in a separate class) and calling init() and start() on the applet, thus
allowing for development in their favorite Java SE development environment. All one has to do
after that is re-test the applet in the AppletViewer program or a web browser to ensure it
conforms to security restrictions.
4. SYSTEM DESIGN

4.1 NORMALIZATION:

In a speaker normalization processor apparatus, a vocal-tract configuration estimator estimates


feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal
tract of each normalization-target speaker, by looking up to a correspondence between vocal-
tract configuration parameters and Formant frequencies previously determined based on a vocal
tract model of the standard speaker, based on speech waveform data of each normalization-target
speaker. A frequency warping function generator estimates a vocal-tract area function of each
normalization-target speaker by changing feature quantities of a vocal-tract configuration of the
standard speaker based on the feature quantities of the vocal-tract configuration of each
normalization-target speaker estimated by the estimation means and the feature quantities of the
vocal-tract configuration of the standard speaker, estimating Formant frequencies of speech
uttered by each normalization-target speaker based on the estimated vocal-tract area function of
each normalization-target speaker, and generating a frequency warping function showing a
correspondence between input speech frequencies and frequencies after frequency warping.
4.2 TABLE DESIGN:

Table 1  Basic phones, IPA transcriptions and examples in Standard Indian

Phone IPA Examples Phone IPA Examples

a1 a ba,a 1 l la,lang

a2 ε an,ai m m ma,mi

a3 ang,ao n n na,ni

b ba,bu ng ong,qing

p
c ci,ca o1 wo,po

ts h
ch cha,chu o2 ou

d da,di p po,pang
o
e1 he,ge q qi,qu
t p h
e2 ei,ye,yue,ian r ri,rong
γ
e3 en,eng,uen s si,san
e
er er sh shi,shui

f fa,fei t te,ti

s
g er ge,gei u wan,bu
h f he,ha x xia,xi

i1 k bi,xia yv u,yue
th

i2 x zi,ci,si z zi,zuo
u

i3 i zhi,chi,shi zh zhu,zhong

j ji,jin sil

k ke,ka y

ts

kh

silence

Table 2:  Basic sentence patterns of standard Indian

1 S+N as predicate 11“Bei4”contruction

2 S+Adj as predicate 12 Existential sentence

3 S+clause as predicate 13 S+V1+V2

4 S+V 14 S+V1+O+V2

5 S+V+O 15 Ineuery
6 S+V+O1+O2 16 One word sentence

7 S+V+Complement 17 V+clause

8“You3”construction Adj+clause

9“Shi4”construcyion 18 Miscellaneous

10“Ba3”contruction

Table 3: Coverage counts of the continuous speech database

  Speech DB863A DB863B DB863C


Occurrenc Rate Occurrenc Rate Occurrence Rate
 units e e
Syllable 401 396 98.7% 397 99% 400 99.9%

Inter-syllabic diphone 415 415 100% 415 100% 415 100%

Inter-sylabic triphone 3035 2128 70% 2644 87% 3023 99.6%

Final-initial structure 781 668 85.5% 724 92.7% 781 100%

Sentence pattern 17 17 100% 17 100% 17 100%

4.3 INPUT DESIGN:

Input facilities the entry of data into the computer system. Input design involves the selection
of the best strategy for getting data into the computer system at the right time and as
accurately as possible. This is because the most difficult aspect of input design in accuracy
.The use of well-defined documents can encourage users to record data accurately without
omission.

Access to computers for students with disabilities involves two major issues: access to the
computers themselves and access to electronic resources such as word processors,
spreadsheets, and the World Wide Web.

Adaptive hardware and software can facilitate computer access for people with disabilities.
Adaptive technology solutions may involve simple, readily available adjustments such as
using built-in access devices on standard computers, or they may require unique
combinations of software and hardware such as those needed for voice 
5. System Description

1. Text Analysis

2. Utterance compressed of words

3. Phasing

4. Utterance compressed of phonemes

5. Wave form generations

Text Analysis:

Optical Character Recognition (OCR) – Allows you to control a scanner from within the TTS
software using its own proprietary OCR software. You use OCR software when you scan a book
and convert that scanned image into true text for TTS software to read. Scanners come with their
own scanning and OCR software that you also can use to create digital text from print materials.
For further information on OCR and scanning, see Converting Print Materials to Digital Text
Using a Scanner.

Formatting text – Allows you to format digital text you create, download from the Internet, or
scan into your computer similar to a word processing program.

Speaking what you type – Speaks text as you type to give you support in writing. Within this
function there may be the ability to set the level of support, such as speaking words or speaking
each letter and then the word.

Proprietary Format – TTS programs that have their own scanning and OCR software save files in
a variety of file types including their own proprietary format.

Utterance compressed of words:


The meaning of an utterance must be composed systematically in a way that incorporates the
meaning of its words and the contribution of its syntax. Utterance meanings must serve as a
formal basis for inference. Children probably have cognitive limitations on the length of
utterances they can produce, independent of their grammatical knowledge. Given such length
limitations, they may sensibly leave out the least important parts. It is also true that the omitted
words tend to be words that are not stressed in adults' utterances, and children may be leaving out
unstressed elements (Demuth, 1994). Some have also suggested that children's underlying
knowledge at this point does not include the grammatical categories that govern the use of the
omitted forms

Phasing:

A Phasing is an audio signal processing technique used to filter a signal by creating a series of
peaks and troughs in the frequency spectrum. The position of the peaks and troughs is typically
modulated so that they vary over time, creating a sweeping effect. For this purpose, phasing
usually include a low-frequency oscillator.

The electronic phasing effect is created by splitting an audio signal into two paths. One path
treats the signal with an all-pass filter, which preserves the amplitude of the original signal and
alters the phase. The amount of change in phase depends on the frequency. When signals from
the two paths are mixed, the frequencies that are out of phase will cancel each other out, creating
the phasing characteristic notches. Changing the mix ratio changes the depth of the notches; the
deepest notches occur when the mix ratio is 50%.The definition of phaser typically excludes
such devices where the all-pass section is a delay line; such a device is called a flanger. Using a
delay line creates an unlimited series of equally spaced notches and peaks. It is possible to
cascade a delay line with another type of all-pass filter as in this combines the unlimited number
of notches from the flinger with the uneven spacing of the phase.

Initions:

Intonation is variation of spoken pitch that is not used to distinguish words. It contrasts with
tone, in which pitch variation in some languages does distinguish words, and in other languages,
including English, performs a grammatical function. Intonation, rhythm, and stress are the three
main elements of linguistic prosody. Intonation patterns in some languages, such as Swedish and
Swiss German, can lead to conspicuous fluctuations in pitch, giving speech a sing-song quality.

Utterance compressed of phonemes:

An arrangement is provided for compressing speech data. Speech data is compressed based on a
phoneme stream, detected from the speech data, and a delta stream, determined based on the
difference between the speech data and a speech signal stream, generated using the phoneme
stream with respect to a voice font. The compressed speech data is decompressed into a
decompressed phoneme stream and a decompressed delta stream from which the speech data is
recovered. The processing system is adapted to convert a caller's voice message to a sequence of
phonemes whereby the caller's voice message is intended for a receiving device.

Wave:

Speech synthesis, automatic generation of speech waveforms, has been under development for
several decades. Text to Wav offers a way to convert written text into sound files, but thanks to a
vague interface and lack of help, we're reluctant to recommend it. The program's user interface is
pretty bland, with most of the controls crammed into the left corner of the toolbar. There are a
few self-explanatory buttons for adjusting your text size and font. We also found slide controls
for adjusting the voice rate, pitch, and volume. A drop-down menu makes it appear as though
you have multiple options for the voice type, but the only one we found was "Microsoft Anna" in
English. To the right were various menu options that were vague to us, such as Convert at Two
Voices, Speak the Input Sentence, and Automatic Change Voices. These options might be
recognizable for people who have worked with this kind of software before, but it wasn't clear to
us. A Help menu offers a link to online help, but it's written in Japanese, so we were out of luck.
Sure, we were able to input text into the appropriate field and play it back using the Speak
command, but beyond that, we were at a loss. Overall, if you're accustomed to these types of
programs, it might be worth giving Text to Wav a shot. But if you're new to text-to-speech
software, we recommend you look for something with more guidance. Text to Wav comes as a
ZIP file and is accessible after extraction.

Beginning with the isolated-syllable recognition, speech recognition in India has stepped into a
stage of treating large vocabulary and continuous speech. Well developed continuous speech
recognition and synthesis systems demand a high quality continuous speech database which is
compact and valid, and whose scientific design would benefit from incorporating linguistic and
phonetic knowledge.

Currently three speaking styles are identified in continuous speech database: read speech, fluent
speech (with planning) and spontaneous speech (withou planning). The phonetic phenomena
involve both segment and prosody.

Researches in speech database abroad are ahead of us in both segments and prosody. TIMIT, an
Engligh speech database which was built in 1980s for the purpose of training speech
recognizers, considers acoustic-phonetic rules in connected speech[1][2]. The segments in this
database were labelled automatically at first. [3] and then in 1990s were hand-labelled by
phoneticians[4]. In terms of prosody, a group of scientists and engineers with diverse
backgrounds joined their hands in developing a prosodic transcription system called ToBI (Tone
and Break Indices[5]. With rapid progress in speech engineering technology, speech databases
for different languages are being constructed by different research groups for their specific
objectives. For example, in Japan, the speech database devised by ATR contains not only speech
waveforms but also linguistic information such as morphological and syntactic tags[6]. In Bell
Laboratories, the speech database for the text-to-speech systems takes segmental duration, stress
values and segmental place in a sentence as factors of determining speech units[7]. Toward
understanding the human-machine dialog and developing a dialog system, AT&T proposes a
methodology for creating and managing an integrated database for the spoken dialog system[8].
To do continuous speech recognition over telepones, many countries have collected speech data
through telephone lines[9].
There exist several versions of Indian speech database home and abroad, such as the databases
for speech reognition developed by Department of Electrical Engineering and Department of
Computer Science at Tsing Hua University, Institute of Automation at Indian Academy of
Sciences, Department of Computer Sciences at University of Hongkong[10][11][12][13],and so
on, and databases for phonetic research and speech synthesis built up by Institute of Acoustics at
Indian Academy of Sciences and Institute of Linguistics at Indian Academy of Social Sciences.
All these databases are deficient in their design in the coverage of the phonetic characteristics in
continuous speech. Take the databases used for speech recognition as examples. The phonetic
phenomena being considered in text design can only be controlled in mono-syllables and
syllable groups while continuous sentences are randomly chosen from a large corpus and
therefore the choice is without control. However, the best speech data for training a recognizer is
continuous natural sentences. For this reason, we, under the financial support by the "863"
Project, set out to design reading texts for a continuous speech database, for which recording and
management of the database are done by Department of Electrical Engineering at University of
Science and Technology of India. It is hoped that through this work results from phonetic
research better serve the needs of speech engineering.

The major obstacle to achieve high accuracy in speech recognition is the large amount of
variabilities present in speech signal, in which linguistic information is carried only by a small
part. According to the current state art in the phonetic research, it is desirable to focus our
attention on the segmental contextual variabilities in read speech.

2 The Acoustic Characteristics of Continuous Speech

2.1  Variability in Continuous Speech

The variability in continuous speech is defined as that the phonetic characteristics of speech are
deviated from the citation forms. On the segmental level, there are two kinds of variabilities
which are either context-dependent variabilities (as arising from the influence of the adjacent
segments) and context-independent ones (arising from the influence of speaking rate, style,
mood, sentence patterns and speaker""s individual differences); On the prosodic level, varibility
is referred to change in fundamental frequency, duration and energy, and the interaction between
segment and prosody.

2.2  Continuous Speech and Formant Transitions

Continuous speech is cascaded by sequence of syllables and each syllable consists of smaller
units in turn. To explore the phonetic phenomena in continuous speech, we must define the basic
elements of continuous speech in standard Indian. Phoneme is the distinctive unit in a language,
and its various realizations are called allophones. Shi Feng proposed to define the acoustic
representation of allophones as phone[14]. In this paper, we reserve phone as the minimal
segment in continuous speech and use phone to describe the variability under different contexts.
In isolated words or connected words where words are separated by distinct pauses, the
beginnings and ends of words are clearly marked. In continuous speech, however, word
boundaries are blurred and words evolve smoothly in time with no acoustic separation.
Therefore segmental variabilities emerge.

Fig.1 is the spectrogram of an utterance of "la4 yue4 chul liu4" (the sixth day of the last month
by the lunar calendar). It demonstrates that a phonetic segment corresponds with a number of
concatenating segments in speech waveform [15].In other word, one phonetic segment
corresponds to more acoustic segments due to the existence of transitions. It is insufficient to use
phones to describe variabilities and transitions in continuous speech.

In the past ten years, researchers at Institute of Linguistics, Indian Academy of Social Sciences,
funded by the "863" Project, have been conducting a systematic research into the formant
transitions in all mono-syllables (CV/CVN) and disyllables (C1V1/N1-C2V2/N2) in standard
Indian, with particular attention paid to the inter-syllable transitions. These research works are
most valuable to the present study. With disyllabic structures[16], Chen Xiaoxia investigated the
coarticulation of C1V1C2V2, where V1 represents 22 finals, C2 represents labials "b,p,m,f",
alveolars "d,t,n,1", and velars "g,k,h", and V2 is occupied by three vowels "a,i,u". The results
show that the influence of V2 on V1 passing through C2 is obvious[17]. Research on the
disyllabic structures with C2 being the zero initial, carried out by Yan Jingzhu, indicates that the
formant transition is very strong when C2 is the zero initial[18]. Sun Guohua presents rules for
formant transitions when C2 is "z, c, s, zh, ch, sh, j, q, x" and V1 is followed by nasal endings
"n, ng"[19]. On the basis of these results, it is summarized that (1) when C2 is a zero-initial or
voiced consonants (m, n, 1, r), formant transitions between V1 and C2 is observed; (2) there is
little change on V1 when C2 is a stop or stop fricative; (3) the transitions from nl to C2 is quite
different from other cases, when nl is nasal endings "n, ng". Xu Yi proposes four kinds of
junctures[20]: close juncture for intra-syllabic segments, syllable juncture between syllable,
rhythm juncture between words, and pause juncture. Different junctures reflect different levels in
the prosodic structure hierarchy in continuous speech. The structure of clvl/nl-c2v2 will be
influenced by complication in continuous speech.

Pause and Prosodic Structures

The pause in continuous speech is a silence between waveforms of two syllables. Only when the
duration of silence is long enough is there no transition between two segments. There is one
more prosodic structure in an utterance which may have some relationship with syntactic and
semantic constraints.

The boundaries of the prosodic structure are breaks whose realization can be a pause, pre-
lengthening/final length, or pitch movement or F0 reset[21][22]. Some researches reveal that the
pause induced by syntax occurs at the boundaries of sentence or more complex structures, and
the average length between two pauses is about ten words[23]. Such a fact shows that pausing
takes place at the major prosodic boundaries. In a study of the pause in news stories by Li
Aijun[24], it is pointed out that in all breaks that are perceived, only 30% are realized as real
pause. It can be inferred that perceived breaks are not necessarily signaled by silence, but
probably by other acoustic cues. Consequently there are transitions between two segments even
at the boundary of two utterances. It can also be concluded that transitions exist between nearly
all syllables in an utterance which is not too long.

Basic Phones in Standard Indian

We have proposed 37 phones for standard Indian which are tabulated in Table 1 with their IPA
transcriptions and the contexts. In Table 1,"sil" is an abbreviating for "silence" and only
vowels/a, i, o, e/have their allophones wich are constrained by the contexts to their left and right.
It is generally known that the vowel/i/has three allophones: [i] as in  "yi", [ ] as in "zi ",
"ci", "si ", and [] as in "zhi ", "chi", "shi" , "ri". They are symbolized as il, i2and i3 respectively.
The low vowel "a" is realized differently as the context varies: [a] as in open syllables, [ε] as in
"ai", "an", and [a] as in "ang", "ao", symbolized as all, a2 and a3 respectively. In the similar vein,
for "e", el is the symbol for [γ] as in "ge","he",e2 for [e] as in "ei", "ie", "yve", and e3 for [] as in
"en", "eng"; for "o", ol is for [o] as in "uo", and o2 for [] as in "ou". Within syllables, the
contexts themselves will be able to identify the allophones for the above vowels. But when their
influence on the preceding consonants are taken into account, it is necessary to distinguish
different allophones. For example, "a" is in "ga", "gai" and "gao" are realized distinctly and so
are the transitions from "g" to "a". With three allophones al, a2, a3, we are able to document the
distinction. Fig.2 is the spectrograms of "g" as it is followed by allophones of "a, o, e, u". It is
seen that the differences between transitions to al, a2 and a3 are obvious. Since there is no
significant distinction between allophones of "o", no allophones are suggested.
6. Testing and Implementation

TESTING:

Software is only one element of a larger computer based system. Ultimately


software is incorporated with other system elements (Ex. New hardware) and a series of
system integration and validation tests are conducted. System testing is actually a series
of different tests whose primary purpose is to fully exercise the computer-based system.

Testing presents an interesting anomaly for the software development. The testing
phase creates a series of test cases that are intended to ‘Demolish’ the software that has
been built. A good test case is one that has a high probability of finding an as yet
undiscovered error. A successful test is one that uncovers as an yet undiscovered error.

System testing is the stage of implementation, which is aimed at ensuring that the system
works accurately and efficiently before live operation commences. Testing is vital to the success
of the system. System testing makes a logical assumption that if all the parts of the system are
correct, the goal will be successfully achieved. The reports Produced are tested for validation
and the updated results are tested based on the updated results. The candidate system to the
variety to tests. A series of testing are performed for the proposed system is ready for the
acceptance testing.

The Testing Steps:

 Unit Testing
 Integration Testing
 Output Testing
 User Acceptance Testing
UNIT TESTING

Unit testing focuses verification efforts on the smallest unit of software design of the module.
This is also known as “Module Testing”. Unit testing focuses on the Modules, independently of
one another, to locate errors. This enables the tester to detect Errors in coding and logical those
are contained within the logical alone. Those resulting from the interaction between modules are
initially avoided.

Unit testing comprises the set of tests performed by an individual programmer prior to
integration of the unit into a larger system.

A program unit is usually small enough that the programmer who developed can test in
a great detail and certainly in greater detail that will be possible when the unit is
integrated in to an evolving software product.

There are four categories of tests that a programmer will typically perform on a
program unit.

 Functional Tests
 Performance Tests
 Stress Tests
 Structure Tests

Functional Tests

Functional tests, where test cases involving exercising the code with nominal input
values for which the expected results are known were done.

In client Module, the request message is verified for all possible inputs taking
into account the set of possible circumstances. This is essential because if affects the
overall output of the system.
In Server module, the request message interpretation and job administration
module are extensively tested so that they work satisfactorily for all possible
circumstances.

Performance Tests

Performance testing is concerned with the evaluation speed and memory utilization
of the program. Using various test cases tests the package and the performance is found
satisfactory.

Stress Tests

Stress testing, which is concerned with exercising the internal logic of a program
and traveling particular execution paths is done. The input is given in such a way that
starting form request from client to the job completion all possible paths are tested.

Structure Tests

Structure testing is also referred to as White Box or Glass Box Testing.

Test Data

Data to be entered in this project are Distance coverage, energy, link quality in
such a way that if the distance coverage is high then the energy required to transmit the
message should also high due to high energy the link quality will be less. Based on this
condition the values should be entered at runtime in the table through will rank will be
calculated and packets are routed.

INTEGRATION TESTING

Data can be lost across an interface, one module cans adverse effort on another, sub
function, when combined, may not produce the desired major functions. Integration testing is a
systematic testing for constructing the program structure, while at the same time conducting tests
to uncover errors associated with in the interface. The objective is to take unit tested modules
and build a program structure. All the modules are combined and tested as a whole. Thus in the
integration testing step, all the errors uncovered are corrected for the next testing stores.
OUTPUT TESTING

After performing the validation testing, the next step is output testing for the proposed
system since no system could be useful if it dose not produce the required output in the specified
format. The outputs generated or displayed by the system under consideration are tested by so
that the output generated is in a user-friendly manner and does not produce unseen errors. Here,
the output format is considered into two ways-one is on screen and another is on printed
format.The output format on the screen is found to be correct and the format was designed in the
system design phase according to the user needs. The reports generated where found accurate
and easy to navigate. For the hard copy also, the output comes out as a specified requirements by
the user. Hence output testing does not result in any correction in the system.

USER ACCEPTANCE TESTING:

User acceptance of a system is the factor for the success of any system. The system under
consideration is tested for user acceptance keeping in touch with the prospective system users at
the time of developing and making changes whenever required. Preparing of test data plays a
vital role in the system testing. The reports have been tested for test data provided and was found
functioning properly. The corrections where also made to make the report to provide data
whenever new data’s where updated.

Implementation is the final and important phase. It involves user training, system testing
and successful running of the developed proposed system. The user tests the developed system
and made according to their needs. The testing phase involves the testing of developed system
using various kinds of data. In the various reports provided each and every report are checked
and updated based on prevailing conditions.

An elaborate testing of data is prepared and the system is tested using the test data. The test
provided is checked with various constraints provided in the developed system, like security
measures and every thing that are need to report are checked. While testing errors are noted,
corrections are made.
Name of Module: text recognition

Test Case Test Type of Prerequisites, Test steps Result Pass/Fail


ID Scenario Test Case if any

QSW- Text Functional User enters In this step If the text Pass
>CP0001 recognition his text for the text are is given in
the recognized the text
conversion and it goes area mean
of text to to the next the text
speech. step. will be
hearted in
the
speaker.

If the text
field is
Please
blank mean
enter some
the alert
text in the
message
text area.
QSW- comes the
>CP0002 screen
Fail
7. Conclusion and Future Scope

Conclusion:

A Text-To-Speech (TTS) synthesizer is a computer-based system that should be able to read any
text aloud, whether it was directly introduced in the computer by an operator or scanned and
submitted to an Optical Character Recognition (OCR) system. So this kind of technology arrives
in the market means anyone can lead a fast and simple life for getting and typing anything.

Future Scope

This is flourishing technology which was very useful to this development world. The upcoming
technology is fully based on Artificial Intelligence so that this project is helps as step for the
same. The above project has lot of advantages and very limit of disadvantages. We concluded
that it was very useful to one and all who were using this process for reduce their man power.
8. Forms and Reports

Home Page

This is the main page for the text to speech converter. Once the user run the application he will
get the above screen. This user interface makes user very friendly and anyone can use this. The
text area is the tool to get the input from the user. The user can give text upto 200 characters. It’s
the limit. Once the user give input he can output through voice.
Input Given Example

In this screen the user enter his text as “Hello Welcome” we choose the voice language as the
following screen. This screen helps us to know the user gives input to the application.
Language Selection
Once the user give input to the text area he can get the output in voice as per their choice. For
that the selection option is used to select the languages. In the above screen the languages
mentioned are: US-EN, UK-EN, FRENCH, GERMAN, SPANISH, ITALIAN.
Gender Selection Screen

the user give input to the text area he can get the output in voice as per their choice. After
selecting the preffered language the use can select the gender as per their choice like male,
female. The selection option used for highlight the required option. The options dealing with the
selection are Mike(Male), Heather(Female), Ryan(Male), Crystal(Female). There is difference
between both male people as well for both females. That is the voice modulation is differed.
Speed Selection

There is various types of speech. Some one speaks very fastly and other in very slow and calm
manner. Those characterised people want their response as per their way of express. For this the
application is specially modified like the user select the speed option so that the voice output will
deliver in corresponding manner. The speed classified from -10 to +10.

`
Voice Output

Once all the selection and input(text) given by the user, they can click their play button in the left
bottom corner of the application like in the above screen. So that the output performs and the
screen is shown as below. (See loading symbol instead of play button)
Sample Code:

public class VoiceRecognition extends Activity implements OnClickListener,

OnInitListener {

public static final int VOICE_RECOGNITION_REQUEST_CODE = 1234;

public ListView mList;

public Button speakButton;

// TTS object

public TextToSpeech myTTS;

// status check code

public int MY_DATA_CHECK_CODE = 0;

public void informationmenu() {


speakWords("information screen");

startActivity(new Intent("android.intent.action.INFOSCREEN"));

public void voicemenu() {

speakWords("voice recognition menu");

startActivity(new Intent("android.intent.action.RECOGNITIONMENU"));

public void mainmenu() {

speakWords("main menu");

startActivity(new Intent("android.intent.action.MENU"));

public void voicerecog() {

speakWords("speak now");

startActivity(new Intent("android.intent.action.SPEAK"));

/**

* Called with the activity is first created.


*/

@Override

public void onCreate(Bundle voiceinput) {

super.onCreate(voiceinput);

// Inflate our UI from its XML layout description.

setContentView(R.layout.voice_recognition);

// check for TTS data

Intent checkTTSIntent = new Intent();

checkTTSIntent.setAction(TextToSpeech.Engine.ACTION_CHECK_TTS_DATA);

startActivityForResult(checkTTSIntent, MY_DATA_CHECK_CODE);

// Get display items for later interaction

voiceinputbuttons();

// Check to see if a recognition activity is present

PackageManager pm = getPackageManager();

List<ResolveInfo> activities = pm.queryIntentActivities(new Intent(

RecognizerIntent.ACTION_RECOGNIZE_SPEECH), 0);
if (activities.size() != 0) {

speakButton.setOnClickListener(this);

} else {

speakButton.setEnabled(false);

speakButton.setText("Recognizer not present");

// setup TTS

public void onInit(int initStatus) {

// check for successful instantiation

if (initStatus == TextToSpeech.SUCCESS) {

if (myTTS.isLanguageAvailable(Locale.US) == TextToSpeech.LANG_AVAILABLE)

myTTS.setLanguage(Locale.US);

} else if (initStatus == TextToSpeech.ERROR) {

Toast.makeText(this, "Sorry! Text To Speech failed...",

Toast.LENGTH_LONG).show();

}
/**

* Handle the click on the start recognition button.

*/

public void onClick(View v) {

startVoiceRecognitionActivity();

public void voiceinputbuttons() {

speakButton = (Button) findViewById(R.id.btn_speak);

mList = (ListView) findViewById(R.id.list);

/**

* Fire an intent to start the speech recognition activity.

*/

public void startVoiceRecognitionActivity() {

Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);

intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL,

RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);

intent.putExtra(RecognizerIntent.EXTRA_PROMPT,
"Speech recognition demo");

startActivityForResult(intent, VOICE_RECOGNITION_REQUEST_CODE);

/**

* Handle the results from the recognition activity.

*/

@Override

public void onActivityResult(int requestCode, int resultCode, Intent data) {

if (requestCode == VOICE_RECOGNITION_REQUEST_CODE

&& resultCode == RESULT_OK) {

// Fill the list view with the strings the recognizer thought it

// could have heard

ArrayList<String> matches = data

.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS);

mList.setAdapter(new ArrayAdapter<String>(this,

android.R.layout.simple_list_item_1, matches));

// matches is the result of voice input. It is a list of what the

// user possibly said.

// Using an if statement for the keyword you want to use allows the

// use of any activity if keywords match


// it is possible to set up multiple keywords to use the same

// activity so more than one word will allow the user

// to use the activity (makes it so the user doesn't have to

// memorize words from a list)

// to use an activity from the voice input information simply use

// the following format;

// if (matches.contains("keyword here") { startActivity(new

// Intent("name.of.manifest.ACTIVITY")

if (matches.contains("information || info screen || info || about")) {

informationmenu();

} else {

speakWords("Speak Now");

startVoiceRecognitionActivity();

if (matches.contains("home || menu || home screen")) {

mainmenu();

} else {

speakWords("Speak Now");

startVoiceRecognitionActivity();

if (matches.contains("speak")) {
voicerecog();

} else {

speakWords("Speak Now");

startVoiceRecognitionActivity();

if (matches.contains("close || stop || finish")) {

finish();

} else {

speakWords("Speak Now");

startVoiceRecognitionActivity();

if (matches

.contains("voice || recognition|| voice recognition")) {

voicemenu();

} else {

speakWords("Speak Now");

startVoiceRecognitionActivity();

}
super.onActivityResult(requestCode, resultCode, data);

// speak the user text

public void speakWords(String speech) {

// speak straight away

myTTS.speak(speech, TextToSpeech.QUEUE_FLUSH, null);

import java.util.Locale;

@SuppressWarnings("unused")

public class mainj extends Activity implements OnInitListener {


private TextToSpeech myTTS;

// status check code

private int MY_DATA_CHECK_CODE = 0;

// setup TTS

public void onInit(int initStatus) {

// check for successful instantiation

if (initStatus == TextToSpeech.SUCCESS) {

if (myTTS.isLanguageAvailable(Locale.US) == TextToSpeech.LANG_AVAILABLE)

myTTS.setLanguage(Locale.US);

} else if (initStatus == TextToSpeech.ERROR) {

Toast.makeText(this, "Sorry! Text To Speech failed...",

Toast.LENGTH_LONG).show();

@Override

public void onCreate(Bundle savedInstanceState) {

super.onCreate(savedInstanceState);

setContentView(R.layout.loadscreen);
Intent checkTTSIntent = new Intent();

checkTTSIntent.setAction(TextToSpeech.Engine.ACTION_CHECK_TTS_DATA);

startActivityForResult(checkTTSIntent, MY_DATA_CHECK_CODE);

Thread logoTimer = new Thread() {

public void run() {

try {

try {

sleep(3000);

speakWords("main menu loaded");

} catch (InterruptedException e) {

e.printStackTrace();

Intent menuIntent = new Intent("android.intent.action.MENU");

startActivity(menuIntent);

finally {

finish();

}
}

};

logoTimer.start();

// speak the user text

private void speakWords(String speech) {

// speak straight away

if (myTTS != null) {

myTTS.speak(speech, TextToSpeech.QUEUE_FLUSH, null);

// act on result of TTS data check

protected void onActivityResult(int requestCode, int resultCode, Intent data) {

if (requestCode == MY_DATA_CHECK_CODE) {

if (resultCode == TextToSpeech.Engine.CHECK_VOICE_DATA_PASS) {

// the user has the necessary data - create the TTS


myTTS = new TextToSpeech(this, this);

} else {

// no data - install it now

Intent installTTSIntent = new Intent();

installTTSIntent

.setAction(TextToSpeech.Engine.ACTION_INSTALL_TTS_DATA);

startActivity(installTTSIntent);

import android.app.Activity;

@SuppressWarnings("unused")

public class menu extends Activity implements TextToSpeech.OnInitListener,

OnClickListener {

// defined

TextToSpeech mTts;

public static final int VOICE_RECOGNITION_REQUEST_CODE = 1234;


// remember to include a listview on the xml or the voice recognition code

// will not work

public ListView mList;

// TTS object

Button speakButton, infoButton, voiceButton, talkButton;

VoiceRecognition vro = new VoiceRecognition();

SpeakingAndroid speak = new SpeakingAndroid();

// TTS object

public TextToSpeech myTTS;

// status check code

public int MY_DATA_CHECK_CODE = 0;

@Override

protected void onCreate(Bundle aboutmenu) {

super.onCreate(aboutmenu);

setContentView(R.layout.mainx);

VoiceRecognition voiceinput = new VoiceRecognition();


// get a reference to the button element listed in the XML layout

speakButton = (Button) findViewById(R.id.btn_speak);

infoButton = (Button) findViewById(R.id.aboutbutton);

voiceButton = (Button) findViewById(R.id.voicebutton);

talkButton = (Button) findViewById(R.id.talk);

// listen for clicks

infoButton.setOnClickListener(this);

speakButton.setOnClickListener(this);

talkButton.setOnClickListener(this);

// check for TTS data

Intent checkTTSIntent = new Intent();

checkTTSIntent.setAction(TextToSpeech.Engine.ACTION_CHECK_TTS_DATA);

startActivityForResult(checkTTSIntent, MY_DATA_CHECK_CODE);

// calling method

voiceinputbuttons();

// Check to see if a recognition activity is present


// if running on AVD virtual device it will give this message. The mic

// required only works on an actual android device//

PackageManager pm = getPackageManager();

List<ResolveInfo> activities = pm.queryIntentActivities(new Intent(

RecognizerIntent.ACTION_RECOGNIZE_SPEECH), 0);

if (activities.size() != 0) {

voiceButton.setOnClickListener(this);

} else {

voiceButton.setEnabled(false);

voiceButton.setText("Recognizer not present");

// setup TTS

public void onInit(int initStatus) {

// check for successful instantiation

// returns a fail statement if speech doesn't work

if (initStatus == TextToSpeech.SUCCESS) {
if (myTTS.isLanguageAvailable(Locale.US) == TextToSpeech.LANG_AVAILABLE)

myTTS.setLanguage(Locale.US);

} else if (initStatus == TextToSpeech.ERROR) {

Toast.makeText(this, "Sorry! Text To Speech failed...",

Toast.LENGTH_LONG).show();

// creating method

public void voiceinputbuttons() {

speakButton = (Button) findViewById(R.id.btn_speak);

mList = (ListView) findViewById(R.id.list);

// respond to button clicks

public void onClick(View v) {

switch (v.getId()) {

// use switch case so each button does a different thing

// accurately(similar to an if statement)

case R.id.btn_speak:
String words1 = speakButton.getText().toString();

// speakwords(xxxx); is the piece of code that actually calls the

// text to speech

speakWords(words1);

Intent voiceIntent = new Intent(

"android.intent.action.RECOGNITIONMENU");

startActivity(voiceIntent);

break;

case R.id.aboutbutton:

String words2 = infoButton.getText().toString();

speakWords(words2);

Intent infoIntent = new Intent("android.intent.action.INFOSCREEN");

startActivity(infoIntent);

break;

case R.id.voicebutton:

speakWords("Speak Now");

startVoiceRecognitionActivity(); // call for voice recognition

// activity

break;

case R.id.talk:
speakWords("This is the main menu.");

break;

// speak the user text

// setting up the speakWords code

public void speakWords(String speech) {

// speak straight away

myTTS.speak(speech, TextToSpeech.QUEUE_FLUSH, null);

/**

* Fire an intent to start the speech recognition activity.

*/

public void startVoiceRecognitionActivity() {

Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);

intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL,

RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);

intent.putExtra(RecognizerIntent.EXTRA_PROMPT,
"Speech recognition demo");

startActivityForResult(intent, VOICE_RECOGNITION_REQUEST_CODE);

/**

* Handle the results from the recognition activity.

*/

@Override

public void onActivityResult(int requestCode, int resultCode, Intent data) {

if (requestCode == VOICE_RECOGNITION_REQUEST_CODE

&& resultCode == RESULT_OK) {

// Fill the list view with the strings the recognizer thought it

// could have heard

ArrayList<String> matches = data

.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS);

mList.setAdapter(new ArrayAdapter<String>(this,

android.R.layout.simple_list_item_1, matches));

// matches is the result of voice input. It is a list of what the

// user possibly said.

// Using an if statement for the keyword you want to use allows the

// use of any activity if keywords match


// it is possible to set up multiple keywords to use the same

// activity so more than one word will allow the user

// to use the activity (makes it so the user doesn't have to

// memorize words from a list)

// to use an activity from the voice input information simply use

// the following format;

// if (matches.contains("keyword here") { startActivity(new

// Intent("name.of.manifest.ACTIVITY")

if (matches.contains("information")) {

vro.informationmenu();

if (matches.contains("info screen")) {

vro.informationmenu();

if (matches.contains("info")) {

vro.informationmenu();

if (matches.contains("about")) {

vro.informationmenu();

}
if (matches.contains("home")) {

vro.mainmenu();

if (matches.contains("menu")) {

vro.mainmenu();

if (matches.contains("home screen")) {

vro.mainmenu();

if (matches.contains("speak")) {

vro.voicerecog();

if (matches.contains("close")) {

finish();

if (matches.contains("stop")) {

finish();

if (matches.contains("finish")) {

finish();
}

if (matches.contains("voice")) {

vro.voicemenu();

if (matches.contains("recognition")) {

vro.voicemenu();

if (matches.contains("voice recognition")) {

vro.voicemenu();

// still in the onActivityResult: This is for the text to speech part

if (requestCode == MY_DATA_CHECK_CODE) {

if (resultCode == TextToSpeech.Engine.CHECK_VOICE_DATA_PASS) {

// the user has the necessary data - create the TTS

myTTS = new TextToSpeech(this, this);

} else {

// no data - install it now


Intent installTTSIntent = new Intent();

installTTSIntent

.setAction(TextToSpeech.Engine.ACTION_INSTALL_TTS_DATA);

startActivity(installTTSIntent);

super.onActivityResult(requestCode, resultCode, data);

BIBLIOGRAPHY
REFERENCES BOOK
 J.Ang, Y.Liu, and E. Shriberg, “Automatic dialog act segmentation and
classification in multiparty meeting.” In Proceeding of the ICASSP, Philadelphia,
PA, March 2005.

 Santanjeev Banerjee, konrad kording, Thomas Griffiths, and Joshua Tenenbaum,


“Unsupervised topic modeling for multiparty spoken discourse,” in proceedings
of the COLING-ACL,Sydney,Australia,July 2006.

 Matthew Purver.Patrick Ehlen, and john Niekrasz, “detecting action items in


multi-party meeting: Annotation and initial experiments,” in MLMI, Revised
Selection Papers.2006

 Pie-yun Hsueh and Johanna Moore, “What decision have you made?: Automatic
decision detection in meeting conversation,” in Processing of
NAACL/Hlt,Rochester,NEW York.2007
 Surabhi Gupta,John Niekrasz, Matthew PURVER, AND Daniel Jurafsky,
“Resolving “you” in multi-party dialog.” In Proceeding of the *th SIGdial
Workshop on Discourse and dialogue,Antwerp,Belgium,2007.

References:

[1]. Fisher W M, Doddington G R. The DARPA speech recognition research database:


specification and status. Proceeding of the DARPA Speech Recognition Workshop, Palo Alto,
CA, 1986: 93-99.

[2]. Zue V W, Cyphers D S, Kassel R H et all. The development: design and analysis of the
acoustic-phonetic corpus. Proceedings of ICASSP-86, Tokyo, Japan, 1986: 8-11.

[3]. Stephanie Seneff and Victor Zue W. Transcription and alignment of the TIMIT database.
The Second Symposium on Advanced Man-Machine Interface through Spoken Language, Oahu,
Hawaii, 1988: 20-22.

[4]. Pat Keating, Blankenship B, Byrd D et all. Phonetic analyses of the TIMIT corpus of
American English. Proceedings ICSLP92, 1992,1 :823-826.

[5]. Kim Silverman, Mary Beckman, John Pitrelli et al. TOBI, A standard for labeling English
prosody. Proceedings of the International Conference on Spoken Language Processing, 1992,2:
867-870.

[6]. Tsuyoshi Morimoto, Noriyoshi Uratani, Toshiyuki Takezawa et al. A speech and languge
database for speech translation research. Proceedings of the International Conference on Spoken
Language Processing, 1994,4 :1791-1794.
[7]. Jan P H, van Santen, Buchsbaum A L. Mehtods for optimal text selection.
ProceedingsofEurospeech""97, 1997,2: 557-561.

[8]. Chih-mei Lin, Shrikanth Narayanan, Russell Ritennour. "Database management and analysis
for spoken dialog system: Methodology and Tools. Proceedings of Eurospeech""97, 1997,5:
2199-2202.

[9]. Ikuo KUDO, Takao NAKAMA, Nozomi ARAI et al. The data collection of voice across
Japan (VAJ) project". Proceedings of the International Conference on Spoken Language
Processing.1994, 4 :1799-1802.

[10]. Wang H-M, Chang Y-C, Lee L-S. An algorithm for automatically selecting phonetically
balanced sentences from a large corpus for training and testing a speech recognition system.
Proc. Int. Conf. Computer Proc. Oriental Lang. (Korea), 1994 :507-510.

[11]. Sun J S, Wang Z Y, Wang X et al. Constructing a word tabel for training of continuous
speech. In Proceedings of the 2nd national academic conference on the latest development of
computer intelligent interface and intelligent application (Tsing Hua University Press), 1995:
116-121.

[12]. Qu Fei, Huang Taiyi, Zhang Xijun. Text design for a synthsis speech database of standard
Indian. Proceeding of the 4th National Conference on speech communication of man and
machine (NCMMSC-96), 1996: 337-341.

[13]. Li Wenxian, Zu Yiqing, Chan C. A Indian Speech Database (Putonghua Corpus).


Proceedings of SST-94, 1994 :834.

[14]. Shi Feng. Studies In Tone And Stops. Beijing University Press, 1990.

[15]. Fant G, Gauffin J. Speech science and technology. translated by Zhang Jialu et al,
Commerial Press, 1994.

[16]. Cao Jianfen. Table of two syllable pairs. Applied Linguistics, 1997: 60-68.
[17]. Chen Xiaoxia. An acoustic study of intersyllabic anticipatory coarticulation of three places
of articulation of C2 in CVCV in standard Chinee. Zhong Guo Yu Wen, 1997; 54-63.

[18]. Yan Jingzhu. A study of the formant transitions between the first syllable with vocalic
ending and the second syllable with initial vowel in the disyllabic sequence of standard Indian.
Report of Phonetic Research, 1994-1995:41-53.

[19]. Sun Guohua. The formant transitions of vl-z in disyllables of Standard Indian. The
Proceeding of 3rd Phonetic workshop, 1996: 108-110.

[20]. Xu Yi. Acoustic features of junctures in standard Indian. Zhong Guo Yu Wen, 1986.

[21]. Fant G, Kruckenberg A. Preliminaries to the study of Swedish prose reading and reading
style. STL-QPSR 2,1989.

[22]. Eleonora Blaauw. The contribution of prosodic boundary marks to the perceptual diference
between read and spontaneous speech. Speech Conmmunication 14,1994: 359375.

[23]. Guo Jinfu. Length of sentences, speech rate and structural pause. The Abstracts of Studies
on Indian Language and Characters in the Epoch of Computer, 1995 :17.

[24]. Li Aijun. The pause in broadcast news of standard Indian. The Proceeding of CYCA""97,
1997: 262-266.

[25]. Lee K F. Automatic Speech Recognition: The Development of the SPHINX System.
Kluwer Academic Publishers, IEEE Transactions on ASSP, Vol.37, no.11, Nov.1989.

[26]. Peri Bhaskararao. Subphonemic segment inventories for concatenative speech synthesis.
Fundamentals of Speech Synthesis and Speech Recognition, Edited by ERIC Keller, University
of Lausanne, Switzerland, JOHN WILEY & SONS, 1994:69-85.

[27].Luo Zhensheng, Zhang Bixia. A study on automatic analysis and algrithm as well as
strategy of statistic distribution for Indian sentences patterns. Journal of Indian Information
Processing, 1994; 8(2.
[28].Cormen T H, Leiserson C E, Riverst R L, Introducton to algorithms, The MIT Press,
Cambridge, Massachussetts, 1990.

You might also like