Professional Documents
Culture Documents
[The following materials were gathered from the web during the month of May 1996.
This article has been in periodic update from 1993-1996.]
This is a REPORT. SUPERADAPTOID does not REVIEW products that have not been
personally evaluated by DEMONSTRATION.
The following article outlines the scope of Comp.Speech FAQ Postings. Institutional,
research, and business resources available in the web. These business and product
listings are not complete. However, this represents the Better-Of-The-Best of lists.
Speech synthesis is the task of transforming written input to spoken output. The input
can either be provided in a graphemic/orthographic or a phonemic script, depending on
its source.
There are several algorithms. The choice depends on the task they're used for. The
easiest way is to just record the voice of a person speaking the desired phrases. This is
useful if only a restricted volume of phrases and sentences is used, e.g. messages in a
train station, or schedule information via phone. The quality depends on the way
recording is done.
More sophisticated but worse in quality are algorithms which split the speech into
smaller pieces. The smaller those units are, the less are they in number, but the quality
also decreases. An often used unit is the phoneme, the smallest linguistic unit.
Depending on the language used there are about 35-50 phonemes in western European
languages, i.e. there are 35-50 single recordings. The problem is combining them as
fluent speech requires fluent transitions between the elements. The intellegibility is
therefore lower, but the memory required is small.
A solution to this dilemma is using diphones. Instead of splitting at the transitions, the
cut is done at the center of the phonemes, leaving the transitions themselves intact.
This gives about 400 elements (20*20) and the quality increases.
The longer the units become, the more elements are there, but the quality increases
along with the memory required. Other units which are widely used are half-syllables,
syllables, words, or combinations of them, e.g. word stems and inflectional endings.
* "Talking Machines, Theories, Models and Designs" Eds, G. Bailly & C. Benoit
(Elsevier: North Holland)
* W.B. Kleijn and K.K. Paliwal (Eds.), Speech Coding and Synthesis, Elsevier,
Amsterdam, 1995.
* John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech: The MITalk
System", Cambridge University Press, 1987.
Survey of the State of the Art in Human Language Technology Report edited by Ronald
A. Cole et. al. with a section on Text-to-Speech Technologies.
WWW searchable online-bibiliography for Phonetics and Speech Technology with more
than 8000 entries.
Provided by Institut fur Phonetik at Johann Wolfgang Goethe-Universitat
Frankfurt.
Most of the following are links to WWW pages with demonstrations of speech
synthesis. Plenty more links are included in the detailed list of speech synthesis
software/hardware in Q5.5.
YorkTalk
Loughborough Sound Images
University of Birmingham - FDFS
Eurovocs
DECtalk
AT&T Bell Labs Synthesiser
Pavarobotti
WWW demo of the Pavarobotti synthesis technology developed at the National
Center for Voice and Speech
Say...
WWW demo of the rsynth speech synthesis software. The WWW capability was
implemented by Axel Belinfante.
ICP-Grenoble
CNET-Lannion (with TD-PSOLA)
KTH-Stockholm
Universite-Mons - several versions
AT&T Bell Laboratories Voices
WWW interface to the Demo of the Laureate speech synthesis system - not yet
commercially available. (this link may be good but it gives odd error messages)
Please email any updates, corrections or additions to the following list. The range of
commercially available synthesis software is growing rapidly so any help in keeping up
to date will be appreciated.
IN THE FAQ...
AsTeR
BeSTspeech from Berkeley Speech Technologies, Inc., (BST)
TheBigMouth
AsTeR
Platform: UNIX
Description:
TTS front-end program which encodes structural information about documents in
speech synthesis. For more information check out:
http://www.research.digital.com/CRL/personal/raman/aster/aster-toplevel.html
Platform: ?
Description: BeSTspeech reads ASCII text no vocabulary limits. Available for Dutch,
English (male and female), French, German, Italian, Portuguese, Spanish, Arabic,
Cantonese, Japanese, Korean, Malay, Mandarin and Russian.
Price: ?
Contact: Berkeley Speech Technologies, Inc.
2246 Sixth Street, Berkeley, California 94710, USA
Ph: (510) 841-5083, Fax: (510) 841-5093
Email: webmaster@bst.com
WWW
Platform: NeXT
Description: Text to speech program based on concatenation of pre-recorded speech
segments. NeXT equivalent of "Speak" for Suns.
Availability: try NeXT archive sites such as sonata.cc.purdue.edu.
Creative TextAssist
Platform: Windows
Description: Based on DECtalk speech synthesis. A detailed technical description of
TextAssist is provided on the Creative WWW pages.
Availability: Creative TextAssist is bundled with most (all?) Creative Sound Blaster
audio cards.
Platform: Windows
Description: The TextAssist API (TAAPI) is created for Microsoft Windows 3.1x and
Windows 95 developers who intend to develop 16-bit Text-to-Speech software
applications using Creative's TextAssist speech engine. It supports direct control of
speech output characteristics, concurrent playback of text-to-speech and wave files,
foreign language support, speech synchronization, and exception dictionaries. It also
includes a voice editing tool for creating new custom voices, a Visual Basic Custom
Control for high-level text-to-speech support in Visual Basic and other languages and
some sample programs.
Platform: PC
Description: CSRE is a software system which includes in an implementation of the
Klatt speech synthesizer. See the CSRE entry in Q1.9 and the AVAAZ WWW pages for
more detail.
Platform: Windows NT, Alpha with Digital UNIX and RS232 ports
Description:
Converts ordinary text into natural-sounding, intelligible speech. Provides
personalized voices, and extensive user controls. DECtalk technology is available
for the following packaging options.
Pricing:DECtalk-Speech-Synthesis
More Information:
Digital Equipment Corporation WWW pages:
Ph: 1-800-DIGITAL
DECtalk Software
More Information:
Digital Equipment Corporation WWW pages:
DECtalk Software page:
Ph: 1-800-DIGITAL
Eloquence
Uses high-level linguistic parsing, which obviates the need for a huge dictionary.
Handles numbers, acronyms, currency, etc. Includes a set of annotation symbols,
for placing stress on particular words, expressing excitement/boredom, etc. Also
allows phonetic input. Support for Windows DDL.
Produces male and female voices for General American English. Dialects under
development include Alabama, Brooklyn, and Boston.
Price:
Flexible license agreements on application.
Availability:
Eloquent Technology, Inc.
2389 North Triphammer Road
Ithaca, NY 14850
Ph: (607) 607-266-7020 Fax: (607) 607-266-7030
Email: eti@plab.dmll.cornell.edu
Requirements:
Requires GNU FSF Emacs 19 (version 19.23 or later) and TCLX 7.3B (Extended
TCL) to run Emacspeak.
Availability:
Eurovocs
Contact:
Technologie & Revalidatie
Postbus 128, B-9000 Gent, Belgium
Ph: +32-9-264 33 97, Fax: +32-9-264 35 94
E-mail: noe@elis.rug.ac.be WWW page:
HADIFIX
Platform: Windows
Description:
German speech synthesis system developed at the Institute for Communications
Research and Phonetics , University of Bonn. Provides conversion of input text to
phonemes, automatic prediction of stress, phrasing and pitch, and speech
generation by concatenation of small units of natural speech. Demisyllables and
similar units are used; they comprise all consonants before the vowel and the
beginning of the vowel (initial demisyllable) or the end of the vowel and the
following consonants (final demisyllable). For example, the word 'Strolch' is
formed by concatenating 'Stro' and 'olch'.
Demo:
Windows demo software available. Limited to synthesis of one short text (text.txt)
at a time. Speech format limitations too. 1.3MB file.
WWW page
On-line demo
Description:
Multilingual Text-to-speech systems, languages available: American English,
British English, German, French, Spanish, Italian, Swedish, Norwegian, Icelandic,
Danish and Finnish.
Speech manager.
Product name: INFOVOX 700, DESKTOP UNIT
Product description: Desktop unit with built in Infovox 600 to be connected to any
computer or terminal via an RS 232-C serial interface. Built in loudspeaker and
rechargable battery for 4 hours use, and control knobs for continuous control of
speech volume and speed.
Platform: any
❍
Speech manager
Product name: INFOVOX 650, OEM BOARD
Product description: OEM-board built with CMOS IC's. Language and control
program are stored in on-board memory.
❍ Platform: any, Interface: 9 pole D-SUB (RS 232-C) 300-9600 Baud
Speech manager
Product name: INFOVOX 750, DESKTOP UNIT
Product description: Desktop unit with built in Infovox 650 to be connected to any
computer or terminal via an RS 232-C serial interface. Built in loudspeaker and
rechargable battery for 5 hours use, and a control knob for continuous control of
speech volume.
Platform: any
❍
Speech manager
Product name: Infovox 210, software for Apple Macintosh
Product description: Software based text-to-speech conversion. Produces 16 bit
and 8 bit sound. Delivered on 3.5" diskettes with user lexicon and a complete
documentation.
Description:
IPOX is an experimental, all-prosodic speech synthesizer, developed by Arthur
Dirksen and John Coleman. IPOX is freely available (after registration) for
evaluation and non-profit research purposes.
Requirements:
PC (preferably a fast 486) running Windows 3.1 or higher. Sound output requires a
16-bit Windows-compatible sound card
Availability: By WWW
JSRU
Contact:
Dr. E.Lewis eric.lewis@bristol.ac.uk
Klatt-style synthesiser
Platform: Unix
Cost: Free
Description:
Software posted to comp.speech in late 1992.
Availability:
By ftp from the comp.speech ftp site
Platform: Unix
Description:
The KPE80 program provides a graphical interface for the implementation of the
Klatt 1980 formant synthesiser written by Jon Iles and Nick Ing-Simmons. It was
inspired by IGE, a piece of code written by Rob Fletcher.
Technical Desc.:
It is comprised of an X-Window interface and version 3.03 of the synthesiser code.
The interface allows users to display and edit Klatt parameters using a graphical
display which includes the time-amplitude waveform of both the original speech
and its synthetic copy, and some signal analysis facilities. Most of the work in
choosing the parameter values to produce the synthetic copy has to be done by
the user. KPE will estimate the fundamental frequency contour from an original
token; this estimate will need to be amended where errors occur. It is possible to
specify the formant trajectories with some precision by overlaying the appropriate
formant frequency parameter tracks on the spectrogram of the target waveform. A
number of facilities exist to help in the refinement of parameter values: original
and synthetic waveforms can be compared aurally, spectrally, and
spectrographically using built-in speech analysis facilities.
File formats:
KPE will read RIFF (.wav) files and SFS files. (SFS is a suite of speech-signal
processing programs available free from Phonetics and Linguistics, UCL.)
Availability:
❍ KPE for SunOs 4.1.3 (statically compiled libraries)
Platform: UNIX
Description: Experimental software which learns text to phoneme translation from
examples using decision-tree-like data structures. It is based on the assumption that
each letter can correspond to different phoneme strings depending on the context.
Lernout & Hauspie have three TTS products. The functionality of the products is similar,
however, they differ in hardware implementation and other details where described
below.
L&H tts2000/T: TTS for the Telephony and Telecommunications Market
L&H tts2000/M: TTS for the Computer and Multimedia Market
L&H tts3000/C: TTS for the Buisness and Consumer Electronics Market
Description:
Text to Speech (TTS) software based on parameterized segment concatenation
(diphones, triphones and tetraphones) algorithms. Available for US English,
German, Dutch, French, Spanish (Castilian), Italian and Korean.
General features include:
❍ The control of volume, speech rate and speech pitch.
❍ The use of control sequences to customize TTS output (adding pauses, using
phonetic input, etc.).
❍ Switching between languages at run time.
❍ Input formats: orthographic input, phonetic input, phonetic input with prosodic
information.
tts2000/T
❍ Output formats: 8 bit mu-law PCM, 8 bit A-law PCM, 16 bit linear PCM.
tts2000/M
❍ Output formats: 8/16 bit wave format, 8 bit mu-law PCM, 8 bit A-law PCM, 16 bit
linear PC.
❍ Sampling Frequency: 8/10/11.025 kHz
Motorola 68040
❍ Two processor platform examples: {Intel 386/486/Pentium or Motorola 68030} and
Platform: IBM-Compatible
Description: The L&H Text-to-Speech software developers kit is able to integrate text-to-
speech technology with your own or existing PC applications under Microsoft Windows
3.1. This software will allow conversion of written text into clear human sounding
synthetic speech.
Requirements:
❍ IBM-compatible PC 386 DX/33, 8Mb RAM
MacinTalk
Platform: Macintosh
Cost: Free
Description: Formant based speech synthesis. There is also a program called "tex-edit"
which apparently can pronounce English sentences reasonably using Macintalk.
Note: MacinTalk doesn't run reliably on Macintosh's with new sound hardware under the
lastest OS (System 7.1 w/HUD 2.0). More recent software is listed above.
Availability:
By anonymous ftp from many archive sites (have a look on archie if you can). tex-
edit is on many of the same sites.
❍ http://www.riken.go.jp/archives/mac/umich/sound/speech/00index.txt
❍ http://jumbo.com/util/mac/speech/
This article by my friend Denise Lance will give you some ideas on the more modern
speech offerings of Apple/Macintosh. When you have finished reading the article (there
are some appropriate notes to read) you can also download English_Text-to-Speech
from there.
Description:
Monologue is a software program that reads text from the clipboard in Windows
16 or 32 bit applications. It can be found as a bundled product with many sound
cards and multimedia general purpose computer systems. Monologue can add
the element of speech to virtually any text oriented application. Any
pronounceable combination of letters and numbers will be spoken clearly. It can
be applied to tasks such as eyes-free proofreading, data verification (e.g.
spreadsheets), reading E-mail and more. User-changeable parameters provide
control over the sound quality by allowing for changes in pitch, and the speed of
speech. An exception dictionary saves preferred pronunciation of words and
abbreviations.
Monologue Win32 now includes support for the Microsoft SAPI. Monologue male
"SpeechFonts" are available for US English, British English, German, French,
Latin American Spanish, Italian. A US English Female SpeechFont is also
available. For more detailed information and examples go to the First Byte WWW
pages.
Availability: Currently bundled with many sound cards and multimedia general purpose
computer systems. For pricing, licensing details, and release information see the First
Byte WWW pages or email info@firstbyte.davd.com.
Platform: Amiga
Description:
A replacement for the Commodore-supplied "translator.library" which is a part of
the Narrator speech synthesis package. It implements multi-lingual text-to-speech
for an Amiga. The library allows the user to specify the language the text to be
spoken should be translated as. This can be done by setting the default language
or by including markup codes in the text in a similar way to Latex or Html. eg:
"\french{Bonjour}". There is currently support for American English, British
English, Swedish, Maori, Finnish, German, Icelandic, Klingon, Polish, Italian, and
Welsh.P
Availability:
The library (but not source) is available by anonymous ftp from Aminet
More Information: is available on the WWW
Narrator
Platform: Amiga
Description:
Formant based speech synthesis. Includes a Engish-to-phoneme translation
library, and a SPEAK: pseudo-device for speech output.
Hardware: Standard Amiga hardware
Availability: Part of AmigaOS
See Also: The Narrator Translation library
TextToSpeech Kit
Misc:
The TextToSpeech Kit comes in two packages: the Developer Kit and the User Kit.
The Developer Kit enables developers to build and test applications which
incorporate text-to-speech. It includes the TextToSpeech Server, the
Hardware:
Uses standard NeXT Computer hardware.
Cost:
❍ TextToSpeech User Kit: $175 CDN ($145 US)
Platform: SUN SPARC, Decstation 5000. Written in C, and therefore portable to other
UNIX platforms. Some successful ports: --> HP, RS-6000, PC-Unix [Linux].
Description:
Sophisticated speech synthesis package. Has text preprocessing (for
abbreviations, numbers), acronym rules, and human-like spelling routines.
Natural-sounding synthesis based on demisyllable concatenation. Has high
accuracy for pronunciation of names of people, places and businesses in
America; good accuracy for English text; rules for stress and intonation marking;
various methods of user control and customization at most stages of processing.
A new version of the ORATOR system is under development. Both ORATOR and
this new "ORATOR II" system are capable of general text synthesis. The ORATOR
II system has a more natural-sounding voice.
More detailed information plus examples of ORATOR synthesis are available on the
ORATOR WWW pages
Misc 2: Examples of Orator are also available on the University of Birmingham Speech
Platform: Windows
Description:
PAM is a talking personal assistant and text reader application. It uses the
ProVoice TTS package. PAM will verbally advise about appointments and
reminder messages at specified times during the day. It can read text files,
clipboard text, and text sent in DDE messages. Using the full verbal interface,
PAM can be used by visually challenged individuals. Shareware - thirty day free
trial.
Description: The ProVerbe Speech Engine produces natural sounding speech from
written text. Naturalness is achieved by using the TD-PSOLA process from the CNET
(France telecom's research lab.) which is based on the concatenation of elementary
speech units (including diphones). Supported languages are British English, German,
French and Spanish. For multi-channel applications Elan Informatique also provides
hardware platforms. The Elan Informatique provides a SDK reference document (sdken.
exe: WinWord6 format in a self extractable compressed format).
Demo versions:
❍ Telephone demonstration: +33-61 17 6701
❍ Anonymous ftp
The directory includes the following demos.
❍ PVBSEDP.zip: French male voice (4.3MB)
The directory also includes synthesis samples for a French male voice, French female
voice, English male voice, and a German male voice. The readme file in the directory
describes the memory requirements for the demos.
A CD-ROM with all these demonstrations is available. To request it, please email Elan
Informatique.
Platform: ProVoice Developer's Toolkits are available for DOS, Windows 3.1, Windows
95, Windows NT, OS/2, and Macintosh.
Description:
ProVoice allows programmers to add synthesized speech to their applications.
Your program passes text strings to the ProVoice speech engine that translates
text into audible speech. Male and/or female "SpeechFonts" are available for
many languages; English, French, German, UK British English, Italian, and
Spanish.
Necessary tools and examples are provided for programmers to manipulate the
ProVoice speech technology; including installation instructions, extensive
samples programs, and complete documentation. In addition, sample code is
provided on disk to illustrate speech programming techniques.
Note 1: First Byte will perform custom work for embedded systems.
Note 2: ProVoice Windows includes support for the Microsoft SAPI. It will speak
through any Windows-supported wave audio device.
For more detailed information and examples go to the First Byte WWW page.
See also: Monologue for Windows from First Byte
rsynth
Price: Free
Misc: Axel Belinfante has implemented a WWW rsynth demo
Availability: anonymous ftp #1 or anonymous ftp #2
Platform: SGI
Description: The SGI Developer Toolbox 4.0 CDROM contains a basicpublic domain text-
to-speech program in the publics/speak directory. The directory includes man pages
and source.
SIMTEL
A wide range of speech related software, sound-blaster software and signal processing
software for PCs is available on SimTel and its mirror sites. It can be obtained by ftp
from:
● ftp://www.cdrom.com/pub/simtelnet/msdos/sound/
Note: Voicemaker - The archives include the program Voicemaker which synthesises
speech.