You are on page 1of 93


 Accent (Linguistics)
 Acoustic Phonetics
 Belt (Music)
 Histology Of Vocal Folds
 Intelligibility (Communication)
 Lombard Effect
 Manner Of Articulation
 Paralanguage: Nonverbal Voice Cues In
 Phonation
 Phonetics
 Voice Change In Boys
 Speaker Recognition
 Speech Synthesis
 Vocal Loading
 Vocal Rest
 Vocal Range
 Vocal Warm Up
 Vocology
 Voice Analysis
 Voice Disorders
 Voice Frequency
 Voice Organ
 Voice Pedagogy
 Voice Projection
 Voice Synthesis
 Voice Types (Singing Voices)
Use Of The Web By People With Disabilities
Human Voice 
The human voice consists of sound made by a human being using the vocal
folds for talking, singing, laughing, crying,screaming, etc. Human voice is specifically that
part of human sound production in which the vocal folds (vocal cords) are the primary
sound source. Generally speaking, the mechanism for generating the human voice can be
subdivided into three parts; the lungs, the vocal folds within the larynx, and the articulators.
The lung (the pump) must produce adequate airflow and air pressure to vibrate vocal folds
(this air pressure is the fuel of the voice). The vocal folds (vocal cords) are a vibrating valve
that chops up the airflow from the lungs into audible pulses that form the laryngeal sound
source. The muscles of the larynx adjust the length and tension of the vocal folds to ‘fine
tune’ pitch and tone. The articulators (the parts of the vocal tract above the larynx
consisting of tongue, palate, cheek, lips, etc.) articulate and filter the sound emanating from
the larynx and to some degree can interact with the laryngeal airflow to strengthen it or
weaken it as a sound source.
The vocal folds, in combination with the articulators, are capable of producing highly
intricate arrays of sound. The tone of voice may be modulated to suggest emotions such
as anger, surprise, or happiness. Singers use the human voice as an instrument for
creating music.

Voice types and the folds (cords) themselves

Adult men and women have different vocal
folds sizes; reflecting the male-female
differences in larynx size. Adult male voices
are usually lower-pitched and have larger
folds. The male vocal folds (which would be
measured vertically in the opposite diagram),
are between 17 mm and 25 mm in length. the
female vocal folds are between 12.5 mm and
17.5 mm in length.

A labeled anatomical diagram

of the vocal folds or cords.

As seen in the illustration, the

folds are located just above
the vertebrate trachea (the
windpipe, which travels from
the lungs). Food and drink do
not pass through the cords but
instead pass through
the esophagus, an unlinked
tube. Both tubes are
separated by the epiglottis, a
"flap" that covers the opening
of the trachea while
The folds in both sexes are within the larynx. They are attached at the back (side nearest
the spinal cord) to the arytenoids cartilages, and at the front (side under the chin) to
the thyroid cartilage. They have no outer edge as they blend into the side of the breathing
tube (the illustration is out of date and does not show this well) while their inner edges or
"margins" are free to vibrate (the hole). They have a three layer construction of
anepithelium, vocal ligament, then muscle (vocalis muscle), which can shorten and bulge
the folds. They are flat triangular bands and are pearly white in color. Above both sides of
the vocal cord is the vestibular fold or false vocal cord, which has a small sac between its
two folds (not illustrated).
The difference in vocal folds size between men and women means that they have differently
pitched voices. Additionally, genetics also causes variances amongst the same sex, with
men and women's singing voices being categorized into types. For example, among men,
there are bass, baritone, tenor and countertenor(ranging from E2 to even F6), and among
women, contralto, mezzo-soprano and soprano (ranging from F3 to C6). There are
additional categories for operatic voices, see voice type. This is not the only source of
difference between male and female voice. Men, generally speaking, have a larger vocal
tract, which essentially gives the resultant voice a lower-sounding timbre. This is mostly
independent of the vocal folds themselves.
Voice modulation in spoken language
Human spoken language makes use of the ability of almost all persons in a given society to
dynamically modulate certain parameters of the laryngeal voice source in a consistent
manner. The most important communicative, or phonetic, parameters are the voice pitch
(determined by the vibratory frequency of the vocal folds) and the degree of separation of
the vocal folds, referred to as vocal fold abduction (coming together) or adduction
The ability to vary the ab/adduction of the vocal folds quickly has a strong genetic
component, since vocal fold adduction has a life-preserving function in keeping food from
passing into the lungs, in addition to the covering action of the epiglottis. Consequently, the
muscles that control this action are among the fastest in the body. Children can learn to use
this action consistently during speech at an early age, as they learn to speak the difference
between utterances such as "apa" (having an abductory-adductory gesture for the p) as
"aba" (having no abductory-adductory gesture). Surprisingly enough, they can learn to do
this well before the age of two by listening only to the voices of adults around them who
have voices much different from their own, and even though the laryngeal movements
causing these phonetic differentiations are deep in the throat and not visible to them.
If an abductory movement or adductory movement is strong enough, the vibrations of the
vocal folds will stop (or not start). If the gesture is abductory and is part of a speech sound,
the sound will be called Voiceless. However, voiceless speech sounds are sometimes better
identified as containing an abductory gesture, even if the gesture was not strong enough to
stop the vocal folds from vibrating. This anomalous feature of voiceless speech sounds is
better understood if it is realized that it is the change in the spectral qualities of the voice as
abduction proceeds that is the primary acoustic attribute that the listener attends to when
identifying a voiceless speech sound, and not simply the presence or absence of voice
(periodic energy).
An adductory gesture is also identified by the change in voice spectral energy it produces.
Thus, a speech sound having an adductory gesture may be referred to as a "glottal stop"
even if the vocal fold vibrations do not entirely stop. for an example illustrating this,
obtained by using the inverse filtering of oral airflow.]
Other aspects of the voice, such as variations in the regularity of vibration, are also used for
communication, and are important for the trained voice user to master, but are more rarely
used in the formal phonetic code of a spoken language.
Physiology and vocal timbre
The sound of each individual's voice is entirely unique not only because of the actual shape
and size of an individual's vocal cords but also due to the size and shape of the rest of that
person's body, especially the vocal tract, and the manner in which the speech sounds are
habitually formed and articulated. (It is this latter aspect of the sound of the voice that can
be mimicked by skilled performers.) Humans have vocal folds that can loosen, tighten, or
change their thickness, and over which breath can be transferred at varying pressures. The
shape of chest and neck, the position of the tongue, and the tightness of otherwise
unrelated muscles can be altered. Any one of these actions results in a change in pitch,
volume, timbre, or tone of the sound produced. Sound also resonates within different parts
of the body, and an individual's size and bone structure can affect somewhat the sound
produced by an individual.
Singers can also learn to project sound in certain ways so that it resonates better within
their vocal tract. This is known as vocal resonation. Another major influence on vocal sound
and production is the function of the larynx, which people can manipulate in different ways
to produce different sounds. These different kinds of laryngeal function are described as
different kinds of vocal registers. The primary method for singers to accomplish this is
through the use of the Singer's Formant, which has been shown to be a resonance added to
the normal resonances of the vocal tract above the frequency range of most instruments
and so enables the singer's voice to carry better over musical accompaniment.
Vocal registration
Vocal registration refers to the system of vocal registers within the human voice. A
register in the human voice is a particular series of tones, produced in the same vibratory
pattern of the vocal folds, and possessing the same quality. Registers originate
in laryngeal functioning. They occur because the vocal folds are capable of producing
several different vibratory patterns. Each of these vibratory patterns appears within a
particular Vocal range range of pitches and produces certain characteristic sounds. the term
register can be somewhat confusing as it encompasses several aspects of the human voice.
The term register can be used to refer to any of the following:
 A particular part of the vocal range such as the upper, middle, or lower registers.
 A resonance area such as chest voice or head voice.
 A phonatory process
 A certain vocal timbre
 A region of the voice that is defined or delimited by vocal breaks.
 A subset of a language used for a particular purpose or in a particular social setting.
In linguistics, a register language is a language that combines tone and
vowel phonation into a single phonological system.
Within speech pathology the term vocal register has three constituent elements: a certain
vibratory pattern of the vocal folds, a certain series of pitches, and a certain type of sound.
Speech pathologists identify four vocal registers based on the physiology of laryngeal
function: the vocal fry register, the modal register, and the falsetto register, and the whistle
register. This view is also adopted by many vocal pedagogists.
Vocal resonation
Vocal resonation is the process by which the basic product of phonation is enhanced in
timbre and/or intensity by the air-filled cavities through which it passes on its way to the
outside air. Various terms related to the resonation process include amplification,
enrichment, enlargement, improvement, intensification, and prolongation; although in
strictly scientific usage acoustic authorities would question most of them. The main point to
be drawn from these terms by a singer or speaker is that the end result of resonation is, or
should be, to make a better sound. There are seven areas that maybe listed as possible
vocal resonators. In sequence from the lowest within the body to the highest, these areas
are the chest, the tracheal tree, the larynx itself, the pharynx, the oral cavity, the nasal
cavity, and the sinuses.
Influences of the human voice
The twelve-tone musical scale, upon which some of the music in the world is based, may
have its roots in the sound of the human voice during the course of evolution, according to
a study published by the New Scientist. Analysis of recorded speech samples found peaks in
acoustic energy that mirrored the distances between notes in the twelve-tone scale.
Voice disorders
There are many disorders that affect the human voice; these include speech impediments,
and growths and lesions on the vocal folds. Talking for improperly long periods of time
causesvocal loading, which is stress inflicted on the speech organs. When vocal injury is
done, often an ENT specialist may be able to help, but the best treatment is the prevention
of injuries through good vocal production. Voice therapy is generally delivered by a Speech-
language pathologist.
Hoarseness or breathiness that lasts for more than two weeks is a common symptom of an
underlying voice disorder and should be investigated medically

Range Of The Human Voice

Voice, Human (Range Of The). The range of the human voice is quite astounding, - there
being about 9 perfect tones, but 17,592,186,044,515 different sounds; thus 14 direct
muscles, alone, or together, produce 16,383 ; 30 indirect muscles, ditto, 173,741,823, and
all in co-operation produce the number we have named ; and these, independently of
different degrees of intensity. A man's voice ranges from bass to tenor, the medium being
whs called a barytone. The female voice ranges from contral o to soprano, the medium
being tinned a mezzo-soprano, - whereas, a boy's Voice is alto, or between a tenor and a


Phonography. Phonography includes every- method of writing by signs that represent

the sounds of the language. It differs from stenography in this respect: - Stenography uses
characters to representwords by their spelling, instead of their sound; hence phonography is
much the shortest and simplest mode of short-hand writing.

Accent (linguistics)
In linguistics, an accent is a manner of pronunciation of a language. An accent may be
associated with the region in which its speakers reside (ageographical or regional accent),
the socio-economic status of its speakers, their ethnicity, their caste or social class,
their first language (when the language in which the accent is heard is not their native
language), and so on.
Accents can be confused with dialects which are varieties of language differing
in vocabulary, syntax, and morphology, as well as pronunciation. Dialects are usually
spoken by a group united by geography or social status.
As human beings spread out into isolated communities, stresses and peculiarities develop.
Over time these can develop into identifiable accents. In North America, the interaction of
people from many ethnic backgrounds contributed to the formation of the different varieties
of North American accents. It is difficult to measure or predict how long it takes an accent
to formulate. Accents in the USA, Canada and Australia, for example, developed from the
combinations of different accents and languages in various societies, and the effect of this
on the various pronunciations of the British settlers, yet North American accents remain
more distant, either as a result of time or of external or "foreign" linguistic interaction, such
as the Italian accent.
In many cases, the accents of non-English settlers from Great Britain and Ireland affected
the accents of the different colonies quite differently. Irish, Scottish and Welsh immigrants
had accents which greatly affected the vowel pronunciation of certain areas of Australia and
Children are able to take on accents relatively quickly. Children of immigrant families, for
example, generally have a more native-like pronunciation than their parents, though both
children and parents may have a noticeable non-native accent. Accents seem to remain
relatively malleable until a person's early twenties, after which a person's accent seems to
become more entrenched.
All the same, accents are not fixed even in adulthood. An acoustic analysis by Jonathan
Harrington of Queen Elizabeth II's Royal Christmas Messages revealed that the speech
patterns of even so conservative a figure as a monarch can continue to change over her
Non-native accents
Pronunciation is the most difficult part of a non-native language to learn. Most individuals
who speak a non-native language fluently speak it with an accent of their native tongue.
The most important factor in predicting the degree to which the accent will be noticeable (or
strong) is the age at which the non-native language was learned. The critical period theory
states that if learning takes place after the critical period (usually considered around
puberty) for acquiring native-like pronunciation, an individual is unlikely to acquire a native-
like accent. This theory, however, is quite controversial among researchers. Although many
subscribe to some form of the critical period, they either place it earlier than puberty or
consider it more of a critical “window,” which may vary from one individual to another and
depend on factors other than age, such as length of residence, similarity of the non-native
language to the native language, and the frequency with which both languages are
used. Nevertheless, children as young as 6 at the time of moving to another country often
speak with a noticeable non-native accent as adults. There are also rare instances of
individuals who are able to pass for native speakers even if they learned their non-native
language in early adulthood.However, neurological constrains associated with brain
development appear to limit most non-native speakers’ ability to sound native-like. Most
researchers agree that for adults, acquiring a native-like accent in a non-native language is
near impossible.
Social factors
When a group defines a standard pronunciation, speakers who deviate from it are often said
to "speak with an accent". However, everyone speaks with an accent. People from theUnited
States would "speak with an accent" from the point of view of an Australian, and vice versa.
Accents such as BBC English or General American or Standard American may sometimes be
erroneously designated in their countries of origin as "accentless" to indicate that they offer
no obvious clue to the speaker's regional or social background.
Certain accents are perceived to carry more prestige in a society than other accents. This is
often due to their association with the elite part of society. For example in the United
Kingdom, Received Pronunciation of the English language is associated with the traditional
upper class. However, in linguistics, there is no differentiation among accents in regards to
their prestige, aesthetics, or correctness. All languages and accents are linguistically equal.
Accent Stereotyping and Prejudice
Stereotypes refer to specific characteristics, traits, and roles that a group and its members
are believed to possess. Stereotypes can be both positive and negative, although negative
are more common.
Stereotypes may result in prejudice, which is defined as having negative attitudes toward a
group and its members. Individuals with non-standard accents often have to deal with both
negative stereotypes and prejudice because of an accent. Researchers consistently show
that people with accents are judged as less intelligent, less competent, less educated,
having poor English/language skills, and unpleasant to listen to.   [19][20] Not only people with
standard accents subscribe to these believes and attitudes, but individuals with accent also
often stereotype against their own or others' accents.
Accent Discrimination
Discrimination refers to specific behaviors or actions directed at a group or its individual
members based solely on the group membership. In accent discrimination, one's way of
speaking is used as a basis for arbitrary evaluations and judgments. [21] Unlike other forms of
discrimination, there are no strong norms against accent discrimination in the general
society. Rosina Lippi-Green writes,
Accent serves as the first point of gate keeping because we are forbidden, by law and social
custom, and perhaps by a prevailing sense of what is morally and ethically right, from using
race, ethnicity, homeland or economics more directly. We have no such compunctions about
language, however. Thus, accent becomes a litmus test for exclusion, and excuse to turn
away, to recognize the other.
Speakers with accents often experience discrimination in housing and employment. [22][23] For
example, landlords are less likely to call back speakers who have foreign or ethnic accents
and are more likely to be assigned by employers to lower status positions than are those
with standard accents. In business settings, individuals with non-standard accents are more
likely to evaluated negatively. Accent discrimination is also present in educational
institutions. For example, non-native speaking graduate students, lecturers, and professors,
across college campuses in the US have been target for being unintelligible because of
accent. On average, however, students taught by non-native English speaker do not
underperform when compared to those taught by native speakers of English.
Studies have shown the perception of the accent, not the accent by itself, often results in
negative evaluations of speakers. In a study conducted by Rubin (1992), students listened
to a taped lecture recorded by the same native English speaker with a standard accent.
However, they were shown a picture of the lecturer who was either a Caucasian or Asian.
Participants in the study who saw the Asian picture believed that they had heard an
accented lecturer and performed worse on a task measuring lecture comprehension.
Negative evaluations may reflect the prejudices rather than real issues with understanding
Acting and accents
Actors are often called upon to speak varieties of language other than their own. For
example, Missouri-born actor Dick van Dyke attempted to imitate a cockney accent in the
film Mary Poppins. Similarly, an actor may portray a character of some nationality other
than his or her own by adopting into the native language the phonological profile typical of
the nationality to be portrayed – what is commonly called "speaking with an accent". One
example would be Viggo Mortensen's use of a Russian accent in his portrayal of Nikolai in
the movie Eastern Promises.
The perception or sensitivity of others to accents means that generalizations are passed off
as acceptable, such as Brad Pitt's Jamaican accent in Meet Joe Black. Angelina
Jolie attempted a Greek accent in the film Alexander that was said by critics to be
distracting. Gary Oldman has become known for playing eccentrics and for his mastery of
Accents may have associations and implications for an audience. For example,
in Disney films from the 1990s onward, English accents are generally employed to serve one
of two purposes: slapstick comedy or evil genius. Examples include Aladdin (the Sultan and
Jafar, respectively), The Lion King (Zazu and Scar, respectively), The Hunchback of Notre
Dame(Victor the Gargoyle and Frollo, respectively), and Pocahontas (Wiggins and Ratcliffe,
respectively - both of whom happen to be played by the same actor, American David Ogden
Legal implications
In the United States, Title VII of the Civil Rights Act of 1964 prohibits discrimination based
on national origin, implying accents. However, employers can insist that a person’s accent
impairs his or her communication skills that are necessary to the effective business
operation and be off the hook. The courts often rely on the employer’s claims or use judges’
subjective opinions when deciding whether the (potential) employee’s accent would
interfere with communication or performance, without any objective proof that accent was
or might be a hindrance.
Kentucky's highest court in the case of Clifford vs. Commonwealth held that a white police
officer, who had not seen the black defendant allegedly involved in a drug transaction,
could, nevertheless, identify him as a participant by saying that a voice on an audiotape
"sounded black." The police officer based this "identification" on the fact that the defendant
was the only African American man in the room at the time of the transaction and that an
audio-tape contained the voice of a man the officer said “sounded black” selling crack
cocaine to a white informant planted by the police.

Acoustic phonetics
Acoustic phonetics is a subfield of phonetics which deals with acoustic aspects
of speech sounds. Acoustic phonetics investigates properties like the mean
squared amplitude of awaveform, its duration, its fundamental frequency, or other
properties of its frequency spectrum, and the relationship of these properties to other
branches of phonetics (e.g. articulatory orauditory phonetics), and to abstract linguistic
concepts like phones, phrases, or utterances.
The study of acoustic phonetics was greatly enhanced in the late 19th century by the
invention of the Edison phonograph. The phonograph allowed the speech signal to be
recorded and then later processed and analyzed. By replaying the same speech signal from
the phonograph several times, filtering it each time with a different band-pass filter,
a spectrogram of the speech utterance could be built up. A series of papers by Ludimar
Hermann published in Pflüger's Archiv in the last two decades of the 19th century
investigated the spectral properties of vowels and consonants using the Edison phonograph,
and it was in these papers that the term formant was first introduced. Hermann also played
back vowel recordings made with the Edison phonograph at different speeds to distinguish
between Willis' and Wheatstone's theories of vowel production.
Further advances in acoustic phonetics were made possible by the development of
the telephone industry. (Incidentally, Alexander Graham Bell's father, Alexander Melville
Bell, was a phonetician.) During World War II, work at the Bell Telephone
Laboratories (which invented the spectrograph) greatly facilitated the systematic study of
the spectral properties of periodicand aperiodic speech sounds, vocal tract resonances and
vowel formants, voice quality, prosody, etc.
On a theoretical level, acoustic phonetics really took off when it became clear that speech
acoustic could be modeled in a way analogous to electrical circuits. Lord Rayleigh was
among the first to recognize that the new electric theory could be used in acoustics, but it
was not until 1941 that the circuit model was effectively used, in a book by Chiba and
Kajiyama called "The Vowel: Its Nature and Structure". (Interestingly, this book by
Japanese authors working in Japan was published in English at the height of World War II.)
In 1952, Roman Jakobson,Gunnar Fant, and Morris Halle wrote "Preliminaries to Speech
Analysis", a seminal work tying acoustic phonetics and phonological theory together. This
little book was followed in 1960 by Fant "Acoustic Theory of Speech Production", which has
remained the major theoretical foundation for speech acoustic research in both the academy
and industry. (Fant was himself very involved in the telephone industry.) Other important
framers of the field include Kenneth N. Stevens, Osamu Fujimura, and Peter Ladefoged.
Belt (music)
Belting (or vocal belting) refers to a specific technique of singing by which a singer
produces a loud sound in the upper middle of the pitch range. It is often described as
a vocal registeralthough some dispute this since technically the larynx is not oscillating in a
unique way . Singers can use belting to convey heightened emotional states .
The term "belt" is sometimes mistakenly described as the use of chest voice in the higher
part of the voice. (The chest voice is a very general term for the sound and muscular
functions of the speaking voice, singing in the lower range and the voice used to shout. Still,
all those possibilities require help from the muscles in the vocal folds and a thicker closure
of the vocal folds. The term "chest voice" is therefore often a misunderstanding, as it
describes muscular work in the chest-area of the body, but the "sound" described as
"chestvoice" is also produced by work of the vocal folds.) However, the proper production of
the belt voice according to some vocal methods involves minimizing tension in the throat
and change of typical placement of the voice sound in the mouth, bringing it forward into
the hard palate.
It is possible to learn classical vocal methods like bel canto and to also be able to belt; in
fact, many musical roles now require it. The belt sound is easier for some than others, but
the sound is possible for classical singers, too. It requires muscle coordinations not readily
used in classically trained singers, which may be why some opera singers find learning to
belt challenging.
In order to increase the number of high notes one can belt, one must practice. This can be
by repeatedly attempting to hit the note in a melody line, or by using vocalise programs
utilizing scales. Many commercial learn-to-sing packageshave a set of scales to sing along to
as their main offering, which the purchaser must practice with often to see improvement.
'Belters' are not exempt from developing a strong head voice, as the more resonant their
higher register in head voice, the better the belted notes in this range will be. Some belters
find that after a period of time focusing on the belt, the head voice will have improved and,
likewise, after a period of time focusing on the head voice, the belt may be found to have
There are many explanations as to how the belting voice quality is produced. When
approaching the matter from the Bel Canto point of view, it is said that the chest voice is
applied to the higher register However, through studying singers who use a "mixed" sound
practitioners have defined mixed sound as belting . One researcher,Jo Estill, has conducted
research on the belting voice, and describes the belting voice as an extremely muscular and
physical way of singing. When observing the vocal tract and torso of singers, while belting,
Estill observed:
 Minimal airflow (longer closed phase (70% or greater) than in any other type of
 Maximum muscular engagement of the torso (In Estill terms: Torso anchoring).
 Engagement of muscles in the head and neck in order to stabilize the larynx) (in
Estill terms: Head and neck anchoring)
 A downwards tilt of the cricoid cartilage (An alternative option would be the thyroid
tilting backwards. Observations show a larger CT space).
 High positioning of the larynx
 Maximum muscular effort of the extrinsic laryngeal muscles, minimum effort at the
level of the true vocal folds.
 Narrowing of the aryepiglottic sphincter (the "twanger")
Possible dangers of belting
Use of belting without proper coordination can lead to forcing. Forcing can lead
consequently to vocal deterioration. Moderate use of the technique and, most importantly,
retraction of the ventricular folds while singing is vital to safe belting. Without proper
training in retraction, belting can indeed cause trauma to the vocal folds that requires the
immediate attention of a doctor.
Most tutors and some students of the method known as Speech Level Singing, created and
supported by Seth Riggs, regard belting as damaging to long term vocal health. They may
teach an alternative using a "mixed" or middle voice which can sound almost as strong, as
demonstrated by Aretha Franklin, Patti LaBelle, Celine Dion, Whitney Houston, Mariah
Carey,Lara Fabian, Ziana Zain, and Regine Velasquez. The subject of belting is a matter of
heated controversy among singers, singing teachers and methodologies.
Proponents of belting say that it is a "soft yell," and if produced properly it can be healthy.
It does not require straining and they say it is not damaging to the voice. Though
the larynx is higher than in classical technique,and many experts on the singing voice
believe that a high larynx position is both dangerous to vocal health and produces what
many find to be an unpleasant sound. According to master teacher David Jones, "Some of
the dangers are general swelling of the vocal cords, pre-polyp swelling, ballooning of
capillaries on the surface of the vocal cords, or vocal nodules. A high-larynxed approach to
the high voice taught by a speech level singing instructor who does not listen appropriately
can lead to one or ALL of these vocal disorders".
However, it is thought by some that belting will produce vocal nodules. This may be true if
belting is produced incorrectly. If the sound is produced is a mixed head and chest sound
that safely approximates a belt, produced well, there may be no damage to the vocal folds.
As for the physiological and acoustical features of the metallic voices, a master thesis has
drawn the following conclusions:
 No significant changes in frequency and amplitude of F1 were observed
 Significant increases in amplitudes of F2, F3 and F4 were found
 In frequencies for F2, metallic voice perceived as louder was correlated to increase in
amplitude of F3 and F4
 Vocal tract adjustments like velar lowering, pharyngeal wall narrowing, laryngeal
raising, aryepiglottic and lateral laryngeal constriction were frequently found.

Histology of the vocal folds

Histology is the study of the minute structure, composition, and function of
tissues. The histology of the vocal folds is the reason for vocal fold vibration.
Histoanaomy of the Glottis
The glottis is defined as the true vocal folds and the space between them. It is composed of
an intermembranous portion or anterior glottis, and an intercartilaginous portion or
posterior glottis. The border between the anterior and posterior glottises is defined by an
imaginary line drawn across the vocal fold at the tip of the vocal process of the arytenoid
cartilage. The anterior glottis is the primary structure of vocal fold vibration for phonation
and the posterior glottis is the widest opening between the vocal folds for respiration.
Thus, voice disorders often involve lesions of the anterior glottis. There are gradual changes
in stiffness between the pliable vocal fold and hard, hyaline cartilage of the arytenoid. The
vocal processes of the arytenoid cartilages form a firm framework for the glottis but are
made of elastic cartilage at the tip. Therefore, the vocal process of the arytenoid bends at
the elastic cartilage portion during adduction and abduction of the vocal folds.
Attachments of the Vocal Fold
The vibratory portion of the vocal fold in the anterior glottis is connected to the thyroid
cartilage anteriorly by the macula flava and anterior commissure tendon, or Broyle's
ligament. Posteriorly, this vibratory portion is connected to the vocal process of the
arytenoid cartilage by the posterior macula flava. The macula flava in newborn vocal folds is
important for the growth and development of the vocal ligament and layered structure of
the vocal folds. In the adult, the macula flavae are probably required for metabolism of the
extracellular matrices of the vocal fold mucosa, replacing damaged fibers in order to
maintain the integrity and elasticity of the vocal fold tissues. Age-related changes in the
macula flava influence the fibrous components of the vocal folds and are partially
responsible for the differences in the acoustics of the adult and aged voice.
Layered Structure of the Adult Vocal Fold
The histological structure of the vocal fold can be separated into 5 or 6 tissues, depending
on the source, which can then be grouped into three sections as the cover, the transition,
and the body.
The cover is composed of the epithelium (mucosa), basal lamina (or basement
membrane zone), and the superficial layer of the lamina propria.
The transition is composed of the intermediate and deep layers of the lamina propria. The
body is composed of the thyroarytenoid muscle. This layered structure of tissues is very
important for vibration of the true vocal folds.
The Cover
The free edge of the vibratory portion of the vocal fold, the anterior glottis, is covered with
stratified squamous epithelium. This epithelium is five to twenty-five cells thick with the
most superficial layer consisting of one to three cells that are lost to abrasion of the vocal
folds during the closed phase of vibration. The posterior glottis is covered with
pseudostratified ciliated epithelium. On the surfaces of the epithelial cells are microridges
and microvilli. Lubrication of the vocal folds through adequate hydration is essential for
normal phonation to avoid excessive abrasion, and the microridges and microvilli help to
spread and retain a mucous coat on the epithelium. Surgery of the vocal folds can disturb
this layer with scar tissue, which can result in the inability of the epithelium to retain an
adequate mucous coat, which will in turn impact lubrication of the vocal folds. The
epithelium has been described as a thin shell, the purpose of which is to maintain the shape
of the vocal fold.
Basal Lamina or Basement Membrane Zone (BMZ)
This is transitional tissue composed of two zones, the lamina lucida and lamina densa. The
lamina lucida appears as a low density clear zone medial to the epithelial basal cells. The
lamina densa has a greater density of filaments and is adjacent to the lamina propria. The
basal lamina or BMZ mainly provides physical support to the epithelim through anchoring
fibers and is essential for repair of the epithelium.
Superficial Layer of the Lamina Propria
This layer consists of loose fibrous components and extracellular matrices that can be
compared to soft gelatin. This layer is also known as Reinke’s space but it is not a space at
all. Like the pleural cavity, it is a potential space. If there really is a space, there is a
problem. The superficial layer of the lamina propria is a structure that vibrates a great deal
during phonation, and the viscoelasticity needed to support this vibratory function depends
mostly on extracellular matrices. The primary extracellular matrices of the vocal fold cover
are reticular, collagenous and elastic fibers, as well as glycoprotein and glycosaminoglycan.
These fibers serve as scaffolds for structural maintenance, providing tensile strength and
resilience so that the vocal folds may vibrate freely but still retain their shape.
The Transition:Intermediate and Deep Layers of the Lamina Propria
The intermediate layer of the lamina propria is primarily made up of elastic fibers while the
deep layer of the lamina propria is primarily made up of collagenous fibers. These fibers run
roughly parallel to the vocal fold edge and these two layers of the lamina propria comprise
the vocal ligament. The transition layer is primarily structural, giving the vocal fold support
as well as providing adhesion between the mucosa, or cover, and the body, the
thyroarytenoid muscle.
The Body: The Thyroarytenoid Muscle
This muscle is variously described as being divided into the thyroarytenoid and vocalis
muscles or the thyrovocalis and the thyromuscularis , depending on the source.
Vocal Fold Lesions
The majority of vocal fold lesions primarily arise in the cover of the folds. Since the basal
lamina secures the epithelium to the superficial layer of the lamina propria with anchoring
fibers, this is a common site for injury. If a person has a phonotrauma or habitual vocal
hyperfunction, also known as pressed phonation, the proteins in the basal lamina can shear,
causing vocal fold injury, usually seen as nodules or polyps, which increase the mass and
thickness of the cover. The squamous cell epithelium of the anterior glottis are also a
frequent site of layrngeal cancer caused by smoking.
Reinke’s Edema
A voice pathology called Reinke’s edema, swelling due to abnormal accumulation of fluid,
occurs in the superficial lamina propria or Reinke’s space. This causes the vocal fold mucosa
to appear floppy with excessive movement of the cover that has been described as looking
like a loose sock. The greater mass of the vocal folds due to increased fluid lowers
thefundamental frequency (F°) during phonation.
Histological Changes From Birth to Old Age
The histologic structure of the vocal fold differs from the pediatric to the adult and old-age
The infant lamina propria is composed of only one layer, as compared to three in the adult,
and there is no vocal ligament. The vocal ligament begins to be present in children at about
four years of age. Two layers appear in the lamina propria between the ages of six and
twelve, and the mature lamina propria, with the superficial, intermediate and deep layers, is
only present by the conclusion of adolescence. As vocal fold vibration is a foundation for
vocal formants, this presence or absence of tissue layers influences a difference in the
number of formants between the adult and pediatric populations. In females, the voice is
three tones lower than the child’s and has five to twelve formants, as opposed to the
pediatric voice with three to six. The length of the vocal fold at birth is approximately six to
eight millimeters and grows to its adult length of eight to sixteen millimeters by
adolescence. The infant vocal fold is half membranous or anterior glottis, and half
cartilaginous or posterior glottis. The adult fold is approximately three-fifths membranous
and two-fifths cartilaginous.
Puberty usually lasts from 2–5 years, and typically occurs between the ages of 12 to 17.
During puberty, voice change is controlled by sex hormones. In females during puberty, the
vocal muscle thickens slightly, but remains very supple and narrow. The squamous mucosa
also differentiates into three distinct layers (the lamina propria) on the free edge of the
vocal folds. The sub- and supraglottic glandular mucosa becomes hormone-dependent to
estrogens and progesterone. For women, the actions of estrogens and progesterone
produce changes in the extravascular spaces by increasing capillary permeability which
allows the passage of intracapillary fluids to the interstitial space as well as modification of
glandular secretions. Estrogens have a hypertrophic and proliferative effect on mucosa by
reducing the desquamating effect on the superficial layers. The thyroid hormones also affect
dynamic function of the vocal folds (Hashimoto’s Thyroiditis affects the fluid balance in the
vocal folds). Progesterone has an anti-proliferative effect on mucosa and accelerates
desquamation. It causes a menstrual-like cycle in the vocal fold epithelium and a drying out
of the mucosa with a reduction in secretions of the glandular epithelium. Progesterone has a
diuretic effect and decreases capillary permeability, thus trapping the extracellular fluid out
of the capillaries and causing tissue congestion. Testosterone, an androgen secreted by the
testes, will cause changes in the cartilages and musculature of the larynx for males during
puberty. In women, androgens are secreted principally by the adrenal cortex and the
ovaries and can have irreversible masculinizing effects if present in high enough
concentration. In men, they are essential to male sexuality. In muscles, they cause a
hypertrophy of striated muscles with a reduction in the fat cells in skeletal muscles, and a
reduction in the whole body fatty mass. Androgens are the most important hormones
responsible for the passage of the boy-child voice to man voice, and the change is
irreversible. The thyroid prominence appears, the vocal folds lengthen and become rounded,
and the epithelium thickens with the formation of three distinct layers in the lamina propria.
There is a steady increase in the elastin content of the lamina propria as we age (elastin is a
yellow scleroprotein, the essential constituent of the elastic connective tissue) resulting in a
decrease in the ability of the lamina propria to expand caused by cross-branching of the
elastin fibers. Among other things, this leads to the mature voice being better suited to the
rigors of opera.
Old Age
There is a thinning in the superficial layer of the lamina propria in old age. In aging, the
vocal fold undergoes considerable sex-specific changes. In the female larynx, the vocal fold
cover thickens with aging. The superficial layer of the lamina propria loses density as it
becomes more edematous. The intermediate layer of the lamina propria tends to atrophy
only in men. The deep layer of the lamina propria of the male vocal fold thickens because of
increased collagen deposits. The vocalis muscle atrophies in both men and women.
However, the majority of elderly patients with voice disorders have disease processes
associated with aging rather than physiologic aging alone.

Intelligibility (communication)
In phonetics, Intelligibility is a measure of how comprehendible speech is, or the degree
to which speech can be understood. Intelligibility is affected by spoken clarity, explicitness,
lucidity, comprehensibility, perspicuity, and precision.
Noise levels
For satisfactory communication, the average speech level should exceed that of an
interfering noise by 6dB; lower sound:noise ratios are rarely acceptable (Moore, 1997).
Manifesting in a wide frequency range, speech is quite resistant to many types of masking
frequency cut-off—Moore reports, for example, that a band of frequencies from 1000Hz to
2000Hz is sufficient (sentence articulation score of about 90%).
Word articulation remains high even when only 1–2% of the wave is unaffected by
Quantity to be
Unit of measurement Good values
 %ALcons Articulation loss (popular in USA) < 10 %
C50 Clarity index (widespread in Germany) > 3 dB
STI (RASTI) Intelligibility (international known) > 0.6
Intelligibility with different types of speech
Lombard speech
The human brain automatically changes speech made in noise through a process called
the Lombard effect. Such speech has increased intelligibility compared to normal speech. It
is not only louder but the frequencies of its phonetic fundamental are increased and the
durations of its vowels are prolonged. People also tend to make more noticeable facial
Shouted speech is less intelligible than Lombard speech because increased vocal energy
produces decreased phonetic information.
Clear speech
Clear speech is used when talking to a person with a hearing impairment. It is characterized
by a slower speaking rate, more and longer pauses, elevated speech intensity, increased
word duration, "targeted" vowel formants, increased consonant intensity compared to
adjacent vowels, and a number of phonological changes (including fewer reduced vowels
and more released stop bursts).
Infant-directed speech
Infant-directed speech—or Baby talk—uses a simplified syntax and a small and easier-to-
understand vocabulary than speech directed to adults Compared to adult directed speech, it
has a higher fundamental frequency, exaggerated pitch range, and slower rate.
Citation speech
Citation speech occurs when people engage self-consciously in spoken language research. It
has a slower tempo and fewer connected speech processes (e.g., shortening of nuclear
vowels, devoicing of word-final consonants) than normal speech. 
Hyperspace speech
Hyperspace speech, also known as the hyperspace effect, occurs when people are misled
about the presence of environment noise. It involves modifying the F1 and F2 of phonetic
vowel targets to ease perceived difficulties on the part of the listener in recovering
information from the acoustic signal.
Lombard effect
Due to the Lombard effect, Great tits sing at a higher
frequency in noise polluted urban surroundings than
quieter ones to help overcome the auditory
masking that would otherwise impair other birds
hearing their song. In humans, the Lombard effect
results in speakers adjusting not only frequency but
also the intensity and rate of pronouncing
word syllables.
The Lombard effect or Lombard reflex is
the involuntary tendency of speakers to increase the
intensity of their voice when speaking inloud noise to
enhance its audibility. This change includes not only
loudness but also other acoustic features such
as pitch and rate and duration of
sound syllables. This compensation effect results in an increase in the auditory signal-to-
noise ratio of the speaker’sspoken words.
The effect links to the needs of effective communication as there is a reduced effect when
words are repeated or lists are read wherecommunication intelligibility is not
important. Since the effect is also involuntary it is used as a means to detect malingering in
those simulating hearing loss. Research upon Great tits and Beluga whales that live in
environments with noise pollution finds that the effect also occurs in the vocalizations of
nonhuman animals.
The effect was discovered in 1909 by Étienne Lombard, a French otolaryngologist.
Lombard speech
When heard with noise, listeners hear speech recorded in noise better compared to that
speech which has been recorded in quiet and then played given with the same level
of masking noise. Changes between normal and Lombard speech include:
 increase in phonetic fundamental frequencies
 shift in energy from low frequency bands to middle or high bands,
 increase in sound intensity,
 increase in vowel duration,
 spectral tilting,
 shift in formant center frequencies for F1 (mainly) and F2.
 the duration of content words are prolonged to a greater degree in noise
than function words.
 great lung volumes are used,
 it is accompanied by larger facial movements but these do not aid as much as its
sound changes.
These changes cannot be controlled by instructing a person to speak as they would in
silence, though people can learn control with feedback.
The Lombard effect also occurs following laryngectomy when people following speech
therapy talk with esophageal speech.
The intelligibility of an individual’s own vocalization can be adjusted with audio-vocal
reflexes using their own hearing (private loop), or it can be adjusted indirectly in terms of
how well listeners can hear the vocalization (public loop). Both processes are involved in the
Lombard effect.
Private loop
A speaker can regulate their vocalizations particularly its amplitude relative to background
noise with reflexive auditory feedback. Such auditory feedback is known to maintain the
production of vocalization since deafness affects the vocal acoustics of both humans and
songbirds Changing the auditory feedback also changes vocalization in human speechor bird
song. Neural circuits have been found in the brainstem that enable such reflex adjustment.
Public loop
A speaker can regulate their vocalizations at higher cognitive level in terms of observing its
consequences on their audience’s ability to hear it. In this auditory self-monitoring adjusts
vocalizations in terms of learnt associations of what features of their vocalization, when
made in noise, create effective and efficient communication. The Lombard effect has been
found to be greatest upon those words that are important to the listener to understand a
speaker suggesting such cognitive effects are important.
Both private and public loop processes exist in children. There is a development shift
however from the Lombard effect being linked to acoustic self-monitoring in young children
to the adjustment of vocalizations to aid its intelligibility for others in adults.
The Lombard effect depends upon audio-vocal neurons in the periolivary region of
the superior olivary complex and the adjacent pontine reticular formation. It has been
suggested that the Lombard effect might also involve the higher cortical areas that control
these lower brainstem areas.
Choral singing
Choral singers experience reduced feedback due to the sound of other singers upon their
own voice. This results in a tendency for people in choruses to sing at a louder level if it is
not controlled by a conductor. Trained soloists can control this effect but it has been
suggested that after a concert they might speak more loudly in noisy surrounding as in
after-concert parties.
The Lombard effect also occurs to those playing instruments such as the guitar
Animal vocalization
Noise has been found to effect the vocalizations of animals that vocalize against a
background of human noise pollution. Great tits in Leiden sing with a higher frequency than
do those in quieter area to overcome the masking effect of the low frequency background
noise pollution of cities. Beluga whales in the St. Lawrence River estuary adjust their whale
song so it can be heard against shipping noise
Experimentally, the Lombard effect has also been found in the vocalization of:
 Budgerigars
 Cats
 Chickens
 Common marmosets
 Cottontop tamarins
 Japanese quail
 Nightingales
 Rhesus Macaques
 Squirrel monkey.
 Zebra finches

Manner of articulation

Human vocal tract

In linguistics (articulatory phonetics), manner of
articulation describes how the tongue, lips, jaw, and
other speech organs are involved in making a sound
make contact. Often the concept is only used for the
production of consonants. For any place of articulation,
there may be several manners, and therefore
severalhomorganic consonants.
One parameter of manner is stricture, that is, how closely
the speech organs approach one another. Parameters
other than stricture are those involved in the r-like
sounds (taps and trills), and the sibilancy of fricatives.
Often nasality and laterality are included in manner,
but phoneticians such as Peter Ladefoged consider them
to be independent.
From greatest to least stricture, speech sounds may be classified along a cline as stop
consonants (with occlusion, or blocked airflow), fricative consonants (with partially blocked
and therefore strongly turbulent airflow), approximants (with only slight turbulence),
and vowels (with full unimpeded airflow). Affricates often behave as if they were
intermediate between stops and fricatives, but phonetically they are sequences of stop plus
Historically, sounds may move along this cline toward less stricture in a process
called lenition. The reverse process is fortition.
Other parameters
Sibilants are distinguished from other fricatives by the shape of the tongue and how the
airflow is directed over the teeth. Fricatives at coronal places of articulation may be sibilant
or non-sibilant, sibilants being the more common.
Taps and flaps are similar to very brief stops. However, their articulation and behavior is
distinct enough to be considered a separate manner, rather than just length.
Trills involve the vibration of one of the speech organs. Since trilling is a separate parameter
from stricture, the two may be combined. Increasing the stricture of a typical trill results in
a trilled fricative. Trilled affricates are also known.
Nasal airflow may be added as an independent parameter to any speech sound. It is most
commonly found in nasal stops and nasal vowels, but nasal fricatives, taps, and
approximants are also found. When a sound is not nasal, it is called oral. An oral stop is
often called a plosive,while a nasal stop is generally just called a nasal.
Laterality is the release of airflow at the side of the tongue. This can also be combined with
other manners, resulting in lateral approximants (the most common), lateral flaps, and
lateral fricatives and affricates.
Individual manners
 Plosive, or oral stop, where there is complete occlusion (blockage) of both the oral
and nasal cavities of the vocal tract, and therefore no air flow. Examples
include English /p t k/ (voiceless) and /b d g/ (voiced). If the consonant is voiced, the
voicing is the only sound made during occlusion; if it is voiceless, a plosive is completely
silent. What we hear as a /p/ or /k/ is the effect that the onset of the occlusion has on
the preceding vowel, as well as therelease burst and its effect on the following vowel.
The shape and position of the tongue (the place of articulation) determine
the resonant cavity that gives different plosives their characteristic sounds. All
languages have plosives.
 Nasal stop, usually shortened to nasal, where there is complete occlusion of the
oral cavity, and the air passes instead through the nose. The shape and position of the
tongue determine the resonant cavity that gives different nasal stops their characteristic
sounds. Examples include English /m, n/. Nearly all languages have nasals, the only
exceptions being in the area of Puget Sound and a single language on Bougainville
 Fricative, sometimes called spirant, where there is continuous frication (turbulent
and noisy airflow) at the place of articulation. Examples include English /f, s/ (voiceless),
/v, z/ (voiced), etc. Most languages have fricatives, though many have only an /s/.
However, the Indigenous Australian languages are almost completely devoid of fricatives
of any kind.
 Sibilants are a type of fricative where the airflow is guided by a groove in the
tongue toward the teeth, creating a high-pitched and very distinctive sound. These are by
far the most common fricatives. Fricatives at coronal (front of tongue) places of articulation
are usually, though not always, sibilants. English sibilants include /s/ and /z/.
 Lateral fricatives are a rare type of fricative, where the frication occurs on one or
both sides of the edge of the tongue. The "ll" of Welsh and the "hl" of Zulu are lateral
 Affricate, which begins like a plosive, but this releases into a fricative rather than
having a separate release of its own. The English letters "ch" and "j" represent affricates.
Affricates are quite common around the world, though less common than fricatives.
 Flap, often called a tap, is a momentary closure of the oral cavity. The "tt" of "utter"
and the "dd" of "udder" are pronounced as a flap in North American English. Many linguists
distinguish taps from flaps, but there is no consensus on what the difference might be. No
language relies on such a difference. There are also lateral flaps.
 Trill, in which the articulator (usually the tip of the tongue) is held in place, and the
airstream causes it to vibrate. The double "r" of Spanish "perro" is a trill. Trills and flaps,
where there are one or more brief occlusions, constitute a class of consonant called rhotics.
 Approximant, where there is very little obstruction. Examples include English /w/
and /r/. In some languages, such as Spanish, there are sounds which seem to fall
between fricativeand approximant.
 One use of the word semivowel, sometimes called a glide, is a type of
approximant, pronounced like a vowel but with the tongue closer to the roof of the mouth,
so that there is slight turbulence. In English, /w/ is the semivowel equivalent of the vowel
/u/, and /j/ (spelled "y") is the semivowel equivalent of the vowel /i/ in this usage. Other
descriptions usesemivowel for vowel-like sounds that are not syllabic, but do not have the
increased stricture of approximants. These are found as elements in diphthongs. The word
may also be used to cover both concepts.
 Lateral approximants, usually shortened to lateral, are a type of approximant
pronounced with the side of the tongue. English /l/ is a lateral. Together with the rhotics,
which have similar behavior in many languages, these form a class of consonant
called liquids.
Broader classifications
Manners of articulation with substantial obstruction of the airflow (plosives, fricatives,
affricates) are called obstruents. These are prototypically voiceless, but voiced obstruents
are extremely common as well. Manners without such obstruction (nasals, liquids,
approximants, and also vowels) are called sonorants because they are nearly always
voiced. Voiceless sonorants are uncommon, but are found in Welsh and Classical Greek (the
spelling "rh"), in Tibetan (the "lh" of Lhasa), and the "wh" in those dialects of English which
distinguish "which" from "witch".
Sonorants may also be called resonants, and some linguists prefer that term, restricting
the word 'sonorant' to non-vocoid resonants (that is, nasals and liquids, but not vowels or
semi-vowels). Another common distinction is between stops (plosives and nasals)
and continuants (all else); affricates are considered to be both, because they are
sequences of stop plus fricative.
Other airstream initiations
All of these manners of articulation are pronounced with an airstream
mechanism called pulmonic egressive, meaning that the air flows outward, and is powered
by the lungs (actually the ribs and diaphragm). Other airstream mechanisms are possible.
Sounds that rely on some of these include:
 Ejectives, which are glottalic egressive. That is, the airstream is powered by an
upward movement of the glottis rather than by the lungs or diaphragm. Plosives, affricates,
and occasionally fricatives may occur as ejectives. All ejectives are voiceless.
 Implosives, which are glottalic ingressive. Here the glottis moves downward, but
the lungs may be used simultaneously (to provide voicing), and in some languages no air
may actually flow into the mouth. Implosive oral stops are not uncommon, but implosive
affricates and fricatives are rare. Voiceless implosives are also rare.
 Clicks, which are velaric ingressive. Here the back of the tongue is used to create a
vacuum in the mouth, causing air to rush in when the forward occlusion (tongue or lips) is
released. Clicks may be oral or nasal, stop or affricate, central or lateral, voiced or
voiceless. They are extremely rare in normal words outside Southern Africa. However,
English has a click in its "tsk tsk" (or "tut tut") sound, and another is often used to say
"giddy up" to a horse.
Paralanguage refers to the non-verbal elements of communication used to modify
meaning and convey emotion. Paralanguage may be
expressed consciously or unconsciously, and it includes the pitch, volume, and, in some
cases, intonation of speech. Sometimes the definition is restricted to vocally-produced
sounds. The study of paralanguage is known asparalinguistics.
The term 'paralanguage' is sometimes used as a cover term for body language, which is not
necessarily tied to speech, and paralinguistic phenomena in speech. The latter are
phenomena that can be observed in speech (Saussure's parole) but that do not belong to
the arbitrary conventional code of language (Saussure's langue).
The paralinguistic properties of speech play an important role in human speech
communication. There are no utterances or speech signals that lack paralinguistic
properties, since speech requires the presence of a voice that can be modulated. This voice
must have some properties, and all the properties of a voice as such are paralinguistic.
However, the distinction linguistic vs. paralinguistic applies not only to speech but
to writing and sign language as well, and it is not bound to any sensory modality. Even
vocal language has some paralinguistic as well as linguistic properties that can be seen (lip
reading, McGurk effect), and even felt, e.g. by the Tadoma method.
One can distinguish the following aspects of speech signals and perceived utterances:
Perspectival aspects
Speech signals that arrive at a listener’s ears have acoustic properties that may allow
listeners to localize the speaker (distance, direction). Sound localization functions in a
similar way also for non-speech sounds. The perspectival aspects of lip reading are more
obvious and have more drastic effects when head turning is involved.
Organic aspects
The speech organs of different speakers differ in size. As children grow up, their organs of
speech become larger and there are differences between male and female adults. The
differences concern not only size, but also proportions. They affect the pitch of
the voice and to a substantial extent also the formant frequencies, which characterize the
differentspeech sounds. The organic quality of speech has a communicative function in a
restricted sense, since it is merely informative about the speaker. It will be expressed
independently of the speaker’s intention.
Expressive aspects
The properties of the voice and the way of speaking are affected by emotions and attitudes.
Typically, attitudes are expressed intentionally and emotions without intention, but attempts
to fake or to hide emotions are not unusual. Expressive variation is central to paralanguage.
It affects loudness, speaking rate, pitch, pitch range and, to some extent, also the formant
Linguistic aspects
These aspects are the main concern of linguists. Ordinary phonetic transcriptions of
utterances reflect only the linguistically informative quality. The problem of how listeners
factor out the linguistically informative quality from speech signals is a topic of current
Some of the linguistic features of speech, in particular of its prosody, are paralinguistic or
pre-linguistic in origin. A most fundamental and widespread phenomenon of this kind is
known as the "frequency code" (Ohala, 1984). This code works even in communication
across species. It has its origin in the fact that the acoustic frequencies in the voice of small
vocalizers are high while they are low in the voice of large vocalizers. This gives rise to
secondary meanings such as 'harmless', 'submissive', 'unassertive', which are naturally
associated with smallness, while meanings such as 'dangerous', 'dominant', and 'assertive'
are associated with largeness. In most languages, the frequency code also serves the
purpose of distinguishing questions from statements. It is universally reflected in expressive
variation, and it is reasonable to assume that it has phylogenetically given rise to the sexual
dimorphism that lies behind the large difference in pitch between average female and male
In text-only communication such as email, chatrooms and instant messaging, paralinguistic
elements can be displayed by emoticons, font and color choices, capitalization and the use
of non-alphabetic or abstract characters. Nonetheless, paralanguage in written
communication is limited in comparison with face-to-face conversation, sometimes leading
to misunderstandings.
Nonverbal communication
Nonverbal communication (NVC) is usually understood as the process
of communication through sending and receiving wordless messages. i.e., language is not
the only source of communication, there are other means also. NVC can be communicated
through gestures and touch (Haptic communication), by body language or posture, by facial
expression and eye contact. NVC can be communicated through object communication such
as clothing, hairstyles or even architecture, symbols and infographics. Speech contains
nonverbal elements known as paralanguage,
including voice quality, emotion and speaking style, as well as prosodic features such
as rhythm, intonation and stress. Dance is also regarded as a nonverbal communication.
Likewise, written texts have nonverbal elements such as handwriting style, spatial
arrangement of words, or the use of emoticons.
However, much of the study of nonverbal communication has focused on face-to-face
interaction, where it can be classified into three principal areas: environmental conditions
where communication takes place, the physical characteristics of the communicators, and
behaviors of communicators during interaction.
Verbal vs. oral communication
Scholars in this field usually use a strict sense of the term "verbal", meaning "of or
concerned with words", and do not use "verbal communication" as a synonym for oral or
spoken communication. Thus, vocal sounds that are not considered to be words, such as a
grunt, or singing a wordless note, are nonverbal. Sign languages and writing are generally
understood as forms of verbal communication, as both make use of words — although like
speech, both may contain paralinguistic elements and often occur alongside nonverbal
messages. Nonverbal communication can occur through
any sensory channel — sight, sound, smell, touch or taste. NVC is important as:
"When we speak (or listen), our attention is focused on words rather than body language.
But our judgment includes both. An audience is simultaneously processing both verbal and
nonverbal cues. Body movements are not usually positive or negative in and of themselves;
rather, the situation and the message will determine the appraisal." (Givens, 2000, p. 4)
The first scientific study of nonverbal communication was Charles Darwin's book The
Expression of the Emotions in Man and Animals (1872). He argued that all mammals show
emotion reliably in their faces. Studies now range across a number of fields,
including , linguistics, semiotics and social psychology.
While much nonverbal communication is based on arbitrary symbols, which differ from
culture to culture, a large proportion is also to some extent iconic and may be universally
understood. Paul Ekman's influential 1960s studies of facial expression determined that
expressions of anger, disgust, fear, joy, sadness and surprise are universal.
Clothing and bodily characteristics
Uniforms have both a functional and a communicative purpose. This man's clothes identify
him as male and a police officer; hisbadges and shoulder sleeve insignia give information
about his job and rank.
Elements such as physique, height, weight, hair, skin color, gender, odors, and clothing
send nonverbal messages during interaction. For example, a study, carried out
in Vienna, Austria, of the clothing worn by women attending discothèques showed that in
certain groups of women (especially women who were in town without their partners)
motivation for sex, and levels of sexual hormones, were correlated with aspects of the
clothing, especially the amount of skin displayed, and the presence of sheer clothing, e.g. at
the arms. Thus, to some degree, clothing sent signals about interest in courtship.
Research into height has generally found that taller people are perceived as being more
impressive. Melamed & Bozionelos (1992) studied a sample of managers in the UK and
found that height was a key factor affecting who was promoted. Often people try to make
themselves taller, for example, standing on a platform, when they want to make more of an
impact with their speaking.
Physical environment
Environmental factors such as furniture, architectural style, interior decorating, lighting
conditions, colors, temperature, noise, and music affect the behavior of communicators
during interaction. The furniture itself can be seen as a nonverbal message
Proxemics: physical space in communication
Proxemics is the study of how people use and perceive the physical space around them. The
space between the sender and the receiver of a message influences the way the message is
The perception and use of space varies significantly across cultures and different settings
within cultures. Space in nonverbal communication may be divided into four main
categories: intimate, social, personal, and public space.
The term territoriality is still used in the study of proxemics to explain human behavior
regarding personal space. Hargie & Dickson (2004, p. 69) identify 4 such territories:
1. Primary territory: this refers to an area that is associated with someone who has
exclusive use of it. For example, a house that others cannot enter without the
owner’s permission.
2. Secondary territory: unlike the previous type, there is no “right” to occupancy, but
people may still feel some degree of ownership of a particular space. For example,
someone may sit in the same seat on train every day and feel aggrieved if someone
else sits there.
3. Public territory: this refers to an area that is available to all, but only for a set
period, such as a parking space or a seat in a library. Although people have only a
limited claim over that space, they often exceed that claim. For example, it was
found that people take longer to leave a parking space when someone is waiting to
take that space.
4. Interaction territory: this is space created by others when they are interacting. For
example, when a group is talking to each other on a footpath, others will walk
around the group rather than disturb it.
Chronemics: time in communication
Chronemics is the study of the use of time in nonverbal communication. The way we
perceive time, structure our time and react to time is a powerful communication tool, and
helps set the stage for communication. Time perceptions include punctuality and willingness
to wait, the speed of speech and how long people are willing to listen. The timing and
frequency of an action as well as the tempo and rhythm of communications within an
interaction contributes to the interpretation of nonverbal messages. Gudykunst & Ting-
Toomey (1988) identified 2 dominant time patterns:
Monochronic Time
A monochronic time system means that things are done one at a time and time is
segmented into precise, small units. Under this system time is scheduled, arranged and
The United States is considered a monochronic society. This perception of time is learned
and rooted in the Industrial Revolution, where "factory life required the labor force to be on
hand and in place at an appointed hour" (Guerrero, DeVito & Hecht, 1999, p.  238). For
Americans, time is a precious resource not to be wasted or taken lightly. "We buy time,
save time, spend time and make time. Our time can be broken down into years, months,
days, hours, minutes, seconds and even milliseconds. We use time to structure both our
daily lives and events that we are planning for the future. We have schedules that we must
follow: appointments that we must go to at a certain time, classes that start and end at
certain times, work schedules that start and end at certain times, and even our favorite TV
shows, that start and end at a certain time.”
As communication scholar Edward T. Hall wrote regarding the American’s viewpoint of time
in the business world, “the schedule is sacred.” Hall says that for monochronic cultures,
such as the American culture, “time is tangible” and viewed as a commodity where “time is
money” or “time is wasted.” The result of this perspective is that Americans and other
monochronic cultures, such as the German and Swiss, place a paramount value on
schedules, tasks and “getting the job done.” These cultures are committed to regimented
schedules and may view those who do not subscribe to the same perception of time as
Monochronic cultures include Germany, Canada, Switzerland, United States, and
Polychronic Time
A polychronic time system is a system where several things can be done at once, and a
more fluid approach is taken to scheduling time. Unlike Americans and most northern and
western European cultures, Latin American and Arabic cultures use the polychronic system
of time.
These cultures are much less focused on the preciseness of accounting for each and every
moment. As Raymond Cohen notes, polychronic cultures are deeply steeped in tradition
rather than in tasks—a clear difference from their monochronic counterparts. Cohen notes
that "Traditional societies have all the time in the world. The arbitrary divisions of the clock
face have little saliency in cultures grounded in the cycle of the seasons, the invariant
pattern of rural life, and the calendar of religious festivities" (Cohen, 1997, p. 34).
Instead, their culture is more focused on relationships, rather than watching the clock. They
have no problem being “late” for an event if they are with family or friends, because the
relationship is what really matters. As a result, polychronic cultures have a much less formal
perception of time. They are not ruled by precise calendars and schedules. Rather, “cultures
that use the polychronic time system often schedule multiple appointments simultaneously
so keeping on schedule is an impossibility.”
Polychronic cultures include Saudi Arabia, Egypt, Mexico, Philippines, India, and many in
Movement and body position
Information about the relationship andaffect of these two skaters is communicated by
their body posture, eye gaze andphysical contact.
The term was first used (in 1952) by Ray Birdwhistell, an anthropologist who wished to
study how people communicate through posture, gesture, stance, and movement. Part of
Birdwhistell's work involved making film of people in social situations and analyzing them to
show different levels of communication not clearly seen otherwise. The study was joined by
several other anthropologists, including Margaret Mead and Gregory Bateson.
Posture can be used to determine a participant’s degree of attention or involvement, the
difference in status between communicators, and the level of fondness a person has for the
other communicator. Studies investigating the impact of posture
on interpersonal relationships suggest that mirror-image congruent postures, where one
person’s left side is parallel to the other person’s right side, leads to favorable perception of
communicators and positive speech; a person who displays a forward lean or a decrease in
a backwards lean also signify positive sentiment during communication. Posture is
understood through such indicators as direction of lean, body orientation, arm position, and
body openness.

A wink is a type of gesture.

A gesture is a non-vocal bodily movement intended to express meaning. They may be
articulated with the hands, arms or body, and also include movements of the head, face and
eyes, such as winking, nodding, or rolling ones' eyes. The boundary between language and
gesture, or verbal and nonverbal communication, can be hard to identify.
Although the study of gesture is still in its infancy, some broad categories of gestures have
been identified by researchers. The most familiar are the so-called emblems or quotable
gestures. These are conventional, culture-specific gestures that can be used as replacement
for words, such as the hand-wave used in the US for "hello" and "goodbye". A single
emblematic gesture can have a very different significance in different cultural contexts,
ranging from complimentary to highly offensive. For a list of emblematic gestures, see list of
Another broad category of gestures comprises those gestures used spontaneously when we
speak. These gestures are closely coordinated with speech. The so-called beat gestures are
used in conjunction with speech and keep time with the rhythm of speech to emphasize
certain words or phrases. These types of gestures are integrally connected to speech and
thought processes. Other spontaneous gestures used when we speak are more contentful
and may echo or elaborate the meaning of the co-occurring speech.For example, a gesture
that depicts the act of throwing may be synchronous with the utterance, "He threw the ball
right into the window."
Gestural languages such as American Sign Language and its regional siblings operate as
complete natural languages that are gestural in modality. They should not be confused with
finger spelling, in which a set of emblematic gestures are used to represent a written
Gestures can also be categorized as either speech-independent or speech-related. Speech-
independent gestures are dependent upon culturally accepted interpretation and have a
direct verbal translation. A wave hello or a peace sign are examples of speech-independent
gestures. Speech related gestures are used in parallel with verbal speech; this form of
nonverbal communication is used to emphasize the message that is being communicated.
Speech related gestures are intended to provide supplemental information to a verbal
message such as pointing to an object of discussion.
Gestures such as Mudra (Sanskrit) encode sophisticated information accessible to initiates
that are privy to the subtlety of elements encoded in their tradition.
Haptics: touching in communication

A high five is an example of communicative touch.

Haptics is the study of touching as nonverbal communication. Touches that can be defined
as communication include handshakes, holding hands, kissing (cheek, lips, hand), back
slapping, high fives, a pat on the shoulder, and brushing an arm. Touching of oneself may
include licking, picking, holding, and scratching. These behaviors are referred to as
"adapter" or "tells" and may send messages that reveal the intentions or feelings of a
communicator. The meaning conveyed from touch is highly dependent upon the context of
the situation, the relationship between communicators, and the manner of touch.
Humans communicate interpersonal closeness through a series of non-verbal actions known
as immediacy behaviors. Examples of immediacy behaviors are: smiling, touching,open
body positions, and eye contact. Cultures that display these immediacy behaviors are
known to be high contact cultures.
Haptic communication is the means by which people and other animals communicate via
touching. Touch is an extremely important sense for humans; as well as providing
information about surfaces and textures it is a component of nonverbal communication in
interpersonal relationships, and vital in conveying physical intimacy. It can be both sexual
(such as kissing) and platonic (such as hugging or tickling).
Touch is the earliest sense to develop in the fetus. The development of an infant's haptic
senses and how it relates to the development of the other senses such as vision has been
the target of much research. Human babies have been observed to have enormous difficulty
surviving if they do not possess a sense of touch, even if they retain sight and hearing.
Babies who can perceive through touch, even without sight and hearing, tend to fare much
better. Touch can be thought of as a basic sense in that most life forms have a response to
being touched, while only a subset have sight and hearing.[citation needed]
In chimpanzees the sense of touch is highly developed. As newborns they see and hear
poorly but cling strongly to their mothers. Harry Harlow conducted a controversial study
involving rhesus monkeys and observed that monkeys reared with a "terry cloth mother", a
wire feeding apparatus wrapped in softer terry cloth which provided a level of tactile
stimulation and comfort, were considerably more emotionally stable as adults than those
with a mere wire mother.(Harlow,1958)
Touching is treated differently from one country to another. Socially acceptable levels of
touching varies from one culture to another. In the Thai culture, touching someone's head
may be thought rude. Remland and Jones (1995) studied groups of people communicating
and found that in England (8%), France (5%) and the Netherlands (4%) touching was rare
compared to their Italian (14%) and Greek (12.5%) sample.[citation needed]
Striking, pushing, pulling, pinching, kicking, strangling and hand-to-hand fighting are forms
of touch in the context of physical abuse. In a sentence like "I never touched him/her" or
"Don't you dare to touch him/her" the term touch may be meant as euphemism for either
physical abuse or sexual touching. To 'touch oneself' is a euphemism for masturbation.
The word touch has many other metaphorical uses. One can be emotionally touched,
referring to an action or object that evokes an emotional response. To say "I was touched
by your letter" implies the reader felt a strong emotion when reading it. Usually does not
include anger, disgust or other forms of emotional rejection unless used in a sarcastic
Stoeltje (2003) wrote about how Americans are ‘losing touch’ with this important
communication skill. During a study conduced by University of Miami School of Medicine,
Touch Research Institutes, American children were said to be more aggressive than their
French counterparts while playing at a playground. It was noted that French women touched
their children more
Eye gaze
The study of the role of eyes in nonverbal communication is sometimes referred to as
"oculesics". Eye contact can indicate interest, attention, and involvement. Gaze comprises
the actions of looking while talking, looking while listening, amount of gaze, and frequency
of glances, patterns of fixation, pupil dilation, and blink rate.
Paralanguage: nonverbal cues of the voice
Paralanguage (sometimes called vocalics) is the study of nonverbal cues of the voice.
Various acoustic properties of speech such as tone, pitch and accent, collectively known
asprosody, can all give off nonverbal cues. Paralanguage may change the meaning of
The linguist George L. Trager developed a classification system which consists of the voice
set, voice qualities, and vocalization.
 The voice set is the context in which the speaker is speaking. This can include the
situation, gender, mood, age and a person's culture.
 The voice qualities are volume, pitch, tempo, rhythm, articulation, resonance,
nasality, and accent. They give each individual a unique "voice print".
 Vocalization consists of three subsections: characterizers, qualifiers and segregates.
Characterizer's are emotions expressed while speaking, such as laughing, crying, and
yawning. A voice qualifier is the style of delivering a message - for example, yelling "Hey
stop that!", as opposed to whispering "Hey stop that". Vocal segregates such as "uh-
huh" notify the speaker that the listener is listening.))
Functions of nonverbal communication
Argyle (1970)  put forward the hypothesis that whereas spoken language is normally used
for communicating information about events external to the speakers, non-verbal codes are
used to establish and maintain interpersonal relationships. It is considered more polite or
nicer to communicate attitudes towards others non-verbally rather than verbally, for
instance in order to avoid embarrassing situations .
Argyle (1988) concluded there are five primary functions of nonverbal bodily behavior in
human communication:
 Express emotions
 Express interpersonal attitudes
 To accompany speech in managing the cues of interaction between speakers and
 Self-presentation of one’s personality
 Rituals (greetings)
Concealing deception
Nonverbal communication makes it easier to lie without being revealed. This is the
conclusion of a study where people watched made-up interviews of persons accused of
having stolen a wallet. The interviewees lied in about 50 % of the cases. People had access
to either written transcripts of the interviews, or audio tape recordings, or video recordings.
The more clues that were available to those watching, the larger was the trend that
interviewees who actually lied were judged to be truthful. That is, people that are clever at
lying can use voice tone and face expression to give the impression that they are truthful 
The relation between verbal and nonverbal communication
The relative importance of verbal and nonverbal communication
An interesting question is: When two people are communicating face-to-face, how much of
the meaning is communicated verbally, and how much is communicated non-verbally? This
was investigated by Albert Mehrabian and reported in two papers ,. The latter paper
concluded: "It is suggested that the combined effect of simultaneous verbal, vocal, and
facial attitude communications is a weighted sum of their independent effects - with
coefficients of .07, .38, and .55, respectively." This "rule" that clues from spoken words,
from the voice tone, and from the facial expression, contribute 7 %, 38 %, and 55 %
respectively to the total meaning, is widely cited. It is presented on all types of popular
courses with statements like "scientists have found out that . . . ". In reality, however, it is
extremely weakly founded. First, it is based on the judgment of the meaning of single
taperecorded words, i.e. a very artificial context. Second, the figures are obtained by
combining results from two different studies which potentially cannot be combined. Third, it
relates only to the communication of positive versus negative emotions. Fourth, it relates
only to women, as men did not participate in the study.
Since then, other studies have analysed the relative contribution of verbal and nonverbal
signals under more naturalistic situations. Argyle  , using video tapes shown to the subjects,
analysed the communication of submissive/dominant attitude and found that non-verbal
cues had 4.3 times the effect of verbal cues. The most important effect was that body
posture communicated superior status in a very efficient way. On the other hand, a study
by Hsee et al. had subjects judge a person on the dimension happy/sad and found that
words spoken with minimal variation in intonation had an impact about 4 times larger than
face expressions seen in a film without sound. Thus, the relative importance of spoken
words and facial expressions may be very different in studies using different set-ups.
Interaction of verbal and nonverbal communication
When communicating, nonverbal messages can interact with verbal messages in six ways:
repeating, conflicting, complementing, substituting, regulating and accenting/moderating.
"Repeating" consists of using gestures to strengthen a verbal message, such as pointing to
the object of discussion.
Verbal and nonverbal messages within the same interaction can sometimes send opposing
or conflicting messages. A person verbally expressing a statement of truth while
simultaneously fidgeting or avoiding eye contact may convey a mixed message to the
receiver in the interaction. Conflicting messages may occur for a variety of reasons often
stemming from feelings of uncertainty, ambivalence, or frustration. [19] When mixed
messages occur, nonverbal communication becomes the primary tool people use to attain
additional information to clarify the situation; great attention is placed on bodily movements
and positioning when people perceive mixed messages during interactions.
Accurate interpretation of messages is made easier when nonverbal and verbal
communication complement each other. Nonverbal cues can be used to elaborate on verbal
messages to reinforce the information sent when trying to achieve communicative goals;
messages have been shown to be remembered better when nonverbal signals affirm the
verbal exchange.[20]
Nonverbal behavior is sometimes used as the sole channel for communication of a message.
People learn to identify facial expressions, body movements, and body positioning as
corresponding with specific feelings and intentions. Nonverbal signals can be used
without verbal communication to convey messages; when nonverbal behavior does not
effectively communicate a message, verbal methods are used to enhance understanding. [21]
Nonverbal behavior also regulates our conversations. For example, touching someone's arm
can signal that you want to talk next or interrupt.[21]
Nonverbal signals are used to alter the interpretation of verbal messages. Touch, voice
pitch, and gestures are some of the tools people use to accent or amplify the message that
is sent; nonverbal behavior can also be used to moderate or tone down aspects of verbal
messages as well.[22] For example, a person who is verbally expressing anger may accent
the verbal message by shaking a fist.
Dance and nonverbal communication
Dance is a form of nonverbal communication that requires the same underlying faculty in
the brain for conceptualization, creativity and memory as does verbal language in speaking
and writing. Means of self-expression, both forms have vocabulary (steps and gestures in
dance), grammar (rules for putting the vocabulary together) and meaning. Dance, however,
assembles (choreographs) these elements in a manner that more often resembles poetry,
with its ambiguity and multiple, symbolic and elusive meanings.
Clinical studies of nonverbal communication
From 1977 to 2004, the influence of disease and drugs on receptivity of nonverbal
communication was studied by teams at three separate medical schools using a similar
paradigm.[23].Researchers at the University of Pittsburgh, Yale University and Ohio State
University had subjects observe gamblers at a slot machine awaiting payoffs. The amount of
this payoff was read by nonverbal transmission prior to reinforcement. This technique was
developed by and the studies directed by psychologist, Dr. Robert E. Miller and psychiatrist,
Dr. A. James Giannini. These groups reported diminished receptive ability in heroin
addicts [24] and phencyclidine abusers[25] was contrasted with increased receptivity in cocaine
addicts. Men with major depression[26] manifested significantly decreased ability to read
nonverbal cues when compared with euthymic men.
Freitas-Magalhaes studied the effect of smile in the treatment of depression and concluded
that depressive states decrease when you smile more often.[27]
Obese women[28] and women with premenstrual syndrome [29] were found to also possess
diminished abilities to read these cues. In contradistinction, men with bipolar disorder
possessed increased abilities.[30]. A woman with total paralysis of the nerves of facial
expression was found unable to transmit any nonverbal facial cues whatsoever. [31]. Because
of the changes in levels of accuracy on the levels of nonverbal receptivity, the members of
the research team hypothesized a biochemical site in the brain which was operative for
reception of nonverbal cues. Because certain drugs enhanced ability while others diminished
it, the neurotransmitters dopamine and endorphin were considered to be likely etiological
candidate. Based on the available data, however, the primary cause and primary effect
could not be sorted out on the basis of the paradigm employed[32].
A byproduct of the work of the Pittsburgh/Yale/ Ohio State team was an investigation of the
role of nonverbal facial cues in heterosexual nondate rape. Males who were serial rapists of
adult women were studied for nonverbal receptive abilities. Their scores were the highest of
any subgroup.[33] Rape victims were next tested. It was reported that women who had been
raped on at least two occasions by different perpetrators had a highly significant impairment
in their abilities to read these cues in either male or female senders. [34] These results were
troubling, indicating a predator-prey model. The authors did note that whatever the nature
of these preliminary findings the responsibility of the rapist was in no manner or
The final target of study for this group was the medical students they taught. Medical
students at Ohio State University, Ohio University and Northest Ohio Medical College were
invited to serve as subjects. Students indicating a preference for the specialties of family
practice, psychiatry, pediatrics and obstetrics-gynecology achieved significantly higher
levels of accuracy than those students who planned to train as surgeons, radiologists, or
pathologists. Internal medicine and plastic surgery candidates scored at levels near the
Difficulties with nonverbal communication
People vary in their ability to send and receive nonverbal communication. Thus, on average,
to a moderate degree, women are better at nonverbal communication than are men [36][37][38]
Measurements of the ability to communicate nonverbally and the capacity to feel empathy
have shown that the two abilities are independent of each other [40].
For people who have relatively large difficulties with nonverbal communication, this can
pose significant challenges, especially in interpersonal relationships. There exist resources
that are tailored specifically to these people, which attempt to assist those in understanding
information which comes more easily to others. A specific group of persons that face these
challenges are those with autism spectrum disorders, including Asperger syndrome.
Glottal states
From open to closed:
Voiceless (full airstream)
Phonation has slightly different meanings depending on the Breathy voice (murmur)
subfield of phonetics. Among some phoneticians, phonation is Slack voice
the process by which thevocal folds produce certain sounds
through quasi-periodic vibration. This is the definition used Modal voice (maximum
among those who study laryngeal anatomy and physiology and vibration)
speech production in general. Other phoneticians, though, call Stiff voice
this process quasi-periodic vibration voicing, and they use the
termphonation to refer to any oscillatory state of any part of Creaky voice (restricted
the larynx that modifies the airstream, of which voicing is just airstream)
one example. As such, voiceless and supra-glottal phonation Glottalized (blocked
are included under this definition, which is common in the field airstream)
of linguistic phonetics.
Voicing Supra-glottal phonatio
The phonatory process, or voicing, occurs when air is expelled n
from the lungs through the glottis, creating a pressure drop Faucalized
across the larynx. When this drop becomes sufficiently large, voice ("hollow")
the vocal folds start to oscillate. The minimum pressure drop
required to achieve phonation is called the phonation threshold Harsh voice ("pressed")
pressure, and for humans with normal vocal folds, it is Strident (harsh trilled)
approximately 2–3 cm H2O. The motion of the vocal folds
during oscillation is mostly in the lateral direction, though Non-phonemic
there is also some superior component as well. However, there phonation
is almost no motion along the length of the vocal folds. The
oscillation of the vocal folds serves to modulate the pressure
and flow of the air through the larynx, and this modulated Falsetto
airflow is the main component of the sound of
most voiced phones.
The sound that the larynx produces is a harmonic series. In other words, it consists of a
fundamental tone (called the fundamental frequency, the main acoustic cue for the
percept pitch) accompanied by harmonic overtones, which are multiples of the fundamental
frequency . According to the Source-Filter Theory, the resulting sound excites the resonance
chamber that is the vocal tract to produce the individual speech sounds. 
The vocal folds will not oscillate if they are not sufficiently close to one another, are not
under sufficient tension or under too much tension, or if the pressure drop across the larynx
is not sufficiently large. In linguistics, a phone is called voiceless if there is no phonation
during its occurrence. In speech, voiceless phones are associated with vocal folds that are
elongated, highly tensed, and placed laterally (abducted) when compared to vocal folds
during phonation.
Fundamental frequency, the main acoustic cue for the percept pitch, can be varied through
a variety of means. Large scale changes are accomplished by increasing the tension in the
vocal folds through contraction of the cricothyroid muscle. Smaller changes in tension can
be effected by contraction of the thyroarytenoid muscle or changes in the relative position
of the thyroid and cricoid cartilages, as may occur when the larynx is lowered or raised,
either volitionally or through movement of the tongue to which the larynx is attached via
the hyoid bone. In addition to tension changes, fundamental frequency is also affected by
the pressure drop across the larynx, which is mostly affected by the pressure in the lungs,
and will also vary with the distance between the vocal folds. Variation in fundamental
frequency is used linguistically to produce intonation and tone.
There are currently two main theories as to how vibration of the vocal folds is initiated:
the myoelastic theory and the aerodynamic theory. These two theories are not in
contention with one another and it is quite possible that both theories are true and
operating simultaneously to initiate and maintain vibration. A third theory,
the neurochronaxic theory, was in considerable vogue in the 1950s, but has since been
largely discr . ed.
Myoelastic and aerodynamic theory
The myoelastic theory states that when the vocal cords are brought together and breath
pressure is applied to them, the cords remain closed until the pressure beneath them—the
subglottic pressure—is sufficient to push them apart, allowing air to escape and reducing
the pressure enough for the muscle tension recoil to pull the folds back together again.
Pressure builds up once again until the cords are pushed apart, and the whole cycle keeps
repeating itself. The rate at which the cords open and close—the number of cycles per
second—determines the pitch of the phonation.
The aerodynamic theory is based on the Bernoulli energy law in fluids. The theory states
that when a stream of breath is flowing through the glottis while the arytenoid cartilages are
held together by the action of the interarytenoid muscles, a push-pull effect is created on
the vocal fold tissues that maintains self-sustained oscillation. The push occurs during
glottal opening, when the glottis is convergent, whereas the pull occurs during glottal
closing, when the glottis is divergent. During glottal closure, the air flow is cut off until
breath pressure pushes the folds apart and the flow starts up again, causing the cycles to
repeat. The textbook entitled Myoelastic Aerodynamic Theory of Phonation by Ingo Titze cr .
s Janwillem van den Berg as the originator of the theory and provides detailed mathematical
development of the theory.Template:Titze,I.R., 2006
Neurochronaxic theory
This theory states that the frequency of the vocal fold vibration is determined by the
chronaxy of the recurrent nerve, and not by breath pressure or muscular tension. Advocates
of this theory thought that every single vibration of the vocal folds was due to an impulse
from the recurrent laryngeal nerves and that the acoustic center in the brain regulated the
speed of vocal fold vibration. Speech and voice scientists have long since left this theory as
the muscles have been shown to not be able to contract fast enough to accomplish the
vibration. In addition, persons with paralyzed vocal folds can produce phonation, which
would not be possible according to this theory. Phonation occurring in excised larynges
would also not be possible according to this theory.
As the state of the glottis

A continuum from closed glottis to open. The black triangles represent the arytenoid
cartilages, the sail shapes the vocal cords, and the dotted circle the windpipe.
In linguistic phonetic treatments of phonation, such as those of Peter Ladefoged, phonation
was considered to be a matter of points on a continuum of tension and closure of the vocal
cords. More intricate mechanisms were occasionally described, but they were difficult to
investigate, and until recently the state of the glottis and phonation were considered to be
nearly synonymous.
If the vocal cords are completely relaxed, with the arytenoid cartilages apart for maximum
airflow, the cords do not vibrate. This is voicelessphonation, and is extremely common
with obstruents. If the arytenoids are pressed together for glottal closure, the vocal cords
block the airstream, producing stop sounds such as the glottal stop. In between there is
a sweet spot of maximum vibration. This is modal voice, and is the normal state for vowels
and sonorants in all the world's languages. However, the aperture of the arytenoid
cartilages, and therefore the tension in the vocal cords, is one of degree between the end
points of open and closed, and there are several intermediate situations utilized by various
languages to make contrasting sounds.
For example, Gujarati has vowels with a partially lax phonation called breathy
voice or murmured, while Burmese has vowels with a partially tense phonation
called creaky voice orlaryngealized. Both of these phonations have dedicated IPA
diacritics, an under-umlaut and under-tilde. The Jalapa dialect of Mazatec is unusual in
contrasting both with modal voice in a three-way distinction. (Note that Mazatec is a tonal
language, so the glottis is making several tonal distinctions simultaneously with the
phonation distinctions.)
breathy voice [ja̤] he wears
modal voice [já] tree
creaky voice [ja̰]
Note: There was an . ing error in the source of this information. The latter two
translations may have been mixed up.
Javanese does not have modal voice in its plosives, but contrasts two other points along the
phonation scale, with more moderate departures from modal voice, called slack
voice and stiff voice. The "muddy" consonants in Shanghainese are slack voice; they
contrast with tenuis and aspirated consonants.
Although each language may be somewhat different, it is convenient to classify these
degrees of phonation into discrete categories. A series of seven alveolar plosives, with
phonations ranging from an open/lax to a closed/tense glottis, are:
Open glottis [t] voiceless (full airstream)
[d̤] breathy voice

[d̥] slack voice

Sweet spot [d] modal voice (maximum vibration)
[d̬] stiff voice

[d̰] creaky voice

Closed [ʔ͡t
glottal closure (blocked airstream)
glottis ]
The IPA diacritics under-ring and subscript wedge, commonly called "voiceless" and
"voiced", are sometimes added to the symbol for a voiced sound to indicate more lax/open
(slack) and tense/closed (stiff) states of the glottis, respectively. (Ironically, adding the
'voicing' diacritic to the symbol for a voiced consonant indicates less modal voicing, not
more, because a modally voiced sound is already fully voiced, at its sweet spot, and any
further tension in the vocal cords dampens their vibration.)
Alsatian, like several Germanic languages, has a typologically unusual phonation in its
stops. The consonants transcribed /b̥/, /d̥/, /ɡ̊/ (ambiguously called "lenis") are partially
voiced: The vocal cords are positioned as for voicing, but do not actually vibrate. That is,
they are technically voiceless, but without the open glottis usually associated with voiceless
stops. They contrast with both modally voiced /b, d, ɡ/ and modally voiceless /p, t, k/ in
French borrowings, as well as aspirated /kʰ/ word initially.
Glottal consonants
It has long been noted that in many languages, both phonologically and historically,
the glottal consonants [ʔ, ɦ, h] do not behave like other consonants. Phonetically, they have
nomanner or place of articulation other than the state of the glottis: glottal
closure for [ʔ], breathy voice for [ɦ], and open airstream for [h]. Some phoneticians have
described these sounds as neither glottal nor consonantal, but instead as instances of pure
phonation, at least in many European languages. However, in Semitic languages they do
appear to be true glottal consonants.
Supra-glottal phonation
In the last few decades it has become apparent that phonation may involve the entire
larynx, with as many as six valves and muscles working either independently or together.
From the glottis upward, these articulations are:
1. glottal (the vocal cords), producing the distinctions described above
2. ventricular (the 'false vocal cords', partially covering and damping the glottis)
3. arytenoid (sphincteric compression forwards and upwards)
4. epiglotto-pharyngeal (retraction of the tongue and epiglottis, potentially closing
onto the pharyngeal wall)
5. raising or lowering of the entire larynx
6. narrowing of the pharynx
Until the development of fiber-optic laryngoscopy, the full involvement of the larynx during
speech production was not observable, and the interactions among the six laryngeal
articulators is still poorly understood. However, at least two supra-glottal phonations appear
to be widespread in the world's languages. These are harsh voice ('ventricular' or 'pressed'
voice), which involves overall constriction of the larynx, and faucalized voice ('hollow' or
'yawny' voice), which involves overall expansion of the larynx.
The Bor dialect of Dinka has contrastive modal, breathy, faucalized, and harsh voice in its
vowels, as well as three tones. The ad hoc diacritics employed in the literature are a
subscript double quotation mark for faucalized voice, [a͈], and underlining for harsh
voice, [a̱]. Examples are,
Voice modal breathy harsh faucalized
Bor Dinka tɕìt tɕì̤t tɕì̱t tɕì͈t
diarrhe go scorpion
to swallow
a ahead s
Other languages with these contrasts are Bai (modal, breathy, and harsh
voice), Kabiye (faucalized and harsh voice, previously seen as ±ATR), Somali (breathy and
harsh voice).
Elements of laryngeal articulation or phonation may occur widely in the world's languages as
phonetic detail even when not phonemically contrastive. For example, simultaneous glottal,
ventricular, and arytenoid activity (for something other than epiglottal consonants) has
been observed
in Tibetan, Korean, Nuuchahnulth, Nlaka'pamux, Thai, Sui, Amis, Pame, Arabic,Tigrinya, Ca
ntonese, and Yi.
Familiar language examples
In languages such as French, all obstruents occur in pairs, one modally voiced and one
voiceless.[citation needed]
In English, every voiced fricative corresponds to a voiceless one. For the pairs of
English plosives, however, the distinction is better specified as voice onset time rather than
simply voice: In initial position /b d g/ are only partially voiced (voicing begins during the
hold of the consonant), while /p t k/ are aspirated (voicing doesn't begin until well after its
release).[citation needed] Certain English morphemes have voiced and voiceless allomorphs, such
as the plural, verbal, and possessive endings spelled -s (voiced in kids /kɪdz/ but voiceless
in kits /kɪts/) and the past-tense ending spelled -ed (voiced in buzzed /bʌzd/ but voiceless
in fished /fɪʃt/.[citation needed]
A few European languages, such as Finnish, have no phonemically voiced obstruents but
pairs of long and short consonants instead. Outside of Europe, a lack of voicing distinctions
is not uncommon; indeed, in Australian languages it is nearly universal. In languages
without the distinction between voiceless and voiced obstruents, it is often found that they
are realized as voiced in voiced environments such as between vowels, and voiceless
Vocal registers
In phonology
In phonology, a register is a combination of tone and vowel phonation into a single
phonological parameter. For example, among its vowels, Burmese combines modal voice
with low tone, breathy voice with falling tone, creaky voice with high tone, and glottal
closure with high tone. These four registers contrast with each other, but no other
combination of phonation (modal, breath, creak, closed) and tone (high, low, falling) is
In pedagogy and speech pathology
Among vocal pedagogues and speech pathologists, a vocal register also refers to a
particular phonation limited to a particular range of pitch, which possesses a characteristic
sound quality. The term "register" may be used for several distinct aspects of the human
 A particular part of the vocal range, such as the upper, middle, or lower registers,
which may be bounded by
vocal breaks
 A particular phonation
 A resonance area such as chest voice or head voice
 A certain vocal timbre
Four combinations of these elements are identified in speech pathology: the vocal fry
register, the modal register, the falsetto register, and the whistle register.
Phonetics (from the Greek: φωνή, phōnē, "sound, voice") is a branch of linguistics that
comprises the study of the sounds of human speech. It is concerned with the physical
properties of speech sounds (phones): their physiological production, acoustic properties,
auditory perception, and neurophysiological status. Phonology, on the other hand, is
concerned with abstract, grammatical characterization of systems of sounds.
Phonetics was studied as early as 2500 years ago in ancient India, with Pāṇini's account of
the place and manner of articulation of consonants in his 5th century BC treatise
on Sanskrit. The major Indic alphabets today order their consonants according
to Pāṇini's classification. The Ancient Greeks are cr . ed as the first to base a writing
system on a phonetic alphabet. Modern phonetics began with Alexander Melville Bell,
whose Visible Speech (1867) introduced a system of precise notation for writing down
speech sounds.
The studies about phonetic was strongly enhanced in the late 19th century, also for
invention of phonograph, that allowed the speech signal to be recorded and then later
processed and analyzed. By replaying the same speech signal from the phonograph several
times, filtering it each time with a different band-pass filter, a spectrogram of the speech
utterance could be built up.
A series of papers by Ludimar Hermann published in Pflüger's Archiv in the last two decades
of the 19th century investigated the spectral properties of vowels and consonants using the
Edison phonograph, and it was in these papers that the term formant was first introduced.
Hermann also played back vowel recordings made with the Edison phonograph at different
speeds to distinguish between Willis' and Wheatstone's theories of vowel production.
Phonetics as a research discipline has three main branches:
 articulatory phonetics is concerned with the articulation of speech: The position,
shape, and movement of articulators or speech organs, such as the lips, tongue,
and vocal folds.
 acoustic phonetics is concerned with acoustics of speech: The spectro-temporal
properties of the sound waves produced by speech, such as theirfrequency, amplitude,
and harmonic structure.
 auditory phonetics is concerned with speech perception:
the perception, categorization, and recognition of speech sounds and the role of
the auditory system and the brain in the same.
Main article: Phonetic transcription
Phonetic transcription is a system for transcribing sounds that occur in spoken
language or signed language. The most widely known system of phonetic transcription,
the International Phonetic Alphabet (IPA), uses a one-to-one mapping between phones and
written symbols. The standardized nature of the IPA enables its users to transcribe
accurately and consistently the phones of different languages, dialects, and idiolects. The
IPA is a useful tool not only for the study of phonetics, but also for language teaching,
professional acting, and speech pathology.
Application of phonetics include:
 forensic phonetics: the use of phonetics (the science of speech) for forensic (legal)
 Speech Recognition: the analysis and transcription of recorded speech by a computer
Relation to phonology
In contrast to phonetics, phonology is the study of how sounds and gestures pattern in and
across languages, relating such concerns with other levels and aspects of language.
Phonetics deals with the articulatory and acoustic properties of speech sounds, how they are
produced, and how they are perceived. As part of this investigation, phoneticians may
concern themselves with the physical properties of meaningful sound contrasts or the social
meaning encoded in the speech signal (e.g. gender, sexuality, ethnicity, etc.). However, a
substantial portion of research in phonetics is not concerned with the meaningful elements
in the speech signal.
While it is widely agreed that phonology is grounded in phonetics, phonology is a distinct
branch of linguistics, concerned with sounds and gestures as abstract units
(e.g., features,phonemes, mora, syllables, etc.) and their conditioned variation (via,
e.g., allophonic rules, constraints, or derivational rules). Phonology relates to phonetics via
the set of distinctive features, which map the abstract representations of speech units to
articulatory gestures, acoustic signals, and/or perceptual representations.

Puberty is the process of physical changes by which a child's body becomes an adult body

capable of reproduction. Puberty is initiated by hormone signals from the brain to
the gonads(the ovaries and testes). In response, the gonads produce a variety of hormones
that stimulate the growth, function, or transformation
of brain, bones, muscle, blood, skin, hair, breasts, and reproductive
organs. Growth accelerates in the first half of puberty and stops at the completion of
puberty. Before puberty, body differences between boys and girls are almost entirely
restricted to the genitalia. During puberty, major differences of size, shape, composition,
and function develop in many body structures and systems. The most obvious of these are
referred to as secondary sex characteristics.
In a strict sense, the term puberty (derived from the Latin word puberatum (age of
maturity, manhood)) refers to the bodily changes of sexual maturation rather than
the psychosocial and cultural aspects of adolescent development. Adolescence is the period
of psychological and social transition between childhood and adulthood. Adolescence largely
overlaps the period of puberty, but its boundaries are less precisely defined and it refers as
much to the psychosocial and cultural characteristics of development during the teen years
as to the physical changes of puberty.
Differences between male and female puberty
Two of the most significant differences between puberty in girls and puberty in boys are the
age at which it begins, and the major sex steroids involved.

Approximate outline of development periods in child and teenager development. Puberty is

marked in green at right.
Although there is a wide range of normal ages, girls typically begin the process of puberty
at age 10, boys at age 12. Girls usually complete puberty by ages 15–17, while boys usually
complete puberty by ages 16–18. Any increase in height beyond these ages is uncommon.
Girls attain reproductive maturity about 4 years after the first physical changes of puberty
appear. In contrast, boys accelerate more slowly but continue to grow for about 6 years
after the first visible pubertal changes.

1 Follicle-stimulating hormone - FSH

2 Luteinizing hormone - LH
3 Progesterone
4 Estrogen
5 Hypothalamus
6 Pituitary gland
7 Ovary
8 Pregnancy - hCG (Human chorionic gonadotropin)
9 Testosterone
10 Testicle
11 Incentives
12 Prolactin - PRL
For boys, an androgen called testosterone is the principal sex hormone. While testosterone
produces all boys' changes characterized as virilization, a substantial product of
testosterone metabolism in males is estradiol, though levels rise later and more slowly than
in girls. The male "growth spurt" also begins later, accelerates more slowly, and lasts longer
before theepiphyses fuse. Although boys are on average 2 cm shorter than girls before
puberty begins, adult men are on average about 13 cm (5.2 inches) taller than women.
Most of this sex difference in adult heights is attributable to a later onset of the growth
spurt and a slower progression to completion, a direct result of the later rise and lower adult
male levels of estradiol.
The hormone that dominates female development is an estrogen called estradiol. While
estradiol promotes growth of breasts anduterus, it is also the principal hormone driving the
pubertal growth spurt and epiphyseal maturation and closure. Estradiol levels rise earlier
and reach higher levels in women than in men.
Puberty onset
The onset of puberty is associated with high GnRH pulsing, which precedes the rise in sex
hormones, LH and FSH.Exogenous GnRH pulses cause the onset of puberty. Brain tumors
which increase GnRH output may also lead to premature puberty
The cause of the GnRH rise is unknown. Leptin might be the cause of the GnRH rise. Leptin
has receptors in the hypothalamus which synthesizes GnRH. Individuals who are deficient in
leptin fail to initiate puberty. The levels of leptin increase with the onset of puberty, and
then decline to adult levels when puberty is completed. The rise in GnRH might also be
caused by genetics. A study discovered that a mutation in genes encoding both Neurokinin
B as well as the Neurokinin B receptor can alter the timing of puberty. The researchers
hypothesized that Neurokinin B might play a role in regulating the secretion ofKisspeptin, a
compound responsible for triggering direct release of GnRH as well as indirect release
of LH and FSH.
Physical changes in boys
Testicular size, function, and fertility
In boys, testicular enlargement is the first physical manifestation of puberty (and is
termed gonadarche). Testes in prepubertal boys change little in size from about 1 year of
age to the onset of puberty, averaging about 2–3 cm in length and about 1.5–2 cm in width.
Testicular size continues to increase throughout puberty, reaching maximal adult size about
6 years after the onset of puberty. After the boy's testicles have enlarged and developed for
about one year, the length and then the breadth of the shaft of the penis will increase and
the glans penis and corpora cavernosa will also start to enlarge to adult proportions. While
18–20 cc is an average adult size, there is wide variation in testicular size in the normal
The testes have two primary functions: to produce hormones and to produce sperm.
The Leydig cells produce testosterone, which in turn produces most of the male pubertal
changes. Most of the increasing bulk of testicular tissue is spermatogenic tissue
(primarily Sertoli and Leydig cells). Sperm can be detected in the morning urine of most
boys after the first year of pubertal changes, and occasionally earlier [citation needed]. On average,
potential fertility in boys is reached at 13 years old, but full fertility will not be gained until
14–16 years of age[citation needed].
During puberty, a male's scrotum will become larger and begin to dangle or hang below the
body as opposed to being up tight, to accommodate the production of sperm whereby the
testicles need a certain temperature to be fertile.
Pubic hair
Pubic hair often appears on a boy shortly after the genitalia begin to grow. The pubic hairs
are usually first visible at the dorsal (abdominal) base of the penis. The first few hairs are
described as stage 2. Stage 3 is usually reached within another 6–12 months, when the
hairs are too many to count. By stage 4, the pubic hairs densely fill the "pubic triangle."
Stage 5 refers to the spread of pubic hair to the thighs and upward towards the navel as
part of the developing abdominal hair.
Body and facial hair

Facial hair of a male that has been shaved

In the months and years following the appearance of pubic hair, other areas of skin that
respond to androgens may develop androgenic hair. The usual sequence
is: underarm (axillary) hair, perianal hair, upper lip hair, sideburn (preauricular) hair,
periareolar hair, and the beard area. As with most human biological processes, this specific
order may vary among some individuals. Arm, leg, chest, abdominal, and back hair become
heavier more gradually. There is a large range in amount of body hair among adult men,
and significant differences in timing and quantity of hair growth among different racial
groups. Facial hair is often present in late adolescence, but may not appear until
significantly later.[19] Facial hair will continue to get coarser, darker and thicker for another
2–4 years after puberty. Some men do not develop full facial hair for up to 10 years after
the completion of puberty. Chest hair may appear during puberty or years after. Not all men
have chest hair.
Voice change
Under the influence of androgens, the voice box, or larynx, grows in both sexes. This
growth is far more prominent in boys, causing the male voice to drop and deepen,
sometimes abruptly but rarely "over night," about one octave, because the longer and
thicker vocal folds have a lower fundamental frequency. Before puberty, the larynx of boys
and girls is about equally small. [20] Occasionally, voice change is accompanied by
unsteadiness of vocalization in the early stages of untrained voices. Most of the voice
change happens during stage 3-4 of male puberty around the time of peak growth. Full
adult pitch is attained at an average age of 15 years, It usually precedes the development
of significant facial hair by several months to years.
Male musculature and body shape
By the end of puberty, adult men have heavier bones and nearly twice as much
skeletal muscle. Some of the bone growth (e.g. shoulder width and jaw) is
disproportionately greater, resulting in noticeably different male and female skeletal shapes.
The average adult male has about 150% of the lean body mass of an average female, and
about 50% of the body fat.
This muscle develops mainly during the later stages of puberty, and muscle growth can
continue even after boys are biologically adult. The peak of the so-called "strength spurt",
the rate of muscle growth, is attained about one year after a male experiences his peak
growth rate.
Often, the fat pads of the male breast tissue and the male nipples will develop during
puberty; sometimes, especially in one breast, this becomes more apparent and is
termedgynecomastia. It is usually not a permanent phenomenon.
Body odor and acne
Rising levels of androgens can change the fatty acid composition of perspiration, resulting in
a more "adult" body odor. As in girls, another androgen effect is increased secretion of oil
(sebum) from the skin and the resultant variable amounts of acne. Acne can not be
prevented or diminished easily, but it typically fully diminishes at the end of puberty.
However, it is not unusual for a fully grown adult to suffer the occasional bout of acne,
though it is normally less severe than in adolescents. Some may desire using prescription
topical creams or ointments to keep acne from getting worse, or even oral medication, due
to the fact that acne is emotionally difficult and can cause scarring.
Physical changes in girls
Breast Development
The first physical sign of puberty in girls is usually a firm, tender lump under the center of
the areola of one or both breasts, occurring on average at about 10.5 years of age. [21] This
is referred to as thelarche. By the widely used Tanner staging of puberty, this is stage 2 of
breast development (stage 1 is a flat, prepubertal breast). Within six to 12 months, the
swelling has clearly begun in both sides, softened, and can be felt and seen extending
beyond the edges of the areolae. This is stage 3 of breast development. By another 12
months (stage 4), the breasts are approaching mature size and shape, with areolae
and papillae forming a secondary mound. In most young women, this mound disappears
into the contour of the mature breast (stage 5), although there is so much variation in sizes
and shapes of adult breasts that stages 4 and 5 are not always separately identifiable.
Pubic hair
Pubic hair is often the second noticeable change in puberty, usually within a few months of
thelarche.[23] It is referred to as pubarche. The pubic hairs are usually visible first along
thelabia. The first few hairs are described as Tanner stage 2. [22] Stage 3 is usually reached
within another 6–12 months, when the hairs are too numerous to count and appear on
the pubic mound as well. By stage 4, the pubic hairs densely fill the "pubic triangle." Stage
5 refers to spread of pubic hair to the thighs and sometimes as abdominal hair upward
towards thenavel. In about 15% of girls, the earliest pubic hair appears before breast
development begins.[23]
Vagina, uterus, ovaries
The mucosal surface of the vagina also changes in response to increasing levels of estrogen,
becoming thicker and duller pink in color (in contrast to the brighter red of the prepubertal
vaginal mucosa).[24] Whitish secretions (physiologic leukorrhea) are a normal effect of
estrogen as well.[21] In the two years following thelarche, the uterus, ovaries, and
the follicles in the ovaries increase in size.[25] The ovaries usually contain small
follicular cysts visible by ultrasound.
Menstruation and fertility
The first menstrual bleeding is referred to as menarche, and typically occurs about two
years after thelarche.[23] The average age of menarche in girls is 11.75 years. [23] The time
between menstrual periods (menses) is not always regular in the first two years after
menarche.[28] Ovulation is necessary for fertility, but may or may not accompany the earliest
menses.[29] In postmenarchal girls, about 80% of the cycles were anovulatory in the first
year after menarche, 50% in the third year and 10% in the sixth year. [28] Initiation of
ovulation after menarche is not inevitable. A high proportion of girls with continued
irregularity in the menstrual cycle several years from menarche will continue to have
prolonged irregularity and anovulation, and are at higher risk for reduced fertility. [30] Nubility
is used to designate achievement of fertility.
Body shape, fat distribution, and body composition
During this period, also in response to rising levels of estrogen, the lower half of
the pelvis and thus hips widen (providing a larger birth canal).[22][31] Fat tissue increases to a
greater percentage of the body composition than in males, especially in the typical female
distribution of breasts, hips, buttocks, thighs, upper arms, and pubis. Progressive
differences in fat distribution as well as sex differences in local skeletal growth contribute to
the typical female body shape by the end of puberty. On average, at 10 years, girls have
6% more body fat than boys.[32]
Body odor and acne
Rising levels of androgens can change the fatty acid composition of perspiration, resulting in
a more "adult" body odor. This often precedes thelarche and pubarche by one or more
years. Another androgen effect is increased secretion of oil (sebum) from the skin. This
change increases the susceptibility to acne, a skin condition that is characteristic of puberty.
 Acne varies greatly in its severity.[33]
Timing of the onset of puberty
The definition of the onset of puberty depends on perspective (e.g., hormonal versus
physical) and purpose (establishing population normal standards, clinical care of early or
late pubescent individuals, etc.) The most commonly used definition of the onset of puberty
is physical changes to a person's body[citation needed]. These physical changes are the first visible
signs of neural, hormonal, and gonadal function changes.
The age at which puberty begins varies between individuals usually, puberty begins
between 10-13. The age at which puberty begins is affected by both genetic factors and by
environmental factors such as nutritional state and social circumstances. [34] An example of
social circumstances is the Vandenbergh effect; a juvenile female who has significant
interaction with adult males will enter puberty earlier than juvenile females who are not
socially overexposed to adult males.[35]
The average age at which puberty begins may be affected by race as well. For example, the
average age of menarche in various populations surveyed has ranged from 12 to 18 years.
The earliest average onset of puberty is for African-American girls and the latest average
onset for high altitude subsistence populations in Asia. However, much of the higher age
averages reflect nutritional limitations more than genetic differences and can change within
a few generations with a substantial change in diet. The median age of menarche for a
population may be an index of the proportion of undernourished girls in the population, and
the width of the spread may reflect unevenness of wealth and food distribution in a
Researchers have identified an earlier age of the onset of puberty. However, they have
based their conclusions on a comparison of data from 1999 with data from 1969. In the
earlier example, the sample population was based on a small sample of white girls (200,
from Britain). The later study identified as puberty as occurring in 48% of African-American
girls by age nine, and 12% of white girls by that age.[36]
Historical shift
The average age at which the onset of puberty occurs has dropped significantly since
the 1840s.[37][38][39] Researchers[who?] refer to this drop as the 'secular trend'. In every decade
from 1840 to 1950 there was a drop of four months in the average age of menarche among
Western European females. In Norway, girls born in 1840 had their menarche at an average
age of 17 years. In France the average in 1840 was 15.3 years. In England the average in
1840 was 16.5 years. In Japan the decline happened later and was then more rapid: from
1945 to 1975 in Japan there was a drop of 11 months per decade.
A 2006 study in Denmark found that puberty, as evidenced by breast development, started
at an average age of 9 years and 10 months, a year earlier than when a similar study was
done in 1991. Scientists believe the phenomenon could be linked to obesity or exposure to
chemicals in the food chain, and is putting girls at greater long-term risk of breast cancer.[40]
Genetic influence and environmental factors
Various studies have found direct genetic effects to account for at least 46% of the variation
of timing of puberty in well-nourished populations. [41][42][43][44] The genetic association of
timing is strongest between mothers and daughters. The specific genes affecting timing are
not yet known.[41] Among the candidates is an androgen receptor gene.[45]
Researchers[46] have hypothesized that early puberty onset may be caused by certain hair
care products containing estrogen or placenta, and by certain chemicals, namely phthalates,
which are used in many cosmetics, toys, and plastic food containers.
If genetic factors account for half of the variation of pubertal timing, environment factors
are clearly important as well. One of the first observed environmental effects is that puberty
occurs later in children raised at higher altitudes. The most important of the environmental
influences is clearly nutrition, but a number of others have been identified, all which affect
timing of female puberty and menarche more clearly than male puberty.
Hormones and steroids
There is theoretical concern, and animal evidence, that environmental hormones
and chemicals may affect aspects of prenatal or postnatal sexual development in humans.
 Large amounts of incompletely metabolized estrogens and progestagens from
pharmaceutical products are excreted into the sewage systems of large cities, and are
sometimes detectable in the environment. Sex steroids are sometimes used in cattle
farming but have been banned in chicken meat production for 40 years. Although
agricultural laws regulate use to minimize accidental human consumption, the rules are
largely self-enforced in the United States. Significant exposure of a child to hormones or
other substances that activate estrogen or androgen receptors could produce some or all of
the changes of puberty.
Harder to detect as an influence on puberty are the more diffusely distributed environmental
chemicals like PCBs (polychlorinated biphenyl), which can bind and trigger estrogen
More obvious degrees of partial puberty from direct exposure of young children to small but
significant amounts of pharmaceutical sex steroids from exposure at home may be detected
during medical evaluation for precocious puberty, but mild effects and the other potential
exposures outlined above would not.
Bisphenol A (BPA) is a chemical used to make plastics, and is frequently used to make baby
bottles, water bottles, sports equipment, medical devices, and as a coating in food and
beverage cans. Scientists are concerned about BPA's behavioral effects on fetuses, infants,
and children at current exposure levels because it can affect the prostate gland, mammary
gland, and lead to early puberty in girls. BPA mimics and interferes with the action of
estrogen-an important reproduction and development regulator. It leaches out of plastic
into liquids and foods, and the Centers for Disease Control and Prevention (CDC) found
measurable amounts of BPA in the bodies of more than 90 percent of the U.S. population
studied. The highest estimated daily intakes of BPA occur in infants and children. Many
plastic baby bottles contain BPA, and BPA is more likely to leach out of plastic when its
temperature is increased, as when one warms a baby bottle or warms up food in the
Nutritional influence
Nutritional factors are the strongest and most obvious environmental factors affecting
timing of puberty.[41] Girls are especially sensitive to nutritional regulation because they
must contribute all of the nutritional support to a growing fetus. Surplus calories (beyond
growth and activity requirements) are reflected in the amount of body fat, which signals to
the brain the availability of resources for initiation of puberty and fertility.
Much evidence suggests that for most of the last few centuries, nutritional differences
accounted for majority of variation of pubertal timing in different populations, and even
among social classes in the same population. Recent worldwide increased consumption of
animal protein, other changes in nutrition, and increases in childhood obesity have resulted
in falling ages of puberty, mainly in those populations with the higher previous ages. In
many populations the amount of variation attributable to nutrition is shrinking.
Although available dietary energy (simple calories) is the most important dietary influence
on timing of puberty, quality of the diet plays a role as well. Lower protein intakes and
higherdietary fiber intakes, as occur with typical vegetarian diets, are associated with later
onset and slower progression of female puberty.
Obesity influence and exercise
Scientific researchers have linked early obesity with an earlier onset of puberty in girls. They
have cited obesity as a cause of breast development before nine years and menarche before
twelve years.[49] Early puberty in girls can be a harbinger of later health problems.[50]
The average level of daily physical activity has also been shown to affect timing of puberty,
especially in females. A high level of exercise, whether for athletic or body image purposes,
or for daily subsistence, reduces energy calories available for reproduction and slows
puberty. The exercise effect is often amplified by a lower body fat mass and cholesterol.
Physical and mental illness
Chronic diseases can delay puberty in both boys and girls. Those that involve chronic
inflammation or interfere with nutrition have the strongest effect. In the western
world, inflammatory bowel disease and tuberculosis have been notorious for such an effect
in the last century, while in areas of the underdeveloped world,
chronic parasite infections are widespread.
Mental illnesses occur in puberty. The brain undergoes significant development
by hormones which can contribute to mood disorders such as Major depressive
disorder, bipolar disorder,dysthymia and schizophrenia. Girls aged between 15 and 19 make
up 40% of anorexia nervosa cases.[51]
Stress and social factors
Some of the least understood environmental influences on timing of puberty are social and
psychological. In comparison with the effects of genetics, nutrition, and general health,
social influences are small, shifting timing by a few months rather than years. Mechanisms
of these social effects are unknown, though a variety of physiological processes,
includingpheromones, have been suggested based on animal research.
The most important part of a child's psychosocial environment is the family, and most of the
social influence research has investigated features of family structure and function in
relation to earlier or later female puberty. Most of the studies have reported that menarche
may occur a few months earlier in girls in high-stress households, whose fathers are absent
during their early childhood, who have a stepfather in the home, who are subjected to
prolonged sexual abuse in childhood, or who are adopted from a developing country at a
young age. Conversely, menarche may be slightly later when a girl grows up in a large
family with a biological father present.
More extreme degrees of environmental stress, such as wartime refugee status with threat
to physical survival, have been found to be associated with delay of maturation, an effect
that may be compounded by dietary inadequacy.
Most of these reported social effects are small and our understanding is incomplete. Most of
these "effects" are statistical associations revealed by epidemiologic surveys. Statistical
associations are not necessarily causal, and a variety of covariables and alternative
explanations can be imagined. Effects of such small size can never be confirmed or refuted
for any individual child. Furthermore, interpretations of the data are politically controversial
because of the ease with which this type of research can be used for political advocacy.
Accusations of bias based on political agenda sometimes accompany scientific criticism.
Another limitation of the social research is that nearly all of it has concerned girls, partly
because female puberty requires greater physiologic resources and partly because it
involves a unique event (menarche) that makes survey research into female puberty much
simpler than male. More detail is provided in the menarche article.
Variations of sequence
The sequence of events of pubertal development can occasionally vary. For example, in
about 15% of boys and girls, pubarche (the first pubic hairs) can precede,
respectively,gonadarche and thelarche by a few months. Rarely, menarche can occur before
other signs of puberty in a few girls. These variations deserve medical evaluation because
they can occasionally signal a disease.
In a general sense, the conclusion of puberty is reproductive maturity. Criteria for defining
the conclusion may differ for different purposes: attainment of the ability to reproduce,
achievement of maximal adult height, maximal gonadal size, or adult sex hormone levels.
Maximal adult height is achieved at an average age of 15 years for an average girl and 18
years for an average boy. Potential fertility (sometimes termed nubility) usually precedes
completion of growth by 1–2 years in girls and 3–4 years in boys. Stage 5 typically
represents maximal gonadal growth and adult hormone levels.
Neurohormonal process
The endocrine reproductive system consists of the hypothalamus, the pituitary, the gonads,
and the adrenal glands, with input and regulation from many other body systems. True
puberty is often termed "central puberty" because it begins as a process of the central
nervous system. A simple description of hormonal puberty is as follows:
1. The brain's hypothalamus begins to release pulses of GnRH.
2. Cells in the anterior pituitary respond by secreting LH and FSH into the circulation.
3. The ovaries or testes respond to the rising amounts of LH and FSH by growing and
beginning to produce estradiol and testosterone.
4. Rising levels of estradiol and testosterone produce the body changes of female and
male puberty.
The onset of this neurohormonal process may precede the first visible body changes by 1–2
Components of the endocrine reproductive system
The arcuate nucleus of the hypothalamus is the driver of the reproductive system. It
has neurons which generate and release pulses of GnRH into the portal venous system of
thepituitary gland. The arcuate nucleus is affected and controlled by neuronal input from
other areas of the brain and hormonal input from the gonads, adipose tissue and a variety
of other systems.
The pituitary gland responds to the pulsed GnRH signals by releasing LH and FSH into the
blood of the general circulation, also in a pulsatile pattern.
The gonads (testes and ovaries) respond to rising levels of LH and FSH by producing
the steroid sex hormones, testosterone and estrogen.
The adrenal glands are a second source for steroid hormones. Adrenal maturation,
termed adrenarche, typically precedes gonadarche in mid-childhood.
Major hormones
 Neurokinin B (a tachykinin peptide) and kisspeptin (a neuropeptide), both present
in the same hypothalamic neurons, are critical parts of the control system that switches
on the release of GnRH at the start of puberty. [52]
 GnRH (gonadotropin-releasing hormone) is a peptide hormone released from
the hypothalamus which stimulates gonadotrope cells of the anterior pituitary.
 LH (luteinizing hormone) is a larger protein hormone secreted into the general
circulation by gonadotrope cells of the anterior pituitary gland. The main target cells of
LH are the Leydig cells of testes and the theca cells of the ovaries. LH secretion changes
more dramatically with the initiation of puberty than FSH, as LH levels increase about
25-fold with the onset of puberty, compared with the 2.5-fold increase of FSH.
 FSH (follicle stimulating hormone) is another protein hormone secreted into the
general circulation by the gonadotrope cells of the anterior pituitary. The main target
cells of FSH are the ovarian follicles and the Sertoli cells and spermatogenic tissue of
the testes.
 Testosterone is a steroid hormone produced primarily by the Leydig cells of
the testes, and in lesser amounts by the theca cells of the ovaries and the adrenal
cortex. Testosterone is the primary mammalian androgen and the "original" anabolic
steroid. It acts on androgen receptors in responsive tissue throughout the body.
 Estradiol is a steroid hormone produced by aromatization of testosterone. Estradiol
is the principal human estrogen and acts on estrogen receptors throughout the body.
The largest amounts of estradiol are produced by the granulosa cells of the ovaries, but
lesser amounts are derived from testicular and adrenal testosterone.
 Adrenal androgens are steroids produced by the zona reticulosa of the adrenal
cortex in both sexes. The major adrenal androgens
are dehydroepiandrosterone, androstenedione(which are precursors of testosterone),
and dehydroepiandrosterone sulfate which is present in large amounts in the blood.
Adrenal androgens contribute to the androgenic events of early puberty in girls.
 IGF1 (insulin-like growth factor 1) rises substantially during puberty in response to
rising levels of growth hormone and may be the principal mediator of the pubertal
growth spurt.
 Leptin is a protein hormone produced by adipose tissue. Its primary target organ is
the hypothalamus. The leptin level seems to provide the brain a rough indicator of
adipose mass for purposes of regulation of appetite and energy metabolism. It also plays
a permissive role in female puberty, which usually will not proceed until an adequate
body mass has been achieved.
Endocrine perspective
The endocrine reproductive system becomes functional by the end of the first trimester of
fetal life. The testes and ovaries become briefly inactive around the time of birth but resume
hormonal activity until several months after birth, when incompletely understood
mechanisms in the brain begin to suppress the activity of the arcuate nucleus. This has
been referred to as maturation of the prepubertal "gonadostat," which becomes sensitive to
negative feedback by sex steroids. The period of hormonal activity until several months
after birth, followed by suppression of activity, may correspond to the period of infant
sexuality, followed by a latency stage, which Sigmund Freud described.[53]
Gonadotropin and sex steroid levels fall to low levels (nearly undetectable by current clinical
assays) for approximately another 8 to 10 years of childhood. Evidence is accumulating that
the reproductive system is not totally inactive during the childhood years. Subtle increases
in gonadotropin pulses occur, and ovarian follicles surrounding germ cells (future eggs)
double in number.
Normal puberty is initiated in the hypothalamus, with de-inhibition of the pulse generator in
the arcuate nucleus. This inhibition of the arcuate nucleus is an ongoing active suppression
by other areas of the brain. The signal and mechanism releasing the arcuate nucleus from
inhibition have been the subject of investigation for decades and remain incompletely
understood.Leptin levels rise throughout childhood and play a part in allowing the arcuate
nucleus to resume operation. If the childhood inhibition of the arcuate nucleus is interrupted
prematurely by injury to the brain, it may resume pulsatile gonadotropin release and
puberty will begin at an early age.
Neurons of the arcuate nucleus secrete gonadotropin releasing hormone (GnRH) into the
blood of the pituitary portal system. An American physiologist, Ernst Knobil, found that the
GnRH signals from the hypothalamus induce pulsed secretion of LH (and to a lesser degree,
FSH) at roughly 1-2 hour intervals. The LH pulses are the consequence of pulsatile GnRH
secretion by the arcuate nucleus that, in turn, is the result of an oscillator or signal
generator in the central nervous system ("GnRH pulse generator"). [54] In the years
preceding physical puberty, Robert M. Boyar discovered that the gonadotropin pulses occur
only during sleep, but as puberty progresses they can be detected during the day. [55] By the
end of puberty, there is little day-night difference in the amplitude and frequency of
gonadotropin pulses.
Some investigators have attributed the onset of puberty to a resonance of oscillators in the
brain.[56][57][58][59] By this mechanism, the gonadotropin pulses that occur primarily at night
just before puberty represent beats.[60][61][62]
An array of "autoamplification processes" increases the production of all of the pubertal
hormones of the hypothalamus, pituitary, and gonads[citation needed].
Regulation of adrenarche and its relationship to maturation of the hypothalamic-gonadal
axis is not fully understood, and some evidence suggests it is a parallel but largely
independent process coincident with or even preceding central puberty. Rising levels of
adrenal androgens (termed adrenarche) can usually be detected between 6 and 11 years of
age, even before the increasing gonadotropin pulses of hypothalamic puberty. Adrenal
androgens contribute to the development of pubic hair (pubarche), adult body odor, and
other androgenic changes in both sexes. The primary clinical significance of the distinction
between adrenarche and gonadarche is that pubic hair and body odor changes by
themselves do not prove that central puberty is underway for an individual child.
Hormonal changes in boys
Early stages of male hypothalamic maturation seem to be very similar to the early stages of
female puberty, though occurring about 1–2 years later.
LH stimulates the Leydig cells of the testes to make testosterone and blood levels begin to
rise. For much of puberty, nighttime levels of testosterone are higher than daytime.
Regularity of frequency and amplitude of gonadotropin pulses seems to be less necessary
for progression of male than female puberty.
However, a significant portion of testosterone in adolescent boys is converted to estradiol.
Estradiol mediates the growth spurt, bone maturation, and epiphyseal closure in boys just
as in girls. Estradiol also induces at least modest development of breast tissue
(gynecomastia) in a large proportion of boys. Boys who develop mild gynecomastia or even
developing swellingsunder nipples during puberty are told the effects are temporary in some
male teenagers due to high levels of estradiol.
Another hormonal change in males takes place during the teenage years for most young
men. At this point in a males life the testosterone levels slowly rise, and most of the effects
are mediated through the androgen receptors by way of conversion dihydrotestosterone in
target organs (especially that of the bowels).
Hormonal changes in girls
As the amplitude of LH pulses increases, the theca cells of the ovaries begin to produce
testosterone and smaller amounts of progesterone. Much of the testosterone moves into
nearby cells called granulosa cells. Smaller increases of FSH induce an increase in
the aromatase activity of these granulosa cells, which converts most of the testosterone to
estradiol for secretion into the circulation.
Rising levels of estradiol produce the characteristic estrogenic body changes of female
puberty: growth spurt, acceleration of bone maturation and closure, breast growth,
increased fat composition, growth of the uterus, increased thickness of
the endometrium and the vaginal mucosa, and widening of the lower pelvis.
As the estradiol levels gradually rise and the other autoamplification processes occur, a
point of maturation is reached when the feedback sensitivity of the hypothalamic
"gonadostat" becomes positive. This attainment of positive feedback is the hallmark of
female sexual maturity, as it allows the mid cycle LH surge necessary for ovulation.
Levels of adrenal androgens and testosterone also increase during puberty, producing the
typical androgenic changes of female puberty: pubic hair, other androgenic hair as outlined
above, body odor, acne.
Growth hormone levels rise steadily throughout puberty. IGF1 levels rise and then decline
as puberty ends. Growth finishes and adult height is attained as the estradiol levels
complete closure of the epiphyses.
 adrenarche (approximately age 7)
 gonadarche (approximately age 8)
 thelarche (approximately age 11 in females)
 pubarche (approximately age 12)
 menarche (approximately age 12.5 in females)
 spermarche (in males)

Speaker recognition
Speaker recognition is the computing task of validating a user's claimed identity
using characteristics extracted from their voices.
There is a difference between speaker recognition (recognizing who is speaking)
and speech recognition (recognizing what is being said). These two terms are frequently
confused, as isvoice recognition. Voice recognition is combination of the two where it uses
learned aspects of a speakers voice to determine what is being said - such a system cannot
recognise speech from random speakers very accurately, but it can reach high accuracy for
individual voices it has been trained with. In addition, there is a difference between the act
of authentication (commonly referred to as speaker verification or speaker
authentication) and identification.
Speaker recognition has a history dating back some four decades and uses the acoustic
features of speech that have been found to differ between individuals. These acoustic
patterns reflect both anatomy (e.g., size and shape of the throat and mouth) and learned
behavioral patterns (e.g., voice pitch, speaking style). Speaker verification has earned
speaker recognition its classification as a "behavioral biometric."
Verification versus identification
There are two major applications of speaker recognition technologies and methodologies. If
the speaker claims to be of a certain identity and the voice is used to verify this claim, this
is called verification or authentication. On the other hand, identification is the task of
determining an unknown speaker's identity. In a sense speaker verification is a 1:1 match
where one speaker's voice is matched to one template (also called a "voice print" or "voice
model") whereas speaker identification is a 1:N match where the voice is compared against
N templates.
From a security perspective, identification is different from verification. For example,
presenting your passport at border control is a verification process - the agent compares
your face to the picture in the document. Conversely, a police officer comparing a sketch of
an assailant against a database of previously documented criminals to find the closest
match(es) is an identification process.
Speaker verification is usually employed as a "gatekeeper" in order to provide access to a
secure system (e.g.: telephone banking). These systems operate with the user's knowledge
and typically requires their cooperation. Speaker identification systems can also be
implemented covertly without the user's knowledge to identify talkers in a discussion, alert
automated systems of speaker changes, check if a user is already enrolled in a system, etc.
In forensic applications, it is common to first perform a speaker identification process to
create a list of "best matches" and then perform a series of verification processes to
determine a conclusive match.[citation needed]
Variants of speaker recognition
Each speaker recognition system has two phases: Enrollment and verification. During
enrollment, the speaker's voice is recorded and typically a number of features are extracted
to form a voice print, template, or model. In the verification phase, a speech sample or
"utterance" is compared against a previously created voice print. For identification systems,
the utterance is compared against multiple voice prints in order to determine the best
match(es) while verification systems compare an utterance against a single voice print.
Because of the process involved, verification is faster than identification.
Speaker recognition systems fall into two categories: text-dependent and text-independent.
If the text must be the same for enrollment and verification this is called text-dependent
recognition. In a text-dependent system, prompts can either be common across all speakers
(e.g.: a common pass phrase) or unique. In addition, the use of shared-secrets (e.g.:
passwords and PINs) or knowledge-based information can be employed in order to create a
multi-factor authentication scenario.
Text-independent systems are most often used for speaker identification as they require
very little if any cooperation by the speaker. In this case the text during enrollment and test
is different. In fact, the enrollment may happen without the user's knowledge, as in the
case for many forensic applications. As text-independent technologies do not compare what
was said at enrollment and verification, verification applications tend to also employ speech
recognition to determine what the user is saying at the point of authentication.
The various technologies used to process and store voice prints include frequency
estimation, hidden Markov models, Gaussian mixture models, pattern
matching algorithms, neural networks, matrix representation,Vector Quantization
and decision trees. Some systems also use "anti-speaker" techniques, such as cohort
models, and world models.
Ambient noise levels can impede both collection of the initial and subsequent voice samples.
Noise reduction algorithms can be employed to improve accuracy, but incorrect application
can have the opposite effect. Performance degradation can result from changes in
behavioural attributes of the voice and from enrolment using one telephone and verification
on another telephone ("cross channel"). Integration with two-factor authentication products
is expected to increase. Voice changes due to ageing may impact system performance over
time. Some systems adapt the speaker models after each successful verification to capture
such long-term changes in the voice, though there is debate regarding the overall security
impact imposed by automated adaptation.
Capture of the biometric is seen as non-invasive. The technology traditionally uses existing
microphones and voice transmission technology allowing recognition over long distances via
ordinary telephones (wired or wireless).
Digitally recorded audio voice identification and analogue recorded voice identification uses
electronic measurements as well as critical listening skills that must be applied by a forensic
expert in order for the identification to be accurate.
Speaker recognition
Speaker recognition is the computing task of validating a user's claimed identity using
characteristics extracted from their voices.
There is a difference between speaker recognition (recognizing who is speaking)
and speech recognition (recognizing what is being said). These two terms are frequently
confused, as isvoice recognition. Voice recognition is combination of the two where it uses
learned aspects of a speakers voice to determine what is being said - such a system cannot
recognise speech from random speakers very accurately, but it can reach high accuracy for
individual voices it has been trained with. In addition, there is a difference between the act
of authentication (commonly referred to as speaker verification or speaker
authentication) and identification.
Speaker recognition has a history dating back some four decades and uses the acoustic
features of speech that have been found to differ between individuals. These acoustic
patterns reflect both anatomy (e.g., size and shape of the throat and mouth) and learned
behavioral patterns (e.g., voice pitch, speaking style). Speaker verification has earned
speaker recognition its classification as a "behavioral biometric."
Verification versus identification
There are two major applications of speaker recognition technologies and methodologies. If
the speaker claims to be of a certain identity and the voice is used to verify this claim, this
is called verification or authentication. On the other hand, identification is the task of
determining an unknown speaker's identity. In a sense speaker verification is a 1:1 match
where one speaker's voice is matched to one template (also called a "voice print" or "voice
model") whereas speaker identification is a 1:N match where the voice is compared against
N templates.
From a security perspective, identification is different from verification. For example,
presenting your passport at border control is a verification process - the agent compares
your face to the picture in the document. Conversely, a police officer comparing a sketch of
an assailant against a database of previously documented criminals to find the closest
match(es) is an identification process.
Speaker verification is usually employed as a "gatekeeper" in order to provide access to a
secure system (e.g.: telephone banking). These systems operate with the user's knowledge
and typically requires their cooperation. Speaker identification systems can also be
implemented covertly without the user's knowledge to identify talkers in a discussion, alert
automated systems of speaker changes, check if a user is already enrolled in a system, etc.
In forensic applications, it is common to first perform a speaker identification process to
create a list of "best matches" and then perform a series of verification processes to
determine a conclusive match.[citation needed]
Variants of speaker recognition
Each speaker recognition system has two phases: Enrollment and verification. During
enrollment, the speaker's voice is recorded and typically a number of features are extracted
to form a voice print, template, or model. In the verification phase, a speech sample or
"utterance" is compared against a previously created voice print. For identification systems,
the utterance is compared against multiple voice prints in order to determine the best
match(es) while verification systems compare an utterance against a single voice print.
Because of the process involved, verification is faster than identification.Speaker
recognition systems fall into two categories: text-dependent and text-independent.
If the text must be the same for enrollment and verification this is called text-dependent
recognition. In a text-dependent system, prompts can either be common across all speakers
(e.g.: a common pass phrase) or unique. In addition, the use of shared-secrets (e.g.:
passwords and PINs) or knowledge-based information can be employed in order to create a
multi-factor authentication scenario.
Text-independent systems are most often used for speaker identification as they require
very little if any cooperation by the speaker. In this case the text during enrollment and test
is different. In fact, the enrollment may happen without the user's knowledge, as in the
case for many forensic applications. As text-independent technologies do not compare what
was said at enrollment and verification, verification applications tend to also employ speech
recognition to determine what the user is saying at the point of authentication.
The various technologies used to process and store voice prints include frequency
estimation, hidden Markov models, Gaussian mixture models, pattern
matching algorithms, neural networks, matrix representation,Vector Quantization
and decision trees. Some systems also use "anti-speaker" techniques, such as cohort
models, and world models.
Ambient noise levels can impede both collection of the initial and subsequent voice samples.
Noise reduction algorithms can be employed to improve accuracy, but incorrect application
can have the opposite effect. Performance degradation can result from changes in
behavioural attributes of the voice and from enrolment using one telephone and verification
on another telephone ("cross channel"). Integration with two-factor authentication products
is expected to increase. Voice changes due to ageing may impact system performance over
time. Some systems adapt the speaker models after each successful verification to capture
such long-term changes in the voice, though there is debate regarding the overall security
impact imposed by automated adaptation.
Capture of the biometric is seen as non-invasive. The technology traditionally uses existing
microphones and voice transmission technology allowing recognition over long distances via
ordinary telephones (wired or wireless).
Digitally recorded audio voice identification and analogue recorded voice identification uses
electronic measurements as well as critical listening skills that must be applied by a forensic
expert in order for the identification to be accurate.
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for
this purpose is called a speech synthesizer, and can be implemented
in software orhardware. A text-to-speech (TTS) system converts normal language text
into speech; other systems render symbolic linguistic representations like phonetic
transcriptions into speech.
Synthesized speech can be created by concatenating pieces of recorded speech that are
stored in a database. Systems differ in the size of the stored speech units; a system that
stores phones or diphones provides the largest output range, but may lack clarity. For
specific usage domains, the storage of entire words or sentences allows for high-quality
output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other
human voice characteristics to create a completely "synthetic" voice output.
The quality of a speech synthesizer is judged by its similarity to the human voice and by its
ability to be understood. An intelligible text-to-speech program allows people with visual
impairments or reading disabilities to listen to written works on a home computer. Many
computer operating systems have included speech synthesizers since the early 1980s.
Overview of text processing

Overview of a typical TTS system

A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-
end. The front-end has two major tasks. First, it converts raw text containing symbols like
numbers and abbreviations into the equivalent of written-out words. This process is often
called text normalization, pre-processing, or tokenization. The front-end then
assigns phonetic transcriptions to each word, and divides and marks the text into prosodic
units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions
to words is called text-to-phonemeor grapheme-to-phoneme conversion. Phonetic
transcriptions and prosody information together make up the symbolic linguistic
representation that is output by the front-end. The back-end—often referred to as
the synthesizer—then converts the symbolic linguistic representation into sound. In certain
systems, this part includes the computation of the target prosody (pitch contour, phoneme
durations), which is then imposed on the output speech.
Long before electronic signal processing was invented, there were those who tried to build
machines to create human speech. Some early legends of the existence of "speaking
heads" involved Gerbert of Aurillac (d. 1003 AD), Albertus Magnus (1198–1280), and Roger
Bacon (1214–1294).
In 1779, the Danish scientist Christian Kratzenstein, working at the Russian Academy of
Sciences, built models of the human vocal tract that could produce the five
long vowel sounds (inInternational Phonetic Alphabet notation, they
are [aː], [eː], [iː], [oː] and [uː]). This was followed by the bellows-operated "acoustic-
mechanical speech machine" by Wolfgang von Kempelen of Vienna, Austria, described in a
1791 paper. This machine added models of the tongue and lips, enabling it to
produce consonants as well as vowels. In 1837, Charles Wheatstone produced a "speaking
machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia".
Wheatstone's design was resurrected in 1923 by Paget.
In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech
analyzer and synthesizer that was said to be clearly intelligible. Homer Dudley refined this
device into the VODER, which he exhibited at the 1939 New York World's Fair.
The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues at Haskins
Laboratories in the late 1940s and completed in 1950. There were several different versions
of this hardware device but only one currently survives. The machine converts pictures of
the acoustic patterns of speech in the form of a spectrogram back into sound. Using this
device, Alvin Liberman and colleagues were able to discover acoustic cues for the perception
of phonetic segments (consonants and vowels).
Dominant systems in the 1980s and 1990s were the MITalk system, based largely on the
work of Dennis Klatt at MIT, and the Bell Labs system; the latter was one of the first
multilingual language-independent systems, making extensive use of Natural Language
Processing methods.
Early electronic speech synthesizers sounded robotic and were often barely intelligible. The
quality of synthesized speech has steadily improved, but output from contemporary speech
synthesis systems is still clearly distinguishable from actual human speech.
As the cost-performance ratio causes speech synthesizers to become cheaper and more
accessible to the people, more people will benefit from the use of text-to-speech programs.
Electronic devices
The first computer-based speech synthesis systems were created in the late 1950s, and the
first complete text-to-speech system was completed in 1968. In 1961, physicist John Larry
Kelly, Jr and colleague Louis Gerstman used an IBM 704 computer to synthesize speech, an
event among the most prominent in the history of Bell Labs. Kelly's voice recorder
synthesizer (vocoder) recreated the song "Daisy Bell", with musical accompaniment
from Max Mathews. Coincidentally, Arthur C. Clarke was visiting his friend and colleague
John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by the
demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A
Space Odyssey,where the HAL 9000 computer sings the same song as it is being put to
sleep by astronaut Dave Bowman. Despite the success of purely electronic speech
synthesis, research is still being conducted into mechanical speech synthesizers.
Synthesizer technologies
The most important qualities of a speech synthesis system are naturalness and intelligibility.
Naturalness describes how closely the output sounds like human speech, while intelligibility
is the ease with which the output is understood. The ideal speech synthesizer is both natural
and intelligible. Speech synthesis systems usually try to maximize both characteristics.
The two primary technologies for generating synthetic speech waveforms are concatenative
synthesis and formant synthesis. Each technology has strengths and weaknesses, and the
intended uses of a synthesis system will typically determine which approach is used.
Concatenative synthesis
Concatenative synthesis is based on the concatenation (or stringing together) of segments
of recorded speech. Generally, concatenative synthesis produces the most natural-sounding
synthesized speech. However, differences between natural variations in speech and the
nature of the automated techniques for segmenting the waveforms sometimes result in
audible glitches in the output. There are three main sub-types of concatenative synthesis.
Unit selection synthesis
Unit selection synthesis uses large databases of recorded speech. During database creation,
each recorded utterance is segmented into some or all of the following:
individual phones,diphones, half-phones, syllables, morphemes, words, phrases,
and sentences. Typically, the division into segments is done using a specially
modified speech recognizer set to a "forced alignment" mode with some manual correction
afterward, using visual representations such as the waveform and spectrogram. An index of
the units in the speech database is then created based on the segmentation and acoustic
parameters like the fundamental frequency (pitch), duration, position in the syllable, and
neighboring phones. At runtime, the desired target utterance is created by determining the
best chain of candidate units from the database (unit selection). This process is typically
achieved using a specially weighted decision tree.
Unit selection provides the greatest naturalness, because it applies only a small amount
of digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech
sound less natural, although some systems use a small amount of signal processing at the
point of concatenation to smooth the waveform. The output from the best unit-selection
systems is often indistinguishable from real human voices, especially in contexts for which
the TTS system has been tuned. However, maximum naturalness typically require unit-
selection speech databases to be very large, in some systems ranging into the gigabytes of
recorded data, representing dozens of hours of speech. Also, unit selection algorithms have
been known to select segments from a place that results in less than ideal synthesis (e.g.
minor words become unclear) even when a better choice exists in the database.
Diphone synthesis
Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-
sound transitions) occurring in a language. The number of diphones depends on
thephonotactics of the language: for example, Spanish has about 800 diphones, and
German about 2500. In diphone synthesis, only one example of each diphone is contained
in the speech database. At runtime, the target prosody of a sentence is superimposed on
these minimal units by means of digital signal processing techniques such as linear
predictive coding,PSOLA or MBROLA. The quality of the resulting speech is generally worse
than that of unit-selection systems, but more natural-sounding than the output of formant
synthesizers. Diphone synthesis suffers from the sonic glitches of concatenative synthesis
and the robotic-sounding nature of formant synthesis, and has few of the advantages of
either approach other than small size. As such, its use in commercial applications is
declining, although it continues to be used in research because there are a number of freely
available software implementations.
Domain-specific synthesis
Domain-specific synthesis concatenates prerecorded words and phrases to create complete
utterances. It is used in applications where the variety of texts the system will output is
limited to a particular domain, like transit schedule announcements or weather reports.
 The technology is very simple to implement, and has been in commercial use for a long
time, in devices like talking clocks and calculators. The level of naturalness of these systems
can be very high because the variety of sentence types is limited, and they closely match
the prosody and intonation of the original recordings. [citation needed]
Because these systems are limited by the words and phrases in their databases, they are
not general-purpose and can only synthesize the combinations of words and phrases with
which they have been preprogrammed. The blending of words within naturally spoken
language however can still cause problems unless the many variations are taken into
account. For example, in non-rhotic dialects of English the "r" in words like "clear" /ˈkliːə/ is
usually only pronounced when the following word has a vowel as its first letter (e.g.  "clear
out" is realized as /ˌkliːəɹˈɑʊt/). Likewise in French, many final consonants become no longer
silent if followed by a word that begins with a vowel, an effect called liaison.
This alternation cannot be reproduced by a simple word-concatenation system, which would
require additional complexity to be context-sensitive.
Formant synthesis
Formant synthesis does not use human speech samples at runtime. Instead, the
synthesized speech output is created using additive synthesis and an acoustic model
(physical modelling synthesis)[20]. Parameters such as fundamental frequency, voicing,
and noise levels are varied over time to create a waveform of artificial speech. This method
is sometimes called rules-based synthesis; however, many concatenative systems also have
rules-based components. Many systems based on formant synthesis technology generate
artificial, robotic-sounding speech that would never be mistaken for human speech.
However, maximum naturalness is not always the goal of a speech synthesis system, and
formant synthesis systems have advantages over concatenative systems. Formant-
synthesized speech can be reliably intelligible, even at very high speeds, avoiding the
acoustic glitches that commonly plague concatenative systems. High-speed synthesized
speech is used by the visually impaired to quickly navigate computers using a screen
reader. Formant synthesizers are usually smaller programs than concatenative systems
because they do not have a database of speech samples. They can therefore be used
in embedded systems, where memory andmicroprocessor power are especially limited.
Because formant-based systems have complete control of all aspects of the output speech,
a wide variety of prosodies and intonations can be output, conveying not just questions and
statements, but a variety of emotions and tones of voice.
Examples of non-real-time but highly accurate intonation control in formant synthesis
include the work done in the late 1970s for the Texas Instruments toy Speak & Spell, and in
the early 1980s Sega arcade machines.[21] and in many Atari, Inc. arcade games[22] using
the TMS5220 LPC Chips. Creating proper intonation for these projects was painstaking, and
the results have yet to be matched by real-time text-to-speech interfaces.[23]
Articulatory synthesis
Articulatory synthesis refers to computational techniques for synthesizing speech based on
models of the human vocal tract and the articulation processes occurring there. The first
articulatory synthesizer regularly used for laboratory experiments was developed at Haskins
Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. This
synthesizer, known as ASY, was based on vocal tract models developed at Bell
Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.
Until recently, articulatory synthesis models have not been incorporated into commercial
speech synthesis systems. A notable exception is the NeXT-based system originally
developed and marketed by Trillium Sound Research, a spin-off company of the University
of Calgary, where much of the original research was conducted. Following the demise of the
various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with
Apple Computer in 1997), the Trillium software was published under the GNU General Public
License, with work continuing as gnuspeech. The system, first marketed in 1994, provides
full articulatory-based text-to-speech conversion using a waveguide or transmission-line
analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".
HMM-based synthesis
HMM-based synthesis is a synthesis method based on hidden Markov models, also called
Statistical Parametric Synthesis. In this system, the frequency spectrum (vocal
tract),fundamental frequency (vocal source), and duration (prosody) of speech are modeled
simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based
on themaximum likelihood criterion.[24]
Sinewave synthesis
Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main
bands of energy) with pure tone whistles.[25]
Text normalization challenges
The process of normalizing text is rarely straightforward. Texts are full
of heteronyms, numbers, and abbreviations that all require expansion into a phonetic
representation. There are many spellings in English which are pronounced differently based
on context. For example, "My latest project is to learn how to better project my voice"
contains two pronunciations of "project".
Most text-to-speech (TTS) systems do not generate semantic representations of their input
texts, as processes for doing so are not reliable, well understood, or computationally
effective. As a result, various heuristic techniques are used to guess the proper way to
disambiguate homographs, like examining neighboring words and using statistics about
frequency of occurrence.
Recently TTS systems have begun to use HMMs (discussed above) to generate "parts of
speech" to aid in disambiguating homographs. This technique is quite successful for many
cases such as whether "read" should be pronounced as "red" implying past tense, or as
"reed" implying present tense. Typical error rates when using HMMs in this fashion are
usually below five percent. These techniques also work well for most European languages,
although access to required training corpora is frequently difficult in these languages.
Deciding how to convert numbers is another problem that TTS systems have to address. It
is a simple programming challenge to convert a number into words (at least in English), like
"1325" becoming "one thousand three hundred twenty-five." However, numbers occur in
many different contexts; "1325" may also be read as "one three two five", "thirteen twenty-
five" or "thirteen hundred and twenty five". A TTS system can often infer how to expand a
number based on surrounding words, numbers, and punctuation, and sometimes the
system provides a way to specify the context if it is ambiguous. [26] Roman numerals can also
be read differently depending on context. For example "Henry VIII" reads as "Henry the
Eighth", while "Chapter VIII" reads as "Chapter Eight".
Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches"
must be differentiated from the word "in", and the address "12 St John St." uses the same
abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make
educated guesses about ambiguous abbreviations, while others provide the same result in
all cases, resulting in nonsensical (and sometimes comical) outputs.
Text-to-phoneme challenges
Speech synthesis systems use two basic approaches to determine the pronunciation of a
word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-
phoneme conversion (phoneme is the term used by linguists to describe distinctive sounds
in a language). The simplest approach to text-to-phoneme conversion is the dictionary-
based approach, where a large dictionary containing all the words of a language and their
correct pronunciations is stored by the program. Determining the correct pronunciation of
each word is a matter of looking up each word in the dictionary and replacing the spelling
with the pronunciation specified in the dictionary. The other approach is rule-based, in
which pronunciation rules are applied to words to determine their pronunciations based on
their spellings. This is similar to the "sounding out", or synthetic phonics, approach to
learning reading.
Each approach has advantages and drawbacks. The dictionary-based approach is quick and
accurate, but completely fails if it is given a word which is not in its dictionary. [citation needed]As
dictionary size grows, so too does the memory space requirements of the synthesis system.
On the other hand, the rule-based approach works on any input, but the complexity of the
rules grows substantially as the system takes into account irregular spellings or
pronunciations. (Consider that the word "of" is very common in English, yet is the only word
in which the letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems
use a combination of these approaches.
Languages with a phonemic orthography have a very regular writing system, and the
prediction of the pronunciation of words based on their spellings is quite successful. Speech
synthesis systems for such languages often use the rule-based method extensively,
resorting to dictionaries only for those few words, like foreign names and borrowings, whose
pronunciations are not obvious from their spellings. On the other hand, speech synthesis
systems for languages like English, which have extremely irregular spelling systems, are
more likely to rely on dictionaries, and to use rule-based methods only for unusual words,
or words that aren't in their dictionaries.
Evaluation challenges
The consistent evaluation of speech synthesis systems may be difficult because of a lack of
universally agreed objective evaluation criteria. Different organizations often use different
speech data. The quality of speech synthesis systems also depends to a large degree on the
quality of the production technique (which may involve analogue or digital recording) and on
the facilities used to replay the speech. Evaluating speech synthesis systems has therefore
often been compromised by differences between production techniques and replay facilities.
Recently, however, some researchers have started to evaluate speech synthesis systems
using a common speech dataset.[27]
Prosodics and emotional content
A recent study reported in the journal "Speech Communication" by Amy Drahota and
colleagues at the University of Portsmouth, UK, reported that listeners to voice recordings
could determine, at better than chance levels, whether or not the speaker was smiling. [28] It
was suggested that identification of the vocal features which signal emotional content may
be used to help make synthesized speech sound more natural.
Dedicated hardware
 Votrax
 SC-01A (analog formant)
 SC-02 / SSI-263 / "Arctic 263"
 General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000)
 Magnevation SpeakJet ( TTS256)
 Savage Innovations SoundGin
 National Semiconductor DT1050 Digitalker (Mozer)
 Silicon Systems SSI 263 (analog formant)
 Texas Instruments LPC Speech Chips
 TMS5110A
 TMS5200
 Oki Semiconductor
 ML22825 (ADPCM)
 ML22573 (HQADPCM)
 Toshiba T6721A
 Philips PCF8200
 TextSpeak Embedded TTS Modules
Computer operating systems or outlets with speech synthesis
Arguably, the first speech system integrated into an operating system was the
1400XL/1450XL personal computers designed by Atari, Inc. using the Votrax SC01 chip in
1983. The 1400XL/1450XL computers used a Finite State Machine to enable World English
Spelling text-to-speech synthesis[29]. Unfortunately, the 1400XL/1450XL personal computers
never shipped in quantity.
The Atari ST computers were sold with "stspeech.tos" on floppy disk.
The first speech system integrated into an operating system that shipped in quantity
was Apple Computer's MacInTalk in 1984. Since the 1980s Macintosh Computers offered
text to speech capabilities through The MacinTalk software. In the early 1990s Apple
expanded its capabilities offering system wide text-to-speech support. With the introduction
of faster PowerPC-based computers they included higher quality voice sampling. Apple also
introduced speech recognition into its systems which provided a fluid command set. More
recently, Apple has added sample-based voices. Starting as a curiosity, the speech system
of Apple Macintosh has evolved into a fully-supported program, PlainTalk, for people with
vision problems. VoiceOver was for the first time featured in Mac OS X Tiger (10.4). During
10.4 (Tiger) & first releases of 10.5 (Leopard) there was only one standard voice shipping
with Mac OS X. Starting with 10.6 (Snow Leopard), the user can choose out of a wide range
list of multiple voices. VoiceOver voices feature the taking of realistic-sounding breaths
between sentences, as well as improved clarity at high read rates over PlainTalk. Mac OS X
also includes say, a command-line based application that converts text to audible speech.
TheAppleScript Standard Additions includes a say verb that allows a script to use any of the
installed voices and to control the pitch, speaking rate and modulation of the spoken text.
The second operating system with advanced speech synthesis capabilities was AmigaOS,
introduced in 1985. The voice synthesis was licensed by Commodore International from a
third-party software house (Don't Ask Software, now Softvoice, Inc.) and it featured a
complete system of voice emulation, with both male and female voices and "stress"
indicator markers, made possible by advanced features of the Amiga hardware
audio chipset.[30] It was divided into a narrator device and a translator library. Amiga Speak
Handler featured a text-to-speech translator. AmigaOS considered speech synthesis a
virtual hardware device, so the user could even redirect console output to it. Some Amiga
programs, such as word processors, made extensive use of the speech system.
Microsoft Windows
Modern Windows systems use SAPI4- and SAPI5-based speech systems that include
a speech recognition engine (SRE). SAPI 4.0 was available on Microsoft-based operating
systems as a third-party add-on for systems like Windows 95 and Windows 98. Windows
2000 added a speech synthesis program called Narrator, directly available to users. All
Windows-compatible programs could make use of speech synthesis features, available
through menus once installed on the system. Microsoft Speech Server is a complete
package for voice synthesis and recognition, for commercial applications such as call
Text-to-Speech (TTS) capabilities for a computer refers to the ability to play back text in a
spoken voice. TTS is the ability of the operating system to play back printed text as spoken
An internal (installed with the operating system) driver (called a TTS engine): recognizes
the text and using a synthesized voice (chosen from several pre-generated voices) speaks
the written text. Additional engines (often use a certain jargon or vocabulary) are also
available through third-party manufacturers.[31]
Version 1.6 of Android added support for speech synthesis (TTS).[32]
The most recent TTS development in the web browser, is the JavaScript Text to
Speech work of Yury Delendik, which ports the Flite C engine to pure JavaScript. This allows
web pages to convert text to audio using HTML5 technology. The ability to use Yury's TTS
port currently requires a custom browser build that uses Mozilla's Audio-Data-API. However,
much work is being done in the context of the W3C to move this technology into the
mainstream browser market through the W3C Audio Incubator Group with the involvement
of The BBC and Google Inc.
Currently, there are a number of applications, plugins and gadgets that can read messages
directly from an e-mail client and web pages from a web browser or Google Toolbar such
asText-to-voice which is an add-on to Firefox . Some specialized software can narrate RSS-
feeds. On one hand, online RSS-narrators simplify information delivery by allowing users to
listen to their favourite news sources and to convert them to podcasts. On the other hand,
on-line RSS-readers are available on almost any PC connected to the Internet. Users can
download generated audio files to portable devices, e.g. with a help of podcast receiver, and
listen to them while walking, jogging or commuting to work.
A growing field in internet based TTS is web-based assistive technology, e.g. 'Browsealoud'
from a UK company and Readspeaker. It can deliver TTS functionality to anyone (for
reasons of accessibility, convenience, entertainment or information) with access to a web
browser. Additionally SPEAK.TO.ME from Oxford Information Laboratories is capable of
delivering text to speech through any browser without the need to download any special
applications, and includes smart delivery technology to ensure only what is seen is spoken
and the content is logically pathed.
 Some models of Texas Instruments home computers produced in 1979 and 1981
(Texas Instruments TI-99/4 and TI-99/4A) were capable of text-to-phoneme synthesis
or reciting complete words and phrases (text-to-dictionary), using a very popular
Speech Synthesizer peripheral. TI used a proprietary codec to embed complete spoken
phrases into applications, primarily video games.[33]
 IBM's OS/2 Warp 4 included VoiceType, a precursor to IBM ViaVoice.
 Systems that operate on free and open source software systems including Linux are
various, and include open-source programs such as the Festival Speech Synthesis
Systemwhich uses diphone-based synthesis (and can use a limited number
of MBROLA voices), and gnuspeech which uses articulatory synthesis [34] from the Free
Software Foundation.
 Companies which developed speech synthesis systems but which are no longer in
this business include BeST Speech (bought by L&H), Eloquent Technology (bought by
SpeechWorks), Lernout & Hauspie (bought by Nuance), SpeechWorks (bought by
Nuance), Rhetorical Systems (bought by Nuance).
Speech synthesis markup languages
A number of markup languages have been established for the rendition of text as speech in
an XML-compliant format. The most recent is Speech Synthesis Markup Language (SSML),
which became a W3C recommendation in 2004. Older speech synthesis markup languages
include Java Speech Markup Language (JSML) and SABLE. Although each of these was
proposed as a standard, none of them has been widely adopted.
Speech synthesis markup languages are distinguished from dialogue markup
languages. VoiceXML, for example, includes tags related to speech recognition, dialogue
management and touchtone dialing, in addition to text-to-speech markup.
Speech synthesis has long been a vital assistive technology tool and its application in this
area is significant and widespread. It allows environmental barriers to be removed for
people with a wide range of disabilities. The longest application has been in the use
of screen readers for people with visual impairment, but text-to-speech systems are now
commonly used by people with dyslexia and other reading difficulties as well as by pre-
literate children. They are also frequently employed to aid those with severe speech
impairment usually through a dedicated voice output communication aid.
Sites such as Ananova and YAKiToMe! have used speech synthesis to convert written news
to audio content, which can be used for mobile applications.
Speech synthesis techniques are used as well in the entertainment productions such as
games, anime and similar. In 2007, Animo Limited announced the development of a
software application package based on its speech synthesis software FineSpeech, explicitly
geared towards customers in the entertainment industries, able to generate narration and
lines of dialogue according to user specifications. [35] The application reached maturity in
2008, when NEC Biglobe announced a web service that allows users to create phrases from
the voices of Code Geass: Lelouch of the Rebellion R2 characters.[36]
TTS applications such as YAKiToMe! and Speakonia are often used to add synthetic voices to
YouTube videos for comedic effect, as in Barney Bunch videos. YAKiToMe! is also used to
convert entire books for personal podcasting purposes, RSS feeds and web pages for news
stories, and educational texts for enhanced learning.
Software such as Vocaloid can generate singing voices via lyrics and melody. This is also the
aim of the Singing Computer project (which uses GNU LilyPond and Festival) to help blind
people check their lyric input.[37]

Vocal loading
Vocal loading is the stress inflicted on the speech organs when speaking for long periods.
Of the working population, about 15% have professions where their voice is their primary
tool. That includes professions such as teachers, sales personnel, actors and singers, and TV
and radio reporters. Many of them, especially teachers, suffer from voice-related medical
problems. In a larger scope, this involves millions of sick-leave days every year, for
example, both in the US and the European Union. Still, research in vocal loading has often
been treated as a minor subject.
Voice organ
Voiced speech is produced by air streaming from the lungs through the vocal cords, setting
them into an oscillating movement. In every oscillation, the vocal folds are closed for a
short period of time. When the folds reopen the pressure under the folds is released. These
changes in pressure form the waves called (voiced) speech.
Loading on tissue in vocal folds
The fundamental frequency of speech for an average male is around 110Hz and for an
average female around 220Hz. That means that for voiced sounds the vocal folds will hit
together 110 or 220 times a second, respectively. Suppose then that a female is speaking
continuously for an hour. Of this time perhaps five minutes is voiced speech. The folds will
then hit together more than 30 thousand times an hour. It is intuitively clear that the vocal
fold tissue will experience some tiring due to this large number of hits.
Vocal loading also includes other kinds of strain on the speech organs. These include all
kinds of muscular strain in the speech organs, similarly as usage of any other muscles will
experience strain if used for an extended period of time. However, researchers' largest
interest lies in stress exerted on the vocal folds.
Effect of speaking environment
Several studies in vocal loading show that the speaking environment does have a significant
impact on vocal loading. Still, the exact details are debated. Most scientists agree on the
effect of the following environmental properties:
 air humidity - dry air increases stress experienced in the vocal folds
 hydration - dehydration increases effects of stress inflicted on the vocal folds
 background noise - people tend to speak louder when background noise is present,
even when it isn't necessary. Increasing speaking volume increases stress inflicted on
the vocal folds
 pitch - the "normal" speaking style has close to optimal pitch. Using a higher or
lower pitch than normal will also increase stress in the speech organs.
In addition, smoking and other types of air pollution might have a negative effect on voice
production organs.
Objective evaluation or measurement of vocal loading is very difficult due to the tight
coupling of the experienced psychological and physiological stress. However, there are some
typical symptoms that can be objectively measured. Firstly, the pitch range of the voice will
decrease. Pitch range indicates the possible pitches that can be spoken. When a voice is
loaded, the upper pitch limit will decrease and the lower pitch limit will rise. Similarly, the
volume range will decrease.
Secondly, an increase in the hoarseness and strain of a voice can often be heard.
Unfortunately, both properties are difficult to measure objectively, and only perceptual
evaluations can be performed.
Voice care
Regularly, the question arises of how one should use one's voice to minimise tiring in the
vocal organs. This is encompassed in the study of vocology, the science and practice of
voice habilitation. Basically, a normal, relaxed way of speech is the optimal method for voice
production, in both speech and singing. Any excess force used when speaking will increase
tiring. The speaker should drink enough water and the air humidity level should be normal
or higher. No background noise should be present or, if not possible, the voice should be
amplified. Smoking is discouraged.
Vocal rest
Vocal rest is the process of resting the vocal folds by not speaking or singing, which
typically follows vocal disorders or viral infections which cause hoarseness in the voice, such
as thecommon cold or influenza. The purpose of vocal rest is to hasten recovery time. It is
believed that vocal rest, along with rehydration, will significantly decrease recovery time
after a cold. It is generally believed, however, that if one needs to communicate one should
speak and not whisper. The reasons for this differ; some believe that whispering merely
does not allow the voice to rest and may have a dehydrating effect,  while others hold that
whispering can cause additional stress to the larynx.
Vocal range
Vocal range is the measure of the breadth of pitches that a human voice can phonate.
Although the study of vocal range has little practical application in terms of speech, it is a
topic of study within linguistics, phonetics, and speech and language pathology, particularly
in relation to the study of tonal languages and certain types of vocal disorders. However,
the most common application of the term "vocal range" is within the context of singing,
where it is used as one of the major defining characteristics for classifying singing voices
into groups known as voice types.
Singing and the definition of vocal range
While the broadest definition of vocal range is simply the span from the lowest to the
highest note a particular voice can produce, this broad definition is often not what is meant
when "vocal range" is discussed in the context of singing. Vocal pedagogists tend to define
the vocal range as the total span of "musically useful" pitches that a singer can produce.
This is because some of the notes a voice can produce may not be considered usable by the
singer within performance for various reasons. For example, within opera all singers must
project over an orchestra without the aid of a microphone. An opera singer would therefore
only be able to include the notes that they are able to adequately project over an orchestra
within their vocal range. In contrast, a pop artist could include notes that could be heard
with the aid of a microphone.
Another factor to consider is the use of different forms of vocal production. The human voice
is capable of producing sounds using different physiological processes within the larynx.
These different forms of voice production are known as vocal registers. While the exact
number and definition of vocal registers is a controversial topic within the field of singing,
the sciences identify only four registers: the whistle register, the falsetto register, the modal
register, and the vocal fry register. Typically, only the usable range of the modal register,
the register used in normal speech and most singing, is used when determining vocal range.
However, there are some instances where other vocal registers are included. For example,
within opera, countertenors utilize falsetto often and coloratura sopranos utilize the whistle
register frequently. These voice types would therefore include the notes from these other
registers within their vocal range. Another example would be a male doo-wop singer who
might quite regularly deploy his falsetto pitches in performance and thus include them in
determining his range. However, in most cases only the usable pitches within the modal
register are included when determining a singer's vocal range.
Vocal range and voice classification

Vocal range plays such an important role in classifying singing voices into voice types that
sometimes the two terms are confused with one another. A voice type is a particular kind of
human singing voice perceived as having certain identifying qualities or characteristics;
vocal range being only one of those characteristics. Other factors are vocal weight,
vocal tessitura, vocal timbre, vocal transition points, physical characteristics, speech level,
scientific testing, and vocal registration. All of these factors combined are used to categorize
a singer's voice into a particular kind of singing voice or voice type.
There are a plethora of different voice types used by vocal pedagogists today in a variety of
voice classification systems. Most of these types, however, are sub-types that fall under
seven different major voice categories that are for the most part acknowledged across all of
the major voice classification systems. Women are typically divided into three
groups: soprano,mezzo-soprano, and contralto. Men are usually divided into four
groups: countertenor, tenor, baritone, and bass. When considering the pre-pubescent
voices of children an eighth term,treble, can be applied. Within each of these major
categories there are several sub-categories that identify specific vocal qualities
like coloratura facility and vocal weight to differentiate between voices.
Vocal range itself can not determine a singer's voice type. While each voice type does have
a general vocal range associated with it, human singing voices may possess vocal ranges
that encompass more than one voice type or are in between the typical ranges of two voice
types. Therefore, voice teachers only use vocal range as one factor in classifying a singer's
voice. More important than range in voice classification is tessitura, or where the voice is
most comfortable singing, and vocal timbre, or the characteristic sound of the singing
voice.For example, a female singer may have a vocal range that encompasses the high
notes of a mezzo-soprano and the low notes of a soprano. A voice teacher would therefore
look to see whether or not the singer were more comfortable singing up higher or singing
lower. If the singer were more comfortable singing higher than the teacher would probably
classify her as a soprano and if the singer were more comfortable singing lower than they
would probably classify her as a mezzo-soprano. The teacher would also listen to the sound
of the voice. Sopranos tend to have a lighter and less rich vocal sound than a mezzo-
soprano. A voice teacher, however, would never classify a singer in more than one voice
type, regardless of the size of their vocal range.
The following are the general vocal ranges associated with each voice type using scientific
pitch notation where middle C=C4. Some singers within these voice types may be able to
sing somewhat higher or lower:
 Soprano: C4 – C6
 Mezzo-soprano: A3 – A5
 Contralto: F3 – F5
 Tenor: C3 – C5
 Baritone: F2 – F4
 Bass: E2 – E4
In terms of frequency, human voices are roughly in the range of 80 Hz to 1100 Hz (that is,
E2 to C6) for normal male and female voices together.
World records and extremes of vocal range
The following facts about female and male ranges are known:
 Guinness lists the highest demanded note in the classical repertoire as G6 in 'Popoli
di Tessaglia,' a concert aria by W. A. Mozart, composed for Aloysia Weber. Though pitch
standards were not fixed in the eighteenth century, this rare note is also heard in the
opera Esclarmonde by Jules Massenet.[citation needed] The highest note commonly called for is
F6, famously heard in the Queen of the Night's two arias "Der Hölle Rache kocht in
meinem Herzen" and "O zittre nicht, mein lieber Sohn" in Mozart's opera Die
Several little-known works call for pitches higher than G6. For example, the soprano Mado
Robin, who was known for her exceptionally high voice, sang a number of compositions
created especially to exploit her highest notes, reaching C7.
 Lowest note in a solo: Guinness lists the lowest demanded note in the classical
repertoire as D2 (almost two octaves below Middle C) in Osmin's second aria in
Mozart's Die Entführung aus dem Serail. Although Osmin's note is the lowest 'demanded'
in the operatic repertoire, lower notes are frequently heard, both written and unwritten,
and it is traditional for basses to interpolate a low C in the duet "Ich gehe doch rathe ich
dir" in the same opera. Leonard Bernstein composed an optional B1 (a minor third below
D2) in a bass aria [not specific enough to  verify] in the opera house version of Candide. In a Russian
piece combining solo and choral singing, Pavel Chesnokov directs the bass soloist in "Do
not deny me in my old age" to descend even lower, to G1, depending on the
 Lowest note for a choir: Mahler's Eighth Symphony (bar 1457 in the "Chorus
mysticus") and Rachmaninoff's Vespers require B♭1. In Russian choirs
the oktavists traditionally sing an octave below the bass part, down to G1.
Vocal warm up
A vocal warm-up is a series of exercises which prepare the voice for singing, acting, or
other use.
Why Warm Up
A study by Elliott, Sundberg, & Gramming emphasized that changing pitch undoubtedly
stretches the muscles, and any singer will tell you that vocal warm-ups make them feel
more prepared.
Physical whole-body warm-ups also help prepare a singer. Muscles all over the body are
used when singing (the diaphragm being one of the most obvious). Stretches of
the abdomen,back, neck, and shoulders are important to avoid stress, which influences the
sound of the voice.
Some warm ups also train your voice. Sometimes called vocalises, these activities
teach breath control, diction, blending, and balance.
How To Warm Up
Before you start to actually sing, it is important to start breathing properly and from the
diaphragm. Start with simple exercises such as hissing. Take a deep breath in then make a
hissing sound, breathing outwards until you've expelled as much air as possible from your
lungs. Repeat several times and be sure when you're breathing in to breath using your
diaphragm, not moving your shoulders up and down. (That is a common sign of an
untrained breather).
After, use lip trills and tongue trills to help control your breathing as well. Start just using a
steady note, then making a "fire engine sound" go up and down. Eventually move to real
notes, starting in the middle of your range, such as Middle C.
Range and Tone
Start easy, with light humming. Pick a note in the middle of your range (Middle C is
reasonable) and begin humming. Move between notes, but stay in the middle range.
To start warming up your range, sigh from the top of your range to the bottom, letting the
voice fall in a glissando without much control. Do several of these, working on getting really
to the highest and lowest parts of your range.
Next, sing an arpeggio of three thirds to an octave (1 3 5 1 5 3 1), again starting
from middle C. Use open vowels, like o, ih, ay, and ah, starting with a consonant like B, D,
or P. Repeat the exercise a half-step higher, and continue up to the top of your range, but
don't push too high.
Next, sing down a five note scale, with an open vowel and a sibilant like Z. "Za a a a a" is
reasonable. This time, repeat the exercise a half-step lower, to the bottom of your
Finally, sing a slightly more difficult phrase, again starting an octave lower than middle
C. Jump first an octave, then down a fourth, then down a third, then another third. (1 8 5 3
1). The phrase "I lo-ove to sing" fits with this exercise. Others choose to sing a few words
over and over to warm up, such as "Me, my, mo, mull."
Vocology is the science of enabling or endowing the human voice with greater ability or
fitness. Its concerns include the nature of speech and language pathology, the defects of
the vocal tract (laryngology), the remediation of speech therapy and the voice
training and voice pedagogy of song and speech for actors and public speakers.
The study of vocology is recognized academically in taught courses and institutes such as
the National Center for Voice and Speech, Westminster Choir College at Rider University,
The Grabscheid Voice Center at Mount Sinai Medical Center, the Vox Humana Laboratory
at St. Luke's-Roosevelt Hospital Center and the Regional Center for Voice and Swallowing,
at Milan's Azienda Ospedaliera Fatebenefratelli e Oftalmico.
Also reflecting this increased recognition is that when the Scandinavian journal of
logopedics & phoniatrics and Voice merged in 1996 the new name selected was Logopedics,
Phoniatrics, Vocology.
Meaning and Origin of term
Vocology was invented (simultaneously, but independently) by lngo R. Titze, and
an otolaryngologist at Washington University, Prof. George Gates. Titze defines Vocology as
"the science and practice of voice habilitation, with a strong emphasis on habilitation". To
habilitate means to “enable”, to “equip for”, to “capacitate”; in other words, to assist in
performing whatever function that needs to be performed". He goes on that this "is more
than repairing a voice or bringing it back to a former state ... rather, it is the process of
strengthening and equipping the voice to meet very specific and special demands".
Voice analysis
Voice analysis is the study of speech sounds for purposes other than linguistic content,
such as in speech recognition. Such studies include mostly medical analysis of
the voice i.e.phoniatrics, but also speaker identification. More controversially, some believe
that the truthfulness or emotional state of speakers can be determined using Voice Stress
Analysis orLayered Voice Analysis.
Typical voice problems
A medical study of the voice can be, for instance, analysis of the voice of patients who have
had a polyp removed from his or her vocal cords through an operation. In order to
objectively evaluate the improvement in voice quality there has to be some measure of
voice quality. An experienced voice therapist can quite reliably evaluate the voice, but this
requires extensive training and is still always subjective.
Another active research topic in medical voice analysis is vocal loading evaluation. The vocal
cords of a person speaking for an extended period of time will suffer from tiring, that is, the
process of speaking exerts a load on the vocal cords where the tissue will suffer from tiring.
Among professional voice users (i.e. teachers, sales people) this tiring can cause voice
failures and sick leaves. To evaluate these problems vocal loading needs to be objectively
Analysis methods
Voice problems that require voice analysis most commonly originate from the vocal folds or
the laryngeal musculature that controls them, since the folds are subject to collision forces
with each vibratory cycle and to drying from the air being forced through the small gap
between them, and the laryngeal musclature is intensely active during speech or singing
and is subject to tiring. However, dynamic analysis of the vocal folds and their movement is
physically difficult. The location of the vocal folds effectively prohibits direct, invasive
measurement of movement. Less invasive imaging methods such as x-
rays or ultrasounds do not work because the vocal cords are surrounded by cartilage which
distort image quality. Movements in the vocal cords are rapid, fundamental frequencies are
usually between 80 and 300 Hz, thus preventing usage of ordinary video. Stroboscopic,
and high-speed videos provide an option but in order to see the vocal folds, a fiberoptic
probe leading to the camera has to be positioned in the throat, which makes speaking
difficult. In addition, placing objects in the pharynx usually triggers a gag reflex that stops
voicing and closes the larynx. In addition, stroboscopic imaging is only useful when the
vocal fold vibratory pattern is closely periodic.
The most important indirect methods are currently inverse filtering of either microphone or
oral airflow recordings and electroglottography (EGG). In inverse filtering, the speech sound
(the radiated acoustic pressure waveform, as obtained from a microphone) or the oral
airflow waveform from a circumferentially vented (CV) mask is recorded outside the mouth
and then filtered by a mathematical method to remove the effects of the vocal tract. This
method produces an estimate of the waveform of the glottal airflow pulses, which in turn
reflect the movements of the vocal folds. The other kind of noninvasive indirect indication of
vocal fold motion is the electroglottography, in which electrodes placed on either side of the
subject's throat at the level of the vocal folds record the changes in the conductivity of the
throat according to how large a portion of the vocal folds are touching each other. It thus
yields one-dimensional information of the contact area. Neither inverse filtering nor EGG are
sufficient to completely describe the complex 3-dimensional pattern of vocal fold movement,
but can provide useful indirect evidence of that movement.
List of voice disorders
From Wikipedia, the free encyclopedia
  (Redirected from Voice disorders)
Voice disorders are medical conditions affecting the production of speech. These include:
 Chorditis
 Vocal fold nodules
 Vocal fold cysts
 Vocal cord paresis
 Reinke's Edema
 Spasmodic dysphonia
 Foreign accent syndrome
 Bogart-Bacall Syndrome
 Laryngeal papillomatosis
 Puberphonia
Voice frequency
From Wikipedia, the free encyclopedia
A voice frequency (VF) or voice band is one of the frequencies, within part of
the audio range, that is used for the transmission of speech.
In telephony, the usable voice frequency band ranges from approximately 300 Hz to
3400 Hz. It is for this reason that the ultra low frequency band of the electromagnetic
spectrumbetween 300 and 3000 Hz is also referred to as voice frequency (despite the fact
that this is electromagnetic energy, not acoustic energy). The bandwidth allocated for a
single voice-frequency transmission channel is usually 4 kHz, including guard bands,
allowing a sampling rate of 8 kHz to be used as the basis of the pulse code
modulation system used for the digital PSTN.
Fundamental frequency
The voiced speech of a typical adult male will have a fundamental frequency from 85 to
180 Hz, and that of a typical adult female from 165 to 255 Hz. Thus, the fundamental
frequency of most speech falls below the bottom of the "voice frequency" band as defined
above. However, enough of the harmonic series will be present for the missing
fundamental to create the impression of hearing the fundamental tone.
Vocal apparatus

The human head and neck (internal).

Vocal apparatus or vocal organs is a term used in phonetics to designate all parts
of human anatomy that can be used to produce speech. This includes
the lips, tongue, teeth, hard and soft palates, uvula, larynx, lungs, etc.
The voice organ is the part of the human body responsible for the generation of sound,
usually in the form of speech or singing. It comprises the larynx and the vocal tract.
The human voice produces sounds in the following manner:
1. Air pressure from the lungs creates a steady flow of air through
the trachea (windpipe), larynx (voice box) and pharynx(back of the throat).
2. The vocal folds in the larynx vibrate, creating fluctuations in air pressure that are
known as sound waves.
3. Resonances in the vocal tract modify these waves according to the position and
shape of the lips, jaw, tongue, soft palate, and other speech organs,
creating formant regions and thus different qualities of sonorant (voiced) sound.
4. Mouth and nose openings radiate the sound waves into the environment.
The larynx
The larynx or voice box is a cylindrical framework of cartilage that serves to anchor
the vocal folds. When the muscles of the vocal folds contract, the airflow from the lungs is
impeded until the vocal folds are forced apart again by the increasing air pressure from the
lungs. This process continues in a periodic cycle that is felt as a vibration (buzzing). In
singing, the vibration frequency of the vocal folds determines the pitch of the sound
produced. Voiced phonemes such as the pure vowels are, by definition, distinguished by the
buzzing sound of this periodic oscillation of the vocal cords.
The lips of the mouth can be used in a similar way to create a similar sound, as
any toddler or trumpeter can demonstrate. A rubber balloon, inflated but not tied off and
stretched tightly across the neck produces a squeak or buzz, depending on
the tension across the neck and the level of pressure inside the balloon. Similar actions,
with similar results, occur when the vocal cords are contracted or relaxed across the larynx.
The vocal tract
The sound source from the larynx is not sufficiently loud to be heard as speech, nor can the
various timbres of different vowel sounds be produced: without the vocal tract, only a
buzzing sound would be heard.
Production of vowels
A vowel is any phoneme in which airflow is impeded only or mostly by the voicing action of
the vocal cords.
The well-defined fundamental frequency provided by the vocal cords in voiced phonemes is
only a convenience, however, not a necessity, since a strictly unvoiced whisper is still quite
intelligible. Our interest is therefore most focused on further modulations of and additions to
the fundamental tone by other parts of the vocal apparatus, determined by the variable
dimensions of oral, pharyngeal, and even nasal cavities.
Formants are the resonant frequencies of the vocal tract that emphasize particular voice
harmonics near in frequency to the resonance, or turbulent non-periodic energy (i.e. noise)
near the formant frequency in the case of whispered speech. The formants tell a listener
what vowel is being spoken.
Vocal pedagogy
From Wikipedia, the free encyclopedia
  (Redirected from Voice pedagogy)
Vocal pedagogy is the study of the art and science of voice instruction. It is utilized in the
teaching of singing and assists in defining what singing is, how singing works, and how
proper singing technique is accomplished.
The anatomy of the Vocal folds is an
important topic the field of Vocal Pedagogy.

Laryngoscopic view of the vocal folds.

Abduction and adduction
Latin plica vocalis
Gray's subject #236 1079
MeSH Vocal+Cords
Vocal pedagogy covers a broad range of aspects of singing, ranging from the physiological
process of vocal production to the artistic aspects of interpretation of songs from different
genres or historical eras. Typical areas of study include:
 Human anatomy and physiology as it relates to the physical process of singing.
 Breathing and air support for singing
 Posture for singing
 Phonation
 Vocal resonation or voice projection
 Diction, vowels and articulation
 Vocal registration
 Sostenuto and legato for singing
 Other singing elements, such as range extension, tone quality, vibrato, coloratura
 Vocal health and voice disorders related to singing
 Vocal styles, such as learning to sing opera, belt, or Art song
 Phonetics
 Voice classification
All of these different concepts are a part of developing proper vocal technique. Not all
vocal teachers have the same opinions within every topic of study which causes variations in
pedagogical approaches and vocal technique.

Pythagoras, the man in the center with the book, teaching music, in The School of
Athens by Raphael
Within Western culture, the study of vocal pedagogy began in Ancient Greece. Scholars such
as Alypius and Pythagoras studied and made observations on the art of singing. It is
unclear, however, whether the Greeks ever developed a systematic approach to teaching
singing as little writing on the subject survives today.
The first surviving record of a systematized approach to teaching singing was developed in
the medieval monasteries of the Roman Catholic Church sometime near the beginning of the
13th century. As with other fields of study, the monasteries were the center of musical
intellectual life during the medieval period and many men within the monasteries devoted
their time to the study of music and the art of singing. Highly influential in the development
of a vocal pedagogical system were monks Johannes de Garlandia and Jerome of
Moravia who were the first to develop a concept of vocal registers. These men identified
three registers: chest voice, throat voice , and head voice (pectoris , guttoris, and capitis).
Their concept of head voice, however, is much more similar to the modern pedagogists
understanding of the falsetto register. Other concepts discussed in the monastic system
included vocal resonance, voice classification, breath support, diction, and tone quality to
name a few. The ideas developed within the monastic system highly influenced the
development of vocal pedagogy over the next several centuries including the Bel Canto style
of singing. 
With the onset of the Renaissance in the 15th century, the study of singing began to move
outside of the church. The courts of rich partons, such as the Dukes of Burgundy who
supported the Burgundian School and the Franco-Flemish School, became secular centers of
study for singing and all other areas of musical study. The vocal pedagogical methods
taught in these schools, however, were based on the concepts developed within the
monastic system. Many of the teachers within these schools had their initial musical training
from singing in church choirs as children. The church also remained at the forefront of
musical composition at this time and remained highly influential in shaping musical tastes
and practices both in and outside the church. It was the Catholic Church that first
popularized the use of castrato singers in the 16th century, which ultimately led to the
popularity of castrato voices in Baroque and Classical operas.
It was not until the development of opera in the 17th century that vocal pedagogy began to
break away from some of the established thinking of the monastic writers and develop
deeper understandings of the physical process of singing and its relation to key concepts
like vocal registration and vocal resonation. It was also during this time, that noted voice
teachers began to emerge. Giulio Caccini is an example of an important early Italian voice
teacher. In the late 17th century, the bel canto method of singing began to develop in Italy.
This style of singing had a huge impact on the development of opera and the development
of vocal pedagogy during the Classical and Romantic periods. It was during this time, that
teachers and composers first began to identify singers by and write roles for more
specific voice types. However, it wasn't until the 19th century that more clearly defined
voice classification systems like the German Fach system emerged. Within these systems,
more descriptive terms were used in classifying voices such as coloratura soprano and lyric

Examining the vocal mechanism with a laryngoscope, late 19th century

Voice teachers in the 19th century continued to train singers for careers in opera. Manuel
Patricio Rodríguez García is often considered one of the most important voice teachers of
the 19th century, and is cr . ed with the development of the  laryngoscope and the
beginning of modern voice pedagogy.
Mathilde Marchesi was both an important singer and teacher of singing at the turn of the
20th century.
The field of voice pedagogy became more fully developed in the middle of the 20th century.
A few American voice teachers began to study the science, anatomy, and physiology of
singing, especially Ralph Appelman at Indiana University, Oren Brown at the Washington
University School of Medicine and later the Juilliard School, and William Vennard at
the University of Southern California. This shift in approach to the study of singing led to the
rejection of many of the assertions of the bel canto singing method, most particularly in the
areas of vocal registration and vocal resonation. As a result, there are currently two
predominating schools of thought among voice teachers today, those who maintain the
historical positions of the bel canto method and those who choose to embrace more
contemporary understandings based in current knowledge of human anatomy and
physiology. There are also those teachers who borrow ideas from both perspectives,
creating a hybrid of the two.
Appelman and Vennard were also part of a group of voice instructors who developed
courses of study for beginning voice teachers, adding these scientific ideas to the standard
exercises and empirical ways to improve vocal technique, and by 1980 the subject of voice
pedagogy was beginning to be included in many college music degree programs for singers
and vocal music educators.
More recent works by authors such as Richard Miller and Johan Sundberg have increased
the general knowledge of voice teachers, and scientific and practical aspects of voice
pedagogy continue to be studied and discussed by professionals. In addition, the creation of
organisations such as theNational Association of Teachers of Singing (now an international
organization of Vocal Instructors) has enabled voice teachers to establish more of a
consensus about their work, and has expanded the understanding of what singing teachers
Topics of study
Pedagogical philosophy
There are basically three major approaches to vocal pedagogy, all related to how the
mechanistic and psychological controls are employed within the act of singing. Some voice
instructors advocate an extreme mechanistic approach that believes that singing is largely a
matter of getting the right physical parts in the right places at the right time, and that
correcting vocal faults is accomplished by calling direct attention to the parts which are not
working well. On the other extreme, is the school of thought that believes that attention
should never be directed to any part of the vocal mechanism--that singing is a matter of
producing the right mental images of the desired tone, and that correcting vocal faults is
achieved by learning to think the right thoughts and by releasing the emotions through
interpretation of the music. Most voice teachers, however, believe that the truth lies
somewhere in between the two extremes and adopt a composite of those two approaches.
The nature of vocal sounds
Physiology of vocal sound production

There are four physical processes involved in producing vocal

sound: respiration, phonation, resonation, and articulation. These processes occur in the
following sequence:
 1. Breath is taken
 2. Sound is initiated in the larynx
 3. The vocal resonators receive the sound and influence it
 4. The articulators shape the sound into recognizable units
Although these four processes are to be considered separately, in actual practice they
merge into one coordinated function. With an effective singer or speaker, one should rarely
be reminded of the process involved as their mind and body are so coordinated that one
only perceives the resulting unified function. Many vocal problems result from a lack of
coordination within this process.

A labeled anatomical diagram of the vocal folds or cords.

In its most basic sense, respiration is the process of moving air in and out of the body--
inhalation and exhalation. Breathing for singing and speaking is a more controlled process
than is the ordinary breathing used for sustaining life. The controls applied to exhalation are
particularly important in good vocal technique.
Phonation is the process of producing vocal sound by the vibration of the vocal folds that is
in turn modified by the resonance of the vocal tract. It takes place in the larynx when
the vocal folds are brought together and breath pressure is applied to them in such a way
that vibration ensues causing an audible source of acoustic energy, i.e., sound, which can
then be modified by the articulatory actions of the rest of the vocal apparatus. The vocal
folds are brought together primarily by the action of the interarytenoid muscles, which pull
the arytenoid cartilages together.

Vocal resonation is the process by which the basic product of phonation is enhanced in
timbre and/or intensity by the air-filled cavities through which it passes on its way to the
outside air. Various terms related to the resonation process include amplification,
enrichment, enlargement, improvement, intensification, and prolongation, although in
strictly scientific usage acoustic authorities would question most of them. The main point to
be drawn from these terms by a singer or speaker is that the end result of resonation is, or
should be, to make a better sound.
There are seven areas that may be listed as possible vocal resonators. In sequence from the
lowest within the body to the highest, these areas are the chest, the tracheal tree,
the larynx itself, the pharynx, the oral cavity, the nasal cavity, and the sinuses.
Places of articulation (passive & active):
1. Exo-labial, 2. Endo-labial, 3. Dental, 4. Alveolar, 5. Post-alveolar, 6. Pre-palatal, 7.
Palatal, 8. Velar, 9. Uvular, 10. Pharyngeal, 11. Glottal, 12. Epiglottal, 13. Radical, 14.
Postero-dorsal, 15. Antero-dorsal, 16. Laminal, 17. Apical, 18. Sub-apical
Articulation is the process by which the joint product of the vibrator and the resonators is
shaped into recognizable speech sounds through the muscular adjustments and movements
of the speech organs. These adjustments and movements of the articulators result in verbal
communication and thus form the essential difference between the human voice and other
musical instruments. Singing without understandable words limits the voice to nonverbal
communication. In relation to the physical process of singing, vocal instructors tend to focus
more on active articulation as opposed to passive articulation. There are five basic active
articulators: the lip ("labial consonants"), the flexible front of the tongue ("coronal
consonants"), the middle/back of the tongue ("dorsal consonants"), the root of the tongue
together with the epiglottis ("radical consonants"), and the larynx ("laryngeal consonants").
These articulators can act independently of each other, and two or more may work together
in what is calledcoarticulation.
Unlike active articulation, passive articulation is a continuum without many clear-cut
boundaries. The places linguolabial and interdental, interdental and dental, dental and
alveolar, alveolar and palatal, palatal and velar, velar and uvular merge into one another,
and a consonant may be pronounced somewhere between the named places.
In addition, when the front of the tongue is used, it may be the upper surface or blade of
the tongue that makes contact ("laminal consonants"), the tip of the tongue ("apical
consonants"), or the under surface ("sub-apical consonants"). These articulations also
merge into one another without clear boundaries.
Interpretation is sometimes listed by voice teachers as a fifth physical process even though
strictly speaking it is not a physical process. The reason for this is that interpretation does
influence the kind of sound a singer makes which is ultimately achieved through a physical
action the singer is doing. Although teachers may acquaint their students with musical
styles and performance practices and suggest certain interpretive effects, most voice
teachers agree that interpretation can not be taught. Students who lack a natural creative
imagination and aesthetic sensibility can not learn it from someone else. Failure to interpret
well is not a vocal fault even though it may affect vocal sound significantly.
Classification of vocal sounds
Vocal sounds are divided into two basic categories-vowels and consonants-with a wide
variety of sub-classifications. Voice Teachers and serious voice students spend a great deal
of time studying how the voice forms vowels and consonants, and studying the problems
that certain consonants or vowels may cause while singing. The International Phonetic
Alphabet is used frequently by voice teachers and their students.
Problems in describing vocal sounds
Describing vocal sound is an inexact science largely because the human voice is a self-
contained instrument. Since the vocal instrument is internal, the singer's ability to monitor
the sound produced is complicated by the vibrations carried to the ear through the
Eustachean (auditory) tube and the bony structures of the head and neck. In other words,
most singers hear something different in their ears/head than what a person listening to
them hears. As a result, voice teachers often focus less on how it "sounds" and more on
how it "feels". Vibratory sensations resulting from the closely-related processes of phonation
and resonation, and kinesthetic ones arising from muscle tension, movement, body position,
and weight serve as a guide to the singer on correct vocal production.
Another problem in describing vocal sound lies in the vocal vocabulary itself. There are
many schools of thought within vocal pedagogy and different schools have adopted different
terms, sometimes from other artistic disciplines. This has led to the use of a plethora of
descriptive terms applied to the voice which are not always understood to mean the same
thing. Some terms sometimes used to describe a quality of a voice's sound are: warm,
white, dark, light, round, reedy, spread, focused, covered, swallowed, forward, ringing,
hooty, bleaty, plummy, mellow, pear-shaped, and so forth.
The singing process functions best when certain physical conditions of the body exist. The
ability to move air in and out of the body freely and to obtain the needed quantity of air can
be seriously affected by the posture of the various parts of the breathing mechanism. A
sunken chest position will limit the capacity of the lungs, and a tense abdominal wall will
inhibit the downward travel of the diaphragm. Good posture allows the breathing
mechanism to fulfill its basic function efficiently without any undue expenditure of energy.
Good posture also makes it easier to initiate phonation and to tune the resonators as proper
alignment prevents unnecessary tension in the body. Voice Instructors have also noted that
when singers assume good posture it often provides them with a greater sense of self
assurance and poise while performing. Audiences also tend to respond better to singers with
good posture. Habitual good posture also ultimately improves the overall health of the body
by enabling better blood circulation and preventing fatigue and stress on the body.
Breathing and breath support
In the words of Robert C. White, who paraphrased a "Credo" for singing (no blasphemy
In the Beginning there was Breath, and Singing was with Breath, and Singing was Breath,
and Singing was Breath. And all singing was made by the Breath, and without Breath was
not any Singing made that was made. (White 1988, p. 26)
All singing begins with breath. All vocal sounds are created by vibrations in
the larynx caused by air from the lungs. Breathing in everyday life is a subconscious bodily
function which occurs naturally, however the singer must have control of the intake and
exhalation of breath to achieve maximum results from their voice.
Natural breathing has three stages: a breathing-in period, a breathing out period, and a
resting or recovery period; these stages are not usually consciously controlled. Within
singing there are four stages of breathing:
 1. a breathing-in period (inhalation)
 2. a setting up controls period (suspension)
 3. a controlled exhalation period (phonation)
 4. a recovery period
These stages must be under conscious control by the singer until they becomed conditioned
reflexes. Many singers abandon conscious controls before their reflexes are fully conditioned
which ultimately leads to chronic vocal problems.
Voice classification
In European classical music and opera, voices are treated
like musical instruments. Composers who write vocal music must Voice type
have an understanding of the skills, talents, and vocal properties
of singers. Voice classification is the process by which human Female voices
singing voices are evaluated and are thereby designated Soprano
into voice types. These qualities include but are not limited Mezzo-soprano
to: vocal range, vocal weight, vocal tessitura, vocal timbre, Contralto
and vocal transition points such as breaks and lifts within the Male voices
voice. Other considerations are physical characteristics, speech Countertenor
level, scientific testing, and vocal registration. The science Tenor
behind voice classification developed within European classical Baritone
music and has been slow in adapting to more modern forms of Bass
singing. Voice classification is often used withinopera to
associate possible roles with potential voices. There are currently several different systems
in use within classical music including: the German Fachsystem and the choral music
system among many others. No system is universally applied or accepted.
However, most classical music systems acknowledge seven different major voice categories.
Women are typically divided into three groups: soprano, mezzo-soprano, and contralto. Men
are usually divided into four groups: countertenor, tenor, baritone, and bass. When
considering children's voices, an eighth term,treble, can be applied. Within each of these
major categories there are several sub-categories that identify specific vocal qualities
like coloratura facility and vocal weight to differentiate between voices.
It should be noted that within choral music, singers voices are divided solely on the basis
of vocal range. Choral music most commonly divides vocal parts into high and low voices
within each sex (SATB). As a result, the typical choral situation affords many opportunities
for misclassification to occur. Since most people have medium voices, they must be
assigned to a part that is either too high or too low for them; the mezzo-soprano must sing
soprano or alto and the baritone must sing tenor or bass. Either option can present
problems for the singer, but for most singers there are fewer dangers in singing too low
than in singing too high.
Within contemporary forms of music (sometimes referred to as Contemporary Commercial
Music), singers are classified by the style of music they sing, such as jazz, pop, blues, soul,
country, folk, and rock styles. There is currently no authoritative voice classification system
within non-classical music. Attempts have been made to adopt classical voice type terms to
other forms of singing but such attempts have been met with controversy. The development
of voice categorizations were made with the understanding that the singer would be using
classical vocal technique within a specified range using unamplified (no microphones) vocal
production. Since contemporary musicians use different vocal techniques, microphones, and
are not forced to fit into a specific vocal role, applying such terms as soprano, tenor,
baritone, etc. can be misleading or even inaccurate.
Dangers of quick identification
Many voice teachers warn of the dangers of quick identification. Premature concern with
classification can result in misclassification, with all its attendant dangers. Vennard says:
"I never feel any urgency about classifying a beginning student. So many premature
diagnoses have been proved wrong, and it can be harmful to the student and embarrassing
to the teacher to keep striving for an ill-chosen goal. It is best to begin in the middle part of
the voice and work upward and downward until the voice classifies itself."
Most voice teachers believe that it is essential to establish good vocal habits within a limited
and comfortable range before attempting to classify the voice. When techniques of posture,
breathing, phonation, resonation, and articulation have become established in this
comfortable area, the true quality of the voice will emerge and the upper and lower limits of
the range can be explored safely. Only then can a tentative classification be arrived at, and
it may be adjusted as the voice continues to develop. Many acclaimed voice instructors
suggest that teachers begin by assuming that a voice is of a medium classification until it
proves otherwise. The reason for this is that the majority of individuals possess medium
voices and therefore this approach is less likely to misclassify or damage the voice.
Vocal registration
Vocal registers
   Vocal fry
Vocal registration refers to the system of vocal registers within the human voice. A
register in the human voice is a particular series of tones, produced in the same vibratory
pattern of the vocal folds, and possessing the same quality. Registers originate
in laryngeal function. They occur because the vocal folds are capable of producing several
different vibratory patterns. Each of these vibratory patterns appears within a
particular range of pitches and produces certain characteristic sounds. The term register can
be somewhat confusing as it encompasses several aspects of the human voice. The term
register can be used to refer to any of the following:
 A particular part of the vocal range such as the upper, middle, or lower registers.
 A resonance area such as chest voice or head voice.
 A phonatory process
 A certain vocal timbre
 A region of the voice which is defined or delimited by vocal breaks.
 A subset of a language used for a particular purpose or in a particular social setting.
In linguistics, a register language is a language which combines tone and
vowel phonation into a single phonological system.
Within speech pathology the term vocal register has three constituent elements: a certain
vibratory pattern of the vocal folds, a certain series of pitches, and a certain type of sound.
Speech pathologists identify four vocal registers based on the physiology of laryngeal
function: the vocal fry register, the modal register, the falsetto register, and the whistle
register. This view is also adopted by many teachers of singing.
Some voice teachers, however, organize registers differently. There are over a dozen
different constructs of vocal registers in use within the field. The confusion which exists
concerning what a register is, and how many registers there are, is due in part to what
takes place in the modal register when a person sings from the lowest pitches of that
register to the highest pitches. The frequency of vibration of the vocal folds is determined
by their length, tension, and mass. As pitch rises, the vocal folds are lengthened, tension
increases, and their thickness decreases. In other words, all three of these factors are in a
state of flux in the transition from the lowest to the highest tones.
If a singer holds any of these factors constant and interferes with their progressive state of
change, his laryngeal function tends to become static and eventually breaks occur with
obvious changes of tone quality. These breaks are often identified as register boundaries or
as transition areas between registers. The distinct change or break between registers is
called apassaggio or a ponticello. Vocal instructors teach that with study a singer can move
effortlessly from one register to the other with ease and consistent tone. Registers can even
overlap while singing. Teachers who like to use this theory of "blending registers" usually
help students through the "passage" from one register to another by hiding their "lift"
(where the voice changes).
However, many voice instructors disagree with this distinction of boundaries blaming such
breaks on vocal problems which have been created by a static laryngeal adjustment that
does not permit the necessary changes to take place. This difference of opinion has effected
the different views on vocal registration.
Singing is an integrated and coordinated act and it is difficult to discuss any of the individual
technical areas and processes without relating them to the others. For example, phonation
only comes into perspective when it is connected with respiration; the articulators affect
resonance; the resonators affect the vocal folds; the vocal folds affect breath control; and
so forth. Vocal problems are often a result of a breakdown in one part of this coordinated
process which causes voice teachers to frequently focus in intensively on one area of the
process with their student until that issue is resolved. However, some areas of the art of
singing are so much the result of coordinated functions that it is hard to discuss them under
a traditional heading like phonation, resonation, articulation, or respiration.
Once the voice student has become aware of the physical processes that make up the act of
singing and of how those processes function, the student begins the task of trying to
coordinate them. Inevitably, students and teachers, will become more concerned with one
area of the technique than another. The various processes may progress at different rates,
with a resulting imbalance or lack of coordination. The areas of vocal technique which seem
to depend most strongly on the student's ability to coordinate various functions are.
 1. Extending the vocal range to its maximum potential
 2. Developing consistent vocal production with a consistent tone quality
 3. Developing flexibility and agility
 4. Achieving a balanced vibrato
Developing the singing voice
Singing is not a natural process but is a skill that requires highly developed muscle reflexes.
Singing does not require much muscle strength but it does require a high degree of muscle
coordination. Individuals can develop their voices further through the careful and systematic
practice of both songs and vocal exercises. Voice Teachers instruct their students to
exercise their voices in an intelligent manner. Singers should be thinking constantly about
the kind of sound they are making and the kind of sensations they are feeling while they are
Exercising the singing voice
There are several purposes for vocal exercises, including:
 1. Warming up the voice
 2. Extending the vocal range
 3. "Lining up" the voice horizontally and vertically
 4. Acquiring vocal techniques such as legato, staccato, control of dynamics, rapid
figurations, learning to comfortably sing wide intervals, and correcting vocal faults.
Extending the vocal range
An important goal of vocal development is to learn to sing to the natural limits of one's
vocal range without any obvious or distracting changes of quality or technique. Voice
instructors teach that a singer can only achieve this goal when all of the physical processes
involved in singing (such as laryngeal action, breath support, resonance adjustment, and
articulatory movement) are effectively working together. Most voice teachers believe that
the first step in coordinating these processes is by establishing good vocal habits in the
most comfortable tessitura of the voice first before slowly expanding the range beyond that.
There are three factors which significantly affect the ability to sing higher or lower:
1. The Energy Factor- In this usage the word energy has several connotations. It refers to
the total response of the body to the making of sound. It refers to a dynamic relationship
between the breathing-in muscles and the breathing-out muscles known as the breath
support mechanism. It also refers to the amount of breath pressure delivered to the vocal
folds and their resistance that pressure, and it refers to the dynamic level of the sound.
2. The Space Factor- Space refers to the amount of space created by the moving of the
mouth and the position of the palate and larynx. Generally speaking, a singer's mouth
should be opened wider the higher they sing. The internal space or position of the soft
palate and larynx can be widened by the relaxing of the throat. Voice teachers often
describe this as feeling like the "beginning of a yawn".
3. The Depth Factor- In this usage the word depth has two connotations. It refers to the
actual physical sensations of depth in the body and vocal mechanism and it refers to mental
concepts of depth as related to tone quality.
McKinney says, "These three factors can be expressed in three basic rules: (1) As you sing
higher, you must use more energy; as you sing lower, you must use less. (2) As you sing
higher, you must use more space; as you sing lower, you must use less. (3) As you sing
higher, you must use more depth; as you sing lower, you must use less."
General music studies
Some voice teachers will spend time working with their students on general music
knowledge and skills, particularly music theory, music history, and musical styles and
practices as it relates to the vocal literature being studied. If required they may also spend
time helping their students become better sight readers, often adopting Solfege which
assigns certain syllables to the notes of the scale.
Performance skills and practices
Since singing is a performing art, voice teachers spend some of their time preparing their
students for performance. This includes teaching their students etiquette of behavior on
stage such as bowing, addressing problems like stage fright or nervous tics, and the use of
equipment such as microphones. Some students may also be preparing for careers in the
fields ofopera or musical theater where acting skills are required. Many voice instructors will
spend time on acting techniques and audience communication with students in these fields
of interest. Students of opera also spend a great deal of time with their voice teachers
learning foreign language pronunciations.
Voice projection
Voice projection is the strength of speaking or singing whereby the voice is used loudly
and clearly. It is a technique which can be employed to demand respect and attention, such
as when a teacher is talking to the class, or simply to be heard clearly, as an actor in
a theatre.
Breath technique is essential for proper voice projection. Whereas in normal talking one
may use air from the top of the lungs, a properly projected voice uses air properly flowing
from the expansion of the diaphragm. In good vocal technique, well-balanced respiration is
especially important to maintaining vocal projection. The goal is to isolate and relax the
muscles controlling the vocal folds, so that they are unimpaired by tension. The
external intercostal muscles are used only to enlarge the chest cavity, whilst the
counterplay between the diaphragm and abdominal muscles is trained to control airflow.
Stance is also important, and it is recommended to stand up straight with your feet shoulder
width apart and your upstage foot (right foot if right-handed etc) slightly forward. This
improves your balance and your breathing.
In singing voice projection is often equated with resonance, the concentrated pressure
through which one produces a focused sound. True resonance will produce the greatest
amount of projection available to a voice by utilizing all the key resonators found in the
vocal cavity. As the sound being produced and these resonators find the same overtones,
the sound will begin to spin as it reaches the ideal singer's formant at about 2800 Hz. The
size, shape, and hardness of the resonators all factor into the production of these overtones
and ultimately determine the projective capacities of the voice. 
Voice type
A voice type is a particular kind of human singing voice
perceived as having certain identifying qualities or Voice type
characteristics. Voice classification is the process by which
human voices are evaluated and are thereby designated Female voices
into voice types. These qualities include but are not limited Soprano
to: vocal range, vocal weight, vocal tessitura, vocal timbre, Mezzo-soprano
and vocal transition points such as breaks and lifts within the Contralto
voice. Other considerations are physical characteristics, speech Male voices
level, scientific testing, and vocal registration. The science Countertenor
behind voice classification developed within European classical Tenor
music and is not generally applicable to other forms of singing. Baritone
Voice classification is often used within opera to associate Bass
possible roles with potential voices. There are currently several
different systems in use including: the German Fach system and the choral music system
among many others. No system is universally applied or accepted. This article focuses on
voice classification within classical music. For other contemporary styles of singing
see: Voice classification in non-classical music.
Voice classification is a tool for singers, composers, venues, and listeners to categorize vocal
properties, and to associate possible roles with potential voices. There have been times
when voice classification systems have been used too rigidly, i.e. a house assigning a singer
to a specific type, and only casting him or her in roles they consider belonging to this
A singer will ultimately choose a repertoire that suits their instrument. Some singers such
as Enrico Caruso, Rosa Ponselle, Joan Sutherland, Maria Callas, Ewa Podles, or Plácido
Domingo have voices which allow them to sing roles from a wide variety of types; some
singers such as Shirley Verrett or Grace Bumbrychange type, and even voice part over their
careers; and some singers such as Leonie Rysanek have voices which lower with age,
causing them to cycle through types over their careers. Some roles as well are hard to
classify, having very unusual vocal requirements; Mozart wrote many of his roles for specific
singers who often had remarkable voices, and some of Verdi’s early works make extreme
demands on his singers.
A note on vocal range vs. tessitura: Choral singers are classified into voice parts based
on range; solo singers are classified into voice types based in part on tessitura – where the
voice feels most comfortable for the majority of the time.
(For more information and roles and singers, see the individual voice type pages.)
Number of voice types
There are a plethora of different voice types used by vocal pedagogists today in a variety of
voice classification systems. Most of these types, however, are sub-types that fall under
seven different major voice categories that are for the most part acknowledged across all of
the major voice classification systems. Women are typically divided into three
groups: soprano,mezzo-soprano, and contralto. Men are usually divided into four
groups: countertenor, tenor, baritone, and bass. When considering the pre-pubescent male
voice an eighth term, treble, can be applied. Within each of these major categories there
are several sub-categories that identify specific vocal qualities like coloratura facility
and vocal weight to differentiate between voices.
Female voices
The range specifications given below are based on the American scientific pitch notation.
Soprano range: The soprano is the highest female voice. The typical soprano voice lies
between middle C (C4) and "high C"(C6). The low extreme for sopranos is roughly B3 or A3
(just below middle C). Most soprano roles do not extend above "high C" although there are
several standard soprano roles that call for D6 or D-flat6. At the highest extreme,
some coloratura soprano roles may reach from F6 to A6 (the F to A above "high C").
Soprano tessitura: The tessitura of the soprano voice lies higher than all the other female
voices. In particular, the coloratura soprano has the highest tessitura of all the soprano sub-
Soprano sub-types: As with all voice categories, sopranos are often divided into different
sub-categories based on range, vocal color or timbre, the weight of voice, and dexterity of
the voice. These sub-categories include: Coloratura soprano, Soubrette, Lyric
soprano, Spinto, and Dramatic soprano.
Intermediate voice types
Two types of soprano especially dear to the French are the Dugazon and the Falcon, which
are intermediate voice types between the soprano and the mezzo soprano: a Dugazon is a
darker-colored soubrette, a Falcon a darker-colored soprano drammatico. Mezzo-soprano
The mezzo-soprano is the middle-range voice type for females and is the most common
female voice.
Mezzo-soprano range: The mezzo-soprano voice lies between the soprano voice and
contralto voice, over-lapping both of them. The typical range of this voice is between A3
(the A below middle C) to A5 (the A two octaves above A3). In the lower and upper
extremes, some mezzo-sopranos may extend down to the G below middle C (G3) and as
high as "high C" (C6).
Mezzo-soprano tessitura: Although this voice overlaps both
the contralto and soprano voices, the tessitura of the mezzo-soprano is lower than that of
the soprano and higher than that of the contralto.
Mezzo-soprano sub-types: Mezzo-sopranos are often broken down into three
categories: Lyric mezzo-soprano, Coloratura mezzo-soprano and Dramatic mezzo-
Contralto and alto are not the same term. Technically, "alto" is not a voice type but a
designated vocal line in choral music based on vocal range. The range of the alto part in
choral music is usually more similar to that of a mezzo-soprano than a contralto. However,
in many compositions the alto line is split into two parts. The lower part, Alto 2, is usually
more suitable to a contralto voice than a mezzo-soprano voice..
Contralto range: The contralto voice is the lowest female voice. A true operatic contralto is
extremely rare, so much so that often roles intended for contraltos are performed by
mezzo-sopranos as this voice type is difficult to find. The typical contralto range lies
between the F below middle C (F3) to the second F (F5) above middle C. In the lower and
upper extremes, some contralto voices can sing from the E below middle C (E3) to the
second b-flat above (b-flat5), which is only one whole step short of the "Soprano C".
Contralto tessitura: The contralto voice has the lowest tessitura of the female voices. In
current operatic practice, female singers with very low vocal tessituras are often included
Contralto sub-types: Contraltos are often broken down into two categories: Lyric
contralto and Dramatic contralto.
Male voices
The term countertenor refers to the highest male voice. Many countertenor singers perform
roles originally written for castrati in baroque operas. Except for a few very rare voices
(such as the American male soprano Michael Maniaci, or singers with a disorder such
as Kallmann syndrome), singers called countertenors generally sing in the falsetto register,
sometimes using their modal register for the lowest notes. Historically, there is much
evidence that "countertenor", in England at least, also designated a very high tenor voice,
the equivalent of the French haute-contre, and something similar to the "leggiero tenor"
or tenor altino. It should be remembered that, until about 1830, all male voices used some
falsetto-type voice production in their upper range.
Countertenor ranges (approximate)[citation needed]
Countertenor: from about G3 to E5 or F5
Sopranist: extend the upper range to usually only C6, but some as high as E6 or F6
Haute-contre: from about D3 or E3 to about D5
Countertenor sub-types: There are several sub-types of countertenors
including Sopranist or male soprano, Haute-contre, and modern castrato.
Tenor range: The tenor is the highest male voice within the modal register. The typical
tenor voice lies between the C one octave below middle C (C3) to the C one octave above
"Middle C" (C5). The low extreme for tenors is roughly B-flat 2 (the second b-flat below
middle C). At the highest extreme, some tenors can sing up to the second F above "Middle
C" (F5).Tenor tessitura: The tessitura of the tenor voice lies above the baritone voice and
below the countertenor voice. The Leggiero tenor has the highest tessitura of all the tenor
Tenor sub-types: Tenors are often divided into different sub-categories based on range,
vocal color or timbre, the weight of the voice, and dexterity of the voice. These sub-
categories include: Leggiero tenor, Lyric tenor, Spinto tenor, Dramatic tenor,
and Heldentenor.
The Baritone is the most common type of male voice.
Baritone range: The vocal range of the baritone lies between the bass and tenor ranges,
overlapping both of them. The typical baritone range is from the second F below middle C
(F2) to the F above middle C (F4), which is exactly two octaves. In the lower and upper
extremes, a baritone's range can be extended at either end.
Baritone tessitura: Although this voice overlaps both the tenor and bass voices,
the tessitura of the baritone is lower than that of the tenor and higher than that of the bass.
Baritone sub-types: Baritones are often divided into different sub-categories based on
range, vocal color or timbre, the weight of the voice, and dexterity of the voice. These sub-
categories include: Lyric baritone, Bel Canto (coloratura) baritone, kavalierbariton, Dramatic
baritone, Verdi baritone, baryton-noble, and Bariton/Baryton-Martin.
Bass range: The bass is the lowest male voice. The typical bass range lies between the
second E below "middle C" (E2) to the E above middle C (E4). In the lower and upper
extremes of the bass voice, some basses can sing from the C two octaves below middle C
(C2) to the G above middle C (G4).
Bass tessitura: The bass voice has the lowest tessitura of all the voices.
Bass sub-types: Basses are often divided into different sub-categories based on range, vocal
color or timbre, the weight of the voice, and dexterity of the voice. These sub-categories
include: Basso Profondo, Basso Buffo / Bel Canto Bass, Basso Cantante, Dramatic
Bass, and Bass-baritone.
Children's voices
The voice from childhood to adulthood
The human voice is in a constant state of change and development just as the whole body is
in a state of constant change. A human voice will alter as a person gets older moving from
immaturity to maturity to a peak period of prime singing and then ultimately into a declining
period. The vocal range and timbre of children's voices does not have the variety that
adults' voices have. Both boys and girls prior to puberty have an equivalent vocal range and
timbre. The reason for this is that both groups have a similar laryngeal size and height and
a similarvocal cord structure. With the onset of puberty, both men and women's voices alter
as the vocal ligaments become more defined and the laryngeal cartilages harden.
The laryngealstructure of both voices change but more so in men. The height of the
male larynx becomes much longer than in women. The size and development of adult lungs
also changes what the voice is physically capable of doing. From the onset of puberty to
approximately age 22, the human voice is in an in-between phase where it is not quite a
child's voice nor an adult one yet. This is not to suggest that the voice stops changing at
that age. Different singers will reach adult development earlier or later than others, and as
stated above there are continual changes throughout adulthood as well.
The term treble can refer to either a young female or young male singer with an
unchanged voice in the soprano range. Initially, the term was associated with boy
sopranos but as the inclusion of girls into children's choirs became acceptable in the
twentieth century the term has expanded to refer to all pre-pubescent voices. The lumping
of children's voices into one category is also practical as both boys and girls share a similar
range and timbre.
Treble range: Most trebles have an approximate range from the A below "middle C" (A3) to
the F one and a half octaves above "middle C" (F5). Some trebles, however, can extend
their voices higher in the modal register to "high C" (C6). This ability may be comparatively
rare, but the Anglican church repertory, which many trained trebles sing, frequently
demands G5 and even A5.  Many trebles are also able to reach higher notes by use of
the whistle register but this practice is rarely called for in performance.
Classifying singers
Voice classification is important for vocal pedagogists and singers as a guiding tool for the
development of the voice. Misclassification can damage the vocal cords, shorten a singing
career and lead to the loss of both vocal beauty and free vocal production. Some of these
dangers are not immediate ones; the human voice is quite resilient, especially in early
adulthood, and the damage may not make its appearance for months or even years.
Unfortunately, this lack of apparent immediate harm can cause singers to develop bad
habits that will over time cause irreparable damage to the voice. Singing outside the natural
vocal range imposes a serious strain upon the voice. Clinical evidence indicates that singing
at a pitch level that is either too high or too low creates vocal pathology. Noted vocal
pedagogist Margaret Greene says,
"The need for choosing the correct natural range of the voice is of great importance in
singing since the outer ends of the singing range need very careful production and should
not be overworked, even in trained voices."
Singing at either extreme of the range may be damaging, but the possibility of damage
seems to be much more prevalent in too high a classification. A number of medical
authorities have indicated that singing at too high a pitch level may contribute to certain
vocal disorders. Medical evidence indicates that singing at too high of a pitch level may lead
to the development of vocal nodules. Increasing tension on the vocal cords is one of the
means of raising pitch. Singing above an individual's best tessitura keeps the vocal cords
under a great deal of unnecessary tension for long periods of time, and the possibility of
vocal abuse is greatly increased. Singing at too low a pitch level is not as likely to be
damaging unless a singer tries to force the voice down.
In general vocal pedagogists consider four main qualities of a human voice when attempting
to classify it: vocal range, tessitura, timbre, and vocal transition points. However, teachers
may also consider physical characteristics, speech level, scientific testing and other factors.
Dangers of quick identification
Many vocal pedagogists warn of the dangers of quick identification. Premature concern with
classification can result in misclassification, with all its attendant dangers. William
"I never feel any urgency about classifying a beginning student. So many premature
diagnoses have been proved wrong, and it can be harmful to the student and embarrassing
to the teacher to keep striving for an ill-chosen goal. It is best to begin in the middle part of
the voice and work upward and downward until the voice classifies itself."
Most vocal pedagogists believe that it is essential to establish good vocal habits within a
limited and comfortable range before attempting to classify the voice. When techniques of
posture, breathing, phonation, resonation, and articulation have become established in this
comfortable area, the true quality of the voice will emerge and the upper and lower limits of
the range can be explored safely. Only then can a tentative classification be arrived at, and
it may be adjusted as the voice continues to develop. Many vocal pedagogists suggest that
teachers begin by assuming that a voice is of a medium classification until it proves
otherwise. The reason for this is that the majority of individuals possess medium voices and
therefore this approach is less likely to misclassify or damage the voice.
Choral music classification
Unlike other classification systems, choral music divides voices solely on the basis of vocal
range. Choral music most commonly divides vocal parts into high and low voices within each
sex (SATB). As a result, the typical choral situation affords many opportunities for
misclassification to occur. Since most people have medium voices, they must be assigned to
a part that is either too high or too low for them; the mezzo-soprano must sing soprano or
alto and the baritone must sing tenor or bass. Either option can present problems for the
singer, but for most singers there are fewer dangers in singing too low than in singing too
Speech synthesis - Synthesizer technologies: Encyclopedia II - Speech synthesis - Synthesizer technologies

There are two main technologies used for the generating synthetic speech waveforms: concatenative synthesis and
formant synthesis. Speech synthesis - Concatenative synthesis. Concatenative synthesis is based on the concatenation
(or stringing together) of segments of recorded speech. Generally, concatenative synthesis gives the most natural
sounding synthesized speech. However, natural variation in speech and automated techniques for segmenting the
waveforms sometimes result in audible glitches in the output, detracting from the naturalness

Use Of The Web By People With Disabilities


This document provides an introduction to use of the Web by people with disabilities. It
illustrates some of their requirements when using Web sites and Web-based applications,
and provides supporting information for the guidelines and technical work of the World Wide
Web Consortium's (W3C) Web Accessibility Initiative (WAI).

Table of Contents

1. Introduction
2. Scenarios of People with Disabilities Using the Web
3. Different Disabilities That Can Affect Web Accessibility
4. Assistive Technologies and Adaptive Strategies
5. Further Reading
6. Scenario References
7. General References
8. Acknowledgements

1. Introduction

The Web Accessibility Initiative (WAI) develops guidelines for accessibility of Web sites,

browsers, and authoring tools, in order to make it easier for people with disabilities to use
the Web. Given the Web's increasingly important role in society, access to the Web is vital
for people with disabilities. Many of the accessibility solutions described in WAI materials
also benefit Web users who do not have disabilities.

This document provides a general introduction to how people with different kinds of
disabilities use the Web. It provides background to help understand how people with
disabilities benefit from provisions described in the Web Content Accessibility Guidelines
1.0, Authoring Tool Accessibility Guidelines 1.0, and User Agent Accessibility Guidelines 1.0.
It is not a comprehensive or in-depth discussion of disabilities, nor of the assistive
technologies used by people with disabilities. Specifically, this document describes:
 scenarios of people with disabilities using accessibility features of Web sites and
Web-based applications;
 general requirements for Web access by people with physical, visual, hearing, and
cognitive or neurological disabilities;
 some types of assistive technologies and adaptive strategies used by some people
with disabilities when accessing the Web.

This document contains many internal hypertext links between the sections on scenarios,
disability requirements, assistive technologies, and scenario references. The scenario
references and general references sections also include links to external documents.

2. Scenarios of People with Disabilities Using the Web

The following scenarios show people with different kinds of disabilities using assistive
technologies and adaptive strategies to access the Web. In some cases the scenarios show
how the Web can make some tasks easier for people with disabilities.

Please note that the scenarios do not represent actual individuals, but rather individuals
engaging in activities that are possible using today's Web technologies and assistive
technologies. The reader should not assume that everyone with a similar disability to those
portrayed will use the same assistive technologies or have the same level of expertise in
using those technologies. In some cases, browsers, media players, or assistive technologies
with specific features supporting accessibility may not yet be available in an individual's
primary language. Disability terminology varies from one country to another, as do
educational and employment opportunities.

Each scenario contains links to additional information on the specific disability or disabilities
described in more detail in Section 3; to the assistive technology or adaptive strategy
described in Section 4; and to detailed curriculum examples or guideline checkpoints in the
Scenarios References in Section 6.

Following is a list of scenarios and accessibility solutions: 

 online shopper with color blindness  (user control of style sheets)

 reporter with repetitive stress injury  (keyboard equivalents for mouse-driven
commands; access-key)
 online student who is deaf (captioned audio portions of multimedia files)
 accountant with blindness (appropriate markup of tables, alternative text,
abbreviations, and acronyms; synchronization of visual, speech, and braille display)
 classroom student with dyslexia  (use of supplemental graphics; freezing animated
graphics; multiple search options)
 retiree with aging-related conditions,  managing personal finances (magnification;
stopping scrolling text; avoiding pop-up windows)
 supermarket assistant with cognitive disability  (clear and simple language;
consistent design; consistent navigation options; multiple search options)
 teenager with deaf-blindness , seeking entertainment (user control of style sheets;
accessible multimedia; device-independent access; labelled frames; appropriate
table markup)

Online shopper with color blindness

Mr. Lee wants to buy some new clothes, appliances, and music. As he frequently does, he is
spending an evening shopping online. He has one of the most common visual disabilities for
men: color blindness, which in his case means an inability to distinguish between green and

He has difficulty reading the text on many Web sites. When he first starting using the Web,
it seemed to him the text and images on a lot of sites used poor color contrast, since they
appeared to use similar shades of brown. He realized that many sites were using colors that
were indistinguishable to him because of his red/green color blindness. In some cases the
site instructions explained that discounted prices were indicated by red text, but all of the
text looked brown to him. In other cases, the required fields on forms were indicated by red
text, but again he could not tell which fields had red text.

Mr. Lee found that he prefered sites that used sufficient color contrast, and redundant
information for color. The sites did this by including names of the colors of clothing as well
as showing a sample of the color; and by placing an asterix (*) in front of the required fields
in addition to indicated them by color.

After additional experimentation, Mr. Lee discovered that on most newer sites the colors
were controlled by style sheets and that he could turn these style sheets off with his
browser or override them with his own style sheets. But on sites that did not use style
sheets he couldn't override the colors. 

Eventually Mr. Lee bookmarked a series of online shopping sites where he could get reliable
information on product colors, and not have to guess at which items were discounted. 

Reporter with repetitive stress injury

Mr. Jones is a reporter who must submit his articles in HTML for publishing in an on-line
journal. Over his twenty-year career, he has developed repetitive stress injury (RSI) in his
hands and arms, and it has become painful for him to type. He uses a combination
of speech recognition and an alternative keyboard to prepare his articles, but he doesn't use
a mouse. It took him several months to become sufficiently accustomed to using speech
recognition to be comfortable working for many hours at a time. There are some things he
has not worked out yet, such as a sound card conflict that arises whenever he tries to use
speech recognition on Web sites that have streaming audio.

He has not been able to use the same Web authoring software as his colleagues, because
the application that his office chose for a standard is missing many of the keyboard
equivalents that he needs in place of mouse-driven commands. To activate commands that
do not have keyboard equivalents, he would have to use a mouse instead of speech
recognition or typing, and this would re-damage his hands at this time. He researched some
of the newer versions of authoring tools and selected one with full keyboard support. Within
a month, he discovered that several of his colleagues have switched to the new product as
well, after they found that the full keyboard support was easier on their own hands.

When browsing other Web sites to research some of his articles, Mr. Jones likes the access
key feature that is implemented on some Web pages. It enables him to shortcut a long list
of links that he would ordinarily have to tab through by voice, and instead go straight to the
link he wants.
Online student who is deaf

Ms. Martinez is taking several distance learning courses in physics. She is deaf. She had
little trouble with the curriculum until the university upgraded their on-line courseware to a
multimedia approach, using an extensive collection of audio lectures. For classroom-based
lectures the university provided interpreters; however for Web-based instruction they
initially did not realize that accessibility was an issue, then said they had no idea how to
provide the material in accessible format. She was able to point out that the University was
clearly covered by a policy requiring accessibility of online instructional material, and then to
point to the Web Content Accessibility Guidelines 1.0 as a resource providing guidance on
how to make Web sites, including those with multimedia, accessible.

The University had the lectures transcribed and made this information available through
their Web site along with audio versions of the lectures. For an introductory multimedia
piece, the university used a SMIL-based multimedia format enabling synchronized
captioning of audio and description of video. The school's information managers quickly
found that it was much easier to comprehensively index the audio resources on the
accessible area of the Web site, once these resources were captioned with text.

The professor for the course also set up a chat area on the Web site where students could
exchange ideas about their coursework. Although she was the only deaf student in the class
and only one other student knew any sign language, she quickly found that the Web-based
chat format, and the opportunity to provide Web-based text comments on classmates' work,
ensured that she could keep up with class progress.

Accountant with blindness

Ms. Laitinen is an accountant at an insurance company that uses Web-based formats over a
corporate intranet. She is blind. She uses a screen reader to interpret what is displayed on
the screen and generate a combination of speech output and refreshable braille output. She
uses the speech output, combined with tabbing through the navigation links on a page, for
rapid scanning of a document, and has become accustomed to listening to speech output at
a speed that her co-workers cannot understand at all. She uses refreshable braille output to
check the exact wording of text, since braille enables her to read the language on a page
more precisely.

Much of the information on the Web documents used at her company is in tables, which can
sometimes be difficult for non-visual users to read. However, since the tables on this
company's documents are marked up clearly with column and row headers which her screen
reader can access, she easily orients herself to the information in the tables. Her screen
reader reads her the alternative text for any images on the site. Since the insurance codes
she must frequently reference include a number of abbreviations and acronyms, she finds
the expansions of abbreviations and acronyms the first time they appear on a page allows
her to better catch the meaning of the short versions of these terms.

As one of the more senior members of the accounting staff, Ms. Laitenen must frequently
help newer employees with their questions. She has recently upgraded to a browser
that allows better synchronization of the screen display with audio and braille rendering of
that information. This enables her to better help her colleagues, since the screen shows her
colleagues the same part of the document that she is reading with speech or braille output.

Classroom student with dyslexia

Ms. Olsen attends middle school, and particularly likes her literature class. She has attention
deficit disorder with dyslexia, and the combination leads to substantial difficulty reading.
However with recent accommodations to the curriculum she has become enthusiastic about
this class.

Her school recently started to use more online curricula to supplement class textbooks. She
was initially worried about reading load, since she reads slowly. But recently she tried text
to speech software, and found that she was able to read along visually with the text much
more easily when she could hear certain sections of it read to her with the speech synthesis,
instead of struggling over every word.

Her classes recent area of focus is Hans Christian Andersen's writings, and she has to do
some research about the author. When she goes onto the Web, she finds that some sites
are much easier for her to use than others. Some of the pages have a lot of graphics, and
those help her focus in quickly on sections she wants to read. In some cases, though, where
the graphics are animated, it is very hard for her to focus, and so it helps to be able
to freeze the animated graphics.

One of the most important things for her has been the level of accessibility of the Web-
based online library catalogues and the general search functions on the Web. Sometimes
the search options are confusing for her. Her teacher has taught a number of different
search strategies, and she finds that some sites provide options for a variety of searching
strategies and she can more easily select searching options that work well for her.

Retiree with several aging-related conditions, managing personal finances

Mr. Yunus uses the Web to manage some of his household services and finances. He has
some central-field vision loss, hand tremor, and a little short-term memory loss.

He uses a screen magnifier to help with his vision and his hand tremor; when the icons and
links on Web pages are bigger, it's easier for him to select them, and so he finds it easier to
use pages with style sheets. When he first started using some of the financial pages, he
found the scrolling stocktickers distracting, and they moved too fast for him to read. In
addition, sometimes the pages would update before he had finished reading them.
Therefore he tends to use Web sites that do not have a lot of movement in the text, and
that do not auto-refresh. He also tended to "get stuck" on some pages, finding that he could
not back up, on some sites where new browser windows would pop open without notifying

Mr. Yunus has gradually found some sites that work well for him, and developed a
customized profile at some banking, grocery, and clothing sites.

Supermarket assistant with cognitive disability

Mr. Sands has put groceries in bags for customers for the past year at a supermarket. He
has Down syndrome, and has difficulty with abstract concepts, reading, and doing
mathematical calculations. He usually buys his own groceries at this supermarket, but
sometimes finds that there are so many product choices that he becomes confused, and he
finds it difficult to keep track of how much he is spending. He has difficulty re-learning
where his favorite products are each time the supermarket changes the layout of its
Recently, he visited an online grocery service from his computer at home. He explored the
site the first few times with a friend. He found that he could use the Web site without much
difficulty -- it had a lot of pictures, which were helpful in navigating around the site, and in
recognizing his favorite brands.

His friend showed him different search options that were available on the site, making it
easier for him to find items. He can search by brand name or by pictures, but he mostly
uses the option that lets him select from a list of products that he has ordered in the past.
Once he decides what he wants to buy, he selects the item and puts it into his virtual
shopping basket. The Web site gives him an updated total each time he adds an item,
helping him make sure that he does not overspend his budget. 

The marketing department of the online grocery wanted their Web site to have a high
degree of usability in order to be competitive with other online stores. They usedconsistent
design and consistent navigation options so that their customers could learn and remember
their way around the Web site. They also used the clearest and simplest language
appropriate for the site's content so that their customers could quickly understand the

While these features made the site more usable for all of the online-grocery's customers,
they made it possible for Mr. Sands to use the site. Mr. Sands now shops on the online
grocery site a few times a month, and just buys a few fresh items each day at the
supermarket where he works.

Teenager with deaf-blindness, seeking entertainment

Ms. Kaseem uses the Web to find new restaurants to go to with friends and classmates. She
has low vision and is deaf. She uses a screen magnifier to enlarge the text on Web sites to a
font size that she can read. When screen magnification is not sufficient, she also uses
a screen reader to drive a refreshable braille display, which she reads slowly.

At home, Ms. Kaseem browses local Web sites for new and different restaurants. She uses a
personal style sheet with her browser, which makes all Web pages display according to her
preferences. Her preferences include having background patterns turned off so that there is
enough contrast for her when she uses screen magnification. This is especially helpful when
she reads on-line sample menus of appealing restaurants.

A multimedia virtual tour of local entertainment options was recently added to the Web site
of the city in which Ms. Kaseem lives. The tour is captioned and described -- including text
subtitles for the audio, and descriptions of the video -- which allows her to access it using a
combination of screen magnification and braille. The interface used for the virtual tour
is accessible no matter what kind of assistive technology she is using  -- screen
magnification, her screen reader with refreshable braille, or herportable braille device. Ms.
Kaseem forwards the Web site address to friends and asks if they are interested in going
with her to some of the restaurants featured on the tour.

She also checks the public transportation sites to find local train or bus stops near the
restaurants. The Web site for the bus schedule has frames without meaningful titles, and
tables without clear column or row headers, so she often gets lost on the site when trying to
find the information she needs. The Web site for the local train schedule, however, is easy
to use because the frames on that Web site have meaningful titles, and the schedules,
which are laid out as long tables with clear row and column headersthat she uses to orient
herself even when she has magnified the screen display. 

Occasionally she also uses her portable braille device, with an infrared connection, to get
additional information and directions at a publicly-available information kiosk in a shopping
mall downtown; and a few times she has downloaded sample menus into her braille device
so that she has them in an accessible format once she is in the restaurant.

3. Different Disabilities that Can Affect Web Accessibility

This section describes general kinds of disabilities that can affect access to the Web. There
are as yet no universally accepted categorizations of disability, despite efforts towards that
goal. Commonly used disability terminology varies from country to country and between
different disability communities in the same country. There is a trend in many disability
communities to use functional terminology instead of medical classifications. This document
does not attempt to comprehensively address issues of terminology.

Abilities can vary from person to person, and over time, for different people with the same
type of disability. People can have combinations of different disabilities, and combinations of
varying levels of severity.

The term "disability" is used very generally in this document. Some people with conditions
described below would not consider themselves to have disabilities. They may, however,
have limitations of sensory, physical or cognitive functioning which can affect access to the
Web. These may include injury-related and aging-related conditions, and can be temporary
or chronic.

The number and severity of limitations tend to increase as people age, and may include
changes in vision, hearing, memory, or motor function. Aging-related conditions can be
accommodated on the Web by the same accessibility solutions used to accommodate people
with disabilities.

Sometimes different disabilities require similar accommodations. For instance, someone who
is blind and someone who cannot use his or her hands both require full keyboard
equivalents for mouse commands in browsers and authoring tools, since they both have
difficulty using a mouse but can use assistive technologies to activate commands supported
by a standard keyboard interface.

Many accessibility solutions described in this document contribute to "universal design" (also
called "design for all") by benefiting non-disabled users as well as people with disabilities.
For example, support for speech output not only benefits blind users, but also Web users
whose eyes are busy with other tasks; while captions for audio not only benefit deaf users,
but also increase the efficiency of indexing and searching for audio content on Web sites.

Each description of a general type of disability includes several brief examples of the kinds
of barriers someone with that disability might encounter on the Web. These lists of barriers
are illustrative and not intended to be comprehensive. Barrier examples listed here are
representative of accessibility issues that are relatively easy to address with existing
accessibility solutions, except where otherwise noted.

Following is a list of some disabilities and their relation to accessibility issues on the Web.
 visual disabilities
o blindness
o low vision
o color blindness
 hearing impairments
o deafness
o hard of hearing
 physical disabilities
o motor disabilities
 speech disabilities
o speech disabilities
 cognitive and neurological disabilities
o dyslexia and dyscalculia
o attention deficit disorder
o intellectual disabilities

o memory impairments
o mental health disabilities
o seizure disorders
 multiple disabilities
 aging-related conditions

Visual disabilities

Blindness (scenario -- "accountant")

Blindness involves a substantial, uncorrectable loss of vision in both eyes.

To access the Web, many individuals who are blind rely on screen readers -- software that
reads text on the screen (monitor) and outputs this information to a speech
synthesizer and/or refreshable braille display. Some people who are blind use text-based
browsers such as Lynx, or voice browsers, instead of a graphical user interface browser plus
screen reader. They may use rapid navigation strategies such as tabbing through the
headings or links on Web pages rather than reading every word on the page in sequence.

Examples of barriers that people with blindness may encounter on the Web can include:

 images that do not have alternative text

 complex images (e.g., graphs or charts) that are not adequately described
 video that is not described in text or audio
 tables that do not make sense when read serially (in a cell-by-cell or "linearized"
 frames that do not have "NOFRAME" alternatives, or that do not have meaningful
 forms that cannot be tabbed through in a logical sequence or that are poorly labelled
 browsers and authoring tools that lack keyboard support for all commands
 browsers and authoring tools that do not use standard applications programmer
interfaces for the operating system they are based in
 non-standard document formats that may be difficult for their screen reader to
Low vision (scenarios -- "teenager" and "retiree")

There are many types of low vision (also known as "partially sighted" in parts of Europe),
for instance poor acuity (vision that is not sharp), tunnel vision (seeing only the middle of
the visual field), central field loss (seeing only the edges of the visual field), and clouded

To use the Web, some people with low vision use extra-large monitors, and increase the
size of system fonts and images. Others use screen magnifiers or screen enhancement
software. Some individuals use specific combinations of text and background colors, such as
a 24-point bright yellow font on a black background, or choose certain typefaces that are
especially legible for their particular vision requirements. 

Barriers that people with low vision may encounter on the Web can include:

 Web pages with absolute font sizes that do not change (enlarge or reduce) easily
 Web pages that, because of inconsistent layout, are difficult to navigate when
enlarged, due to loss of surrounding context
 Web pages, or images on Web pages, that have poor contrast, and whose contrast
cannot be easily changed through user override of author style sheets
 text presented as images, which prevents wrapping to the next line when enlarged
 also many of the barriers listed for blindness, above, depending on the type and
extent of visual limitation

Color blindness (scenario -- "shopper")

Color blindness is a lack of sensitivity to certain colors. Common forms of color blindness
include difficulty distinguishing between red and green, or between yellow and blue.
Sometimes color blindness results in the inability to perceive any color.

To use the Web, some people with color blindness use their own style sheets to override the
font and background color choices of the author.

Barriers that people with color blindness may encounter on the Web can include:

 color that is used as a unique marker to emphasize text on a Web site

 text that inadequately contrasts with background color or patterns
 browsers that do not support user override of authors' style sheets

Hearing Impairments

Deafness (scenario -- "online student")

Deafness involves a substantial uncorrectable impairment of hearing in both ears. Some

deaf individuals' first language is a sign language, and they may or may not read a written
language fluently, or speak clearly.

To use the Web, many people who are deaf rely on captions for audio content. They may
need to turn on the captions on an audio file as they browse a page; concentrate harder to
read what is on a page; or rely on supplemental images to highlight context.
Barriers that people who are deaf may encounter on the Web can include:

 lack of captions or transcripts of audio on the Web, including webcasts

 lack of content-related images in pages full of text, which can slow comprehension
for people whose first language may be a sign language instead of a written/spoken
 lack of clear and simple language 
 requirements for voice input on Web sites

Hard of hearing

A person with a mild to moderate hearing impairment may be considered hard of hearing. 

To use the Web, people who are hard of hearing may rely on captions for audio content
and/or amplification of audio. They may need to toggle the captions on an audio file on or
off, or adjust the volume of an audio file.

Barriers encountered on the Web can include:

 lack of captions or transcripts for audio on the Web, including webcasts

Physical disabilities

Motor disabilities (scenario -- "reporter")

Motor disabilities can include weakness, limitations of muscular control (such as involuntary
movements, lack of coordination, or paralysis), limitations of sensation, joint problems, or
missing limbs. Some physical disabilities can include pain that impedes movement. These
conditions can affect the hands and arms as well as other parts of the body.

To use the Web, people with motor disabilities affecting the hands or arms may use a
specialized mouse; a keyboard with a layout of keys that matches their range of hand
motion; a pointing device such as a head-mouse, head-pointer or mouth-stick; voice-
recognition software; an eye-gaze system; or other assistive technologies to access and
interact with the information on Web sites. They may activate commands by typing single
keystrokes in sequence with a head pointer rather than typing simultaneous keystrokes
("chording") to activate commands. They may need more time when filling out interactive
forms on Web sites if they have to concentrate or maneuver carefully to select each

Barriers that people with motor disabilities affecting the hands or arms may encounter

 time-limited response options on Web pages

 browsers and authoring tools that do not support keyboard alternatives for mouse
 forms that cannot be tabbed through in a logical order 

Speech disabilities
Speech disabilities

Speech disabilities can include difficulty producing speech that is recognizable by some voice
recognition software, either in terms of loudness or clarity.

To use parts of the Web that rely on voice recognition, someone with a speech disability
needs to be able to use an alternate input mode such as text entered via a keyboard.

Barriers that people with speech disabilities encounter on the Web can include:

 Web sites that require voice-based interaction and have no alternative input mode

Cognitive and neurological disabilities

Visual and Auditory Perception (scenario -- "classroom student")

Individuals with visual and auditory perceptual disabilities, including dyslexia (sometimes
called "learning disabilities" in Australia, Canada, the U.S., and some other countries) and
dyscalculia may have difficulty processing language or numbers. They may have difficulty
processing spoken language when heard ("auditory perceptual disabilities"). They may also
have difficulty with spatial orientation.

To use the Web, people with visual and auditory perceptual disabilities may rely on getting
information through several modalities at the same time. For instance, someone who has
difficulty reading may use a screen reader plus synthesized speech to facilitate
comprehension, while someone with an auditory processing disability may use captions to
help understand an audio track. 

Barriers that people with visual and auditory perceptual disabilities may encounter on the
Web can include:

 lack of alternative modalities for information on Web sites, for instance lack of
alternative text that can be converted to audio to supplement visuals, or the lack of
captions for audio

Attention deficit disorder (scenario -- "classroom student")

Individuals with attention deficit disorder may have difficulty focusing on information.

To use the Web, an individual with an attention deficit disorder may need to turn off
animations on a site in order to be able to focus on the site's content.

Barriers that people with attention deficit disorder may encounter on the Web can include:

 distracting visual or audio elements that cannot easily be turned off

 lack of clear and consistent organization of Web sites
Intellectual disabilities (scenario -- "supermarket assistant")

Individuals with impairments of intelligence (sometimes called "learning disabilities" in

Europe; or "developmental disabilities" or previously "mental retardation" in the United
States) may learn more slowly, or have difficulty understanding complex concepts. Down
Syndrome is one among many different causes of intellectual disabilities.

To use the Web, people with intellectual disabilities may take more time on a Web site, may
rely more on graphics to enhance understanding of a site, and may benefit from the level of
language on a site not being unnecessarily complex for the site's intended purpose.

Barriers can include:

 use of unnecessarily complex language on Web sites

 lack of graphics on Web sites
 lack of clear or consistent organization of Web sites

Memory impairments (scenario -- "retiree")

Individuals with memory impairments may have problems with short-term memory, missing
long-term memory, or may have some loss of ability to recall language.

To use the Web, people with memory impairments may rely on a consistent navigational
structure throughout the site.

Barriers can include:

 lack of clear or consistent organization of Web sites

Mental health disabilities

Individuals with mental health disabilities may have difficulty focusing on information on a
Web site, or difficulty with blurred vision or hand tremors due to side effects from

To use the Web, people with mental health disabilities may need to turn off distracting
visual or audio elements, or to use screen magnifiers.

Barriers can include:

 distracting visual or audio elements that cannot easily be turned off

 Web pages with absolute font sizes that do not enlarge easily

Seizure disorders
Some individuals with seizure disorders, including people with some types of epilepsy
(including photo-sensitive epilepsy), are triggered by visual flickering or audio signals at a
certain frequency. 

To use the Web, people with seizure disorders may need to turn off animations, blinking
text, or certain frequencies of audio. Avoidance of these visual or audio frequencies in Web
sites helps prevent triggering of seizures. 
Barriers can include:
 use of visual or audio frequencies that can trigger seizures

Multiple Disabilities (scenario -- "teenager")

Combinations of disabilities may reduce a user's flexibility in using accessibility information.

For instance, while someone who is blind can benefit from hearing an audio description of a
Web-based video, and someone who is deaf can benefit from seeing the captions
accompanying audio, someone who is both deaf and blind needs access to a text transcript
of the description of the audio and video, which they could access on a refreshable braille

Similarly, someone who is deaf and has low vision might benefit from the captions on audio
files, but only if the captions could be enlarged and the color contrast adjusted.

Someone who cannot move his or her hands, and also cannot see the screen well, might
use a combination of speech input and speech output, and might therefore need to rely on
precise indicators of location and navigation options in a document.

Aging-Related Conditions (scenario -- "retiree")

Changes in people's functional ability due to aging can include changes in abilities or a
combination of abilities including vision, hearing, dexterity and memory. Barriers can
include any of the issues already mentioned above. Any one of these limitations can affect
an individual's ability to access Web content. Together, these changes can become more
complex to accommodate.

For example, someone with low vision may need screen magnification, however when using
screen magnification the user loses surrounding contextual information, which adds to the
difficulty which a user with short-term memory loss might experience on a Web site.

4. Assistive Technologies and Adaptive Strategies

Assistive technologies are products used by people with disabilities to help accomplish tasks
that they cannot accomplish otherwise or could not do easily otherwise. When used with
computers, assistive technologies are sometimes referred to as adaptive software or

Some assistive technologies are used together with graphical desktop browsers, text
browsers, voice browsers, multimedia players, or plug-ins. Some accessibility solutions are
built into the operating system, for instance the ability to change the system font size, or
configure the operating system so that multiple-keystroke commands can be entered with a
sequence of single keystrokes. 

Adaptive strategies are techniques that people with disabilities use to assist in using
computers or other devices. For example someone who cannot see a Web page may tab
through the links on a page as one strategy for helpinjg skim the content. 
Following is a list of the assistive technologies and adaptive strategies described below. This
is by no means a comprehensive list of all such technologies or strategies, but rather
explanations of examples highlighted in the scenarios above.

 alternative keyboards or switches

 braille and refreshable braille
 scanning software
 screen magnifiers
 screen readers
 speech recognition
 speech synthesis
 tabbing through structural elements
 text browsers
 visual notification
 voice browsers

Alternative keyboards or switches (scenario -- "reporter")

Alternate keyboards or switches are hardware or software devices used by people with
physical disabilities, that provide an alternate way of creating keystrokes that appear to
come from the standard keyboard. Examples include keyboard with extra-small or extra-
large key spacing, keyguards that only allow pressing one key at a time, on-screen
keyboards, eyegaze keyboards, and sip-and-puff switches. Web-based applications that can
be operated entirely from the keyboard, with no mouse required, support a wide range of
alternative modes of input.

Braille and refreshable braille (scenarios -- "accountant" and "teenager")

Braille is a system using six to eight raised dots in various patterns to represent letters and
numbers that can be read by the fingertips. Braille systems vary greatly around the world.
Some "grades" of braille include additional codes beyond standard alpha-numeric characters
to represent common letter groupings (e.g., "th," "ble" in Grade II American English braille)
in order to make braille more compact. An 8-dot version of braille has been developed to
allow all ASCII characters to be represented. Refreshable or dynamic braille involves the use
of a mechanical display where dots (pins) can be raised and lowered dynamically to allow
any braille characters to be displayed. Refreshable braille displays can be incorporated into
portable braille devices with the capabilities of small computers, which can also be used as
interfaces to devices such as information kiosks.

Scanning software

Scanning software is adaptive software used by individuals with some physical or cognitive
disabilities that highlights or announces selection choices (e.g., menu items, links, phrases)
one at a time. A user selects a desired item by hitting a switch when the desired item is
highlighted or announced.

Screen magnifiers (scenarios -- "teenager" and "retiree")

Screen magnification is software used primarily by individuals with low vision that magnifies
a portion of the screen for easier viewing. At the same time screen magnifiers make
presentations larger, they also reduce the area of the document that may be viewed,
removing surrounding context . Some screen magnifiers offer two views of the screen: one
magnified and one default size for navigation.

Screen readers (scenarios -- "accountant" and "teenager")

Software used by individuals who are blind or who have dyslexia that interprets what is
displayed on a screen and directs it either to speech synthesis for audio output, or to
refreshable braille for tactile output. Some screen readers use the document tree (i.e., the
parsed document code) as their input. Older screen readers make use of the rendered
version of a document, so that document order or structure may be lost (e.g., when tables
are used for layout) and their output may be confusing.

Speech recognition

Speech (or voice) recognition is used by people with some physical disabilities or temporary
injuries to hands and forearms as an input method in some voice browsers. Applications
that have full keyboard support can be used with speech recognition.

Speech synthesis (speech output) (scenario -- "accountant")

Speech synthesis or speech output can be generated by screen readers or voice browsers,
and involves production of digitized speech from text. People who are used to using speech
output sometimes listen to it at very rapid speeds.

Tabbing through structural elements (scenario -- "accountant")

Some accessibility solutions are adaptive strategies rather than specific assistive
technologies such as software or hardware. For instance, for people who cannot use a
mouse, one strategy for rapidly scanning through links, headers, list items, or other
structural items on a Web page is to use the tab key to go through the items in sequence.
People who are using screen readers -- whether because they are blind or dyslexic -- may
tab through items on a page, as well as people using voice recognition.

Text browsers

Text browsers such as Lynx are an alternative to graphical user interface browsers. They
can be used with screen readers for people who are blind. They are also used by many
people who have low bandwidth connections and do not want to wait for images to

Visual notification

Visual notification is an alternative feature of some operating systems that allows deaf or
hard of hearing users to receive a visual alert of a warning or error message that might
otherwise be issued by sound

You might also like