You are on page 1of 4

4.

8 DICITALAUDIOAND THECOMPUTER l0l


one state to another. But the states of the machine cannot be directly
observed. Instead, a finite number of observations can be made about thl
current state of the state machine. The observations are stochastically
related to the actual states. There is an algorithm for deriving the probability
that a given sequence of observations was generated by a girren sequence
of states. In an HMM-based system, each element (e.g., phdneme) his one
mode-l representing it, that is, one state machine with alsociated probable
initial state, transition probabilities, and observation probabilities: There is
also an algorithm for deciding which of a set of modeli produced the speech
being analyzed. Recently, neural networks have come to be used for speaker
identification and speaker verification. chapter 10 of [gg] discusses othgr
applications of neural nets in speech.
obviously, speech recognition systems have an easier
iob if all speakers
,p"ul alr" same text. lrolut"d words are easier, connected speech is harder.
Handling any arbitrary speaker from the generar populition is harder.
Allowing an arbitrary vocabulary makes the task harder still. currently,
typical systems achieve an accuracy in the mid 90 percent range for
dictionaries of several hundred words spoken by different speakers. For
further information, see [55,891.

4.8 DtctrAt AUDto AND THE compurER


It is now common to find audio capabilities in many computers, large and
small. In this last section, we review the capabilities for au-dio and music at
various levels of quality.
one fundamental capability is audio storage: g-bit audio at an g-kHz
sample rate consumes 1 KB for 1 second of sound. CD-quality audio (stereo,
16-bit linear PCM,44.1-kHz sample rate) consumes 176 KB per second, and
stereo at the professional rate of 48 kHz eats almost 2o0 kbytes per
-sound
second. Sound storage on disk thus requires large disks.
sound playback and recording require a DAC and ADC, respectively. The
8-bit hardware on many computers is adequate for speelh and basic
algorithm testing, but not for professional music or recording. plug-in
boards with 16-bit DACs and ADCs are the solution if such hardware is not
provided as part of the basic system. Even better audio quality can be had
with external conversion units, using, for example, a scSl connector.
For digital input and output, a connector for AES/EBU transmission
and/or an SPDIF connector is required. often, the DACs and ADCs on an
external DAT player with an appropriate interface can be used instead of
dedicated hardware.
For real-time processing of sound, plug-in cards and external units are
available with commercially available DSp chips. with a single chip, typi-
cally a stereo stream can be input, processed, and output. In such u s.he-e,
the sound does not necessarily go to disk. Such systems can also generate
three-dimensional audio, for example. There is a tendency nowadals to use
DICITAL AUDIO REPRESENTATION AND PROCESSING
4.i 0 AC<\
102
a synthes-
RISC or CISC chips instead of DSP chips for dedicated processing' ln some
been eq;'
,yrt.-r, a single R1SC chip is fast inough to provide the generalized
Using two RISC But this r
compute power as well u, cycles for sound processing'
"*itu applica- developei
chips in parallel, one for genetaf computing and one for real-time
sounds, a
is the same for both'
tions, hai the advantage tiat the software development becarne li
(when a DSP chip is"used, the software environment is usually radically
Synerg"'
diff"r..rt from thai for the computer's CPU.) For music synthesis, there are sound ce
sam-
also plug-in cards currently available implementing FM synthesis, hardrvai=
pling, and other synthesis techniques'
' Edliing sound requires a good griphics system; for editing transform data, it is necr:
physicai ,
graphiciacceleratois are ofte' recommended, as the arnount of data can be
I close
Error*o,rr. Sound can be edited on the screens of portabie computers, but systems
a large crisp color monitor is to be preferred'
cuneifcrr
Giien ati ttre power in and fanfare surrounding UNIX workstations, such for the I
as the sun or splnc, one would expect them to suirport music
easily. But
and musicians cornplain when tlpetvr:::
uNlx is not a real-time operating system, nent feca
their music stops dead whlle the operating system services something else"
and multimedia is develop.
Finally, a remark about sound in personal computers
implied in this chapter have feature :,
in order. For many years, the basic clpabilities of earlr'
been included in plug-in boards for the PC, such as the SoundBlaster. With
:

conneci:
the release of documints such as Multimedia PC Specification Version 1'0
requires
(Microsoft), we can expect the computer industry to follow a path well
as speeci
k ro*n to digital audio and computer music specialists. The software
speech ri
protocols wili- become standardizecl, as will music and sound exchange
interface
iormats. The quality of the sound coming into and out of the system will
can be:
improve. The iapabilities of the system will be expanded to include more
perhaps
and more sophisticated techniques.
:
ICVETTS
commul
COITlpult
4.9 cLostNG REMARKS point c.
develcp
First, a comment is in order about the relative difficulty of implementing generai.
audio versus video/graphics in a multimedia system. Audio is often consid- when ti
ered as easier than video or graphics, perhaps because the bandwidth is teleph: I
smaller. But audio imposes significant design constraints which must be systern s
handled by themselvei. At the same time, Loeb [1i] makes the interesting foot irr s
argument ihut u,rdio provides a good platform for prototyping generalized
ap-plications with conlinuous data streams, precisely because the data rates
in audio are lower. 4.1 0
Second, I would like to provide a gentle warning for those coming to
audio from other domains. trt is tempting to take some of the implications
of the theory given above and to implement them in hardware, expecting
I appre .

a system to fati in place. In the music world, we have seen this happen time
Cook. i
Loeb, \
u.rd ti*" again, with almost predictable results. Consider the developers of
Schons:
the ill-fateJsynthia synthesiier (ca. 1980) who implemented Fourier analy-
up with chapter
sis and synthisis. After months of work in their Sarage, they carne
4.'IO ACKNOWLEDCMENTS

a synthesizer that had a beautiful graphics


user interface, one that has rarely
been equaled since in an integratei graphics user interface
on a synthesizer.
But this w?s a synthesizer that so-unded awful (r heard
developers had assumed that the theory would give
ii myserf). The
them good orchestrar
sounds, and it didn't. or consider thehardware"card
discrissed iir [90]. It
became the hardware basis for the commercial synthesizer
calred GDS and
synergy [91] still used by wendy carlos
t92]. Bui the software and musicar
sound development by far exieeded the cost and time
needed for the
-musicar
hardware. Ultimatery,-the project faired. In working *itt
systems,
it.is necessary to plan for more than just the hardwa"re
necessary to take the
physical characteristics of sound into account.
I close with a remark about the relative importance of audio
systems (see also [93]). Historically, starting peihaps
in digital
with the invention of
cuneiform, the act o.f committing data to a permanent
medium required,
for the mo,st part, the use or tn! hands u.,-d eyer. The
invention of the
typewriter led to the ewERTy keyboard being ide.tifieo
with the perma-
nent recording of data. As data processing maihinery
and .o,,'pur*r, *"r.
developed, it was naturai that the
ewERTy keyboari *orr.r u" a standard
feature of the hardware, iust as a musical keybtard
*u, u *noard feature
of early music synthesizers. The recent pen operating
system loosens the
connection between the QWERTy keyboarcl a'ci thE-cfrnpr,.r,
requires use of the hand and the eye. Witl, *urli. but stilr
as speech and sound synthesis, texi-to-spcech,
input;;;;;il;'r;;;
audio window systems, and
speech recognition, one can conceive or;r computer
with an entirery audio
interface. For the first time in several miilennia,
the act of recording data
can be freed from the necessity of using the hand
and the eye. Indeed,
perhaps for the first time since iuneiforrn,
th. mediurn of communication
reverts to (recorded) human speech, not
iust icons
representing human
communication. I do not advocate the deveiop-"nt
or u pui"ty audio-based
computer (would the audio user interface be an
AUI?).
point out that audio as a computer I/O channel has Ra-t(er, I wish to
reached a level of
development where,it is strongenough to serve
as a fuil-fle<iged part of
generalized I/o. we know that hu-uri.orn-unication
works at its fullest
when the various perceptual modes are working
together. (After ail, the
telephone does not maich the t6te-i-tcte.) I feel
that the most effective
systems of tomorrow wiI include audio (music
and speech) at least on equar
footing with other modes of interaction.

4.1O AcKNowtEDGMENTs

I appreciate theassistance of Thom Brum,


Jeff Barish, Marina Bosi, perry
Robert currie, Robert Robert
909k, _Duisberg, Gross, Joh n Bu fbrd, shoshana
Loeb, Mike Minnick, Ke' pohrmuin, rouy
nicnu.as,-curt]s noaos, gi'
John Snell, and Julius ci. s*itr, rrr as i'vras
nf;![::"tt, ireparing trris
83
rc
4.3 DICITAL REPRESENTATIONS OF SOUND

codingbitstothebandswiththelowestMNR.Aslongasbitsareavailable,
more iubbands are encoded' which case a zero $
-.-L:^L
This meansthat some subbands remain uncoded' in each
A scale factor is also ffansmitted.for
transmitted as the bi;';il;;;. are scaled before
zero, since the bands
band when the number of bits is not
t'on"'o subbands' 2 to 15 data bits per subband
ouantization. finatty]]oi in
:;"ilfiil;;:'il4;;;"*;;. of this scheme, by the way, is that noise
another band'
one UanO is independent of n-oise in
also be transmitted' Since the ancillary
User-defineO ancffi-auta ca.t
be available for audio itself,
data consum. ro*.'ilil,i* *r"rJtln"t*is" quality'
data iould possibly.de.grade the audio
decoder
"ti"g "".iffary
Decoding, shown ris""t a'rr, is tfre gprytttu "1l1tj1i1g; The The
- data
determines from tne biir&.u* *t i.h
subbands have nonzero data.

N
INPUT ENCODED
B]TSTREAM

DECODING OF
BIT ALLOCATION

DECODING OI:
SCALE FACTORS

ed

SYNTHISIS SURI]AND
ds FILTER

rfe
rre
ed OUTPU'I PCM SAMPI,ES
is,
"er
ch
I€I
he
ise
Flgure 4.1 1 fne structure of the MPEG decoder
foi layers 1 and 2. (Reproduced with permission
rrom iso/teC DIS 11 172, Figure 3-A 1)

You might also like