Beat Boxing Phology
Beat Boxing Phology
by
August 2022
ii
Acknowledgments
There are not enough pages to give everyone who made this dissertation a reality the thanks
The first thanks goes to Louis Goldstein for support and endless patience. His aptly
timed pearls of wisdom and nuggets of clarity have triggered more major shifts in my
thinking than I can count. And when I’ve finished reeling from a mental sea change (or even
when the waters are calm), little has been more comforting than his calm demeanor and
readiness to help me accept the new situations I find myself in. Louis, thank you for taking
the time to drag me toward some deeper understanding of language and life.
showing me how to reconceptualize complicated topics into simple problems and for
reinforcing the understanding that the superficial differences we see in the world are only
illusions. And I am grateful to Jason Zevin for letting me contribute to his lab meetings
despite not knowing what I was talking about and for offering me summer research funding
even though I’m pretty sure the project moved backward instead of forward because of me.
And thanks to my committee as a whole who together, though perhaps without realizing it,
did something I would never in a million years have predicted: they sparked my interest in
history—a subject which only a few years ago I cared nothing at all for but which I now find
indispensable. Thanks to all three of you for helping me make it this far.
I have been lucky to have the guidance of USC Linguistics department faculty (some
now moved on to other institutions) like Dani Byrd, Elsi Kaiser, Rachel Walker, Karen
Jesney, and Mary Byram Washburn. Substantial credit for any of my accomplishments at
iii
USC goes to Guillermo Ruiz: he has always worked hard to help me, even when I seemed
determined to shoot myself in the foot; he can never be compensated enough. Many of the
insights in this dissertation can be traced back to conversations with my fellow beatboxing
scientists: Nimisha Patil and Timothy Greer at USC and Seunghun Lee, Masaki Fukuda, and
Kosei Kimura at International Christian University in Tokyo. I have also greatly benefited
from the camaraderie and insights of many of my fellow USC graduate students in
Linguistics including Caitlin Smith, Brian Hsu, Jessica Campbell, Jessica Johnson, Yijing Lu,
Ian Rigby, Jesse Storbeck, Luis Miguel Toquero Perez, Adam Woodnutt, Yifan Yang, Hayeun
Jang, Miran Oh, Tanner Sorensen, Maury Courtland, Binh Ngo, Samantha Gordon Danner,
Alfredo Garcia Pardo, Yoonjeong Lee, Ulrike Steindl, Mythili Menon, Christina Hagedorn,
research—that it’s possible to actually do things and not just talk about them. Outside of
academia, I’m indebted to Charlie for coaxing me out of my reclusive comfort zone in our
first few grad school years with invitations to his holiday parties and improv performances;
Los Angeles had been intimidating, but Charlie made it less so and opened the door to
almost a decade of confident exploration (though I’ve barely scratched the surface of LA).
group—especially Shri Narayanan and Asterios Toutios—for the research opportunities, for
giving me some chances to flex my coding skills a little, and for collecting the beatboxing
iv
data used in this dissertation (not to mention for securing the NIH and NSF grants that
made this work possible in the first place). Special thanks to Adam Lammert for teaching me
just enough about image analysis to be dangerous during my first year of grad school; our
more recent conversations about music, language, and teaching have been a treat and I hope
The members of the LSA’s Faculty Learning Community (FLC) have shown me time
and time again the importance of community for thoughtful, justice-anchored teaching and
for staying sane during a global pandemic. Our meetings have been the only stable part of
my schedule for years now, and if they ever end I doubt I’ll know what to do with myself.
Thanks to Kazuko Hiramatsu, Christina Bjorndahl, Evan Bradley, Ann Bunger, Kristin
Denham, Jessi Grieser, Wesley Leonard, Michael Rushforth, Rosa Vallejos, and Lynsey Wolter
for helping me understand more than I thought I could. I am particularly indebted to Michal
Temkin Martinez, my academic big sister, who generously created a me-sized opening in the
FLC and who has been an unending font of encouragement since the moment we met.
My colleagues from the USC Ballroom Dance Team are responsible for years of
happiness and growth. I am grateful to Jeff Tanedo, Sanika Bhargaw, Katy Maina, Kim
Luong, Alison Eling, Dana Dinh, Andrew Devore, Alex Hazen, Max Pflueger, Eric
Gauderman, Mark Gauderman, Ashley Grist, Rachel Adams, Sayeed Ahmed, Zoe Schack,
Queenique Dinh, Michael Perez, and so many others for their leadership, support, and
camaraderie during our time together. Tasia Dedenbach was a superb dance partner and
friend; she gave me the confidence to trust my eye in my academic visualizations and also
gave me my first academic poster template which I abuse to this day. Alexey Tregubov is an
v
absolute gem of a human being who, simply by demonstrating his own work ethic, is the
reason I was able to actually finish any of my dissertation chapters. Sara Kwan has been the
most thoughtful friend a person could ask for. She is a brilliant and patient sounding-board
that develop as we indulge our mutual fondness for board games—and her well-timed snack
deliveries (like the mint chocolate Milano cookies I’m eating at this very moment) are always
appreciated. Lorena Bravo and Jonathan Atkinson have gone from being just my dance
teachers to being cherished friends. Thank you for the encouragement, the life lessons, and
Sarah Harper deserves special recognition, as all who know her can attest. There have
circles at two different universities. Along with all the good times we’ve had with the friends
we now share, she has also had perhaps the most punishing job of any of my friends—having
to deal with me in my crankiest moods. Sarah, thank you for your unwavering support
through it all.
Thanks to Joyce Soares, Ed Soares, Kethry Soares, and Gunnar Jaffarian for treating
me like family even when I wasn’t officially family yet. Lane Stowell, thank you for being a
true friend to Erin for these last several years, and for taking care of her when I have been
unable. And thanks to Angela Boyer, who holds the record for being friends with me the
longest despite three time zones and a bunch of miles coming between us when I moved to
California. It was Angela who suggested I take my first introductory Linguistics class, which
in turn triggered all the events that led me to writing these words.
vi
Because this is a Linguistics dissertation, this is the paragraph where I am supposed to
thank my parents for instilling in me an early passion for language and learning and thank
my grandmother for teaching me to love reading—all of which is perfectly true and for which
I am indeed grateful. But in the context of this particular dissertation, perhaps even more
Mom and Dad, I know you were embarrassed, when I was small and you took me to a
concert where Raffi started asking kids what kind of music they listened to at home, because
you thought that we didn’t listen to very much music at all. But my life has been filled with
music thanks to you: listening to Dad’s tuba on his bands’ cassette tapes or at Tuba
Christmas; hearing Mom ring out the descant of my favorite hymns; listening to the two of
you harmonizing on songs from the ancient past; singing in Mom-Mom’s choir at Christmas;
learning musical mnemonics in children’s choir that haunt me to this day; and watching in
awe (and listening in some agony) as Dad started learning the violin during a mid-life stroke
of inspiration. All of this, plus the fourteen years of piano lessons you paid for—for what
little use I made of them—and now a dissertation about vocal music. Altogether, I’d say you
That leaves two very important women left to thank. Erin Soares, thank you for being
patient with me every time I moved the goalposts on you; I am happy to report with some
confidence that my dissertation is finally, truly finished. I owe that to you: you have given me
a lifestyle that makes me feel safe and comfortable enough to write a dissertation—no small
feat. With you I am confident, capable, and loved, and I can’t wait to spend the rest of my life
vii
making you feel the same. And Mairym Llorens Monteserín, thank you for… everything. I
viii
Table of contents
Dedication ii
Acknowledgements iii
List of tables x
Abstract xvii
Chapter 1: Introduction 1
Chapter 2: Method 21
Chapter 3: Sounds 44
References 293
Appendix 308
ix
List of tables
x
Table 22. Unforced Kick Drum environments. 174
Table 23. Kick Drum environment type observations. 174
Table 24. Kick Drum token observations. 175
Table 25. Summary of the five beat patterns analyzed. 188
Table 26. The beatboxing sounds used in this chapter. 191
Table 27. Sounds of beatboxing used in beat pattern 5. 192
Table 28. Sounds of beatboxing used in beat pattern 9. 199
Table 29. Sounds of beatboxing used in beat pattern 4. 202
Table 30. Sounds of beatboxing used in beat pattern 10. 207
Table 31. Sounds of beatboxing used in beat pattern 1. 213
Table 32. Non-exhaustive lists of state-, parameter-, and graph-level properties for
dynamical systems used in speech. 231
Table 33. Sounds of beatboxing used in this chapter. 248
Table 34. Contingency table of beatboxing sound constrictors (top) and the speech
sounds they replace (left). 266
xi
List of figures
xiii
Figure 63. The Inward Clickroll with Liproll. 98
Figure 64. The Lip Bass. 98
Figure 65. tch. 99
Figure 66. The Liproll with Sweep Technique. 99
Figure 67. The Sega SFX. 100
Figure 68. The Trumpet. 100
Figure 69. The Vocalized Tongue Bass. 101
Figure 70. The High Tongue Bass. 101
Figure 71. The Kick Drum exhale. 102
Figure 72. Histogram of 10,000 random sound pair trials in a 6 x 7 x 2 matrix. 118
Figure 73. Histogram of 10,000 random sound pair trials in a 4 x 7 x 2 matrix. 119
Figure 74. A lip closure time function for a spoken voiceless bilabial stop [p], taken
from real-time MRI data. 133
Figure 75. Schematic example of a spring restoring force point attractor. 134
Figure 76. Schematic example of a critically damped mass-spring system. 135
Figure 77. Schematic example of a critically damped mass-spring system with a
soft spring. 136
Figure 78. Position and velocity time series for labial closures for a beatboxing Kick
Drum {B} (left) and a speech voiceless bilabial stop [p] (right). 144
Figure 79. Parameter values tuned for a specific speech unit are applied to a point
attractor graph, resulting in a gesture. 150
Figure 80. Speech-specific and beatboxing-specific parameters can be applied
separately to the same point attractor graph, resulting in either a speech
action (a gesture) or a beatboxing action. 150
Figure 81. Forced/Classic Kick Drum. Larynx raising, no tongue body closure. 156
Figure 82. Unforced Kick Drum. Tongue body closure, no larynx raising. 158
Figure 83. Spit Snare vs Unforced Kick Drum. 159
Figure 84. Forced Kick Drum beat patterns. 165
Figure 85. Unforced Kick Drum beat patterns. 166
Figure 86. Beat patterns with both forced and unforced Kick Drums. 168
Figure 87. An excerpt from a PointTier with humming. 171
Figure 88. A sequence of a lateral alveolar closure {tll}, unforced Kick Drum {b},
and Spit Snare {SS}. 176
Figure 89. A beat pattern that demonstrates the beatboxing technique of humming
with simultaneous oral sound production. 180
Figure 90. This beat pattern contains five sounds: a labial stop produced with a
tongue body closure labeled {b}, a dental closure {dc}, an lateral closure
{tll}, and lingual egressive labial affricate called a Spit Snare {SS}. All of
the sounds are made with a tongue body closure. 181
Figure 91. Drum tab of beat pattern 5. 193
Figure 92. Regions for beat pattern 5. 194
Figure 93. Time series of vocal tract articulators used in beat pattern 5, captured
using a region of interest technique. 195
Figure 94. Time series and rtMRI snapshots of forced and unforced Kick Drum 196
xiv
Figure 95. Drum tab of beat pattern 9. 200
Figure 96. Time series and gestures of beat pattern 9. 200
Figure 97. Drum tab notation for beat pattern 4. 203
Figure 98. Regions used to make time series for the Liproll beat pattern. 204
Figure 99. Time series of the beat pattern 4 (Liproll showcase). 206
Figure 100. Drum tab for beat pattern 10. 208
Figure 101. The regions used to make the time series for beat pattern 10. 210
Figure 102. Time series of beat pattern 10. 211
Figure 103. Drum tab notation for beat pattern 1. 214
Figure 104. Regions for beat pattern 1 (Clickroll showcase). 216
Figure 105. Time series of beat pattern 1. 217
Figure 106. The DOR region for the Clickroll showcase (beat pattern 1) in the first
{CR dc B ^K}. 218
Figure 107. Each forced Kick Drum in the beat pattern in order of occurrence. 218
Figure 108. Time series and real-time MRI snapshots of forced and unforced Kick
Drums. 219
Figure 109. A schematic coupling graph and gestural score of a Kick Drum and Spit
Snare. 234
Figure 110. A schematic coupling graph and gestural score of a Kick Drum,
humming, and a Spit Snare. 235
Figure 111. A schematic coupling graph and gestural score of a {b CR B ^K}
sequence. 237
Figure 112. Waveform, spectrogram, and text grid of the beatrhymed word
“dopamine”. 248
Figure 113. Bar plot of the expected counts of constrictor matching with no task
interaction. 251
Figure 114. Bar plot of the expected counts of constrictor matching with task
interaction. 251
Figure 115. Bar plots of the expected counts of K Snare constrictor matching with
no task interaction 253
Figure 116. Bar plots of the expected counts of K Snare constrictor matching with
task interaction. 253
Figure 117. Serial and hierarchical representations of a 16-bar phrase (8 lines with 2
measures each). 256
Figure 118. Example of a two-line beat pattern. 263
Figure 119. Bar plot showing measured totals of constrictor matches and
mismatches. 265
Figure 120. Bar plots with counts of the actual matching and mismatching
constrictor replacements everywhere except the back beat. 268
Figure 121. Bar plot with counts of the actual matching and mismatching
constrictor replacements on just the back beat. 269
Figure 122. Four lines of beatrhyming featuring two replacement mismatches
(underlined). 270
xv
Figure 123. Counts of replacements by beatboxing sounds (bottom) against the
manner of articulation of the speech sound they replace (left). 272
Figure 124. Counts of replacements by beatboxing sounds (bottom) against the
speech sound they replace (left). 272
Figure 125. Four 16-bar beatboxing (sections B and D) and beatrhyming (sections
C and E) phrases with letter labels for each unique sound sequence. 275
Figure 126. Beat pattern display and repetition ratio calculations for sections B, C,
D, and E. 276
Figure 127. Tableau in which a speech labial stop is replaced by a K Snare on the
back beat. 283
Figure 128. Tableau in which a speech labial stop is replaced by a Kick Drum off
the back beat. 283
Figure 129. Waveform, spectrogram, and text grid of the beatrhymed word “move”
with a Kick Drum splitting the vowel into two parts. 287
Figure 130. Waveform, spectrogram, and text grid of the beatrhymed word “sky”
with a K Snare splitting the vowel into two parts. 288
Figure 131. The anthropophonic perspective. 296
xvi
Abstract
beatboxing that can account for phonological phenomena that speech and beatboxing share.
language: because hallmarks of linguistic phonology like contrastive units (Chapter 3),
alternations (Chapter 5), and harmony (Chapter 6) also exist in beatboxing, beatboxing
phonology provides evidence that beatboxing and speech share not only the vocal tract but
Beatboxing has phonological behavior based in its own phonological units and
organization. One could choose to model beatboxing with adaptations of either features or
gestures as its fundamental units. But as Chapter 4: Theory discusses, a gestural approach
captures both domain-specific aspects of phonology (learned targets and parameter settings
for a given constriction) and domain-general aspects (the ability of gestural representations
Gestures have domain-specific meaning within their own system (speech or beatboxing)
while sharing a domain-general conformation with other behaviors. Gestures can do this by
explicitly connecting the tasks specific to speech or to beatboxing with the sound-making
potential of the vocal substrate they share; this in turn creates a direct link between speech
xvii
gestures and beatboxing gestures. This link is formalized at the graph level of the dynamical
The direct formal link between beatboxing and speech units makes predictions about
what types of phonological phenomena beatboxing and speech units are able to
that the phonological units of the two domains will be able to co-occur, with beatboxing and
speech sounds interwoven together by a single individual. This type of behavior is known as
These advantages of the gestural approach for describing speech, beatboxing, and
not, the phonological system is not encapsulated away from other cognitive domains, nor
impermeable to connections with other domains. On the contrary, phonological units are
illustrate, the properties that the phonological system shares with other domains are also the
foundation of the phonological system’s ability to flexibly integrate with other (e.g., musical)
domains.
xviii
CHAPTER 1: INTRODUCTION
performance—the latter being primarily the focus here. Beatboxers are increasingly
recognized in both scientific and popular literature as artists who push the limits of the vocal
tract with unspeechlike vocal articulations that have only recently been captured with
modern imaging technology. Scientific study of beatboxing is valuable on its own merits,
especially for beatboxers hoping to teach and learn beatboxing more effectively. But much of
beatboxing science also serves as a type of speech and linguistic science, aimed at
nature of speech.
piece of beatboxing science, the contribution is the first major effort (that I know of) to
theoretical framework relating representations in speech and beatboxing that can account for
these findings. As a type of linguistic science, the dissertation contributes to the longstanding
like alternations and harmony also exist in beatboxing, beatboxing phonology provides
further evidence that phonology is rooted in domain-general cognition (rather than existing
1
Section 1 introduces the art of beatboxing and briefly summarizes the current state of
the context for how research on a distinctly non-linguistic behavior like beatboxing can be
The foundation of beatboxing lies in hip hop. The “old school” of beatboxing began as
human mimicry of the sounds of a beat box, a machine that synthesizes percussion sounds
and other sound effects. The beat box created music that an MC could rap over; when a beat
box wasn’t available, a human could perform the role of a beat box by emulating it vocally.
The two videos below demonstrate how beatboxing was used by early artists like Doug E
[Link] -
(The beat pattern starts in earnest around 0:48. Before that, you can hear Doug E. Fresh
[Link]
2
(Buffy was well-known for his “bass-heavy breathing technique” (source) that you can hear
from 0:10-0:15.)
The last four decades have given beatboxers plenty of time to innovate in both artistic
composition and beatbox battles that demonstrate mechanical skill. Modern beatboxing
performances often stand alone: if there are any words, they are only occasional and woven
by the beatboxer into the beat pattern rather than said by a second person. (There are art
forms like beatrhyming where singing/rapping and beatboxing are fully integrated, but this is
a different vocal behavior; see Chapter 7: Beatrhyming. Combining words or other vocal
behaviors into beatboxing is sometimes called multi-vocalism.) The next two videos show
that beat patterns in the “new school” of beatboxing may be faster, reflecting contemporary
[Link]
[Link]
Beatboxing evolves through innovation of new sounds or sound variations, patterns (e.g.,
combinations of sounds or styles of breathing), and integration with other behaviors (e.g.,
beatboxing flute, beatboxing cello, beatrhyming, beatboxing with other beatboxers). For
3
novice beatboxers, the goal is to learn how to sound as good as experts; for expert
beatboxers, the goal is to create art through innovation while keeping up with trends. This
innovation is constrained by both physical and cultural forces. The major physical constraint
is the vocal tract itself which limits the speed and quality (i.e., constriction degree and
location) of possible movements; new beatboxing sounds and patterns are thought to arise
from testing these physical limitations. As for cultural forces, both the musical genres that
inspire beatboxing and the preferences of beatboxers themselves have a role. Three examples
follow.
First, beatboxing started without words, and today most beatboxers still rarely speak
phrase during a beat pattern, usually with non-modal phonation, the fact that beatrhyming
has its own name to distinguish it from beatboxing implies that it is not the same art form.
Second, since the initial role of beatboxing was to provide a clear beat by emulating drum
sounds, non-continuant stops and affricates became very common while continuants like
vowels are almost never used. When drawing on inspiration from other musical sources,
related genres like electronic dance music would have been appealing for their percussive
similarities. Contemporary beat patterns keep the percussive backbone, though some
sustained sounds (i.e., modal or non-modal phonation for pitch) can be used concurrently as
well. And third: more broadly, beatboxing shares musical properties with a broad range of
(Western) genres, resulting in common patterns. One common property is 4/4 time, which
signifies that the smallest musical phrases each contain four main events (which can be
thought of as being grouped into pairs of two). Another common property is the placement
4
of emphasis on the “back beat” (beat 3 in 4/4 time) via snare sounds (Greenwald, 2002).
These types of properties, together with the vocal modality, shape the musical style and
beatboxers are encouraged to start by drilling the fundamentals: basic sounds like Kick
Drums {B} [p’], Closed Hi-Hats {t} [t’], and PF Snares {PF} [pf’] should become familiar
first in isolation, then in combos and beat patterns to practice them in a rhythmic context.
(Curly bracket notation indicates a beatboxing sound, while square bracket notation
indicates International Phonetic Alphabet notation.) Once the relatively small set of sounds
is secure, it is time to learn new sounds that facilitate breath management—this is important
for performing progressively more complex and intensive beat patterns that demand more
air regulation. At the same time, new beatboxers also need to focus on “technicality”, a jargon
word in the beatboxing community that refers to how accurately and precisely a sound is
performed. Reference to and imitation of other beatboxers is common for establishing ideals
and task targets. All of these basics are the foundations from which a beatboxer can start to
innovate by making novel sounds and beat patterns; and, beatboxers continue to revisit these
different facets of their art to make improvements at multiple time scales (i.e., improving one
consequence of all this, beatboxers are often aware of or focusing on some facet of their
beatboxing as they perform in a way that fluent speakers of a language may not be aware of
their own performance; moreover, beatboxers at different stages in the learning process (or
5
even at the same stage) may beatbox very differently depending on the sounds they know
All of these details are important for later chapters. The fact that beatboxers are
aiming for particular sound qualities and flow patterns means that we should expect to find
beatboxing patterns that balance aesthetics and motor efficiency (Chapter 6: Harmony). The
lack of words in beatboxing, the interest in imitating instruments/sound effects, and the
drive to innovate through the use of new sounds are all hints that beatboxing phonology is
not a variation of speech phonology but a sound organization system in its own right. The
metrical patterns of sounds (e.g., Snares on 3) frames observations about beatboxing sound
alternations (Chapter 5: Alternations) and the relationship between speech and beatboxing
sounds in beatrhyming (Chapter 7: Beatrhyming). And the fact that beatboxers are actively
focusing on different things and cultivating different styles goes a long way to explaining
qualitative variation among beatboxers, including differences in their sound inventories and
A guiding theme in beatboxing science is the study of vocal agility and capability
inform our understanding of what kinds of vocal sound-producing movements and patterns
didn’t think were possible. This in turn offers a better general phonetic framework for
6
studying the relationship between linguistic tasks, cognitive limitations, physical limitations,
Likewise, knowing more about the physical abilities of the vocal tract also informs
Some researchers advocate for using beatboxing for speech therapy (Pillot-Loiseau et al.,
2021). The BeaTalk strategy has been used to improve speech in adults (Icht, 2018, 2021; Icht
& Carl, 2022); and beatboxers Martin & Mullady (n.d.) use beatboxing in their work with
children. (See also Himonides et al., 2018; Moors et al., 2020.) Although beatboxing
interventions for therapeutic purposes are still quite new, the tantalizingly obvious
connection between beatboxing and speech as vocal behaviors has been generating interest
Crucial to both these branches of inquiry but almost completely undeveloped within
the field is a theory of beatboxing cognition. The literature offers just three claims about
beatboxing cognition so far, none of which are firmly established: one about the intentions of
beatboxers, and two about the fundamental units of beatboxing. There is a general consensus
that, based on the origins of beatboxing as a tool for supporting hip hop emcees, a
beatboxer’s primary intention is to imitate the sounds of a drum kit, electronic beat box, and
a variety of other sound effects (Lederer, 2005; Stowell & Plumbley, 2008; Pillot-Loiseau et
al., 2020). But treating beatboxing as simple imitation is reductive and disingenuous to the
primacy of the art form (Woods, 2012). Even in the earliest days, old school beatboxers
established distinctive vocal identities that were surely not just attempts to mimic different
electronic beat boxes. The new school of beatboxing has come a long way since then and
7
shows rapidly evolving preferences in artistic expression that a drive for pure imitation seems
unlikely to motivate.
As for the cognitive representations of the sounds themselves, Evain et al (2019) and
Paroni et al. (2021) posit the notion of a “boxeme” by analogy to the phoneme—an
acoustically and articulatorily distinct building block of a beatboxing sequence. While they
imply that boxemes are meant to be a hypothesis of cognitive units, they do not address
other questions begged by the phoneme analogy (Dehais-Underdown, 2021). Are boxemes
the smallest compositional units or are they composed of even smaller elements, as
patterns that require a theory with some degree of abstraction? And are boxemes symbolic
units, action units, or something else? Separately, Guinn & Nazarov (2018) argue for the
active role of phonological features in beatboxing based on evidence from variations in beat
patterns and phonotactic place restrictions (an absence of beatboxing coronals in prominent
metrical positions). They do not link features back to larger (i.e., segment-sized) unit; while
they offer the possibility that speech and beatboxing features are linked (perhaps in the same
way that the features of a language learned later in life are linked to the features of a
language spoken from birth), it remains unclear whether or how speech representations and
science is still in its infancy with less than 20 years of research, and the few scientists
involved in the field have had their hands full with other more tractable questions. But it will
be difficult to use beatboxing to inform an account of the physical and cognitive factors that
8
shape speech without both physical and cognitive accounts of beatboxing. And while the
theory of beatboxing cognition that is explicit about whether and how speech and
beatboxing sounds are cognitively related should help decide what interventions are more or
and the articulatory properties along which they are organized. Chapter 4: Theory lays out
the hypothesis that those articulatory properties can be formalized as the fundamental
Phonology (Browman & Goldstein, 1986, 1989). Rooting beatboxing cognition in gesture-like
units offers two benefits: the same types of empirically-testable predictions as Articulatory
Phonology, and a theoretical link between the cognitive units of speech and beatboxing. Both
benefits are advantageous for developing theories of speech informed by beatboxing and for
harmony complete with triggers, undergoes, and blockers—and offer an account based on
gestures. Finally, Chapter 7: Beatrhyming goes a step further to provide evidence for a direct
link between the cognitive units of speech and beatboxing via the art of simultaneous
9
2. Beatboxing as a lens for linguistic inquiry
With respect to linguistic inquiry, the longstanding debate addressed here is one of
domain-specificity: Does the human capacity for language consist only of a specialized
composite of other cognitive systems, or is there some component that is unique to language
and cannot be attributed to specialization of other cognitive systems (Anderson, 1981)? The
question has been central in the development of major linguistic paradigms over the last
several decades, including the Minimalist program that views the human language faculty as
only minimally domain-specific (the language faculty in the narrow sense) and otherwise
composed of a unique assembly of other cognitive functions (e.g., Hauser et al., 2002; Collins,
(1983) who offers a modular approach in which a cognitive domain constitutes its own
system. In the original conception, modules are low-level (mostly sensory input) systems
which are likely to be encapsulated, automatic, innate, and which perform computations
distinct from the non-specific handling of general cognitive processing. Liberman &
Mattingly’s (1985) Motor Theory couched speech perception as a linguistic module built
around the relationship between intended phonetic gestures and their acoustic output. The
Motor Theory proposes that speech perception is a parallel system to general auditory
processing, a claim supported by duplex perception tasks (Liberman et al., 1981; Mann &
Liberman, 1983). Modularity has been conceived of many different ways by now, and
whether or not a system like language shows all of the typical traits (e.g., encapsulation,
10
innateness) are open to empirical testing, but domain-specificity remains key to the modular
theory (Coltheart, 1999). Even when phonology is not considered a module in the strictest
sense, it is still common to make reference to the modular “interface” between phonetics and
phonology which implies that the linguistic system of sounds is distinct from the physical
there are substantial barriers for the infant attempting to learn language, including lack of
segmentability and lack of invariance in the acoustic signal of the ambient language(s); given
how quickly and effectively newborns learn speech production and perception, it stands to
reason that humans may be born with a language faculty that provides a universal starting
point for the acquisition process. This language faculty is domain-specific insofar as the
innate cognitive scaffolding is tailored to address linguistic issues. Werker & Tees (1984) and
related work demonstrated that infants are born with the ability to distinguish speech
species-specific language capacity (Universal Grammar) (e.g., Lindblom, 1983; Archangeli &
Pulleyblank, 2015, 2022). This approach has foregrounded major questions in phonology
over the last few decades, all shaped around developing an understanding of how phonetics
shapes phonology. Quantal Theory (Stevens, 1989; Stevens & Keyser, 2010) derives common
phonological categories from quantal regions in the vocal tract where coarticulation is less
likely to interfere with perception. The Theory of Vowel Dispersion (Liljencrants &
11
Lindblom, 1972; Lindblom et al., 1979) generates typologically common vowel patterns using
the principle of maximal contrast but without presupposing any particular phonological
Kluender, 1989; Diehl et al., 1991) argue that the common covariation of certain phonological
frame/content theory (MacNeilage, 1998) posits that the origins of speech come not from a
spontaneous mutation but rather evolved from homeostatic motor functions; in this case,
phonological syllable structure (the frame) descended from the chewing action.
science and evolutionary psychology and often involves comparing speech and language to
other types of human or non-human cognition (Hauser et al., 2002). Categorical perception
has been found in chinchillas (Kuhl & Miller, 1978) and crickets (Wyttenbach et al., 1996), as
well as for human perception of non-speech sounds (Fowler & Rosenblum, 1990) and faces
(e.g., Beale & Keil, 1995). Language and music share certain rhythmic (see Ravignani et al.,
2017 for a recent discussion), syntactic, (Lerdahl & Jackendoff, 1983) and neurological
qualities (Maess et al., 2001), with other apparently cross-domain ties (Feld & Fox, 1994;
Bidelman et al., 2011). Comparison of neurotypical speech and disordered speech contributes
to a neurological aspect of the discussion such as whether the motor planning in speech uses
Despite the evidence suggesting that language and phonology may not have a
12
is to argue that domain-general models of phonology have more predictive power than
domain-specific models for modeling phonological behavior that exists both in and outside
of speech.
Models of a theory help scientists describe and explain natural phenomena, and in
doing so also predict what related phenomena we should expect to find. Domain-specific
models are meant to describe and predict only phenomena within their own domain: in a
domain-specific computational phonological model, for example, the inputs and outputs are
exclusively linguistic and the grammar operates only over those linguistic elements. If the
same model were used to try to account for the inputs and outputs of a different cognitive
domain, then by definition the model would either fail or be subject to alterations that make
it no longer domain-specific.1 And when the model predicts phenomena that are not
observed within its domain, the model is said to be imperfect because it overgenerates. As a
least as early as the divorcing of phonetics from phonology (de Saussure, 1916; Baudouin de
Courtenay, 1972) which led to interest in only those aspects of phonology which are
essentially linguistic (Sapir, 1925; Hockett, 1955; Ladefoged, 1989). In programs descended
from this tradition, the features and grammar of phonological theory are domain-specific
because they deal exclusively with phonological inputs, outputs, and processes. The inputs
1
If a domain-specific model needs to be used to account for the phenomena in a different domain,
domain-specificity can be preserved by copying the model’s form and adapting its units/computations to the
new context. This would result in two non-overlapping domain-specific models. This might happen in a case of
cognitive parasitism; see below for more discussion on this point.
13
and outputs are typically expressed as phonological features—atomic representations of
linguistic information defined by their relationship with each other, whose purpose it is to
encode meaningful contrast, and which are the basis of phonological change (Dresher, 2011;
and organization—they are crucially not meant to be representations of any other domain.
sometimes explanation in phonology may come from outside language. Widespread interest
in the relationship between phonetics and phonology was renewed with the advent of
acoustically-grounded distinctive features (Jakobson et al., 1951) and the mapping of gradient
phonetic features to scalar phonetic (phonological) features in SPE (Chomsky & Halle, 1968;
see Keating, 1996 for the dual role of phonetics in SPE). Phonological grammars commonly
use phonetic grounding to constrain their outputs (Prince & Smolensky, 1993/2004; Hayes,
Kirchner, & Steriade, 2004). On the other hand, other programs based on strict
domain-specific modularity argue that phonetics should have no role in the makeup of the
grammar (e.g., Hale & Reiss, 2000). But in neither case is phonology expected to explain
outputs from the phonological system are transduced into the inputs of the phonetic system
(Keating, 1996; Cohn, 2007). Even then, the interface is not intended to account for any
phonetic phenomenon that is not clearly the result of a linguistic intent, nor is it capable of
domain-specific by design.
14
A domain-specific model can of course be of great practical benefit in the interest of
language is a hypothesis (not a fact) about the relationship between language and the rest of
with a domain-specific approach are also present in another nonlinguistic behavior, then a
single model that encompasses both domains may be preferable to two domain-specific
models that provide separate accounts of their shared phenomena. For this dissertation, the
search for nonlinguistic phonological behavior takes place in the domain of beatboxing.
phonology because beatboxing and speech have many qualitative articulatory properties in
common. For both beatboxing and speech, sound is produced when the vocal tract
Sounds, many of these articulations have similar constriction locations and degrees to
system, in this case based on their musical function (e.g., “snare”, “bass”, “kick”) and their
articulation (see Chapter 3: Sounds). The sounds of beatboxing can be combined and
restrictions as discussed earlier (e.g., “beat 3 must have a snare sound”). And, some common
beatboxing sounds resemble speech sounds enough that they can replace speech sounds in
15
between beatboxing and speech, beatboxing is an ideal nonlinguistic behavior against which
to compare speech in the search for phenomena that are unique to phonology (if any).
Assuming for the moment that beatboxing does exhibit phonology-like patterns (a
explanations for how beatboxing ended up looking phonological. One way starts with
same cognitive capacities, so whatever their shared capacities provide as a publicly available
resource (e.g., phonological harmony) will automatically be available to both phonology and
Evidence from this dissertation shows that the strongest sense of parasitism, where
beatboxing copies the actual phonological representations and grammar from phonology,
cannot be true: though there are similarities in the composition of sounds and phonological
behavior, the beatboxing sound system uses cognitive representations that are not used as
phonological units (neither in the beatboxer’s language nor in any universal feature system).
The beatboxing system must be more innovative than strict parasitism allows for.
The weaker hypothesis of parasitism is that beatboxing might take certain qualities of
phonological units and grammar—like the combinatorial nature of the representations and
16
constrained to be essentially identical to speech as in the strong parasitic hypothesis, but its
representations and grammar whose form it borrowed. Those aspects which beatboxing
borrowed would then technically be domain-general, at least for those two domains, even if
they did not start that way. The weaker parasitic hypothesis is more plausible than the strong
one. Neophyte beatboxers commonly learn beatboxing sound patterns from adaptations of
speech phrases (e.g., “boots and cats” → {B t ^K t}; see Chapter 3: Sounds for a description of
the symbols). Using the physical vocal apparatus to perform similar maneuvers (Chapter 3:
Sounds, Chapter 4: Theory) could in some sense “unlock” access to phonological potential.
(Hauser, Chomsky, & Fitch [2002] suggest that recursion may have similarly been adopted
into speech from domain-specific use in another cognitive domain like navigation.)
the domain-general hypothesis and the weaker parasitic hypothesis. The difference doesn’t
matter because both approaches arrive at the same (almost paradoxical) conclusion: that
beatboxing and speech share many properties and yet are qualitatively completely different
focuses on developing a single-model approach that encompasses both domains and predicts
their shared behavior (as opposed to creating two purely domain-specific models). The
Articulatory Phonology (1986, 1989) is the hypothesis that the fundamental cognitive
units of phonology are not symbolic features, but actions called “gestures”. Gestures have
been argued to be advantageous for phonological theory because they unite the discrete,
17
context-invariant properties usually attributed to phonological units with the dynamic,
are encoded together in the language of dynamical systems: the system parameters are
invariant during the execution of a speech action, but the state of the system changes
continuously (Fowler, 1980). Chapter 4: Theory argues that dynamical systems also
actions, gestures are not unique to speech but they are specialized for speech: by design, the
dynamical equations in the task dynamic framework of motor control can characterize any
goal-oriented action from any domain (Saltzman & Munhall, 1989). This means that gestures
are on the one hand domain-general because the dynamical system that defines them can
serve as the basis for any goal-oriented action, but on the other hand domain-specific
“Second, we should note that the use of dynamical equations is not restricted
to the description of motor behavior in speech but has been used to describe
the coordination and control of skilled motor actions in general (Cooke, 1980;
Kelso, Holt, Rubin, & Kugler, 1981; Kelso & Tuller, 1984a, 1984b; Kugler, Kelso,
& Turvey, 1980). Indeed, in its preliminary version the task dynamic model we
are using for speech was exactly the model used for controlling arm
movements, with the articulators of the vocal tract simply substituted for those
of the arm. Thus, in this respect the model is not consistent with Liberman
and Mattingly’s (1985) concept of language or speech as a separate module,
with principles unrelated to other domains. However, in another respect, the
central role of the task in task dynamics captures the same insight as the
“domain-specificity” aspect of the Modularity hypothesis—the way in which
vocal tract articulators is yoked is crucially affected by the task to be achieved
(Abbs, Gracco, & Cole, 1984; Kelso, Tuller, Vatikiotis-Bateson, & Fowler,
1984).”
18
For an approach to phonological theory that can also describe non-linguistic behaviors,
dynamical action units should be preferred over features (or other purely domain-specific
phonological units) because they have domain-general roots but can be specialized for any
domain. When specialized for speech, these action units are gestures; when specialized for
another domain, they are the gesture-like building block of that domain instead.
Beyond their descriptive power, however, gestures can also make predictions about
the organization of sounds in other domains whereas features cannot. Assuming that
beatboxing harmony: beatboxing harmony has signature traits of speech harmony including
trigger, undergoer, and blocker sounds, the behavior of all of which is predicted by gestural
approaches to harmony. The gestural model also predicts the possibility of multi-tasking by
using speech and beatboxing gestures simultaneously. Chapter 7: Beatrhyming shows not
only that beatboxing and speech can be produced simultaneously, but also that their
fundamental cognitive units are cognitively related with each other through their tasks of
beatboxing harmony could exist or what traits it might have because the features and
grammar are designed only to target linguistic information. Generative linguistic grammars
also cannot generate beatrhyming because they cannot deal with non-linguistic sounds. Of
course, there are ways around these limitations—new models can be constructed that use
19
beatboxing features and beatboxing grammars to generate beatboxing harmony, and
speech-beatboxing cognitive interfaces can be postulated that do computations over the joint
domain of speech and beatboxing sounds. But ultimately all these strategies require making
multiple separate models to account for phenomena that speech and beatboxing share;
compared to a gestural approach that accounts for both speech and beatboxing without any
20
CHAPTER 2: METHOD
Two novice beatboxers, one intermediate beatboxer, and two expert beatboxers were asked
to produce beatboxing sounds in isolation and in musical rhythms (“beat patterns”), and to
speak several passages while lying supine in the bore of a 1.5 T MRI magnet. Skill level
designations were given by the intermediate beatboxer who had also contacted the
beatboxers, was present for the collection of their data, and provided a beatboxer’s insight at
several points in the earlier stages of analysis. Of those five beatboxers, the productions of
just one expert are reported in the present study. The two novices and the intermediate
beatboxer are not discussed because the aim of this dissertation is to characterize expert
beatboxing, not beatboxing acquisition. (See Patil et al., 2017 for a brief study of the basic
sounds of all five beatboxers.) Data from the second expert beatboxer are not reported
because the beatboxer exhibited large head movements during image acquisition, making
kinematic analysis using the methods described below impossible. The beatboxer studied
Each beatboxer was asked in advance to provide a list of sounds they know written
with orthographic notation they would recognize. During the scanning session, each sound
label they had written was presented back to them as a visual stimulus. For each sound,
beatboxers were asked to produce the sound three times slowly and three times quickly, and
then to produce the sound in a beat pattern (sometimes referred to hereafter as a “showcase”
beat pattern). The beatboxers were also invited to perform beat patterns of their choosing
21
that were not meant to showcase any particular sound. For the analyzed expert beatboxer,
there were over 50 different showcase or freestyle beat patterns. The beatboxers were paid
Data were collected using an rtMRI protocol developed for the dynamic study of
vocal tract movements, especially during speech production (Narayanan et al., 2004; Lingala
et al., 2017). The subjects’ upper airways were imaged in the midsagittal plane using a
gradient echo pulse sequence (TR = 6.004 ms) on a conventional GE Signa 1.5 T scanner
(Gmax = 40 mT/m; Smax = 150 mT/m/ms), using an 8- channel upper-airway custom coil.
The slice thickness for the scan was 6 mm, located midsagittally over a 200 mm × 200 mm
field-of-view; image size in the sagittal plane was 84 × 84 pixels, resulting in a spatial
resolution of 2.4 × 2.4 mm. The scan plane was manually aligned with the midsagittal plane
of the subject’s head. The frames were retrospectively reconstructed to a temporal resolution
of 12ms (2 spirals per frame, 83 frames per second) using a temporal finite difference
(BART). Audio was recorded at a sampling frequency of 20 kHz inside the MRI scanner
while the subjects were imaged, using a custom fiber-optic microphone system. The audio
recordings were noise-canceled, then reintegrated with the reconstructed MR-imaged video
(Bresch et al., 2008). The result allows for dynamic visualization and synchronous audio of
22
2. Annotation methods
Beat patterns from the real-time MR videos were annotated using a concise plaintext
percussion notation called “drum tabs” and point tier TextGrids in Praat (Boersma &
Weenink, 1992-2022). Beat patterns are performed with a rhythmic structure related to a
musical meter, so each annotation included labels for the beat pattern sounds and the
metrical position of that sound. This section explains how each annotation style was created,
hierarchically into prosodic feet, words, and phrases, so too is musical meter composed of
strong-weak alternations hierarchically grouped into measures and phrases. But music and
beatboxing are performed isochronously, meaning that there is roughly consistent temporal
Jackendoff, 1983; Palmer & Kelly, 1992; Figure 1). Each branch has two end nodes: a Strong
node (S) on the left, and a Weak node (W) on the right. And, each node can be the parent of
23
Figure 1. A simple hierarchical tree structure with alternating strong-weak nodes.
S
/ \
/ \
/ \
S W
/ \ / \
S W S W
Strong and Weak events at a certain level are sometimes called “beats” and are often marked
with the numbers 1, 2, 3, and 4; the process of finding these beats, say in order to move to
them in dance, is sometimes called beat-induction (Large, 2000). Musical phrases often last
for more than four beats, but it is customary to reset the count back to 1 instead of
continuing on to 5 (Figure 2). When counting music at this level, a musician is likely to say
“one, two, three, four, one, two, three, four, one…”. Each beat 1 is the beginning of a musical
chunk called a “measure.” Since counting the beat resets to 1 after every 4, musicians reading
musical notation might refer to a specific beat in the meter by both measure number and
Each beat can be further divided into sub-beats in which the Strong node retains the
numerical label of its parent and the Weak node is called “and” (here abbreviated to “+”)
24
(Figure 3). When speaking the meter aloud at this level, a musician would say “one and two
and three and four and one and two and three and four and one and…”.
These sub-beats can be divided even more. In these sub-sub-beats, the Strong nodes once
again retain the label of the parent node, while the Weak nodes are given different names
(Figure 5). The Weak sub-sub-beat between the beat node (a number) and the “and” node is
called “y” (pronounced [i]), and the Weak sub-sub-beat between the “and” and the next beat
note is called “a” (pronounced [ə]). When a musician speaks the meter at this level of
granularity, they say “one y and a two y and a three y and a four y and a one y and a two y
25
Figure 4. Two levels below the beat level have further subdivisions.
/ \ / \
/ \ / \
S W S W
/ \ / \ / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
S W S W S W S W
1 2 3 4 1 2 3 4
/ \ / \ / \ / \ / \ / \ / \ / \
S W S W S W S W S W S W S W S W
1 + 2 + 3 + 4 + 1 + 2 + 3 + 4 +
|\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\
S W S W S W S W S W S W S W S W S W S W S W S W S W S W S W S W
1 y + a 2 y + a 3 y + a 4 y + a 1 y + a 2 y + a 3 y + a 4 y + a
Metrical Phonology uses a more compact representation for hierarchical metrical structure, a
notation with stacks of Xs called a metrical grid (Liberman & Prince, 1977; Hayes, 1984;
Figure 5). In each column, the number of Xs represents the strength of a metrical position
relative to the other metrical positions in the same phrase. In the example below, the lowest
row of Xs correspond to the syllable, the Xs above those to the head of each trisyllabic foot,
Figure 5. A metrical grid of the rhythmic structure of the first two lines of an English
limerick.
x x x x
x x x (x) x x x
x x x x x x x x x (x) (x) (x) x x x x x x x x x (x)
There once was a man from Nantucket who kept all his cash in a bucket
The example in Figure 6 below is the metrical grid notation of the metrical tree example in
Figure 4.
26
Figure 6. A material grid representation of the metrical structure of Figure 4.
x
x x
x x x x
x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
1 y + a 2 y + a 3 y + a 4 y + a 1 y + a 2 y + a 3 y + a 4 y + a
Just as speech can have trisyllabic feet, in some cases a sub-division of a beat has three
terminal nodes instead of two (or instead of sub-dividing further to four nodes). These
sequences of three are called “triplets” and are counted by musicians as “one and a two and a
three and a four and a one and a…”. For the purposes of this research, it is not important
with two levels; but for simplicity in the metrical grid the two weaker sub-beats in a triplet
Figure 7. A metrical grid representation in which each beat has three subdivisions.
x
x x
x x x x
x x x x x x x x x x x x
1 + a 2 + a 3 + a 4 + a
If triplets occur in a beat pattern in this research, they are often mixed in among binary
divisions. In the example in Figure 8, beats 1 and 3 have full binary branching while beats 2
27
Figure 8. A metrical grid in which beats 1 and 3 have four sub-divisions while beats 2 and 4
have three sub-divisions.
x
x x
x x x x
x x x x x x
x x x x x x x x x x x x x x
1 y + a 2 + a 3 y + a 4 + a
The preceding description of musical structure has been looking at metrical positions—slots
of abstract time. But not all metrical positions are necessarily used in a beatboxing
performance. For example, in the beat pattern in Figure 9 below each beat (1, 2, 3, or 4) holds
a musical event, but the available metrical positions after each beat (“y + a”) are silent—with
the exception of the “a” of the first 4 on which musical event {B} is produced just before
another {B} on the next beat 1. ({B}, {t}, and {PF} are the beatboxing sounds Kick Drum,
Figure 9. A metrical grid of the beatboxing sequence {B t PF t B B B PF t}. All sounds except
the second {B} are produced on a major beat; the second {B} is produced on the fourth
sub-division of beat 4 of the first measure.
x
x x
x x x x
x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
1 y + a 2 y + a 3 y + a 4 y + a 1 y + a 2 y + a 3 y + a 4 y + a
B t PF t B B B PF t
Metrical grids are useful for representing the relative strength of each metrical position
compared to the others in its phrase. But since the relative strengths of positions in the
28
metrical structure of beatboxing is highly regular (1 > 3 > {2, 4} > “+” > {“y”, “a”}), a more
consolidated type of metrical notation can be used. For beatboxing and some other
percussive music that does not require pitch to be encoded, a drum tab may be used (e.g.,
Figure 10).
Figure 10. A drum tab representation of the beat pattern in Figure 9, including a label
definition for each sound.
B |x--------------x|x---x-----------
t |----x-------x---|------------x---
PF|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Drum tablature (or drum tabs) is an unstandardized form of drum beat/pattern notation
(Drum tablature, 2022; DrumTabs, n.d.). Each drum tab (the whole figure) represents a
musical utterance. Except for the last row, which marks out the meter, each drum tab row
indicates the timing of a particular musical event in the meter. Drum tab notation has two
major advantages over metrical grid notation. First, the metrical pattern of each sound is
easier to see because it sits alone on its tier. Second, multiple events can be marked as
performances including beatboxing. (The metrical grid notation, on the other hand, only
The first symbol of each row (except the last row) is the abbreviation for a beatboxing
exists for that sound (Stowell, 2003; Tyte & SPLINTER, 2014). The names of the sounds
29
corresponding to each symbol are listed beneath the drum tab in a key. The symbol x on a
drum tab row marks the occurrence of a sound, and the symbol - (hyphen) indicates that the
sound represented in that row is not performed at that metrical position. When a sound is
sustained, the initiation of the sound is marked with an x and the duration of its sustainment
is marked with ~ (tilde). For example, the Liproll {LR} in the drum tab in Figure 11
(simplified from a longer and more complicated sequence for illustrative purposes) is
sustained for a full beat or slightly longer each time it is produced. (The sounds {b} and {pf}
Figure 11. A simplification of a drum tab from Chapter 5: Alternations. Sounds sustained
across multiple beat sub-divisions are marked by tildes “~”.
b |x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
pf|--------x-------|--------x-------|--------x-------|------x---------
LR|x~~~~~------x~~~|~~----------x~~~|x~~~~~------x~~~|~~--------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
The bottom row of the drum tab shows the metrical positions available in the musical
utterance. The first beat of the meter is marked with the number 1. The rest of the beats of
the tactus are marked 2, 3, and 4, with the “+” of each beat evenly spaced between them. As
described in the previous section, each beat can be divided as much as required. Generally in
this research, the labels for the “y” and “a” of each beat are omitted in an attempt to improve
overall legibility of the meter, but their positions exist in the space between the numbered
beats and their “+”s. Pipes (|) visually separate each group of four beats from the next
A drum tab transcription was created for each beat pattern in the data set by repeated
audio-visual inspection of each beat pattern’s real-time MRI video. Portions of beat patterns
30
with rapid or unclear articulations were examined frame by frame using the Vocal Tract ROI
Toolbox (Blaylock, 2021). Articulations in the beat pattern were matched to the articulations
of sounds the beatboxer had named and performed in isolation (see Chapter 3: Sounds) in
order to establish which sound labels to use in the drum tab. In many cases, it was easiest to
start by identifying the sounds at the beginning of a phrase (which were often Kick Drums)
and the snare sounds (which fall on the back beat, notated in this dissertation as beat 3),
then look at the sounds in between. Sounds in the beat pattern that did not clearly match a
beatboxing tutorial videos and insight from other beatboxers; in cases where the sound could
not be identified, a new symbol and descriptive name was created for it. Initial drum tab
visualizations of the audio while making text grids in Praat and from time series created from
After creating transcriptions of the beat patterns in drum tabs, MIR Toolbox (v1.7.2)
(specifically the mirevents(..., ‘Attack’) function) was used to automatically find acoustic
events in the audio channel of each video in the data set (Lartillot et al., 2008; n.d.). These
events were converted into points on a Praat PointTier using mPraat (Bořil & Skarnitzl,
2016). MIR Toolbox sometimes identified events that did not correspond to beatboxing
sounds, mostly because the MRI audio (or its reconstruction) led to many sounds having an
“echo” in the signal. For example, Figure 12 shows that the acoustic release of a Kick Drum
31
was followed by several similar but lower amplitude pulses which were not related to any
articulatory movements and which create the illusion that there are several quieter Kick
Drums. Events determined not to be associated with the articulation of a beatboxing sound
(including these duplicate/extra events) were manually removed, keeping only the event with
When a sound was made but no event was found by MIR Toolbox, a point was manually
placed on the Praat PointTier by selecting a portion of the spectrogram that corresponded to
the sound in question (confirmed by audio inspection). The intensity of that selection was
then extracted, the time point of maximum intensity queried (in Praat, using Praat defaults),
and a point placed on the PointGrid at that time. (In a small sample of comparisons between
the result of this method and points that MIR toolbox had already found, this Praat method
placed points 1-3 ms after the points placed by MIR Toolbox.) If this method failed (either
because the selection window was too small or the lack of maximum in Praat's intensity
signal), a point was manually placed by visual inspection of the waveform at the highest
A label was added to each event in the PointTier corresponding to the appropriate
drum tab and PointTier labels. A second point tier with meter labels for each musical event
in a beat pattern was created automatically using mPraat: for each beatboxing event, the time
of the event in the label point tier was duplicated onto a meter PointTier and assigned the
32
In some cases, one beat was judged to corresponded to multiple events. For example,
a Kick Drum and the beginning of a Liproll might both occur on beat 1. In all such cases it
was possible to annotate distinct acoustic events for each sound on that beat. On the meter
tier, the beat (1 here) would be used for both events—in this example, both the Kick Drum
Figure 12. Waveform, spectrogram, and text grid of three Kick Drums produced at relatively
long temporal intervals. The text grid label of each sound is associated with the true acoustic
release of the sound; the subsequent smaller bursts are artefacts from audio reconstruction.
3. Kinematic visualizations
Time series were created from rtMR video pixel intensities using a region of interest method
(Lammert et al., 2010; Blaylock, 2021). Regions of interest reduce the complexity of image
33
processing by isolating relatively small sets of pixels for analysis. The regions distill the
intensities (brightnesses) of all their pixels into a single value (or in the case of a centroid
method, two values). In a video, the region of interest is static but its pixel intensities change
frame by frame; assembling the frame-by-frame intensity aggregates into a list creates a time
series. Regions are generally devised so that pixel intensity changes reflect changes in the
state of a single constriction type relevant to the articulation of a sound. For example, a Kick
Drum {B} is a labial ejective stop (see Chapter 3: Sounds) and so requires a region for lip
aperture and another for larynx height. As the tissue of the relevant articulator(s) moves into
the space encoded by the pixels in a region, the region’s overall pixel intensity increases.
The region of interest analysis technique is versatile, with different region shape types
and time series calculation methods that can be highly effective when used appropriately.
The VocalTract ROI Toolbox (Blaylock, 2021) offers three region shapes: rectangular regions,
pseudocircular regions (Lammert et al., 2013), and regions formed by automatically finding
groups of pixels which covary in intensity (Lammert et al., 2010). (Pseudocircular regions are
pixels.) In this dissertation, rectangular regions were used for articulator movements
designated as horizontal or vertical with respect to their absolute orientation in the video,
pseudocircular regions were used for oblique articulator movements, and statistically
correlated regions were used for especially large tongue body movements (i.e., in the Liproll).
Time series calculation methods include averaging the intensities of all the pixels in the
region, transforming the intensities into a binary mode, and tracking the centroid of tissue
34
within the region (Oh & Lee, 2018). This dissertation uses only average pixel intensity time
series.
When regions of interest tracking average pixel intensity are used for capturing the
kinematics of movement along a given vocal dimension (see below), each region needs to be
placed so that it covers the widest aperture of its intended tract variable. At the lowest
average pixel intensity in the region, the relevant articulator(s) should be just outside the
region; pixel intensity will then increase as the relevant articulator(s) move into the region,
used in glottalic egressive sounds, the region should have maximum intensity when the
arytenoids are at their maximum height; in their lowest position, the arytenoids should be
just below the lower edge of the region. Defining the regions in this way ensures that the
the start of an articulator’s movement into a constriction, the maximum velocity of the
articulator as it moves into its constriction, and the moment of maximum constriction.
LAB. (Figure 13.) A rectangular region to measure lip aperture. Vertically, the region
was arranged so that the upper and lower lip were just outside the region at their widest
aperture. Horizontally, the region was wide enough to include the full width of the lips
during bilabial closures as well as the protrusion of the lips during labiodental closures.
LAB2. (Figure 14.) A rectangular region for measuring labial constrictions in which
the lips are pulled inward between the upper and lower teeth. The region is placed adjacent
35
to, posterior of, and non-overlapping with LAB. The width and height of the region
encompassed the pixels of the retracted portions of the upper and lower lip.
COR. (Figure 15.) A rectangular region for measuring alveolar, dental, and
linguolabial tongue tip constrictions. The region is placed so that the anterior edge is
adjacent to the lips and the posterior edge is far enough right to not have the tongue tip
inside while the tongue is pulled back or down. The upper edge of the region is level with the
alveolar ridge.
DOR. (Figure 16.) A pseudocircular region for measuring tongue body constrictions
near the velum. The region is placed adjacent to the lowered velum such that the region is
filled when the tongue body connects with the lowered velum or for narrow tongue body
FRONT. (Figure 17.) A region for the most anterior tongue body position of the
Liproll. The region was designed so that the anterior edge of the region traced the anterior
edge of the tongue body during its most anterior Liproll constriction, the upper edge of the
region traced the air-tissue boundary along the palate, and the lower/posterior edge traced
the anterior edge of the tongue body at its most posterior Liproll constriction. This shape
was most successfully generated from the aggregate of two adjacent regions of statistically
correlated pixels, one of which contained the front of the tongue body in only its most
anterior Liproll constriction and the other of which contained the front of the tongue body
only while the tongue was in the velar closure posture it adopted during that beat pattern
36
VEL. (Figure 18.) A region for tracking velum height. This was a pseudocircular region
of radius 2 pixels placed over the pixels that contained the velum in its most raised position
and adjacent to the pixels containing the velum in its most lowered state.
LAR. (Figure 19.) A rectangular region placed on the pixels containing the arytenoid
A default subset of regions was created from inspection of the first few beat patterns
JR performed including beat patterns that highlighted the Kick Drum and Closed Hi-Hat.
These regions were modified for other videos as needed—usually to account for head
Figure 13. LAB region, unfilled during a Vocalized Tongue Bass (left) and filled during the
Kick Drum that followed (right).
37
Figure 14. LAB2 region filled during a Liproll (left) and empty after the Liproll is complete
(right).
Figure 15. COR region, filled by an alveolar tongue tip closure for a Closed Hi-Hat {t} (left),
filled by a linguolabial closure {tbc} (center), and empty (right).
38
Figure 16. DOR region, filled by a tongue body closure during a Clickroll (left) and empty
when the tongue body is shifted forward for the release of an Inward K Snare (right).
Figure 17. FRONT region for Liproll outlined in red, completely filled at the beginning of the
Liproll (left) and empty at the end of the Liproll (right).
39
Figure 18. VEL region demonstrated by a Kick Drum, completely empty while the velum is
lowered for the preceding sound (left) and filled while the Kick Drum is produced (right).
Figure 19. LAR region demonstrated by a Kick Drum (an ejective sound), completely empty
before laryngeal raising (left) and filled at the peak of laryngeal raising (right).
fundamental phonological elements called “gestures” (Browman & Goldstein, 1986, 1989).
Gestures are defined with respect to a dynamical system (Chapter 4: Theory). At the level
40
they can be observed, gestures typically involve the motion of a single constriction system
called a vocal tract variable (like the lips or tongue tip) toward a task-relevant goal—often a
spatial target in the vocal tract in terms of some constriction location (where a constriction is
being made in the vocal tract) and degree (how constricted the vocal tract is in that
location). A gesture has a finite life span; while a gesture is active, the dynamical system
parameters that determine a gesture’s behavior (like its intended spatial goal) remain
invariant, but its influence over a tract variable causes continuous articulatory changes
(Fowler, 1980).
Gestural scores are visual representations of the gestures active in a given utterance. A
gestural score often includes two things: a kinematic time series for each tract variable that
estimates the continuous change of that tract variable; and, inferences about when a gesture
is thought to be active and exerting control over a tract variable—its finite duration,
represented by a box or shading accompanying the time series. Gestural scores are here used
to visualize beatboxing movements, though in this case the “gestures” found are intended to
represent only the interval of time during which a constriction is formed and released within
interpretation).
Gestures were found semi-automatically from time series generated by the region of
interest method (Blaylock, 2021). Each beatboxing sound was associated with one or more
regions of interest in a lookup table; for example, the Kick Drum is a glottalic egressive
bilabial stop and so was associated to the LAB (labial) and LAR (laryngeal) regions. Each
41
beatboxing sound in a beat pattern was marked by a point on a Praat point tier as described
earlier. For each sound, the point was used as the basis for automatic use of the DelimitGest
function (Tiede, 2010) on each of the time series associated with that sound. The algorithm
defines seven temporal landmarks for each gesture based on the velocity of the time series
(calculated via the central difference) within a specified search range—in this case, the entire
time series was the search range. The time of maximum constriction (MAXC) is the time of
the velocity minimum nearest that sound’s time point from the point tier. The times of peak
velocity into (PVEL) and out of (PVEL2) the constriction are the times of the nearest
velocity maxima greater than 10% of the maximum velocity of the search range before and
after MAXC, respectively. The time of the onset of movement (GONS) is the time at which
movement velocity is 20% of the range of velocities between the peak velocity into the
constriction and the nearest preceding local velocity minimum; the time of movement end
(GOFFS) was calculated the same way but for the range of velocity between the peak
velocity out of the constriction and the nearest following velocity minimum. Finally, the time
of constriction attainment (NONS) was calculated as the time at which the velocity was 20%
of the range between the the peak velocity into a constriction and the minimum velocity
associated with the time of MAXC; the time at which a constriction began to be released was
likewise calculated as the time of the same velocity threshold but between the velocity
associated with the time MAXC and the peak velocity out of the constriction.
that were grossly unaligned with the actual articulator movement. Often this was because the
MAXC time point taken from the point tier was placed on a local minimum in pixel intensity
42
rather than a local maximum. Those gestures were manually corrected via the MviewRT GUI
(Tiede, 2010) using the same DelimitGest function and parameters by selecting different
starting frames than the ones generated from the Praat point tier. Some of those manually
placed gestures had temporal landmarks that were still grossly unaligned with their expected
relative pixel intensity values, as when a gestural offset landmark was placed halfway into the
constriction of a later gesture; these landmarks were corrected in MviewRT by eye. Gestural
scores and their time series were plotted for dissertation figures in MATLAB using a branch
of the VocalTract ROI Toolbox. In these plots, manually-adjusted gestures are marked by a
43
CHAPTER 3: SOUNDS
This chapter introduces some of the most frequent sounds of beatboxing and identifies
critical phonetic dimensions along which the inventory of beatboxing sounds appears to be
distributed. There are three major conclusions. First, the sounds of beatboxing have a
roughly Zipf’s Law (power law) token frequency distribution, a pattern that has been
identified for word frequency in texts and corpora but not for sound frequency; this is
vocabulary items in the beatboxer’s sound inventory. Second, beatboxing sounds are
relatively small set of articulatory dimensions, though the organization of these dimensions
sounds are contrastive with one another because changing one of the articulatory
components of a sound generally leads to a change in the sound’s meaning. Speech and
beatboxing therefore appear to share not just the vocal apparatus but also distributional and
1. Introduction
At least at some level of representation, beatboxing sounds have intrinsic meaning. The
meaning of a sound often refers to the musical referent it imitates, whether that be part of a
drum kit like a Kick Drum {B} or a synthetic sound effect like a laser (e.g., Sonic Laser
{SonL}). One could therefore compile a list of beatboxing sounds and structure it so that
44
sounds with similar musical roles are listed near each other: kicks, hi-hats, snares, basses,
rolls, sound effects, and more. Catalogs of sounds like this have been assembled by
beatboxers. The boxeme of Paroni et al. (2021) seems to refer to sounds at this level of
granularity which experienced beatboxers are likely able to distinguish and use appropriately
in the context of a beat pattern—a beatboxing utterance. Beatboxing sounds are cognitively
But perhaps there is more to the organization of beatboxing sounds than just their
musical function. Other cognitive domains have been described as using a sort of “mental
chemistry” (Schyns et al., 1998:2) in which a few domain-relevant dimensions are variously
combined to create a myriad of representations. Speech is one such system: the sounds of a
language are composed of discrete choices along a relatively small set of phonetic
dimensions like voicing, place, and duration; these dimensions are thought to encode
linguistic meaning through contrast, and are often considered to be the compositional,
Abler (1989; see also Studdert-Kennedy & Goldstein, 2003) describes three properties
shared by self-diversifying systems like the systems of speech sounds, genes, and chemical
different combinations (referred to by other scholars as feature economy (Ohala, 1980, 2008;
Clements, 2003; Dunbar & Dupoux, 2016). Beatboxing does have at least two levels of
organization—the meanings of the sounds themselves in terms of musical roles (e.g., kick,
45
snare) and their organization into hierarchically structured beat patterns. Less clear is
If meaningful beatboxing sounds are also composed of smaller units, they should be
classifiable over repeated use of a small set of dimensions—some of which may happen to
overlap with the dimensions along which speech sounds are classified because they share the
same phonetic potential via the vocal tract. Alternatively, if beatboxing sounds are not
vocal tract so that each sound is maximally distinct from the other. This would be
are maximally dispersed within a language’s phonological inventory as vowels often seem to
be (Liljencrants & Lindblom, 1972; Lindblom, 1986), then consonant systems like [ɗ k’ ts ɬ m
r ǀ] should be typologically common (which they are not; see Lindblom & Maddieson, 1988).
If beatboxing sounds are organized to be distinctive but not combinatorial, then beatboxing
Note that being able to classify beatboxing sounds along articulatory dimensions is
not enough to claim that those dimensions constitute cognitive beatboxing units. The
dimensions play a role in the cognitive representation and functioning of the system. In
linguistics, evidence for the cognitive reality of organizing atomic features comes from the
different behavioral patterns speech sounds exhibit depending on which features they are
composed of. This chapter only goes so far as to describe and analyze the articulatory
46
dimensions along which beatboxing sounds appear to be dispersed; later chapters revisit the
This chapter presents two novel analyses of beatboxing sound organization. The first
(Analysis 1) measures the token and beat pattern frequency of beatboxing sounds, providing
the first quantitative account of beatboxing sound frequency. The second (Analysis 2) builds
on the first by evaluating whether higher frequency beatboxing sounds can be analyzed as
composites of choices from a relatively small set of phonetic dimensions. In the process, the
47
2. Method
Describing the organization of beatboxing sounds and assessing whether they are composed
combinatorially requires first making a list of beatboxing sounds to analyze. New beatboxing
beatboxing in which to find a list of all the sounds that have been invented so far (though
resources like [Link] offer an attempt). The list of sounds for this analysis was
explaining how to produce various sounds. Two particular methodological concerns about
this process merit discussion in advance: how to decide which articulations a beatboxer uses
are and are not beatboxing sounds, and how to determine which of those sounds to include
The decision of what counts as a beatboxing sound is rooted in the observations and
opinions of the beatboxer and the analyst (who may be the same but in this case are not). In
the process of data collection for this study, each beatboxer was asked to make a list of
sounds they can produce and then showcase each one in a beat pattern. But more sounds
might be used in those beat patterns than were listed and showcased by the beatboxer, either
because the beatboxer forgot to list them or does not overtly recognize them as a distinct
sound. Likewise, a beatboxer might distinguish between two or more sounds that the analyst
nonexistent, or not detected. And, some sounds may be different only in ornamentation,
beatboxer’s overt knowledge of their sound inventory or add and remove sounds in the list
based on the analysis of their usage. Thus a catalog of a beatboxer’s sounds is biased by the
beatboxer’s knowledge and the analyst’s assumptions, and therefore not likely to be a
include in the analysis, as not all sounds of a beatboxer’s inventory might not have the same
status—just as not all the sounds of a language are equally contrastive (Hockett, 1955). Some
beatboxing sounds may just be entering or leaving the inventory, and some may be less
common than others. If the whole sound inventory is analyzed equally, less stable beatboxing
sounds may throw off the analysis by muddying the dimensions that compose more stable
sounds.
sound inventories may fill open “holes” in the inventory over time: given the current state of
sounds in a beatboxer’s inventory, they may be more likely to next learn a sound that is
composed of phonetic dimensions already under cognitive control than to learn a sound
requiring the acquisition of one or more new phonetic dimensions. There is not sufficient
diachronic data in the corpus to measure this directly; however, if we assume that a
beatboxing sound’s corpus frequency is proportional to how early it was learned (higher
frequency indicating earlier acquisition) then cataloging beatboxing sounds from high
frequency to low frequency should yield a growing phonetic feature space. The highest
frequency sounds would be expected to differ along relatively few phonetic dimensions; as
49
sounds of lesser frequency are added to the inventory, we would expect to find that they tend
to fill gaps in the existing phonetic dimension space when possible before opening new
phonetic dimensions. But if the sounds are dispersed non-combinatorially, we may instead
expect to find that even the earliest or most frequent sounds make use of as many phonetic
dimensions as possible to maximize their distinctiveness, with the rest of the sounds fitting
The initial list of sounds was designed to be as encompassing as possible in this study.
The 39 sounds which the beatboxer overtly identified and 16 more which were determined
by the analyst to have qualitatively distinct articulation were combined into a list of 55
sounds. The frequency of each sound was calculated by counting how many times it
appeared in the data set overall (token frequency) and how many separate beat patterns it
appeared in (beat pattern frequency). The token frequency distribution analysis of the full
set of sounds is presented in section 3.1. To minimize the impact of infrequent sounds on the
Details about the acquisition and annotation of beatboxing real-time MR videos can
be found in Chapter 2: Method. Counts of each beatboxing sound were collected from 46
beat patterns based on their drum tab annotations. Each ‘x’ in a drum tab corresponded to
one token of a sound. Each sound was counted according to both its token frequency (how
many times it shows up in the whole data set) and its beat pattern frequency (how many
50
Certain sounds and articulations were labeled in drum tabs but excluded from the
analyses: audible exhalations and inhalations, lip licks (linguolabial tongue tip closures)
which presumably help to regulate moisture on the lips but not to make sound,
non-sounding touches of the tongue to the teeth or alveolar ridge, and lip
spreading/constricting, akin to lip rounding in speech and useful for raising and lowering the
frequency of higher amplitude spectral energy. None of these were identified by the
beatboxer as distinct sounds, nor were they clearly associated with the articulation of any
nearby sounds.
3. Results
Section 3.1 examines the overall frequency distribution of the beatboxing sounds in the data
set. Section 3.2 digs further into the production of the most frequent sounds in order to
Figure 21 shows the token frequency of each beatboxing sound in decreasing order of
frequency. Lighter shaded bars show the token frequency for sounds that only occurred in
one beat pattern in the data set, and the darker bars are sounds that occurred in two or more
best patterns. Beat pattern frequency does not factor into the power law fitting procedure,
but will be used in section 3.2. The most frequent sound appears much more often than any
of the others; the next few most frequent sounds rapidly decrease in frequency from there.
The bulk of the sounds have relatively low and gradually decreasing frequency.
51
There are many different types of frequency distributions, but one commonly
associated with language that results in a similar distribution is Zipf's Law—a discrete power
law (zeta distribution) with a particular relationship among the relative frequencies of the
items observed (Zipf, 1949). A distribution is Zipfian when the second most frequent item is
half as frequent as the most frequent item, the third most frequent item is one third as
frequent as the most frequent item, and so on. To put numbers to it, if there were 100
instances of the most common item in a corpus, the second most common item should occur
50 times, the third most common item 33 times, the fourth 25, and so on. With respect to
language, Zipf’s Law is known for describing the frequency distribution of words in a corpus:
function words tend to be very frequent, accounting for large portions of the token
frequency, while other words have relatively low frequency. On the other hand, the
overestimates the frequencies of both the highest and lowest frequency phones while
rank (i.e., the third most frequent item is 𝑛 = 3) and 𝑥𝑛 represents the frequency of the 𝑛th
word.
1
𝑥𝑛 = 𝑥1 · 𝑛
With respect to this data set, a Zipfian rank-frequency distribution of beatboxing sounds is
predicted to fit the equations above with 𝑥1 = 330 because there were 330 instances of the
52
1
𝑥𝑛 = 330 · 𝑛
Power laws take the more general form in the equation below; Zipf’s Law is the special case
where 𝑏 = 1 (and 𝑎 = 𝑥1). In this form, the parameters 𝑎 and 𝑏 can be estimated by
non-linear least squares regression using MATLAB’s fit function set to “power1”.
−𝑏
𝑓(𝑥) = 𝑎𝑥
sounds actually follows Zipf’s Law, or even that it follows a power law versus, say, a sum of
somewhat different mathematical properties. Even so, estimating the parameters 𝑎 and 𝑏
from the data as described above yields 𝑎 = 325. 6 (316. 9, 334. 3) and
𝑏 = 1. 025 (1. 054, 0. 996), putting the hypothesized model parameters 𝑎 = 330 and
𝑏 = 1 within the 95% confidence intervals of both parameter estimates of Zipf’s Law. The fit
2
has a sum-squared error of 1152.1 and root-mean-square error of 4.66, with 𝑅 = 0. 9914
2
(adjusted 𝑅 = 0. 9912, dfe=53). A visualization of the Zipf’s Law parameters is overlaid on
The goodness of fit to the power law can be evaluated from other graphs. Figures 22
and 4 show the residuals of the fit: the fit model slightly underestimates tokens of frequency
rank 11-21, then slightly over-estimates the rest of the sounds in the long tail of the
distribution. The systematicity of the residuals suggests that the model may not be an ideal
fit, though overestimating the frequency of items in the tail is a relatively common finding in
other domains where Zipf's Law is said to apply. Figure 24 shows the log-log plot of the
53
frequency distribution and the Zipf's Law fit. Power laws plotted this way resemble a straight
line with a slope equal to the exponent in power law notation; distributions with Zipf’s Law
should therefore resemble a line with a slope of -1. Figure 25 shows the cumulative
probability of the sounds, representing for each sound type (x axis) what proportion of all
the tokens in the data set is that sound or a more frequent sound. The benefit of the
cumulative probability graph is to quickly estimate how much of the data can be accounted
for with groups of sounds of a certain frequency or higher; for example, the five most
frequent sounds account for over 50% of tokens. Again, the first few most frequent sounds
are disproportionately represented in the data while the majority of sound types appear only
rarely. The figure also shows the cumulative probability of the power law fit to the data,
beatboxing sounds seems to resemble the Zipfian frequency distribution of words, but not
54
Figure 21. Rank-frequency plot of beatboxing sounds. Beatboxing sound frequencies roughly
follow a power law: the few most frequent sounds are very frequent and most of the sounds
are much less frequent.
Figure 22. Histogram of the residuals of the power law fit. Most sounds have a token
frequency within 5 tokens of their expected frequency.
55
Figure 23. Scatter plot of the residuals of the power law fit (gray) against the expected values
(black). The middle-frequency sounds are a little under-estimated and the lower-frequency
sounds are a little over-estimated.
Figure 24. Log-log plot of the token frequencies (gray) against the power law fit (black).
56
Figure 25. The discrete cumulative density function for the token frequencies of the sounds
in this data set (gray) compared to the expected function for sounds following a power law
distribution (black).
In this section, beatboxing sounds are presented in decreasing order of beat pattern
frequency instead of token frequency under the premise that the most stable and flexible
beatboxing sounds will occur in multiple beat patterns. Sounds with low beat pattern
frequency often have low token frequency, but certain high token frequency sounds were
only performed in one pattern are omitted (like a velar closure {k} which is the 7th most
frequent token) or deferred until a later section in according with their beat pattern
frequency (like the Clop {C} which is the 12th most frequent token). Whenever reference is
57
made to a sound's relative frequency or to the cumulative frequency of a set of sounds,
however, those high token frequency sounds are still part of the calculation. Figure 26 shows
a revision of the cumulative probability distribution in which sounds are ordered by beat
Figure 26. The discrete cumulative density function of the token frequency of sounds in this
beat pattern (gray, same as Figure 25) against the density function of the same sounds
re-ordered by beat pattern frequency order (black).
The analysis of the compositionality of beatboxing sounds is presented in five parts. Sections
3.2.1-3.2.4 introduce beatboxing sounds with articulatory descriptions, then summarize the
phonetic dimensions involved in making those sounds. The sounds are presented according
to their beat pattern frequency: section 3.2.1 presents the five sounds that appear in more
than 10 beat patterns each, covering more than 50% of the cumulative token frequency of the
data set; section 3.2.2 adds seven sounds that appear in four or more beat patterns; and,
58
section 3.2.3 introduces ten sounds that each appear in 2 or more beat patterns. Section 2.3.4
adds another 20 lowest-frequency sounds for a total of 43 sounds. Section 3.2.5 summarizes
with an account of the overall compositional makeup of all the presented beatboxing sounds.
MRI videos representing stages in the articulation of the sound (see Chapter 2: Method for
details of video acquisition and sound elicitation). Usually the images come from one
instance of the sound performed in isolation; some sounds were only performed in beat
patterns, so for those sounds the images come from one instance of a sound in a beat pattern.
Some of the videos from which these images were taken are available online at
[Link]
terminology, the phonetic dimension of constriction degree will involve three terms that are
not usually used or may be unfamiliar: compressed, contacted, and narrow. A compressed
constriction degree involves a vocal closure in which an articular pushes itself into another
surface (or in the case of labial sounds, the lips may push each other). Compressed
constriction degree is used for many speech stops and affricates, and will be a key property of
many beatboxing sounds as well. Contacted constriction degree refers to a lighter closure in
the vocal tract which results in a trill when air is passed through it. Narrow constriction
degree refers to a constriction that is sufficiently tight to cause airflow to become turbulent;
it is used the same way in Articulatory Phonology (Browman & Goldstein, 1989).
Abbreviations for the sounds are provided in two notation formats: IPA and BBX.
Transcription in IPA notation incorporates symbols from the extensions to the International
59
Phonetic Alphabet for disordered speech (Duckworth et al., 1990, Ball et al., 2018b) and the
VoQS System for the Transcription of Voice Quality (Ball et al., 1995; Ball et al., 2018a). The
BBX notation (an initialism deriving from the word “beatbox”) is the author’s variant of
Standard Beatbox Notation (SBN; Stowell, 2003; Tyte & SPLINTER, 2014). At the time of
writing, Standard Beatbox Notation does not include annotations for many newer or less
common sounds. BBX is not meant to contribute to standardization, but simply to provide
functional labels for the sounds under discussion. In a few cases, BBX uses alternative labels
for sounds that SBN already has a symbol for (for example, the Inward Liproll in SBN is
{BB^BB} and in BBX is {LR}). BBX and SBN notations are indicated with curly brackets {}.
sound, BBX and SBN annotations frequently use multiple symbols to denote a single sound
form. Tables 2-4 show the organization of the sounds based on their place of articulation,
constriction degree, airstream mechanism, and musical role. Unless otherwise indicated, the
MRI images presented in the figures below represent a sequence of snapshots at successive
60
Kick Drum
The Kick Drum {B} mimics the kick drum sound of a standard drum set. It is one of the most
glottalic egressive bilabial plosive (Proctor et al., 2013; de Torcy et al., 2014; Blaylock et al.,
2017; Patil et al., 2017; Dehais-Underdown, 2019). First a complete closure is made at the lips
and glottis, then larynx raising increases intraoral pressure so that a distinct “popping” sound
is produced when lip compression is released. The high-frequency rank of the Kick Drum is
likely due to a variety of factors: it is common in the musical genres on which beatboxing is
based; it replaces the [b] in the “boots and cats” phrase commonly used to introduce new
English beatboxers to their first beat pattern; and, is frequently co-produced with other
61
PF Snare
The PF Snare {PF} is a labial affricate; it begins with a full labial closure, then transitions to a
brief labio-dental fricative. That the PF Snare is a glottalic egressive sound is evidenced by
Inward K Snare
The Inward K {^K} (sometimes referred to simply as a K Snare due to its high frequency) is a
voiceless pulmonic ingressive lateral velar affricate. In producing the Inward K, the tongue
body initially makes a closure against the palate. It then shifts forward, with at least one side
lowering to produce a moment of pulmonic ingressive frication. The lateral quality is not
62
directly visible in these midsagittal images; however, laterality can be deduced by observing
that the tongue body does not lose contact with the palate in the midsagittal plane: if the
tongue is blocking the center of the mouth, then air can only enter the mouth past the sides
of the tongue.
The Kick Drum is sometimes referred to as a “forced” sound. An “unforced” version of the
Kick Drum has also been observed in some beatboxing productions. This unforced Kick
Drum {b} has no observable larynx closure and raising like that of the forced Kick Drum;
instead, it is produced with a dorsal closure along with the closure and release of the lips.
Note however that the tongue body does not generally shift forward or backward during the
production of this unforced Kick Drum; the airstream is therefore neither lingual egressive
nor lingual ingressive, but neutral—a “percussive”, a term for a sound lacking airflow
initiation due to pressure or suction buildup. The source of the sound in a percussive is the
noise produced by the elastic compression then release of the contacting surfaces (Catford,
1977). Section 3.2.2 expands the scope of percussive sounds slightly in the context of
63
beatboxing to include sounds with a relatively small amount of tongue body retraction which
signals the presence of lingual ingressive airflow (“relatively” here compared to other lingual
The extensions to the IPA (Ball et al., 2018) offer the symbol [ʬ] for bilabial
percussives. The unforced Kick Drum is likely a context-dependent alternative form of the
more common forced Kick Drum, as discussed at greater length in Chapter 5: Alternations.
(The same chapter also includes an articulatory comparison between three compressed
bilabial sounds—the forced Kick Drum, the unforced Kick Drum, and the Spit Snare.)
Closed Hi-Hat
The Closed Hi-Hat {t} is a voiceless glottalic egressive apical alveolar affricate. The tongue tip
rises to the alveolar ridge to make a complete closure while the vocal folds close and the
which appear in at least 10 beat patterns and which collectively make up more than 50% of
the cumulative token frequency. These frequently used sounds are spread across three
64
primary constrictors: labial (bilabial, labio-dental), coronal (alveolar), and dorsal. Three of
the sounds {B, PF, t} are glottalic egressive, one {^K} is pulmonic ingressive, and one {b} is
percussive (Table 2). Some beatboxers also use glottalic egressive dorsal sounds (e.g.,
Rimshot), but the Inward K Snare is commonly used as a way to inhale while vocalizing. The
unforced Kick Drum appears to be a context-dependent variety of Kick Drum (see Chapter
5: Alternations), indicating that the glottalic egressive Kick Drum is the default form. With
respect to airstreams, this effectively places the most common beatboxing sounds along two
airstreams: glottalic egressive for the majority, with pulmonic ingressive for the important
Of the same sounds, three {PF, t, ^K} are phonetically affricates, and two {B, b} are a
stops (Table 3). Proctor et al., (2013) describe the Kick Drum {B} as another affricate; its
production may vary among beatboxers. But the phonological distinction between affricate
and stop that exists in some languages does not have as clear a role in beatboxing; with only
five sounds under consideration so far that mostly vary by constrictor, a simpler description
is that all of these sounds are produced with a compressed closure similar to what both stops
and affricates in speech require. The nature of the release—briefly sustained or not—likely
enhances the similarity of each sound to its musical referent on the drum kit, but may not be
65
primary and pressure-change-initiator actions (airstream mechanism), as well perhaps as
duration, nasality, voicing, or other phonetic dimensions. Instead, the sounds vary by
constrictor but share the same qualitative constriction degree, lack of nasality, lack of
voicing, and all but one share the same airstream mechanism.
66
Table 1. Notation and descriptions of the most frequent beatboxing sounds.
Sound name BBX IPA Description Token Cumulative Beat pattern
frequency probability frequency
Forced Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop 330 23.44% 34
PF Snare {PF} [p͡f '] Voiceless glottalic egressive labiodental affricate 136 33.10% 23
Inward K Snare {^K} [k͡ʟ̝̊↓] Voiceless pulmonic ingressive lateral velar 91 39.56% 16
affricate
Unforced Kick Drum {b} [ʬ] Voiceless percussive bilabial stop 117 47.87% 14
Closed Hi-Hat {t} [ts’] Voiceless glottalic egressive alveolar affricate 70 52.94% 12
Table 2. The most frequent beatboxing sounds displayed according to constrictor (top) and
airstream (left).
Airstream Bilabial Labiodental Coronal (alveolar) Dorsal
Glottalic egressive B PF t
Pulmonic ingressive ^K
Percussive b
Table 3. The most frequent sounds displayed according to constrictor (top) and constriction
degree (left).
Constriction degree Bilabial Labiodental Coronal (alveolar) Dorsal
Compressed B, b PF t ^K
67
Table 4. The most frequent sounds displayed according to constrictor (top) and musical role
(left).
Musical role Bilabial Labiodental Coronal (alveolar) Dorsal
Kick B, b
Hi-Hat t
Snare PF ^K
68
3.2.2 Medium-frequency sounds
The dental closure, linguolabial closure, and alveolar closure were not identified as distinct
sounds by this beatboxer, and therefore were not given names referring to any particular
musical effect. They are each categorized as a percussive coronal stop, made with the tongue
tip just behind the teeth (dental), touching the alveolar ridge (alveolar), or placed between
“Percussive” may be somewhat misleading for these sounds. Each of these sounds is
produced with a posterior dorsal constriction, just like the percussive unforced Kick Drum.
But unlike the unforced Kick Drum, in each of these sounds there is a relatively small
amount of tongue body retraction. This makes them phonetically lingual ingressive sounds
rather than true percussives which are described as sounds produced without inward or
outward airflow. (The linguolabial closure is also found without a dorsal closure, and in
Earlier, the choice was made to not distinguish between constriction release types
stop and affricate because there is no evidence here that beatboxing requires such a
distinction. For the dental, linguolabial, and alveolar clicks, however, there is evidence to
suggest that they should not be grouped with other lingual ingressive sounds that will enter
the sound inventory in section 3.2.3. Articulatorily, there is a great difference between these
“percussives” and other lingual ingressive sounds with respect to the magnitude of their
69
tongue body retraction. The image sequence in Figure 36 shows the production of an alveolar
closure followed immediately by a Water Drop (Air). Both sounds have tongue body
retraction that indicates a lingual ingressive airstream, but the movement of the tongue body
in the alveolar closure (frames 1-2) is practically negligible compared to the movement of the
tongue body in the Water Drop (Air) (frames 3-4). The same holds for the other sounds
coded as lingual ingressive in this chapter. In later chapters, we will also see evidence that the
dental closure and perhaps some other of these “percussive” sounds are context-dependent
variants of other more common sounds (the Closed Hi-Hat and PF Snare).
70
Figure 34. The linguolabial closure (non-dorsal).
Figure 36. The alveolar closure (frames 1-2) vs the Water Drop (Air). The jaw lowering and
tongue body retraction for the alveolar closure is of lesser magnitude.
71
Spit Snare
The Spit Snare corresponds to the Humming Snare of Paroni et al. (2021), which seems to
have two variants in the beatboxing community: the first, which Paroni et al. (2021)
reasonably describe as a lingual egressive bilabial stop with a brief high frequency trill
release; and the second, sometimes also called a Trap Snare, BMG Snare, or Döme Snare
(due to its popularization by beatboxing artists BMG and Döme (Park, 2017)), which appears
This Spit Snare is a lingual egressive bilabial affricate, produced by squeezing air
hand clap. To create the high oral air pressure that pushes the air through the lip closure, the
volume of the oral cavity is quickly reduced by tongue body fronting and jaw raising. The
lips appear to bulge slightly during this sound, either due to the high air pressure or to the
The IPA annotation for the Spit Snare is composed of the symbol for a bilabial click
(lingual ingressive) tied to the symbol for a voiceless bilabial fricative (pulmonic egressive)
72
followed by an upward arrow. The upward arrow was part of the extensions to the IPA until
the 2008 version, meant to be used as a diacritic in combination with pre-existing click
symbols to represent “reverse clicks” (Ball et al., 2018:159), but was removed in later versions
because such articulations are rarely encountered even in disordered speech (Ball et al.,
2018). The same notation of a bilabial click with an upward arrow was used by Hale & Nash
(1997) to represent the lingual egressive bilabial “spurt” attested in the ceremonial language
Damin. Note that the downward arrow is not complementarily used for lingual ingressive
sounds; instead, its use both in the extension to the IPA and here mark to mark pulmonic
ingressive sounds (designated “Inward” sounds by beatboxers) like the Inward K Snare.
Throat Kick
Another member of the Kick family of sounds is the Throat Kick (also called a Techno Kick,
implosives: while there is always an oral closure coproduced with glottal adduction, lowering,
and voicing, it does not seem to matter where the oral constriction is made. In isolation, this
73
beatboxer produces the Throat Kick with full oral cavity closure from lips to velum; in the
beat pattern showcasing the Throat Kick, the oral closure is an apical alveolar one. (This
latter articulation is the origin of the chosen IPA notation for this sound, an unreleased
alveolar implosive [ɗ̚]). Supralaryngeal cavity expansion (presumably to aid the brief voicing
and also to create a larger resonance chamber) is achieved through tongue root fronting,
Inward Liproll
The Inward Lip Roll is a voiceless pulmonic ingressive bilabial trill. It is usually performed
with lateral labial contact. Note that in this example, as in others, the Inward Liproll is
initiated by a forced Kick Drum. Frames 1-3 show the initial position of the vocal tract, the
initiation of the Kick Drum, and the release of the Kick Drum. In frame 4, the
lips—particularly the lower lip—have been pulled inward over the teeth. Frame 5 shows the
74
Tongue Bass
The Tongue Bass is a pulmonic egressive alveolar trill. The tongue tip makes loose contact
with the alveolar ridge, then air is expelled from the lungs through the alveolar closure,
causing the tongue tip to vibrate. The arytenoid cartilages appear to be in frame in the later
images, but the thyroarytenoid muscles (which would appear as a bright spot separating the
trachea from the supralaryngeal airway) are not; this means that the sound is voiceless. This
beatboxer distinguishes between the Tongue Bass here and a Vocalized Tongue Bass which
does have voicing (as well as a High Tongue Bass in which the thyroarytenoid muscles are
even clearer).
appear in four or more beat patterns in the data set and comprise about 70% of the
cumulative token frequency. Three dimensional expansions are made by the introduction of
these seven sounds to the earlier most frequent five. First, a new constriction degree: in
addition to the earlier compressed closures, now light contact that results in trills is used as
well. Second, while the tongue tip was earlier only responsible for one sound which was an
75
alveolar closure, it now performs five sounds—three of which are alveolar, and two of which
are different constriction location targets. Third is the addition of glottalic ingressive,
pulmonic egressive, and lingual egressive airstreams for the Throat Kick, Tongue Bass, and
Five of the seven sounds use the same compressed constriction degree type as the
most frequent sounds while filling out different constriction location options—though
bilabial and coronal sounds are more popular than the others. The Tongue Bass and Inward
Liproll open a new constriction degree value of light contact but capitalize on the bilabial
and alveolar constrictor locations that already host the most compressed sounds, doubling
Airstream mechanism is expanded by these sounds. Whereas the five most common
sounds used three airstreams (and only two if you don’t count the percussive unforced Kick
Drum because it almost always occurs in restricted environments), adding the new sounds
increases airstream mechanism types to six (or five, if again you count the percussives as
alternants of other sounds). The airstream expansions do not follow any particular trend: the
glottalic ingressive sound is a laryngeal kick, the pulmonic egressive sound is a coronal bass,
the highest frequency sounds continue to be used by the medium frequency sounds, but the
76
Table 5. Notation and descriptions of the medium-frequency beatboxing sounds.
Sound name BBX IPA Description Token Cumulative Beat pattern
frequency probability frequency
Linguolabial closure {tbc} [ʘ̺, t̼] Voiceless percussive labiodental stop 23 57.10% 9
Spit Snare {SS} [ʘ͡ɸ↑] Voiceless lingual egressive bilabial affricate 29 59.16% 6
Inward Liproll {^LR} [ʙ̥↓] Voiceless pulmonic ingressive bilabial trill 31 64.91% 5
Tongue Bass {TB} [r] Voiced pulmonic egressive alveolar trill 27 66.83% 5
Table 6. High and medium frequency beatboxing sounds displayed by constrictor (top) and
airstream mechanism (left). Medium frequency sounds are bolded.
Airstream Bilabial Labiodental Coronal Dorsal Laryngeal
Glottalic egressive B PF t
Glottalic ingressive u
Pulmonic egressive TB
Lingual egressive SS
Percussive b tbc dc ac
Table 7. High and medium frequency sounds displayed by constrictor (top) and constrictor
degree (left). Medium frequency sounds are bolded.
Constriction degree Bilabial Labiodental Coronal Dorsal Laryngeal
Compressed B, b, SS PF tbc dc t, ac ^K u
Contacted ^LR TB
77
Table 8. High and medium frequency beatboxing sounds displayed by constrictor (top) and
musical role (left). Medium frequency sounds are bolded.
Musical role Bilabial Labiodental Coronal Dorsal Laryngeal
Kick B, b u
Snare SS PF ^K
Roll ^LR
Bass TB
78
3.2.3 Low-frequency sounds
Humming
Humming is phonation that occurs when there is a closure in the oral cavity but air can be
vented past a lowered velum through the nose. This beatboxer did not identify humming as a
distinct sound per se, but did identify a beat pattern that featured “Humming while
This sound is a voiced pulmonic ingressive labial trill. Like some other trills in this data set, it
The Closed Tongue Bass is a glottalic egressive alveolar trill performed behind a labial
closure. As with phonation (or any other vibration of this nature), air pressure behind the
closure must be greater than air pressure in front of the closure. Egressive trills usually have
higher air pressure behind the trilling constriction because atmospheric pressure is relatively
low; for the Closed Tongue Bass, the area between the lips and the tongue tip is where
relatively low pressure must be maintained. This appears to be accomplished by allowing the
lips (and possibly cheeks) expand to increase the volume of the chamber while it fills with
air. In the beat pattern that features the Closed Tongue Bass, the beatboxer also uses glottalic
egressive alveolar trills with spread lips, presumably as a non-closed variant of the Closed
Tongue Bass.
80
Liproll
The Liproll is a lingual ingressive bilabial fricative. It begins with the lips closed together and
the tongue body pressed into the palate. The tongue body then shifts backward, creating a
vacuum into which air flows across the lips, initiating a labial trill.
The Water Drop (Tongue) is one of two strategies in this data set for producing a water drop
sound effect, the other being the Water Drop (Air). The Water Drop (Tongue) is a lingual
ingressive palatoalveolar stop with substantial lip rounding. With rounded lips, the tongue
body makes a closure by the velum, and the tongue tip makes a closure at the alveolar ridge;
the tongue tip constriction is then released, mimicking the sound of the first strike of a water
droplet. The narrow rounding of the lips may create a turbulent sound, similar to whistling.
81
(Inward) PH Snare
The (Inward) PH Snare or Inward Classic Snare is a pulmonic ingressive bilabial affricate. In
these beat patterns, it was always followed by an Inward K Snare. A PH Snare closely
this study only explicitly identified the PK Snare as a sound they knew, not the PH Snare.
The choice was made to identify the PH Snare as a distinct sound because the few other
combination sounds in this data set—like the D Kick Roll and Inward Clickroll with
Whistle—also have their component pieces identified separately. (Note: the alternative
choice to treat the combo of PH Snare and Inward K Snare as a single PK Snare would
reduce the number of Inward K Snares in the data set from 91 to 78; re-assessing the power
law fit yields a slightly stronger correlation [R-squared = 0.9957, adjusted R-squared =
0.9956] but an exponent of b=1.032 [confidence interval (1.053, 1.011)] which is slightly larger
82
Inward Clickroll
The Inward Clickroll (also called Inward Tongue Roll) is a voiceless pulmonic ingressive
central sub-laminal retroflex trill. The tongue tip curls backward so that the underside is
against the palate, and sides of the tongue press against the side teeth so that the only air
passage is across the center of the tongue. The lungs expand, pulling air from outside the
body between the underside of the tongue blade and the palate, initiating a trill.
Open Hi-Hat
The Open Hi-Hat is a voiceless central alveolar affricate with a sustained release. The initial
closure release is ejective, but the part of the release that is sustained to produce frication is
pulmonic egressive.
83
Lateral alveolar closure
Sonic Laser
The Sonic Laser is a pulmonic egressive bilabial fricative with an initial apical alveolar
tongue tip closure followed by a narrow palatal constriction of the tongue body during the
fricative.
84
Labiodental closure
accompanied by the tongue moving forward toward an alveolar closure, though it is not clear
if this tongue movement is related to the labiodental closure or the alveolar closure that
typically follows the labiodental closure. Later chapters suggest that the labiodental closure is
high and medium frequency sounds—compressed (stops/affricates) and contacted (for trills).
The remaining sound is the Sonic Laser {SonL}; it, as well perhaps as the Water Drop
(Tongue) {WDT}, uses a narrow constriction degree akin to speech fricatives. The majority
(7/12) of these sounds are bilabial or alveolar constrictions, following the trend from the
previous section that those two constriction locations hold more sounds than the others.
Labiodental and laryngeal constrictions were also augmented, but only one new place was
added (retroflex). This set of sounds also added the final airstream type, lingual ingressive.
85
Less obvious in Tables 10-12 is that these sounds introduce new phonetic dimensions
that apply to certain sound pairs. The lateral alveolar closure {tll} and alveolar closure differ
by laterality, not by place, constriction degree, or airstream. Likewise, the Inward Liproll
{^LR} and Vocalized Inward Liproll {^VLR} differ by voicing, while the Closed Hi-Hat {t}
and Open Hi-Hat {ts} differ by duration (with the latter adopting a secondary pulmonic
The difficulty of capturing all the phonetic dimensions a sound uses when placing it
in an IPA-style table (or in this case, tables) is more than an issue of convenience. Using a
tabular structure for sounds is sometimes a useful proxy for assessing their periodicity
(Abler, 1989)—the degree to which sounds can be organized into groups that share similar
force the sounds into a predetermined pattern at the expense of nuanced descriptions, and a
strategy that only becomes less adequate as the beatboxing sound inventory expands. Some
consonants on the IPA table suffer from the same issue: double-articulated sounds like [w]
and non-pulmonic sounds (clicks, ejectives, implosives) do not fit into the reductive
Of the sounds in this section, the Water Drop (Tongue), Sonic Laser, Open Hi-Hat,
and Closed Tongue Bass all use two values on some phonetic dimension which makes them
impossible to place on these tables. The Water Drop (Tongue), Sonic Laser, and Closed
Tongue Bass all use multiple constriction locations, and the Open Hi-Hat uses both glottalic
egressive and pulmonic egressive airstream. Sounds of this nature can be left out of the
86
tables, like [w] in the IPA. Otherwise, there are three ways to include these sounds on the
tables. The first way is to add a sound to multiple locations on the table to show its
multiple-articulation; this helps somewhat in small doses, but quickly gets confusing when
many sounds must be placed on the table two or more times. The second way is to add new
rows or columns or slots for double-valued dimensions; this might be a new “glottalic
egressive + pulmonic egressive” row in the airstream mechanism dimension, or a new “labial
dimensions miss the point of having tables in the first place: the aim of the game is to look
for repetition of phonetic features in sounds, but adding new rows and columns only creates
more sparseness and hides repetition. The third way of adding double-valued sounds to the
tables is to assume that one of the dimension values is more important than the other(s) and
place the sound accordingly. This is the epitome of procrusteanism, and for simplicity it is
The point here, and even more importantly going forward into the lowest frequency
sounds, is that hard-to-place sounds often flesh out combinatorial possibilities by using
articulations that are already in the system to produce entirely novel sounds. But this will
sometimes not show up in analyses of the IPA-style tables because the sounds cannot be
87
Table 9. Notation and description of the low-frequency beatboxing sounds.
Sound name BBX IPA Description Token Cumulative Beat pattern
frequency probability frequency
Vocalized Liproll, {^VLR} [ʙ↓] Voiced pulmonic ingressive bilabial trill 23 72.66% 2
Inward
Closed Tongue {CTB} [r'̚] Voiceless glottalic egressive alveolar trill with 19 74.01% 2
Bass optional labial closure
Inward Clickroll {^CR} [ɽ↓] Voiceless pulmonic ingressive retroflex trill 8 78.84% 2
Open Hi-Hat {ts} [t’s:] Voiceless glottalic egressive alveolar affricate 8 79.40% 2
with sustained pulmonic egressive release
Lateral alveolar {tll} [ǁ] Voiceless percussive lateral alveolar stop 7 79.90% 2
closure
88
Table 10. High, medium, and low (bolded) frequency sounds displayed by constrictor (top)
and airstream mechanism (left).
Airstream Bilabial Labiodental Coronal Dorsal Laryngeal
Glottalic ingressive u
Lingual egressive SS
Table 11. High, medium, and low (bolded) frequency sounds displayed by constrictor (top)
and constriction degree (left).
Constriction Bilabial Labiodental Coronal Dorsal Laryngeal
degree
Linguolabial Dental Alveolar Retroflex
Narrow SonL
89
Table 12. High, medium, and low (bolded) frequency sounds displayed by constrictor (top)
and musical role (left).
Musical role Bilabial Labiodental Coronal Dorsal Laryngeal
Kick B, b u
90
3.2.4 Lowest-frequency sounds
The previous three sections assigned categorical phonetic descriptions to the set of
beatboxing sounds that appear in more than one beat pattern in this data set. Part of the aim
of doing so was to show what types of sounds are used most frequently in beatboxing, to
avoid making generalizations that weigh a Kick Drum equally with, say, a trumpet sound
effect. This section tests the generalizations of the previous three sections by looking at
another 20 sounds, bringing the total number of sounds described from 23 to 43 (out of a
total 55 sounds, the remainder of which could not be satisfactorily articulatorily described).
If beatboxing sounds are using a somewhat limited set of the many phonetic dimensions
available to a beatboxer, then the same most common phonetic dimensions should be
Clop
91
D Kick
The D Kick is a voiceless glottalic egressive retroflex stop. The underside of the tongue tip
presses against the alveolar ridge, flipping back to an upright position upon release.
Inward Bass
The Inward Bass is pulmonic ingressive voicing. The base of the tongue root participates in
the constriction which may indicate that some other structure than (or in addition to) the
vocal folds is vibrating, such as the ventricular folds. The sound is akin to a growl. In this
case, the pulmonic airflow is directed through the nose rather than the mouth.
92
Low Liproll
The Low Liproll is a voiced glottalic ingressive bilabial trill. The vocal airway is quite wide,
lowering the overall resonance behind the trill to create a deeper sound. Frames 1-2 show the
forced Kick Drum that occurs at the beginning of this sound; frames 3-4 show the lips
Hollow Clop
The Hollow Clop is a glottalic ingressive alveolar stop. It appears to function similarly to a
click (e.g., the Water Drop Tongue) with the tongue tip making an alveolar closure as the
front part of a seal. In this case, however, the back of the seal is glottalic, not lingual.
Retraction of the tongue and lowering of the larynx expand the cavity directly behind the
seal, resulting in the distinctive position of the tongue tip sealed to the alveolar ridge (frame
93
Tooth Whistle
The Tooth Whistle is a labiodental whistle, which in this analysis is treated along with
Voiced Liproll
The Voiced Liproll is a voiced glottalic ingressive bilabial trill, similar to the Low Liproll and
High Liproll. The tongue body retracts during the Voiced Liproll and creates a large cavity
94
Water Drop (Air)
The Water Drop (Air) is a voiceless lingual ingressive palatal stop with subsequent tongue
body fronting. The tongue front and tongue body make a closure, then the tongue body
moves backward to eventually pull the tongue front away from its closure as expected for a
click. Following the release of the tongue front closure, however, the tongue body shifts
forward again. This, combined with lip rounding throughout, creates the sound of a water
drop from a pop that starts with a low resonant frequency and quickly shifts to a higher
resonant frequency.
Clickroll
The Clickroll is a voiceless lingual egressive alveolar trill. The tongue tip and tongue body
make a closure as they would for a click. Instead of the tongue body shifting backward or
95
down to widen the seal, the tongue gradually fills the seal to push air past the alveolar
D Kick Roll
The D Kick Roll is a combination of the D Kick and a Closed (but in this case not actually
closed) Tongue Bass. It begins with a voiceless glottalic egressive retroflex stop (the D Kick).
When the tongue tip flips upright again, it makes light contact against the alveolar ridge; the
larynx continues to rise during this closure, pushing air through to make a trill.
High Liproll
The High Liproll is a voiced glottalic ingressive bilabial trill. The vocal tract airway is narrow
for the duration of the trill, raising the resonant frequencies behind the trill for a higher
sound.
96
Inward Clickroll with Liproll
The Inward Clickroll with Liproll is a combination of the Inward Clickroll and an Inward
Liproll. The Inward Clickroll begins the sound as a pulmonic ingressive retroflex trill; the lips
subsequently curl inward to make another trill vibrating over the same pulmonic ingressive
airflow.
Lip Bass
97
tch
The tch is a voiceless glottalic egressive laminal alveolar stop. The connection between the
tongue and the alveolar ridge begins with just an apical constriction but quickly transitions
to a laminal closure. The larynx rises at that point, pushing air past the closure into the tch
snare.
Sweep Technique
The Sweep Technique is a Liproll variant in which the tongue tip connects with the
underside of the lower lip to change the frequency of the bilabial vibration.
98
Sega SFX
The Sega SFX (abbreviation for sound effect) is composed of an Inward Clickroll and a
labiodental fricative. The lower lip is pulled farther back across the lower teeth during the
Trumpet
The Trumpet is a voiced pulmonic egressive bilabial (or possibly labiodental with the
connection between the upper teeth and the back of the lower lip) fricative. The tongue tip
makes intermittent alveolar closures to separate the Trumpet into notes with distinct onsets
99
Vocalized Tongue Bass
The High Tongue Bass is a voiced pulmonic egressive alveolar trill, made with a higher
laryngeal position and narrower airway to raise the resonant frequency behind the trill.
100
Kick Drum exhale
The Kick Drum exhale is a forced Kick Drum produced with pulmonic egressive airflow in
addition to the usual glottalic egressive airflow. There are only two tokens of it in the data
set, and they might both be more appropriately analyzed as a true forced Kick Drum (frames
Liproll {VLR} and Lip Bass {LB} fill out the bilabial place column, while the additions of the
Hollow Clop {HC} and Clickroll {CR} put a sound in every airstream of the alveolar place
Inward Clickroll {^CR} might be better treated typologically as an alveolar that manifests as
Just as in the previous section, several of the sounds introduced in this section do not
fit into distinctive slots in the IPA-style tables we have established so far. The tch {tch} is a
glottalic egressive alveolar sound like the Closed Hi-Hat {t} except that it uses a laminal
closure instead of an apical closure. (It may also have a release qualitatively similar to a [tʃ].)
101
The Low Liproll {LLR}, High Liproll {HLR}, and Vocalized Liproll {VLR} differ with respect
to the area of the vocal airway behind the labial constriction, as do the Tongue Bass {TB} and
High Tongue Bass {HTB}. The Clop {C} and Water Drop (Air) {WDA} differ by the absence
or presence of a tongue fronting movement. These were placed in the tables procrusteanly by
ignoring the apical/laminal distinction and constrictions that one might judge as secondary
and not to be taken as an assumption about the actual nature of beatboxing sounds.
Six of the lowest frequency sounds were not placed on Tables 14-16 because they were
clearly composed of two major tongue and lip constrictions and were judged not to be able
to fit into a single cell: D Kick Roll {DR}, Inward Clickroll and Whistle {^CRW}, Sega SFX
{SFX}, Trumpet {T}, Loud Whistle {LW}, and Sweep Technique {st}. Each involves
102
Table 13. Notation and descriptions for the lowest frequency beatboxing sounds.
Sound name BBX Description Token Beat pattern
frequency frequency
D Kick Roll DR Voiceless glottalic egressive retroflex stop with alveolar trill 6 1
Inward Clickroll ^CRL Voiceless pulmonic ingressive retroflex trill and bilabial 6 1
with Liproll trill
Sweep technique st 4 1
Sega SFX SFX Voiceless pulmonic ingressive retroflex trill with labial 4 1
fricative
Trumpet T 4 1
High Tongue HTB Voiced pulmonic egressive alveolar trill with narrowed 3 1
Bass airway behind the constriction
103
Table 14. All the described beatboxing sounds that could be placed on a table, arranged by
constrictor (top) and airstream mechanism (left). The lowest-frequency sounds are bolded.
Airstream Bilabial Labiodental Coronal Front Dorsal Laryngeal
t, CTB,
Glottalic egressive B PF ts, tch D
LLR, VLR,
Glottalic ingressive HLR HC u
TB, VTB,
Pulmonic egressive LB, Bx SonL, TW HTB hm
^LR,
Pulmonic ingressive ^VLR, ^Ph ^CR ^K IB
Lingual egressive SS CR
Table 15. All the described beatboxing sounds that could be placed on a table, arranged by
constrictor (top) and constriction degree (left). The lowest-frequency sounds are bolded.
Constriction degree Bilabial Labiodental Coronal Front Dorsal Laryngeal
Narrow SonL, TW
104
Table 16. All the described beatboxing sounds that could be placed on a table, arranged by
constrictor (top) and musical role (left). The lowest-frequency sounds are bolded.
Musical Bilabial Labiodental Coronal Front Dorsal Laryngeal
role
Linguolabial Dental Alveolar Retroflex Palatal
Kick B, b, Bx D u
105
3.2.5 Quantitative periodicity analysis
Section 1 highlighted the difference between a system that is organized periodically with
combinatorial units (like speech) and a system that is organized to maximize distinctiveness
without repeated use of a small set of elements. So far we have seen that beatboxing sounds
do make repeated use of some phonetic properties. This means that beatboxing sounds are
combinatorial, and it also suggests that the sounds are not organized to maximize
the sounds are arranged periodically—that is, whether they appear to maximize the use of a
against the periodicity of Standard American English consonants. The English consonant
system was chosen for convenience and because it has a similar number of sounds (22
beatboxing sounds will be used in this analysis; see below) and major phonetic dimensions:
articulation, and two voicing types (Table 19). The sound [l] is usually the 24th sound and
assumed to contrast with [r] in laterality, but since it is the only sound contrasting in
If beatboxing sounds are arranged periodically, then at least some sounds should be
expected to differ along only a single phonetic dimension. Two sounds that differ along only
a single dimension are a minimal sound pair. English minimal sound pairs include [p/b],
[b/m], and [t/s]. In beatboxing, the Kick Drum {B} is a minimal sound pair with the PF
106
Snare {PF}, Closed Hi-Hat {t}, and D Kick {D} in constrictor/place of articulation: all are
glottalic egressive and formed with a compressed constriction degree, but each is made with
different points of contact in the vocal tract. The Kick Drum is also in a minimal sound pair
with the Spit Snare {SS} and the Inward PH Snare {^Ph} along the dimension of airstream
mechanism. The first analysis (section [Link]) compares the minimal sound pair counts of
along some phonetic dimensions and relatively few sounds in others. In a maximally
distributed system, on the other hand, no phonetic dimension should be used more than the
others. The second analysis (section [Link]) uses Shannon entropy as a metric of how
These analyses set aside some of the beatboxing sounds that arguably constitute
varieties of a single sound. The Open Hi-Hat {ts} could be considered a variety of Closed
Hi-Hat {t} that differs only in duration of the release. The unforced Kick Drum {b}, as well as
the percussives {pf} and {dc, ac}, are argued in Chapter 5: Alternations and Chapter 6:
Harmony to be context-dependent alternants of the glottalic egressive forced Kick Drum {B},
PF Snare {PF}, and Closed Hi-Hat {t}, respectively. Vocalized Liprolls (Inward or Outward)
as well as high/low Liprolls, are voiced variations on the theme of Liproll and Inward Liproll
(though Vocalized Liproll, High Liproll, and Low Liproll all require the Liproll to be
performed as glottalic ingressive rather than as lingual ingressive). The same goes for the
Vocalized Tongue Bass and High Tongue Bass as variants of the Tongue Bass. All sound sets
like these were consolidated into a single sound for these analyses. In the interest of more
107
closely matching the speech sound dimensions, the two narrow sounds Sonic Laser {SonL}
and Tooth Whistle {TW} were removed. Thus, the two-way voicing contrast of English
Water Drop (Air) {WDA} was also removed as it was not distinguishable from the Clop {C}
in this reduced feature system, as was {tch} for its similarity to {t}. From the set of sounds in
section 3.2.4, this analysis excludes {SonL, TW, b, pf, tbc, dc, ac, tll, ts, LLR, HLR, VTB, HTB,
^VLR, WDA, tch}. The 22 beatboxing sounds used in this analysis are shown in Table 17.
These final sound systems sacrifice some nuance. Many of the excluded beatboxing
sounds could be analyzed as genuine minimal sound pairs with each other and the remaining
sounds; their exclusion is meant to make the analysis as conservative as possible while
simplifying the minimal sound pair search method by trimming rarely used phonetic
dimensions. Likewise, there are simplifications to both the speech and beatboxing feature
spaces. Phonetically in speech, [f, v] are labiodental while [p, b, m] are bilabial, and [tʃ, dʒ]
are affricates not stops; consolidating them into labial and stop categories reduces the
number of dimensions available in the analysis. Similar choices were made throughout this
chapter for the beatboxing sounds—for example, the Spit Snare {SS} and PF Snare {PF} have
qualitatively different releases compared to the Kick Drum {B} but all are grouped under the
108
dispersion can be created by linearizing the three-dimensional space into an 84-element
one-dimensional vector, then assigning the 21 elements to the vector at every fourth location.
That is, starting with the first position, [ X _ _ _ X _ _ _ X _ …]. The vector is then
Table 18. Minimal sound pairs are found by taking the Hamming distance of each element’s
three properties: airstream, place, and constriction degree.. The Hamming distance counts
how many properties of two elements are different. For example, the first two elements
assigned into the maximally distributed matrix are a compressed glottalic egressive bilabial
sound and a compressed glottalic egressive palatal sound; since they differ only by the place
dimension, their Hamming distance would be 1 and they would be listed as a minimal sound
pair. (In the matrix these are encoded as [1 1 1] and [1 5 1], respectively; the only difference is
the middle number.) The third element assigned is a compressed glottalic ingressive
labiodental sound ([2 2 1] in the matrix) which has a Hamming distance of 2 with each of
the first two sounds—no minimal sound pairs there. The maximally distributed system yields
beatboxing sounds in the same 6 x 7 x 2 space yields 37 minimal sound pairs (Table 17). The
voicing) space with a total of 57 minimal sound pairs (Table 19). The speech system has
fewer dimensions and more sounds, both of which increase the likely number of minimal
sound pairs. Even so, just these three minimal sound pair counts on their own do not give a
sense of whether the beatboxing and English consonant sound systems are more periodic
109
than if they were arranged by chance. To gain a better sense of the periodicity, random sound
distributions were created to find the likelihood of the beatboxing and speech systems
having 37 and 57 minimal sound pairs, respectively, given the number of sounds and
Ten thousand (10,000) random sound systems were created for each domain using
the same method as the maximally distributed system except that the elements were placed
minimal sound pairs were found across all trials. The purple bar in each figure marks the
actual number of minimal sound pairs calculated from Tables 17 (beatboxing) and 19
(speech). The probability of the beatboxing sound system having 37 or more minimal sound
pairs is 17.69% (about 1 standard deviation from the mean); the probability of the English
consonant system having 57 or more minimal sound pairs is 0.16% (about 3 standard
deviations from the mean). Though not marked, the hypothetical maximally dispersed
system (~20 minimal sound pairs in Figure 72) is roughly as unlikely as the number of
The number of minimal sound pairs found in beatboxing sounds (37) is somewhat
higher than the expected value of minimal sound pairs (mean=33). Compared to the
hypothetical maximally distributed system, this beatboxer’s sound system errs on the side of
more periodic. However, the distribution of beatboxing sounds has far fewer minimal sound
110
pairs than expected compared to the well-ordered system of English consonants. (For the
beatboxing system to be as periodic as the English consonant system in this analysis, there
would have needed to be 45 minimal beatboxing sound pairs.) Assuming that other
languages’ consonant systems share a similar well-orderedness (as has often been claimed),
representing greater dispersion (less predictability) (Shannon, 1948). As Table 17 shows, the
22 beatboxing sounds are mostly concentrated into labial (8 sounds) and alveolar (6 sounds)
constrictions, with the remaining 8 sounds spread across labiodental (1 sound), retroflex (2
Compared to the other systems’ place distributions, beatboxing has the lowest entropy (2.36
bits) which means it re-uses place features the most. The English consonants are slightly less
predictable (2.56 bits), and the maximally dispersed system has the greatest entropy (2.81
bits).
It is not clear whether entropy is a useful metric of comparison for the other phonetic
and narrow; 1.33 bits) and a similar three-way system for English consonants—compressed
(stops, affricates, and nasals), narrow (fricatives), and approximants (1.42 bits). (This brings
the {SonL} and {TW} sounds back into the mix for a total of 24 beatboxing sounds.) This
comparison suggests that beatboxing sounds are slightly more predictable/less evenly
111
distributed along the constriction degree dimension. But the set of English consonants is
arguably more informative along the dimension of manner of articulation, not constriction
degree, and it makes less sense to compare the distribution of two different parameter spaces.
The same goes for voicing (which English consonants often use contrastively but beatboxing
sounds do not) and airstream mechanism (where beatboxing sounds are distributed along
The safest conclusion to draw is that this beatboxer’s beatboxing sounds are more
unevenly distributed along the place dimension than the set of English consonants are,
suggesting that beatboxing has some periodicity but that it manifests more strongly along
112
Table 17. 22 beatboxing sounds/sound families, 37 minimal differences. Compressed on the
left, contacted on the right.
Airstream Bilabial Labiodental Alveolar Retroflex Palatal Dorsal Laryngeal
CT
Glottalic egressive B PF t B D
Pulmonic egressive Bx LB TB hm
Lingual egressive SS CR
Table 18. 21 sounds with maximal dispersion, 20 minimal differences. Compressed on the
left, contacted on the right.
Airstream Bilabial Labiodental Alveolar Retroflex Palatal Dorsal Laryngeal
Glottalic egressive X X X X
Glottalic ingressive X X X
Pulmonic egressive X X X X
Pulmonic ingressive X X X
Lingual egressive X X X X
Lingual ingressive X X X
Table 19. 23 English consonants, 57 minimal differences ([l] conflated with [r]). Voiceless on
the left, voiced on the right.
Manner Labial Dental Alveolar Postalveolar Palatal Velar Glottal
Stop p b t d tʃ dʒ k g
Nasal m n ŋ
Fricative f v θ ð s z ʃ ʒ h
Approximant r j w
113
Table 20. Summary of the minimal sound pair and entropy (place) analyses for beatboxing, a
hypothetical maximally distributed system, and English consonants.
System # Sounds # Min. sound pairs Phonetic dimensions Place entropy (bits)
114
Figure 72. Histogram of 10,000 random minimal sound pair trials in a 6 x 7 x 2 matrix. The
probability of a random distribution of 22 sounds having 37 (purple) or more (darker gray)
minimal sound pairs is 17.69% (95% confidence interval: 17.08–18.30%).
Range: 20-53. Mean: 33.34. Median: 33. Standard deviation: 3.95. Skewness: 0.31. Kurtosis: 3.20.
115
Figure 73. Histogram of 10,000 random minimal sound pair trials in a 4 x 7 x 2 matrix. The
probability of a random distribution of 23 sounds having 57 (purple) or more (darker gray)
minimal sound pairs is 0.16% (95% confidence interval: 0.14–0.19%). (The colors are not
visible because the bars counting random distributions with 57 minimal sound pairs are
vanishingly small.)
Range: 36-69. Mean: 46. Median: 46. Standard deviation: 3.73. Skewness: 0.38. Kurtosis: 3.26.
116
4. Discussion
frequency distribution analysis and a phonetic feature analysis. The sounds of this
beatboxer’s beat patterns form a Zipfian frequency distribution, similar to the Zipfian
distribution of words in language corpora. Both systems rely on a few high-frequency items
that support the rest of utterance. In English, these are function words (e.g., “the” or “and”)
that can be deployed in a wide variety of utterances and are likely to be used multiple times
in a single utterance. Words with lower frequency, on the other hand, are more informative
because they are less predictable—words like “temperature” are typically used in a relatively
restricted set of conversational contexts. In beatboxing, the most frequent sounds are the
Kick Drum, Closed Hi-Hat, PF Snare, and Inward K Snare. These sounds form the backbone
of musical performances and can be used flexibly in many different beat patterns. Infrequent
sounds like the Inward Clickroll add variety to beat patterns but may not be suitable
As for the phonetic frequency analysis, the primary aim was to determine whether or
not beatboxing sounds are composed combinatorially—and the answer seems to be that they
are. As described by Abler (1989), hallmarks of self-diversifying systems like speech and
blending) and periodicity of those elements. This study does not provide evidence about
whether or how beatboxing sounds sustain variation, but it does provide evidence that
117
beatboxing sounds are composed of combinations of phonetic features. Beatboxing has
existed for at least two broadly defined generations (the old school and the new school) to
say nothing of the rapid rate at which beatboxing developed as an art form with cycles of
teaching and learning; since the beatboxer studied here is from the new school of
beatboxing, we can conclude that either the system has recently developed into a
combinatorial one or that the old school of beatboxing was also combinatorial and has
remained so over time. At the very least, no sounds in the inventory are a blend (i.e., an
average) of other sounds; on the contrary, sounds like the D Kick Roll and Inward Clickroll
with Liproll demonstrate that new sounds can be created by non-destructively combining
two existing sounds. That is, the components involved in the sounds separately are still
Section 3.2.5 showed that while beatboxing sounds are not organized with maximal
dispersion, they are also not nearly as periodic as the set of English consonants. In some
sense, the periodicity of the system diminishes as lower frequency sounds are added: the
most frequent sounds are all compressed sounds arranged neatly along major places of
articulation, and all but one (or two, if you count the unforced Kick Drum) are glottalic
egressive; the pulmonic ingressive outlier, the Inward K Snare, only deviates from the others
because it has a crucial homeostatic role to play. Although there is a tendency for later
sounds to pattern into either bilabial or alveolar constrictor and compressed or contacted
constriction degree, still the initial current phonetic dimensions are broadened and more
dimensions are added without filling all the available phonetic space.
118
One reason for this may be that beatboxers do not learn beatboxing sounds like they
learn speech sounds. Speech is ubiquitous in hearing culture; when a child learns one or
more languages, they have an abundance of examples to learn from. Beatboxing is not
ubiquitous, so someone trying to learn beatboxing must usually actively seek out new
vocabulary items to add to their beatboxing inventory; and since it seems many beatboxers
do not start learning to beatbox until at least adolescence, the process of learning even a
single sound may be relatively rather slow. For a beatboxer who learns this way, their sound
inventory is more likely to be less periodic because there is no overt incentive to learn
minimal sound pairs. On the contrary, in the interest of broadening their beatboxing sound
inventory a beatboxer may be more motivated to learn sounds less like the others they
currently know.
As previewed at the ends of sections 3.2.3 and 3.2.4, a major shortcoming of this
periodicity analysis is the reliance on a fixed table structure. Sounds like the Water Drop
(Tongue), Water Drop (Air), Sonic Laser, D Kick Roll, Inward Clickroll with Liproll, Sweep
Technique, Sega SFX, and Trumpet use multiple constrictions that are relatively common
among the sounds but do not manifest in a tabular periodicity measurement. To take the
Water Drop (Tongue) as an example: it uses both labial and coronal constrictors with a
lingual ingressive (tongue body closure and retraction) airstream. Placing it in only the
coronal constrictor column causes the analysis to under-count the labial articulation; but
placing the sound in the labial column too would inflate the number of sounds that use
lingual ingressive airstream. Rather than looking for periodicity in whole sounds, it would be
better in the future to look for periodicity among individual vocal constrictions. Chapter 4:
119
Theory discusses this issue more and the possibility of treating these combinatorial
4.2 Implications
The notion that speech sounds have a relationship with each other—and are in fact defined
by this relationship—is a major insight of pre-generative phonology. Sapir (1925) for example
emphasized that speech sounds (unlike non-speech sounds) form a well-defined set within
which each speech sound has a “psychological aloofness” (1925:39) from the others, creating
relational gaps that encode linguistic information through contrast. Many phonological
theories assume that the fundamental informational units of speech are aligned to specific
phonetic dimensions and combine to make a larger unit (a segment). We have seen that
beatboxing sounds have meaning and that there is even a Zipfian organization to the use of
beatboxing sounds which implies that they have word-like meanings—that is, their meanings
are directly accessible by the speaker or beatboxer, as opposed to the featural or segmental
information of speech sounds which speakers generally do not have awareness of. Since
beatboxing sounds are combinatorial, does that make the individual phonetic dimensions
contrastive? Cognitive?
Beatboxing sounds clearly do not encode the same literal information as speech
sounds because beatboxing cannot be interpreted as speech. But 37 minimal sound pairs
less reductive system of over 40 sounds includes minimal differences in parameters like
120
voicing, double articulations, and double airstreams. The analysis in section [Link] may not
have found evidence for robust periodicity, but it did find that there are far more minimal
sound pairs in this beatboxer’s inventory than if the sounds were carefully arranged to
change the meaning of that sound just as changing one phonetic property of a word may
change the meaning of the word (e.g., changing the nasality of the final sound in “ban”
[bæn] results in “bad” [bæd]). In this sense, yes: the sounds of beatboxing are in a
Because the sounds of this beatboxer are not arranged very periodically, the contrasts
are not as neatly arranged as they are in speech. But even in speech contrast is a gradient
rather than a categorical phenomenon (Hockett, 1955): sounds may encode contrasts to
different degrees depending on the phonetic dimensions involved (e.g., the laterality of [l] in
English applies to only that one sound) or their role in larger constructions (e.g., [ŋ] only
contrastive word-initially and word-finally). Beatboxing sounds can contrast with each other
sound system.
Less clear is whether the differences between beatboxing sounds are also cognitive
differences. The answer depends in part on whether beatboxing sounds have phonological
patterning that is predictable based on certain phonetic dimensions. For example, velum
lowering is generally considered a cognitive gesture for nasality because nasality is active in
phonological behavior (e.g., spreading in phonological harmony); the velum raising that
121
makes oral sounds possible, on the other hand, is often considered inert because it does not
appear to play a role in phonological behavior. Whether any of the combinatorial dimensions
(cf the vocal art form scatting which does draw on phonological well-formedness conditions
for the production of non-linguistic music; Shaw, 2008). For one thing, the lack of vowels
precludes the possibility that the near-universal CV syllable could exist in beatboxing. For
another, if beatboxing sounds were composed of linguistic phonological units then there
not exist in language either (Eklund, 2008; cf Hale & Nash, 1997 for lingual egressive sounds
in Damin).
Even so, we have seen conspicuous overlap between the combinatorial phonetic
constriction degrees, some use of voicing and laterality, and overlapping airstream
beatboxing) cognition. For example, the Quantal Theory (Stevens, 1989; Stevens & Keyser,
2010) deduces common phonological features by searching for regions in the vocal tract that
afford stable relationships between articulation and acoustics; the apparent universality of
features in speech is thus explained as arising from humans sharing the same vocal tract
physiology. But the relationship between articulation and acoustics in the vocal tract is not
122
special to speech—it is simply a property of the human vocal instrument, and so could just as
easily apply to beatboxing. The prediction would be that beatboxing and speech would share
many of the same phonetic features, which is indeed what we found here. Auditory theories
beatboxing which capitalizes on the domain-general properties the systems share. That
chapter also includes a brief discussion of how a gestural description might encode
beatboxing contrast more effectively than the procrustean tables of sounds used here. Since
speech and beatboxing units are informationally unrelated to each other, purely
domain-specific theories of phonology cannot offer any explanation for why beatboxing and
123
CHAPTER 4: THEORY
This chapter introduces a theoretical framework under which speech and beatboxing
phonological units are formally linked. Specifically, in the context of the task-dynamics
framework of skilled motor control, speech and beatboxing are argued to have atomic units
that share the same graph (that is, the same fundamental architecture) but may differ
parametrically in task-driven ways. Under the hypothesis from Articulatory Phonology that
language, the graph-level link between speech and beatboxing actions becomes a cognitive
relationship. This cognitive link permits the formation of hypotheses about similarities and
1. Introduction
between the atoms of speech and beatboxing or their organization. This chapter aims to
sketch such a theory of beatboxing fundamental units and their organization that can
Dynamical systems are here used as the basis for understanding beatboxing units and
organization. The framework of task dynamics (Saltzman & Munhall, 1989) is commonly
used in Articulatory Phonology (Browman & Goldstein, 1986, 1989) to model the
coordination of a set of articulators in achieving the motor tasks (gestures) into which
124
with the fundamental cognitive units of speech. The coordination of the multiple units
composing speech is in turn modeled by coupling the activation dynamics of these units
(Nam & Saltzman, 2003; Goldstein et al., 2009; Nam et al., 2009). But task dynamics and the
coupling model are not speech-specific; they are inspired by nonlinguistic behaviors and can
be used to model any skilled motor task. Section 2 introduces concepts from dynamical
systems that will be the foundation of the link between speech and beatboxing. Section 3
argues that beatboxing sounds may be composed of gestures, and section 4 illustrates the
specific hypothesis that the fundamental units of beatboxing and speech share the same
domain-general part of the equations of task dynamics (the graph level). This establishes a
formal link between the cognitive units of speech and beatboxing that can serve as the basis
Articulatory Phonology hypothesizes that the fundamental units of phonology are action
units called “gestures” (Browman & Goldstein, 1986, 1989). Unlike symbolic features which
make no reference to time and only reference the physical vocal tract abstractly (if at all),
gestures as phonological action units vary in space and over time according to an invariant
differential equation (Saltzman & Munhall, 1989) that predicts directly observable
consequences in the vocal tract. While a gesture is active, it exerts control over a vocal tract
task variable (e.g., lip aperture) through coordinated activity in a set of articulators, in order
to accomplish some phonological task (e.g., a complete labial closure for the production of a
125
phenomena that are stipulated through computational processes in other models emerge in
Section 2.1 describes dynamical systems in terms of state, parameter, and graph levels.
Section 2.2 explains different point attractor dynamical systems and the usefulness of point
The dynamical systems used to model phonological units and their organization can be
characterized with three levels: the state level, the parameter level, and the graph level
(Farmer, 1990; Saltzman & Munhall, 1992; see Saltzman et al., 2006 for a more thorough
damped mass-spring and is commonly used as the basic equation for gestures in Articulatory
State level. In Equation 1, The variables 𝑥, 𝑥̇, and 𝑥̈ all encode the instantaneous value of the
state variable(s) of the system: the first represents its position, the second represents its
velocity, and the third represents its acceleration. The state variables generally are the vocal
tract task variables referred to above, such as the distance between the tongue body and the
palate or pharynx (tongue body constriction degree) or the distance between the upper and
lower lip (lip aperture). The values of those state variables change continuously as vocal tract
126
Parameter level. The task goal of the system (𝑥0) is defined at the parameter level: it
does not change while this gesture is active. Other parameters in this equation are 𝑏 (a
damping coefficient) and 𝑘 (which determines the stiffness of the system—that is, how fast
the system moves toward its goal). Each phonological gesture is associated with its own
distinct parameters. For example, the lip aperture gestures for a voiced bilabial stop [p] and a
voiced bilabial fricative [ɸ] are different primarily in their aperture goal 𝑥0: the goal of the
stop is lip compression (parameterized as a negative value for 𝑥0), while the goal of the
fricative is a light closure or slight space between the lips (parameterized as a value for 𝑥0
near 0). (Parameter values change more slowly over time, such as when a person moves to a
new community and adapts to a new variety of their language). Thus, Equation 1 states that a
fixed relation defined by the phonological parameters holds among the physical state
variables at every moment in time that the gesture is active. This fixed relationship defines a
phonological unit.
Graph level. The graph level is the architecture of the system. Part of the architecture
is the relationship between states and parameters in an equation (Saltzman et al., 2006). For
example, notice that the term for the spring restoring force 𝑘(𝑥 − 𝑥0) in the mass-spring
system above is subtracted from the damping force 𝑏𝑥̇; if it were added instead, that would
be a change in the graph level of this dynamical system. The system’s graph architecture also
includes the number and composition of the equations in a system (Saltzman & Munhall,
1992). With respect to speech, composition crucially includes the specification of which tract
127
Different graphs can result in qualitatively different behaviors. Changing the number
of equations in a system can create entirely different sounds. For example, the graph for an
oral labial stop [b] uses a lip aperture tract variable, but the graph for a nasal labial stop [m]
uses tract variables for both lip aperture and velum position. Alternatively, changing the
relationship between terms in an equation can affect how the same effector moves. Equation
2 shows the graph of a periodic attractor (Saltzman & Kelso, 1987); this type of dynamical
system describes the behavior of a repetitive action like rhythmic finger tapping or turning a
crank, which is qualitatively different from a point attractor system with a goal of a single
point in space. The graph for the periodic attractor in Equation 2 is modified from the
damped mass-spring system in Equation 1 by the addition of the term 𝑓(𝑥, 𝑥̇) which adds or
removes energy from the system to sustain the intended movement amplitude.
Taken together, state, parameter, and graph levels characterize a dynamical system. In
Articulatory Phonology and task-dynamics, the mental units of speech and their organization
are dynamical systems, and so the state, parameter, and graph levels characterize a speaker’s
phonology. Table 21 summarizes the roles of state, parameter, and graph levels in gestures
128
Table 21. Non-exhaustive lists of state-, parameter-, and graph-level properties for dynamical
systems used in speech.
State level Parameter level Graph level
Vocal tract movements in speech have certain characteristics that suggest what an
overlapping series of constrictions and releases in the vocal tract; each constriction affects
the acoustic properties of the vocal instrument in a specific, unique way such that
constrictions of different magnitudes and at different locations in the vocal tract create
distinctive acoustic signals. Each speech action can therefore be characterized as having a
relatively fixed spatial target for the location and degree of constriction. Moreover, speech
movements exhibit equifinality: they reach their targets regardless of the initial states of the
articulators creating the constriction or perturbations from external forces—as long as there
is enough time for the constriction to be completed and the same articulators are not being
how position and velocity change as a function of time during a spoken labial closure.
129
Figure 74. A lip closure time function for a spoken voiceless bilabial stop [p], taken from
real-time MRI data.
Point attractor dynamical systems generate precisely these qualities. Several different
differential equations can be used to model point attractor dynamics, and their goodness of
fit to the data can be assessed by comparing the model kinematics against the real-world
kinematics. For example, consider the first-order point attractor in Equation 3 in which 𝑥 is
the current spatial state of the system, 𝑥̇ is the system velocity, 𝑥0 is the system’s spatial
target, and 0 < 𝑘 < 1 is a constant that determines how quickly the system state changes.
Regardless of the starting value of 𝑥, the state always moves (asymptotically) toward 𝑥0—that
130
Figure 75. Schematic example of a spring restoring force point attractor.
But comparing the spring restoring force (Figure 75) to an actual speech movement (Figure
74) reveals that the details of the velocity state variable for this first-order point attractor are
not a good fit for speech kinematics. Speech movements generally start with 0 velocity and
have increasing velocity until they reach peak velocity sometime in the middle of the
movement trajectory, but this first-order spring system in Equation 3 begins at maximum
velocity and has only decreasing velocity over time. A kinematic profile that starts at peak
velocity is not an accurate portrayal of speech kinematics which tend to start at 0 velocity.
A better choice for modeling the dynamics of speech atoms is the damped
mass-spring system from Equation 1. When critically damped (𝑏 = 2 𝑘), the damped
mass-spring system acts as a point attractor: regardless of the initial starting state 𝑥, the state
131
of the system will converge toward its goal 𝑥0 and stay at that goal for as long as the system is
active. The position time series for a critically damped mass-spring system (Figure 76) results
(Figure 74): velocity starts at 0, increases until it peaks, then gradually decreases again as 𝑥
approaches 𝑥0. However, the peak velocity of the observed speech movement exhibits a more
symmetric velocity profile with the peak velocity about halfway through the gesture; the time
of peak velocity for the mass-spring equation in Equation 1 (Figure 76) is much earlier,
A third point attractor with a different graph is the damped mass-spring system with a “soft
spring” that has been suggested to more accurately model the kinematics of vocal movement
132
(Sorensen & Gafos, 2016). This equation (Equation 4) has the same pieces as Equation 1,
3
plus a cubic term 𝑑(𝑥 − 𝑥0) that weakens the spring restoring force when the current state
is relatively far from the target state. In other words, the system won’t move as quickly
3
Equation 4. 𝑥̈ =− 𝑏𝑥̇ − 𝑘(𝑥 − 𝑥0) + 𝑑(𝑥 − 𝑥0)
Figure 77. Schematic example of a critically damped mass-spring system with a soft spring.
One of the most noticeable differences between the damped mass-spring systems with
(Figure 77) and without (Figure 76) the soft spring is the difference in the relative timing of
peak velocity. Both systems start out with 0 velocity and gradually increase velocity until
velocity reaches its peak; however, the system with the soft spring reaches its peak velocity
later than the system without the soft spring, which Sorensen and Gafos (2016) show is a
133
better fit to speech data (compare for example against the speech labial closure in Figure 74).
The critically-damped mass-spring system without the soft spring can result in this
kinematic profile if gestures have ramped activation—that is, rather than treating gestures as
if they turn on and off like a light switch, increasing a gesture’s control over the vocal tract
gradually like a dimmer switch also delays the time to peak velocity (Kröger et al., 1995; Byrd
& Saltzman, 1998, 2003). Sorensen & Gafos (2016) argue that the dynamical system with the
soft spring term should be preferred to the simpler damped mass-spring system with ramped
Details of equation architecture (graph) aside, point attractor dynamical systems are
useful as speech units for many reasons. A variety of speech phenomena can be accounted
for by specifying the temporal interval during which a point attractor exerts control in the
vocal tract. For example, if a gesture’s dynamical system does not last long enough for the
gesture to reach its goal, gestural undershoot might lead to phonological alternation e.g.,
between a stop and a flap or fricative (Parrell & Narayanan, 2018). Alternatively, if a gesture
remains active after reaching its target state, the gesture is prolonged; this is one account of
the articulation of geminates (Gafos & Goldstein, 2011). The temporal coordination of two or
more gestures can result in spatio-temporal overlap that may account for certain types of
phonological contrasts and alternations (Browman & Goldstein, 1992). Some types of
phonological assimilation, harmony, and epenthesis can be described as resulting from the
temporal overlap of gestures (Browman & Goldstein, 1992. And when two gestures are active
at the same time over the same vocal tract variable(s), those gestures blend together,
resulting in coarticulation.
134
All in all, point attractor topologies are advantageous as models of gestures for a
variety of reasons. Section 4 argues that the point attractor system that gestures share are
Beatboxers have mental representations of beatboxing sounds. Beatboxers are highly aware
of the beatboxing sounds in their inventory and the differences between many of those
sounds. They give names to most of the sounds—though the names may differ from language
to language or beatboxer to beatboxer as they do for the “power kick” (Paroni et al., 2021)
and “kick drum” which are both names for a bilabial ejective that fulfills a particular musical
sound; beatboxers who want to learn a sound they heard someone else perform first need to
know the name of the sound so they can ask for instruction. The naming and identification
of beatboxing sounds suggest that skilled beatboxers can distinguish a wide variety of vocal
articulations within the context of beatboxing. (Unsupervised classifier models can also
reliably group beatboxing sounds into clusters based on the acoustic signal via MFCCs;
Paroni et al., 2021). As distinct objects, each associated with some meaning and available to
representations.
beatboxing in terms of gestures rather than more traditional phonetic dimensions. Gestures
in speech are cognitive representations that fill the role of the most elementary abstract,
135
compositional units of phonological information. Information implies phonological contrast:
changing, removing, adding, or re-timing a gesture can often change the meaning of a word.
Chapter 3: Sounds argued that beatboxing sounds are composed compositionally in the same
way that speech sounds are—though without the same degree of periodicity/feature
economy—and that there are a not insubstantial number of minimal sound pairs for which
changing one articulator task, one gesture, can change the meaning of the sound.
phonetic notation (though foregoing symbols and charts sacrifices some brevity). But the
actual number of tract variable gestures used to create the different beatboxing sounds is
relatively small. Sounds like the D Kick Roll and Open Hi-Hat do not fit into a table just
because they use too many tract variables to conveniently fit in any one place in a table.
Others like the Water Drop (Air) re-use the same tract variable multiple times (in this case,
the tongue body constriction location). A gestural approach to describing beatboxing can
account for these types of cases in a way that looking at the sounds in a table cannot. If the
number of tract variables used really is small, then a gestural perspective might even show
that the periodicity of beatboxing sounds is comparable to the periodicity of speech sounds.
information in the cognitive system underlying beatboxing would require a more complete
inventory of contrastive beatboxing gestures as well as evidence that these gestures play a
136
role in defining natural classes for characterizing beatboxing alternations (on analogy to
complete inventory is beyond the scope of this work, and the task might in principle be
impossible—beatboxing sounds appear to be an open set, and the gestures themselves might
That said, it is possible to hazard some educated guesses about what a set of
beatboxing gestures might include. Frequently-used constrictors like the lips and the tongue
tip are likely to be associated with beatboxing gestures at compressed and contacted
constriction degrees (that can be encoded in task dynamics as constriction degree goals of a
negative value and 0, respectively). Since beatboxing involves a wider range of mechanisms
for controlling pressures in the vocal tract than does speech, contrasting gestures to initiate
pressure changes in the vocal tract would seem to be required: pulmonic, laryngeal, and
lingual tasks, along with contrasts in the goal value of such gestures (increased or decreased
pressure). Voicing seems to make a difference in some sounds too, and at the very least is the
Suspiciously, almost all of these hypothetical gestures use the same vocal tract
variables as speech and with ultimately similar pressure control aims, though not necessarily
with speechlike targets or configurations. Pulmonic tasks (for increased vs. decreased
pressure) is the only one of these not attested to be used contrastively in speech, although
pulmonic ingressive airflow is used somewhat commonly around the world as a sort of
pragmatic contrast to non-ingressive speech (Eklund, 2008). But the point is not that speech
137
and beatboxing are built on the same set of gestures—the contrastive use of pulmonic
airstreams as well as the use of lateral labial constrictions (not reported in this dissertation
because there was only a midsagittal view) rules out the possibility that beatboxing is a
reconfiguration of speech units. Rather, the point is that a gesture-based account may work
well for both speech and beatboxing. In this sense, just as Articulatory Phonology is based
around the hypothesis that gestures are the fundamental units of speech production and
controlled by task dynamic differential equations are the fundamental units of beatboxing
production and perception. The next sections are dedicated to developing an understanding
of how gestures are recruited differently for speech and beatboxing while simultaneously
linking the two domains through the potential of the vocal instrument.
The state, parameter, and graph levels of the differential equations in task dynamics provide
an explicit way to formally compare and link speech and beatboxing sounds. Beatboxing and
speech actions use the same vocal tract articulators to create sound, which means they are
constrained by the same physical limits of the vocal apparatus. In task dynamics, these
limitations constrain the graph dynamics and parameter space of actions available to a vocal
and refine each domain’s graph dynamics and parameter space; even so, as this section
argues, the actions in both domains appear to use the same point attractor topologies, tract
138
variables, and coordination, all of which indicate that speech and beatboxing share the same
graph.
These graph properties appear to be shared by speech and beatboxing: the individual actions
are point attractors (section 4.1.1) operating mostly over the same tract variables as speech
gestures (section 4.1.2) with similar timing relationships (section 4.1.3). In addition, coupled
oscillator models of prosodic structure have been used to account for both speech and
musical timing, making them a good fit for beatboxing as well (section 4.1.4).
Point attractors have been used as models of action units for behaviors other than speech,
even behaviors without the kinds of phonological patterns that speech has (Shadmehr, 1998;
Flash & Sejnowski, 2001). Goldstein et al. (2006:218) “view the control of these units of
Beatboxing and speech sounds leverage the same vocal tract physics: wider
constrictions (as in sonorants) alter the acoustic resonances of the vocal tract, and narrower
constrictions or closures obstruct the flow of air to create changes in mean flowrate and
intraoral pressure and generate acoustic sources. Moreover, beatboxing and speech both have
discrete sound categories, like a labial stop [p] in speech and a labial stop Kick Drum {B} in
beatboxing. Creating discrete sounds requires vocal constrictions with specific targeted
139
constriction locations and degrees (Browman & Goldstein, 1989). As discussed in section 2.2,
kinematics created by speech point attractor gestures: they start slow, increase velocity until a
peak somewhere near the middle of the movement, and slow down again as the target
constriction is attained (Figure 78). This suggests that beatboxing actions share both
Part of the graph level is the specification of which task variables are active at any time.
Beatboxing and speech operate over the same vocal tract organs and therefore have access to
(and are limited to) the same vocal tract variables. In Chapter 3: Sounds it was established
that many beatboxing sounds resemble speech sounds in constriction degree and location.
The specific tract variables used by each behavior may not completely overlap. Beatboxers
for example sometimes use lateral bilabial constrictions but there are no speech task
variables for controlling laterality—due partly to the difficulty of acquiring relevant lateral
data to know what a lateral task variable might be, but also to the fact that laterals in speech
are always coronal and can be modeled by adding an appropriate dorsal gesture to a coronal
gesture. Such a strategy would not work for modeling lateral labials. Overall, though, speech
and beatboxing movements are more similar than they are different.
140
Figure 78. Position and velocity time series for labial closures for a beatboxing Kick Drum {B}
(left) and a speech voiceless bilabial stop [p] (right). Movements were produced by the same
individual, tracked using the same rectangular region of interest that encompassed both the
upper and lower lips. Average pixel intensity time series in the region of interest were
smoothed using locally weighted linear regression (kernal = 0.9, Proctor et al., 2011; Blaylock,
2021), and velocity was calculated using the central difference theorem as implemented in
the DelimitGest function (Tiede, 2010). Both movements were extracted from a longer,
connected utterance (a beatbox pattern with the Kick Drum and the phrase “good pants”
from a sentence produced by the same beatboxer). See Chapter 2: Method for details of data
acquisition.
The physics of sound manipulation in the vocal tract are the same for speech and
beatboxing: different constriction magnitudes and locations along the vocal tract result in
different acoustics. Some regions of the vocal tract are more stable than others, meaning that
variation of constriction location within some regions results in little acoustic change; these
stable regions are argued to shape the set of distinctive contrasts in a language so that
coarticulation does not dramatically alter the acoustic signal and lead to unwanted percepts
(Stevens, 1989; Stevens & Keyser, 2010). Though beatboxing does not have linguistically
contrastive features to convey, parity must still be achieved between an expert beatboxer and
a novice beatboxer in order for learning to occur. Beatboxers exploit the same vocal physics
141
to maximize transmission of the beatboxing signal, resulting in beatboxers leveraging the
The relative timing of two speech gestures can make a meaningful difference within a word.
For example, the timing of a velum lowering gesture makes all the difference between “mad”
[mæd] (velum timed to lower at the beginning of the word) and [bæn] (velum timed to
lower closer to the end of the word). Timing between gestures can also be contrastive even
within a single segment too, like the relative timing of the oral closure gesture and laryngeal
lowering that distinguishes voiced plosives and voiced implosives (Oh, 2021).
coupled periodic timing oscillators or “clocks” (Nam & Saltzman, 2003; Goldstein et al.,
2009; Nam et al., 2009). While a clock is running, its state (the phase of the clock)
continually changes just like the hands on a clock move around a circle. These clocks are
responsible for triggering the activation of its associated gesture(s) in time; the triggering
occurs when a clock’s state is equal to a particular activation phase. Thinking back to the
graph level, coupling two oscillators means that the dynamical equation for each oscillator
includes a term corresponding to the state (phase) of the oscillator(s) to which it is coupled;
thus, the phase of each oscillator at any time depends on the phases of the other oscillators
model of intergestural timing: the phases of coupled clocks settle into different modes like
in-phase (0 degree difference in phase) or anti-phase (180 degree difference in phase) that
142
result in gestures being triggered synchronously or sequentially, respectively. The state,
parameter, and graph components of the coupled oscillator model are given in Table 22.
Table 22. Non-exhaustive lists of state-, parameter-, and graph-level properties for coupled
timing oscillators (periodic attractors).
State level Parameter level Graph level
In-phase coupling between a consonant gesture and a vowel gesture results in a CV syllable
or mora; it is also used intrasegmentally for consonants with more than two gestures, for
example a voiceless stop with both an oral constriction gesture and a glottal opening gesture.
nucleus-coda syllable structure. Anti-phase coupling may also exist in some languages
between consonants in an onset cluster, with all the consonants coupled in-phase to the
vowel but anti-phase to each other, resulting in what has been described as the C-Center
The specific timing relations needed to model beatboxing are unclear at the moment,
and it is not clear if beatboxing needs a coupled oscillator model of timing per se. On the one
hand, beatboxing does not usually feature wide vowel-like constrictions, so there does not
appear to be anything quite like a CV syllable in beatboxing, much less something like a
syllable coda; in general, beatboxing sounds are coordinated with the alternating rhythmic
143
beats (section 4.1.4), so intergestural coupling relations might usually be relevant only among
the component gestures of a given beatboxing sound. On the other hand, there is clear
evidence for intra-segmental timing relationships that may benefit from a coupled oscillator
approach. Some of the most common beatboxing sounds are ejectives, and these require
careful coordination between the release of an oral constriction and the laryngeal
closing/raising action that increases intraoral pressure (Oh, 2021); the same is likely true for
lingual and pulmonic beatboxing sounds. In addition, some beat patterns feature two
beatboxing sounds coordinated to the same metrical beat, resulting in sound clusters like a
Kick Drum followed closely by some kind of trill. This kind of relationship between sounds
and the meter suggests that the beatboxing sounds in these clusters may be coupled with
Hierarchical prosodic structure in speech has also been modeled using coupled oscillators,
including syllable- and foot-level oscillators (Cummins & Port, 1998; Tilsen, 2009; Saltzman
et al., 2008; O’Dell & Neiminen, 2009). The cyclical nature of oscillators matches the ebb and
flow of prominence in some languages, including stress languages that alternate (more or
In Chapter 2: Method, it was shown that the musical meter in styles related to
oscillators are well-suited for modeling these types of rhythmic alternations in music (e.g.,
Large & Kolen, 1994): each oscillator contributes to alternations at one level of the hierarchy,
and the oscillators to which it is coupled have either half its frequency (hierarchically
144
“above”, with slower alternations) or double its frequency (hierarchically “below”, with
slower alternations), yielding a stable 1:2 frequency coupling relationship between each level.
Other rhythmic structures like triplets can be modeled by temporarily changing oscillator
Speech and beatboxing share the same set of vocal organs, each of which has its own
mechanical potential and limitations for any movement. Tasks are constrained by the
physical abilities of the effectors that implement them; in the task dynamics model, this is
represented as a constraint on the range of values of each dynamical parameter that fits into
a given graph. Therefore, speech and beatboxing share both their graph structures and a
speech gestures that use the same tract variable is encoded by different parameter values.
of articulation, with a narrow constriction target for a fricative, a lightly closed constriction
target for a trill, or a compression target for a stop. For a given sound, the selection of a tract
variable (or tract variables) and the associated learned parameter values are part of a
person’s knowledge about their language (and may differ slightly from person to person for a
given language).
the action units that become gestures are not inherently linguistic, but are harnessed by the
145
language-user to be used as phonological units. This is accomplished by tuning the
contrasts relevant to language being spoken. The same pre-linguistic actions can be
harnessed for non-linguistic purposes, including beatboxing; they may simply require
the task-dynamics framework (Saltzman & Munhall, 1989) as the specification of values at
the parameter level of a dynamical equation described in section 2.1. When a gesture is
implemented, the task-specific parameter values for that gesture are applied to the system
graph. This application is depicted in Figure 79. The point attractor graph space on the left
represents the untuned dynamical system that is (by hypothesis) the foundational structure
of every gesture (Saltzman & Munhall, 1989). Learned parameters associated with a
particular speech task are summoned from the phonological lexicon to tune the dynamical
system, like the intention to form a labial constriction for a /b/, represented in the figure as
an unfilled (dark) circle. The result of this tuning is a speech action—a phonological gesture,
146
Figure 79. Parameter values tuned for a specific speech unit are applied to a point attractor
graph, resulting in a gesture.
As argued above, speech and beatboxing actions can both be described as point attractors
operating over a shared set of tract variables, though the use of those tract variables
sometimes differs between the two domains. With respect to parameter tuning in
task-dynamics, this simply means that beatboxing actions use the same point attractor graph
147
as speech but with beatboxing-specific parameter values (Figure 80). This is one way of
establishing a formal link between beatboxing and speech in task dynamics: the atomic
actions of each behavior share the same graph, but differ by domain-specific parameter
values.2
What determines the parameter values for speech sounds and beatboxing sounds?
The answer lies in the intention behind each behavior: beatboxing actions create musical
parameters are tuned accordingly. For example, beatboxing and some languages both feature
bilabial ejectives in their system of sounds. A beatboxing bilabial ejective is a Kick Drum, and
has a particular aesthetic quality to convey, so its labial and laryngeal gestures may have
the interplay between those units—arises from the interaction between the physiological
constraints of vocal sound production and the broader tasks of beatboxing, just as the
fundamental contrastive and cognitive units of speech and the interplay between those units
arise from the interaction between the same constraints and the tasks of speech. Gestures are
2
As noted earlier, an alternative hypothesis is that beatboxing is “parasitic” on speech, recombining whole
speech gestures—including existing phonological parameterizations—into the set of beatboxing sounds. This
seems unlikely because the tract variables and target values used by speech and beatboxing do not fully overlap.
Beatboxing does not adopt the speech gestures used for making approximants and vowels. More to the point,
English-speaking beatboxers use lateral labial gestures, constrictions that make trills, and a variety of
non-pulmonic-egressive airstreams, none of which are attested in the phonology of English. Even if one were to
assume an innate, universal set of phonological elements for beatboxing to pilfer from, the lack of attestation of
phonologically contrastive pulmonic ingressive and lingual egressive units rules them out from the set of
universal features—since beatboxing has them, it must have gotten them from somewhere else besides speech.
For illumination by comparison: there are vocal music genres like scatting (Shaw, 2008) that do seem to be
parasitic on speech gestures and phonological patterns; these behaviors sound speechlike, and beatboxing does
not.
148
a useful way of modeling this interaction in both domains because they encode both
values for a given gesture are constrained both by the physical limitations of the system and
anthropophonic perspective of speech sound. The term anthropophonics originated with Jan
and the psychological (psychophonic) properties of speech sounds. Catford (1977) defines
sound possibilities that can be described (general phonetics) of which the whole set of
speech possibilities is only a subset (linguistic phonetics). Lindblom (1990) adopted the
speech from non-speech phonetic principles, specifically with respect to the question of how
to define a possible sound of speech (cf Ladefoged, 1989). Particularly as used in the vein of
domain-specific tasks filter all that potential into a coherent system. The dynamical
149
5. Predictions of the shared-graph hypothesis
The argument so far is that speech and beatboxing are domain-specific tunings of a shared
graph. Moreover, by the hypothesis of Articulatory Phonology that the actions composing
speech are also the fundamental cognitive units of speech, the graph-level link between
speech and beatboxing is a domain-general cognitive link between speech and beatboxing
sounds. This is how similarities and differences between speech and beatboxing phonology
can be predicted: any phenomenon that could emerge due to the nature of the graph in one
domain is fair game for the other (but task-specific phenomena, including which units are
selected for production and the task-specific parameters of those units, are not). Likewise,
any hypotheses made about speech graphs may therefore manifest in the beatboxing graph
as well, and vice-versa. For example, the Gestural Harmony Model (Smith, 2018)
hypothesizes two new graph elements: a persistence parameter that allows a gesture to have
relationship by which one gesture inhibits the activation of another. In doing so, the model
simultaneously makes predictions about the parameter space and coupling graph options
that beatboxing has access to as well. It turns out that beatboxing fulfills these predictions as
The proposed graph-level link also introduces a new behavioral possibility: that
speech and beatboxing sounds may co-mingle and be coordinated as part of the same motor
plan. After all, no part of the framework outlined above precludes the simultaneous use of a
point attractor with speech parameters and a point attractor with beatboxing parameters.
People do not spontaneously or accidentally beatbox in the middle of a typical sentence, but
150
during vocal play speakers may for fun mix sounds that are otherwise unattested in their
language variety into their utterances; and, beatboxers sometimes use words or word phrases
as part of their music. But the clearest evidence for the existence of speech-and-beatboxing
behavior (and support for the graph-level link) is the art form known as beatrhyming, the
Beatrhyming shows that humans can take full advantage of the flexibility of the motor
system to blend two otherwise distinct tasks into a brand new task. Beatrhyming is discussed
There are alternatives to gestures as the fundamental beatboxing units. Paroni et al.
(2021) suggest the term boxeme be used to mean a distinct unit of beatboxing sound,
performances; since beatboxers explicitly refer to these individual sounds in the composition
of a beat pattern, the notion seems to be that every sound that can be differentiated from
another sound (by name, acoustics, or articulation) is a boxeme candidate. Given the
evidence that beatboxing sounds are composites of smaller units, a phoneme-like boxeme
could be said to be composed of symbolic beatboxing features. (Paroni et al., 2021 do not
commit to either a symbolic or dynamical approach, and “boxeme” may simply be a useful,
theory-agnostic way to refer to a meaningful segment-sized beatboxing sound init; for the
sake of argument, we assume that the clear connection to “phoneme” is meant to imply a
symbolic perspective.)
As mental representations for speech, gestures and phonemes are two very different
hypotheses for the encoding of abstract phonological information: phonemes are purely
151
domain-specific, abstract, symbolic representations composed of atomic phonological
features that are not deterministic with respect to the physical manifestation of a sound.
Gestures on the other hand are simultaneously abstract and concrete (domain-specific and
that is predicted to be observably satisfied at every point in time during which a gesture is
being produced. Gestures are particularly advantageous for treating timing relationships (at
multiple time scales) as part of a person’s phonological knowledge. In this sense, the
difference between units that are both domain-specific and domain-general and units that
and beatboxing because their partly domain-general nature creates explicit, testable links
between the domains. Symbolic boxemes and phonemes, on the other hand, have no basis
for comparison with each other, no intrinsic links to each other, and no basis for one making
predictions about the other because they are defined purely with respect to their own
domain.
152
CHAPTER 5: ALTERNATIONS
This section addresses whether ”forced” {B} and “unforced” {b} varieties of Kick Drum are
a single sound category. It is shown that forced and unforced Kick Drums fulfill the same
rhythmic role in a beat pattern, with unforced Kick Drums generally occurring between
sounds with dorsal constrictions and forced Kick Drums generally occurring elsewhere. The
The Kick Drum mimics the kick drum sound of a standard drum set. It is typically
performed as a voiceless glottalic egressive bilabial plosive, also known as a bilabial ejective
(de Torcy et al. 2013, Proctor et al. 2013, Blaylock et al. 2017, Patil et al. 2017, Underdown
2018). Figure 81 illustrates how one expert beatboxer from the rtMRI beatboxing corpus
produces a classic, ejective Kick Drum. First a complete closure is made at the lips and glottis
(Figure 81a), then larynx raising increases intraoral pressure so that a distinct “popping”
153
Figure 81. Forced/Classic Kick Drum. Larynx raising, no tongue body closure.
a. b.
utterances (“beat patterns”) were clearly identifiable as classic ejective Kick Drums during
the transcription process based on observations of temporally proximal labial closures and
larynx raisings. These Kick Drums in beat patterns qualitatively matched the production of
the Kick Drum in isolation (albeit with some quantitative differences, e.g., in movement
However, some sounds produced with labial closures in the beat patterns of this data
set did not match the expected Kick Drum articulation—nor were they the same as other
labial articulations like the PF Snare (a labio-dental ejective affricate) or Spit Snare (a
buccal-lingual egressive bilabial affricate). These “mystery” sounds had labial closures and
release bursts most similar to those of the Kick Drum, but were generally produced with a
tongue body closure and without any larynx raising. These differences are visible in a
comparison of Figure 81 (the Kick Drum) with Figure 82 (the mystery labial): in Figure 81,
the tongue body never makes a constriction against the palate or velum, and bright spot at
the top of the trachea indicates that the vocal folds are closed; but in Figure 82, the tongue
body is pressed against a lowered velum, and the lack of a bright spot indicates that the vocal
154
Figure 82. Unforced Kick Drum. Tongue body closure, no larynx raising.
a. b. c.
Based both on consultation with beatboxers and on the analysis that follows below, this
mystery labial sound has been identified as what is known in the beatboxing community as
an “unforced Kick Drum”—a “weaker” alternative to the more classic ejective “forced” Kick
Drum, and which does not have a common articulatory definition (compared to the forced
Kick Drum, which beatbox researchers have established is commonly an ejective) (Tyte &
SPLINTER, 2014; Human Beatbox, 2018). Given the clear dorsal closure, one might expect
that the unforced Kick Drum would be performed as a lingual (velaric) ingressive (clicklike)
or egressive sound. However, preliminary analysis suggests that the unforced Kick Drum is a
“percussive” (Pike, 1943), referring to a lack of ingressive or egressive airstream during the
production of this sound (not to be confused with their role in musical percussion). Figure
83 illustrates this via comparison to the Spit Snare, a lingual egressive bilabial sound: the Spit
Snare reduces the volume of the chamber in front of the tongue through tongue fronting and
jaw raising (Figure 83, left), whereas the unforced Kick Drum does neither (Figure 83, right).
155
Figure 83. Spit Snare vs Unforced Kick Drum. The Spit Snare (left) and unforced Kick Drum
(right) are both bilabial obstruents made with lingual closures. The top two images of each
sound are frames representing time of peak velocity into the labial closure and initiation of
movement out of the labial closure (found with the DelimitGest function of Tiede [2010]).
The difference between frames (bottom) was generated using the imshowpair function in
MATLAB’s Image Processing Toolbox. In both images, purple pixels near the lips indicate
that the lips are closer together in the later frame than in the first. For the Spit Snare, the
purple pixels near the tongue indicate that the tongue moved forward between the two
frames, and the green pixels near the jaw indicate that the jaw rose. For the unforced Kick
Drum, the relative lack of color around the tongue and jaw indicate that the tongue and jaw
did not move much between these two frames.
Not all beatboxers appear to be aware of the distinction between forced and unforced Kick
Drums—or if they are aware, they do not necessarily feel the need to specify which type of
Kick Drum they are using. Hence, while the expert beatboxer in this study did not identify
the difference between forced and unforced Kick Drums and chose to produce only forced
Kick Drums in isolation, they made liberal use of both Kick Drum types in beat patterns
156
For another example of beatboxers not distinguishing between forced and unforced
Kick Drums: during an annotation session in the early days of this research, a
demonstrated a beat pattern featuring only sounds with dorsal articulations (a common
strategy used for the practice of phonating while beatboxing, as discussed in Chapter 6:
Harmony). In the beat pattern, she produced several of what we now recognize as unforced
Kick Drums—sounds that act as Kick Drums but have a dorsal articulation instead of an
ejective one. But when asked to name the sound, she simply called it “a Kick Drum,” not
specifying whether it was forced or unforced and apparently not noticing (or caring about,
The parallel to similar observations about speech are striking. English speakers who
have a sense that words are composed of sounds can often recognize the existence of a
category of sounds like /t/, but may not be aware that it manifests differently (sometimes
environment. In the same way, beatboxers are aware of the Kick Drum sound category but
may not always be aware of the different ways it manifests in production. In symbolic
approaches to phonology, this type of observation has been used to argue for the existence of
overlap: instead of categorical changes from one allophone to another depending on context,
the gestures for a given sound are invariant and only appear to change when co-produced
with gestures from another sound (Browman & Goldstein, 1992; see Gafos & Goldstein, 2011
157
for a review). In either approach, there is a single sound category (a phoneme or gestural
constellation) the manifestation of which varies predictably and unconsciously based on the
Do beatboxers treat forced and unforced Kick Drums as alternate forms of the same
sound category? If so forced and unforced Kick Drums would be expected to be members of
the same class of sounds and to occur in complementary distributions conditioned by their
furthermore predicts that the constriction that makes the difference between the sounds will
come from a nearby sound’s gesture. Assuming that the forced Kick Drum is the default
sound because it was the one produced in isolation by the beatboxer, the tongue body
closure characterizing the unforced Kick Drum is predicted to be a gesture associated with
another sound nearby. Establishing the first criterion, that the forced and unforced Kick
Drums are members of the same class of sounds, is done with a musical analysis. A
subsequent phonetic analysis looks for evidence that the two Kick Drums are
The musical analysis takes into account that beatboxing sounds are organized into
constraints that can be satisfied by any sound in the class; for example, although snare
sounds as a class are generally required on beat 3 (the back beat) of any beatboxing
performance, the requirement can be accomplished with any sound from the class of snares
including a PF Snare, a Spit Snare, or an Inward K Snare. The members of a musical class of
158
sounds are not necessarily alternations of the same sound—PF Snares and Inward K Snares
are not argued here to be context-dependent variants of an abstract snare category. But for
forced and unforced Kick Drums to be alternants of the same category, they minimally must
belong to the same musical class. Because sounds in a musical class have metrical occurrence
restrictions, a test of musical class membership is to observe whether forced and unforced
Kick Drums are performed with the same rhythmic patterns and metrical distributions. If
they are not, then they are not members of the same musical class and therefore cannot be
alternants of a single abstract category.3 (The names of the sounds clearly imply that
beatboxers treat the forced Kick Drum and unforced Kick Drum as two members of the Kick
Drum musical class; the musical analysis below illustrates this relationship in detail.)
The phonetic analysis is to note the phonetic environment of each Kick Drum type
and look for patterns in the gestures of those environments. Complementary distribution is
found if the phonetic environments of the two types of Kick Drum are
predictable based on its phonetic environment. This type of analysis is performed in many
Sections 2 and 3 below establish that in this data set, forced and unforced Kick Drums
share the same rhythmic patterning (Section 2.1), but unforced Kick Drums are mostly found
3
It may be useful in future analyses to consider the possibility that some sounds vary by metrical position or
otherwise exhibit positional allophony. Guinn & Nazarov (2018) suggest that phonotactic restrictions on place
that prevent coronals from occurring in metrically strong positions; perhaps those restrictions are part of a
broader pattern of allophony.
159
between two dorsal sounds whereas forced Kick Drums have a wider distribution (Section
2.2). The unforced Kick Drum therefore appears to be a Kick Drum that has assimilated to
an inter-dorsal environment (and lost its laryngeal gesture in the process). This account of
the data will be reinforced in Chapter 6: Harmony when it is shown that unforced Kick
2. Analyses
Beat patterns were transcribed into drum tab notation from real-time MRI videos as
described in Chapter 2: Method. Based on those transcriptions, section 2.1 shows that
unforced Kick Drums have a similar rhythmic distribution to forced Kick Drums,
particularly beat 1 of a beat pattern. Section 2.2 shows that unforced Kick Drums appear to
have a fairly restricted environment, occurring mostly between two dorsal sounds. The two
findings combined suggest that forced and unforced Kick Drums are alternative
From this point forward, the ejective (classic/forced Kick Drum) version will be
written in Standard Beatbox Notation {B}, whereas the unforced Kick Drum will be written
in Standard Beatbox Notation {b} (Tyte & SPLINTER, 2014). (Note that uppercase vs
distinction. For example, the Closed Hi-Hat is considered a forced sound, but is written with
a lowercase {t}.)
160
2.1. Rhythmic patterns of Kick Drums
Forty beat patterns were identified as containing a forced Kick Drum, unforced Kick Drum,
or both. One beat pattern with forced Kick Drums was omitted because it also included
unusually breathy (possibly Aspirated) Kick Drums which are not the subject of the analysis.
Of the remaining thirty-nine beat patterns, all but six were exactly four measures long; for
this analysis, the six longer beat patterns were truncated to just the first four measures. An
exception was made for beat pattern 38 (Figure 86) which comes from the same
performance as beat pattern 28 (Figure 84). The originating beat pattern was 32 measures
long; the first section (measures 1-4, beat pattern 28) used forced Kick Drums whereas the
last section (measures 29-32, beat pattern 38) used both forced and unforced Kick Drums,
and the two sections were judged to have sufficiently distinctive beat patterns that they could
A total of 40 four-measure Kick Drum patterns were sorted into three groups: 28 beat
patterns that only contain forced Kick Drums (Figure 84), 7 beat patterns that only contain
unforced Kick Drums (Figure 85), and 5 beat patterns that contain both forced and unforced
There are many possible forced Kick Drum patterns (Figure 84), but three particular
details will facilitate comparison to unforced Kick Drums. First, in all beat patterns but one
the forced Kick Drum occurs on the very first beat of the very first measure (27/28 cases,
96.4%, beat patterns 2-28). Second, in several cases the Kick Drum occurs on beats 1, 2+, and
4 of the first and third measures (9/28 cases, 32.1%, beat patterns 18-26). And third, 7 of those
same 9 beat patterns feature Kick Drums on 1+ and 2+ of measure 2, with similar patterns in
161
measure 4 (beat patterns 19-25). There are fewer beat patterns that use unforced Kick Drums
to the exclusion of forced Kick Drums (Figure 85), but the unforced Kick Drums in these
beat patterns have similar patterns to the ones just described for forced Kick Drums above.
First, in all but one beat pattern the unforced Kick Drum occurs on beat 1 of measure 1 (6/7
cases, 85.7%, beat patterns 30-35). Second, the Kick Drum tends to also occur on beats 1, 2+,
and 4 of the first and third measures (5/7 cases, 71.4%, beat patterns 31-35). And third, 4 of
those same 5 beat patterns feature Kick Drums on 1+ and 2+ of measure 2, with similar
162
Figure 85. Unforced Kick Drum beat patterns.
29) b|------x---------|--x---x---------|x-----x-----x---|--x---x---x-x---
30) b|x---------------|x---------------|x---------------|x---------------
31) b|x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
32) b|x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---x---x-
33) b|x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
34) b|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x---x-
35) b|x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
| Measure 1 | Measure 2 | Measure 3 | Measure 4
Even the two beat patterns in which a Kick Drum does not occur on beat 1 of measure 1
(beat pattern 1 of Figure 84 and beat pattern 29 of Figure 85) are similar: both have a single
Kick Drum on beat 2+ of measure 1, followed by two Kick Drums on beats 1+ and 2+ of
measure 2. (These beat patterns without Kick Drums on the first beat seem exceptional
compared to the rest of the beat patterns that do have Kick Drums on beat 1. Examining the
real-time MRI reveals that there are, in fact, labial closures on beats 1 and 4 of measure 1 in
both of these beat patterns, mimicking the common pattern of Kick Drums on beats 1, 2+,
and 4 of the first measure. The labial closures on beats 1 and 4 are co-produced with other
sounds on the same beat—a Lip Bass in the case of the forced Kick Drum (Figure 84, beat
pattern 1), and a Duck/Meow sound effect in the case of the unforced Kick Drum (Figure 85,
beat pattern 29). While many of the other beat patterns also feature Kick Drums
co-produced with other sounds on the same beat, the labial closures on beats 1 and 4 in these
two exceptional beat patterns have no acoustic release corresponding to the sound of a Kick
Figure 86 shows five cases of beat patterns with both forced and unforced Kick
Drums. Each beat pattern is presented with both forced {B} and unforced {b} Kick Drum
drum tab lines as well as a “both” drum tab line that is the superposition of the two types of
163
Kick Drum. Notice that the two types of Kick Drum never interfere with each other (i.e., by
occurring on the same beat); on the contrary, they are spaced apart from each other in ways
that create viable Kick Drum patterns. This is especially noticeable in beat patterns 36, 37,
and 40: the Kick Drums collectively create pattern of Kick Drums on beats 1, 2+, and 4 of the
first measure, one of the common patterns described above (Figure 84, patterns 18-26); but
neither the forced nor the unforced Kick Drums accomplish this pattern alone—the pattern
is only apparent when the two Kick Drum types are combined on the same drum tab line.
Beat patterns 38 and 39 demonstrate that even inconsistent selection of forced and
unforced Kick Drums can still yield an appropriate Kick Drum beat pattern. In beat pattern
38, the first two measures feature mostly forced Kick Drums while the second two measures
feature mostly unforced Kick Drums; despite this, the resulting Kick Drum beat pattern is
clearly repeated with Kick Drums on beats 1, 2+, and 4 of the first and third measures as well
as beats 1+ and 2+ of the second and fourth measures. Likewise in beat pattern 39: even
though the penultimate Kick Drum is the only unforced Kick Drum, it contributes to
164
Figure 86. Beat patterns with both forced and unforced Kick Drums.
36) B|x-----------x---|----x-------x---|x-----------x---|----x-------x---
b|------x---------|----------------|------x---------|----------------
both|x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
37) B|x-----------x---|--x-------------|x-----------x---|--x-------------
b|------x---------|----------------|------x---------|----------------
both|x-----x-----x---|--x-------------|x-----x-----x---|--x-------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
38) B|x-----x-----x---|--x-------------|x---------------|----------------
b|----------------|------x---------|------x-----x---|--x---x---------
both|x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
39) B|x-----x-----x---|--x---x---x-----|x-----x-----x---|------x---------
b|----------------|----------------|----------------|--x-------------
both|x-----x-----x---|--x---x---x-----|x-----x-----x---|--x---x---------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
40) B|------x---------|------x---x-----|------x---------|------x---x-----
b|x-----------x---|--x-----------x-|x-----------x---|--x-------------
both|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x-----
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
In summary: forced and unforced Kick Drums fill the same metrical positions. When they
occur together in the same beat pattern, their joint patterning resembles typical Kick Drum
patterns—that is, they fill in each other’s gaps. For a beatboxer, this finding is probably
unsurprising. After all, the sounds are both just varieties of “Kick Drum”, so it makes sense
But notice now that out of 40 beat patterns, only 5 used both forced and unforced
Kick Drums to build Kick Drum patterns; the remaining 35 beat patterns used either forced
or unforced Kick Drums, but not both. In fact, even in 3 of the 5 beat patterns with both
types of Kick Drums, the metrical distribution of Kick Drums is highly regular. For example,
in beat pattern 36 of Figure 86, unforced Kick Drums only occur on beat 2+ of measures 1
165
and 3. If forced and unforced Kick Drums are both fulfilling the role of Kick Drum in these
beat patterns, why do they not appear together in the same beat pattern more often? Why do
they not occur in free variation? The next section demonstrates that although forced Kick
Drums and unforced Kick Drums are members of the same musical class, their distribution
phonological alternations.
2.2.1 Method
Beat patterns were encoded as PointTiers as described in Chapter 2: Method. The PointTier
linearizes beat pattern events into sequences, even when two events are metrically on the
same beat. Most of the time this is desirable; even though a Kick Drum and Liproll may
occur on the same beat, the Kick Drum is in fact produced first in time followed quickly by
the Liproll. However, this is undesirable linearization for laryngeal articulations like
humming which may in fact be simultaneous with co-produced oral sounds, not sequential.
Figure 87 shows a sample waveform and spectrogram in which acoustic noise and the release
of oral closures may hide the true onset of voicing. Humming articulations that were
annotated in drum tabs as co-occurring on the same beat as an oral sound were removed,
leaving only oral articulations. Each beat pattern’s PointTier representation was converted to
environments were created from these beat patterns (i.e., {C X D}, where {C} and {D} are two
166
beat pattern events and {X} is a forced or unforced Kick Drum). Each unique trigram in the
corpus of beat patterns is called an environment type. To ensure that each Kick Drum was in
the middle of an environment type, each beat pattern was prefixed with an octothorpe (“#”)
to represent the beginning of a beat pattern and suffixed with a dollar sign (“$”) to represent
the end of a beat pattern. An utterance-initial unforced Kick Drum before a Clickroll {CR}
might therefore appear as the trigram {# b CR}, and an utterance-final forced Kick Drum
after a Closed Hi-Hat would be {t B $}. The set of unique environment types was generated
from the Text Analytics MATLAB toolbox. Forced Kick Drums were found in 141
Environment classes. Since a major articulatory difference between the forced and
unforced Kick Drums appears to be the presence (for unforced Kick Drums) or absence (for
forced Kick Drums) of a dorsal articulation, the unique trigram environment types were
grouped into environment classes4 based on the dorsal-ness of the sounds adjacent to the
Kick Drum. These environment classes are generalizations that highlight the patterns of Kick
4
Linguists would traditionally be looking for “natural” classes here. The term “environment class” skates around
issues of “naturalness” in speech and beatboxing, but the methodological approach to classifying a sound’s
phonological environment is essentially the same.
167
Figure 87. An excerpt from a PointTier with humming. In this beat pattern, the oral
articulators produce the sequence {b dc tbc b SS}, where {b} is an unforced Kick Drum, {dc}
and {tbc} are dental and interlabial clicks, and {SS} is a Spit Snare. The initial unforced Kick
Drum {b} and the interlabial click {tbc} are both co-produced with an upward pitch sweep
marked as {hm} and called “humming”. These hums were removed for this analysis, leaving
only the oral articulations. (Note that this audio signal was significantly denoised from its
original recording associated with the real-time MRI data acquisition, but a few artefacts
remain as echoes that follow most sounds in the recording.)
For example, consider two hypothetical trigram environment types: {SS b dc}, which is an
unforced Kick Drum between a Spit Snare {SS} and dental closure {dc}, and {^K b LR},
which is an unforced Kick Drum between an Inward K Snare {^K} and a Liproll {LR}. The
Spit Snare, dental closure, Inward K Snare, and Liproll all involve dorsal articulations, so the
environment types {SS b dc} and {^K b LR} would both be members of the environment
class {[+ dorsal] __ [+ dorsal]}. (The +/- binary feature notation style used here is for
convenience to represent the existence or absence of a dorsal closure and should not be
taken as an implication that this is a symbolic featural analysis). The options [+ dorsal], [-
dorsal], and utterance-boundary (“#” or “$”) can occur in both the before and after positions
168
for a Kick Drum environment, resulting in nine (3 * 3 = 9) logically possible Kick Drum
environment classes; two of these nine did not have any Kick Drum tokens in them, leaving
seven Kick Drum environment classes listed in Tables 23 and 24. Not all environment classes
2.2.2 Results
Tables 21 and 22 presents the forced and unforced Kick Drum frequency distributions
environments by token frequency (how many Kick Drums of a given kind were in each
environment class) and type frequency (how many unique trigram environment types of a
given Kick Drum kind were in each environment class). Table 21 shows the results of the
analysis for the forced Kick Drum environments, and Table 22 shows the results for the
Table 21 summarizes the distribution of 330 forced Kick Drum tokens across 141
unique trigram environment types, which generalize to six environment classes. The
majority of forced Kick Drum tokens and environment types did not include proximity to a
dorsal sound ("Not near a dorsal" in Table 21). The forced Kick Drums that did occur near
dorsals tended to have a non-dorsal sound on their opposite side (i.e., {[- dorsal] B [+
dorsal]} or {[+ dorsal] B [- dorsal]}). As shown in Table 22, the vast majority (93.9%) of
unforced Kick Drum tokens occurred in environment classes that included one or more
dorsal sounds near the unforced Kick Drum (the “Near a dorsal” classes), with most of those
(83.3%) featuring dorsal sounds on both sides of the unforced Kick Drum. This is essentially
the reverse of the distribution of forced Kick Drums which were highly unlikely to occur
Kick Drum environment types (Table 23) and tokens (Table 24). Fisher’s exact tests on these
tables were significant (p < 0.001 in both cases), meaning that the frequency distribution of
Kick Drums in these environments deviated from the expected frequencies—that is, Kick
Drum types appeared often in some environments and sparsely in others. Tables 23 and 24
highlight in green the cells with the highest frequencies and which correspond to the
observations in Tables 21 and 22: forced Kick Drums tend to occur between non-dorsal
sounds while unforced Kick Drums tend to occur between dorsal sounds.
170
Table 22. Unforced Kick Drum environments.
Environment class Number of Tokens in environment
environment types class
Before After
Table 23. Kick Drum environment type observations. Forced Kick Drums trigram
environment types were most likely to be of the {[- dorsal] B [- dorsal]} environment class,
while unforced Kick Drum environment types were most likely to be ofn the {[+ dorsal] b [+
dorsal]} environment class.
Forced Kick Drum Unforced Kick Drum
Environment class environment type environment type Total
[+ dorsal] X [+ dorsal] 8 41 45
[+ dorsal] X [- dorsal] 28 2 25
[- dorsal] X [+ dorsal] 20 1 18
[- dorsal] X [- dorsal] 63 3 78
# X [+ dorsal] 1 5 6
# X [- dorsal] 21 0 21
[+ dorsal] X $ 0 2 2
171
Table 24. Kick Drum token observations. Forced Kick Drum tokens were most likely to occur
in the {[- dorsal] B [- dorsal]} environment class, while unforced Kick Drum tokens were
most likely to occur in the {[+ dorsal] b [+ dorsal]} environment class.
Forced Kick Drum Unforced Kick Drum token
Environment class token frequency frequency Total
[+ dorsal] X [- dorsal] 60 2 55
[- dorsal] X [+ dorsal] 42 1 39
# X [+ dorsal] 1 7 8
# X [- dorsal 26 0 26
[+ dorsal] X $ 0 2 2
Figure 88 shows the time series for a sequence of a lateral alveolar closure, unforced Kick
Drum, and Spit Snare {tll b SS}. The sounds surrounding the unforced Kick Drum both have
tongue body closure: the lateral alveolar closure is a percussive like the unforced Kick Drum,
which in this case means it has tongue body closure but no substantial movement of the
tongue body forward or backward to cause a change in air pressure; the Spit Snare on the
other hand is a lingual ingressive sound, requiring a tongue body closure and subsequent
squeezing of air past the lips. The tongue body makes a closure high throughout the
sequence as represented by consistently high values for pixel intensity in the DOR region,
indicating that the Kick Drum may be unforced because of gestural overlap with one or more
tongue body closures intended for a nearby sound like the Spit Snare. The LAR time series
172
for larynx height is also included to confirm that there is no ejective-like action here that
Figure 88. A sequence of a lateral alveolar closure {tll}, unforced Kick Drum {b}, and Spit
Snare {SS}. The DOR region of the tongue body has relatively high pixel intensity
throughout the sequences, and the LAR region of the larynx has low pixel intensity.
3. Conclusion
Forced and unforced Kick Drums are in complementary distribution: unforced Kick Drums,
which were described earlier as having a dorsal articulation in addition to a labial closure,
tend to occur near dorsal sounds; forced Kick Drums do not share this dorsal articulation,
and tend to occur near non-dorsal sounds. Based on this context-dependent complementary
distribution and their similar rhythmic patterning, the forced and unforced Kick Drums
173
Given the matching dorsal or non-dorsal quality of a Kick Drum and its
release its closure between the unforced Kick Drum and the sound before or after it. In a
traditional phonological analysis, one could posit a phonological rule to characterize this
distribution such as: “Kick Drums are unforced (dorsal) between dorsal sounds and forced
(ejective) elsewhere.” (Forced Kick Drums are the elsewhere case because their occurrence is
The Articulatory Phonology analysis is roughly the same, if not so featural: Kick Drums are
unforced if they overlap with a tongue body closure. These interpretations assume a causal
relationship in which the Kick Drum is altered by its environment, but an alternative story
reverses the causation: forced and unforced Kick Drums are distinct sound categories that
trigger dorsal assimilation in the sounds nearby. The analysis of beatboxing phonological
harmony in Chapter 6: Harmony provides further evidence that the Kick Drum is subject to
Kick Drums are not the only sound in the data set to show this type of pattern,
though their relatively high token frequency makes them the only sounds to show it so
robustly. As Chapter 3: Sounds listed, there are two labio-dental compression sounds: a
glottalic egressive PF Snare and a percussive labio-dental sound. As its name implies, the PF
Snares fulfills the musical role of a snare by occurring predominantly on the back beat of a
174
beat pattern. Suspiciously, the labio-dental percussive also appears on the back beat in the
two beat patterns it occurs in, and just like the unforced Kick Drum it occurs surrounded by
sounds with tongue body closures. The same goes for the Closed Hi-Hat and some of the
coronal percussives, though the pattern is confounded somewhat by the percussives being
distributed over several places of articulation while the Closed Hi-Hat is a distinctly alveolar
sound. Taking the Kick Drum, PF Snare, and Closed Hi-Hat together suggests that the
phenomenon discussed in this chapter is actually part of a general pattern that causes some
ejectives to become percussives other sounds with tongue body closures sounds are nearby.
175
CHAPTER 6: HARMONY
Some beatboxing patterns include sequences of sounds that share a tongue body closure, a
type of agreement that in speech might be called phonological harmony. This chapter
demonstrates that beatboxing harmony has many of the signature attributes that
characterize harmony in phonological systems in speech: sounds that are harmony triggers,
based on the phonetic dimension of airstream initiator. This analysis of beatboxing harmony
provides the first evidence for the existence of sub-segmental cognitive units of beatboxing
(vs whole segment-sized beatboxing sounds). These patterns also show that the harmony
1. Introduction
obstruent beatboxing sounds and phonation (which may not always be modal). This type of
"humming while beatboxing" beat pattern is well-known by beatboxers and treated as a skill
to be developed in the pursuit of beatboxing expertise (Stowell & Plumbley, 2008; Park, 2016;
WIRED, 2020).
176
Figure 89. A beat pattern that demonstrates the beatboxing technique of humming with
simultaneous oral sound production. This beat pattern contains five sounds: an unforced
Kick Drum {b}, a dental closure {dc}, a linguolabial closure {tbc}, a Spit Snare {SS}, and brief
moment of phonation/humming {hm}. In this beat pattern, humming co-occurs with other
beatboxing sounds on most major beats (i.e., 1, 2, 3, and 4, but not their subdivisions).
b |x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
dc |--x-----------x-|----------------|--x-----------x-|------------x---
tbc|----x-----------|x---x-------x---|----x-----------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
hm |x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Without knowing the articulation in advance, a humming while beatboxing beat pattern is
an pneumatic paradox: humming requires a lowered velum to keep air pressure low above
the vocal folds while they vibrate and to allow air to escape through the nose, but glottalic
Sounds)—require a raised velum so air pressure can build up behind an oral closure. The
production of voiced stops in speech comes with similar challenges; languages with voiced
stops use a variety of strategies such as larynx lowering to decrease supraglottal pressure
(Catford, 1977; Ohala, 1983; Westbury, 1983). Real-time MRI examples later in this chapter
show that beatboxers use a different strategy to deal with the humming vs obstruent
antagonism: separating the vocal tract into two uncoupled chambers with a tongue body
closure (see also Dehais-Underdown et al., 2020; Paroni, 2021b). Behind the tongue body
closure, the velum is lowered and phonation can occur freely with consistently low
supraglottal pressure. In front of the tongue body closure, air pressure is manipulated by the
coordination of the tongue body and the lips or tongue tip. In speech, a similar articulatory
177
The examples above of speech remedies for voiced obstruents operate over a
relatively short time span near when voicing is desired. Notice, however, that phonation {hm}
in the beat pattern from Figure 891 is neither sustained nor co-produced with every oral
beatboxing sound, yet every sound in the pattern is produced with a tongue body closure. It
turns out that other beat patterns like the one in Figure 90 also feature many sounds with
tongue body closures even when the beat pattern has no phonation at all; the humming
while beatboxing example is just one of several beat pattern types in which multiple sounds
share the property of being produced with a tongue body constriction. When multiple
sounds share the same attribute in speech, the result is phonological “harmony”.
offers deep insights about the makeup of the fundamental units of beatboxing cognition. The
speech (section 1.1) and previews some of the major theoretical issues at stake in the
Figure 90. This beat pattern contains five sounds: a labial stop produced with a tongue body
closure labeled {b}, a dental closure {dc}, an lateral closure {tll}, and lingual egressive labial
affricate called a Spit Snare {SS}. All of the sounds are made with a tongue body closure.
b |x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
dc |----x-----------|----------------|----x-----------|----------------
tll|----------------|x---------------|----------------|x---------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
178
1.1 Speech harmony
Harmony in speech occurs when multiple distinct phonological segments can be said to
“agree” with each other by expressing the same particular phonological property. There are a
few different types of harmony patterns in speech, but the most relevant to this study is
“local harmony” in which the sounds that agree with each other occur in an uninterrupted
both vowels and consonants). Rose & Walker (2011) describe a few types of local harmony
this task is to create messages that have a high likelihood of being accurately recovered by
someone perceiving the message. Harmony is one of several mechanisms that have been
perceptually weak phonological units are more likely to be heard if they last longer and
overlap with multiple segments (Kaun, 2004; Walker 2005; Kimper 2011). Less teleologically,
(Ohala, 1994) or stochastic motor control variation (Tilsen, 2019), which may have
perceptual benefits. In either view, local harmony is initiated by a “trigger” segment which
has some phonological property to spread (i.e., a feature or gesture). Through harmony, that
property is shared with other nearby segments (“targets” or “undergoers”) so that they end
179
The same overarching task of producing a perceptually recoverable message which
may motivate harmony also constrains which phonological properties of a sound will spread
and how. Harmony must be unobtrusive enough that it does not destroy other crucial
phonological contrasts; tongue body closure harmony, for example, is unattested in speech
because it would destroy too much information by turning all vowels and consonants into
velar stops (Gafos, 1996; Smith, 2018). Likewise, sounds that would be disrupted by harmony
should be able to resist harmonizing and prevent its spread; these types of sounds are called
“blockers”. In other languages, some sounds might be “transparent” instead, meaning that
phonological property to other sounds (Rose & Walker 2011). In featural accounts, this is
often done by formally linking a feature to adjacent segments according to some rule,
constraint, or other grammatical or dynamical force. In gestural accounts, local harmony has
been modeled as maintaining a particular vocal tract constriction over the course of multiple
in a row share the same feature or gesture. Harmony is analyzed as a feature or gesture
spreading from a trigger unit onto or through adjacent segments called undergoers, though
some segments may also block harmony or be transparent to it. To the extent that harmony
180
1.2 Beatboxing harmony
Figures 89 and 90 provided examples of beatboxing sequences in which each sound has a
tongue body closure. While these beat patterns may be harmonious in the sense that the
sounds agree on some property, it does not mean that beatboxing harmony has the same
traits as speech harmony. The overarching goals of beatboxing are more aesthetic than
perceptually salient—aesthetic goals. For example, the humming while beatboxing pattern
described earlier allows the beatboxer to add melody to a beat pattern. Even without
phonation, it may sometimes be desirable to make many sounds with a tongue body closure
to create a consistent sound quality from the shorter resonating chamber in front of the
tongue body. Given the completely different tasks that drive speech and beatboxing
harmonies, they could in principle arise from completely distinct motivations using
completely distinct mechanisms, such that any resemblance between them is purely
superficial.
One way to determine whether beatboxing harmony bears only superficial similarity
to harmony in speech or a deeper one one based on a partly shared cognitive system
underlying sequence production is to see whether or not beatboxing harmony exhibits the
signature properties of speech harmony beyond the existence of sequences that share some
properties, namely: triggers, undergoers, and blockers. For example, consider a beatboxing
sequence like *{CR WDT SS WDT}. (The asterisk on that beat pattern indicates that it is not
a sequence found in this data set, which is not quite the same thing as saying that it is an
ill-formed beatboxing sequence.) In that sequence, each sound requires a tongue body
181
closure, so there may be a separate tongue body closure for each sound rather than a
prolonged tongue body closure that would be expected in speech harmony. Either way, none
of the sounds would have to trigger or undergo a tongue body closure assimilation to create
harmony because they all have tongue body closures in any context in which they appear;
and if there is no evidence for triggers, there could be no evidence for blockers either.
Alternatively, evidence could suggest that harmony in speech and beatboxing share
some deeper principles. Local harmony in speech involves prolonged constrictions; since
plenty of other nonspeech behaviors involve holding a body part in one place for an
extended period of time, beatboxing could do that too in order to create a prolonged
aesthetic effect. And if a beatboxer holds a tongue body closure for an extended period of
time during a beat pattern, the closure would temporally overlap with other sounds and
ensure that they are made with a tongue body closure too—even if they weren’t necessarily
selected to have one and wouldn’t have the tongue body closure in other contexts. Thus,
Furthermore, if some beatboxing sounds in the same pattern cannot be produced with a
tongue body closure without radically compromising their character, those sounds might
block the tongue body closure harmony. Beatboxing harmony might present all the same
Finding evidence in beatboxing for sustained constrictions and sounds with signature
harmony properties is not enough to claim that beatboxing harmony is like speech harmony.
Phonological harmony is a sound pattern. It’s predictable. Triggers, undergoers, and blockers
are classes of sounds organized by sub-segmental properties they share. If beatboxing has the
182
same type of harmony, then the sounds of beatboxing harmony must be organized along
similarly sub-segmental lines. Chapter 3: Sounds used analytic dimensions to describe the
phonetic organization of beatboxing sounds. The aim of the current chapter is to test
whether any of these dimensions play a role in the active cognitive patterning of beatboxing.
If beatboxing can be shown to exhibit harmony, then the roles of the sounds in a harmony
along which they are distributed. In turn, those same phonetic dimensions must be
analyses of this chapter aim at answering whether or not harmony is unique to language.
Theories of phonological harmony are designed only to account for language data; but if
beatboxing also has harmony, then a theory is needed that accounts for the shared or
different ways: in only the superficial sense that sequences of sounds share similar properties,
or in the more profound sense that harmony is governed by phonological principles similar
to those found for speech. In the latter case, beatboxing sounds that participate in harmony
patterns should be reliably classifiable into roles like trigger, undergoer, and blocker.
Furthermore, if these roles can be predicted by one or more phonetic attributes, then
183
beatboxing units. Like speech harmony, beatboxing harmony should then be able to be
Section 2 introduces the method by which the beatboxing corpus was probed to
discover and analyze beatboxing harmony examples. Section 3 describes a subset of the
harmony examples in terms of the evidence for triggers, undergoers, and blockers. Section 4
argues for the existence of cognitive sub-segmental beatboxing elements relating to airflow
initiators and provides an account of beatboxing harmony patterns using gestures made
2. Method
See Chapter 2: Method for details of how the rtMR videos were acquired and annotated then
converted to time series and gestural scores for the analysis below.
The videos and drum tabs of each beat pattern were visually inspected in order to
identify those which had sequences of sounds produced with tongue body closures. Eleven
such beat patterns were identified. For this analysis, each of those 11 beats patterns was
examined more closely to evaluate the constriction state of the tongue body during and
between the articulation of sounds in the beat pattern. These observations were
Most of the beat patterns in the database were performed to showcase a particular
beatboxing sound. Seven of the eleven beat patterns exhibiting persistent tongue body
closure were from these showcase beat patterns, each of which features a sound that is
produced with a tongue body closure: Clickroll {CR}, Clop {C}, Duck Meow SFX, Liproll
184
{LR}, Spit Snare {SS}, Water Drop Air {WDA}, and Water Drop Tongue {WDT}. Two other
beat patterns showcasing the Inward Bass and the Humming while Beatboxing pattern were
also performed with a persistent tongue body closure; both of these beat patterns included
the Spit Snare {SS}. The final two beat patterns did not showcase any beatboxing sound in
particular: one was a long beat pattern featuring the Spit Snare, in which the last few
measures were made with a persistent tongue body closure; the other includes both the Spit
Five of the eleven beat patterns with harmony are discussed in this section to illustrate how
beatboxing harmony manifests and to test the hypothesis that beatboxing harmony exhibits
some of the same properties as the signature properties of speech harmony discussed above.
These five are the Spit Snare {SS} showcase (beat pattern 5), the humming while beatboxing
pattern (beat pattern 9), the Clickroll {CR} showcase (beat pattern 1), the Liproll {LR}
showcase (beat pattern 4), and a freestyle beat pattern that was not produced with the
intention of showcasing any particular beatboxing sound (beat pattern 10). As summarized
in Table 25, these beat patterns depict a beatboxing harmony complete with sounds that
trigger the bidirectional spreading of a lingual closure, sounds that undergo alternations due
Observation Analysis
The tongue body rises into a velar closure at The Spit Snare triggers bidirectional tongue
185
the beginning of the utterance and stays body closure harmony. Kick Drums in the
there until the end of the utterance. Kick environment of the harmony lose their
Drums in the scope of this velar closure lose larynx raising movement when they gain
their larynx raising movement. their tongue body closure, and therefore
exhibit an alternation from a glottalic
egressive to percussive airstream.
Observation Analysis
A velar tongue body closure splits the vocal Tongue body closure harmony is triggered
tract into two chambers so that percussion again by the Spit Snare. It does not restrict
and voicing can be produced independently. all laryngeal activity—it allows vocal fold
adduction for voicing (humming), but
eliminates the larynx raising movements
associated with Kick Drums.
Observation Analysis
Tongue body harmony is again achieved by Tongue body closure harmony does not
maintaining a closure against the upper require a static tongue posture; it allows
airway. However, the location of that variability in constriction location so long as
closure moves back and forth between the the constriction degree remains a closure.
palate and the uvula as required by the The Liproll is the harmony trigger this time,
Liproll. When the Liproll is not active, the and PF Snares undergo harmony.
tongue body adopts a velar position.
Observation Analysis
Some sequences of sounds agree in tongue The Spit Snare is once again a harmony
body closure, but these groups are separated trigger, but the Inward Liproll {^LR} and
from each other by sounds without tongue High Tongue Bass {HTB} block the spread
body closure including the Inward Liproll of harmony. Both blocking sounds are
and High Tongue Bass. Kick Drums near pulmonic, indicating that harmony is
these two sounds retain their larynx raising blocked by pulmonic airflow. Temporal
movements. proximity to the harmony blockers prevents
the Kick Drums from harmonizing.
Observation Analysis
Brief sequences agreeing in tongue body The Clickroll triggers tongue body closure
186
closure are broken up by forced Kick Drums harmony and the pulmonic Inward K Snare
and Inward K Snares. The tongue body is blocks harmony. As with beat pattern 10,
elevated during the forced Kick Drums but Kick Drums close to the harmony blocker
an air channel over the tongue is created by are not susceptible to harmonizing. The
raising the velum. elevated tongue body position during forced
Kick Drums is argued to be anticipatory
coarticulation from the Inward K Snare
As for the other six beat patterns not discussed: the Clop {C} showcase (beat pattern 2) was
not analyzed because it only contains one oral sound—the Clop {C}; the Duck Meow SFX
{DM} showcase was not analyzed because a complete phonetic description of the Duck
Meow SFX was not currently possible to give in Chapter 3: Sounds, making an articulatory
analysis unfeasible. The remaining beat patterns for the Water Drop Air {WDA} showcase
(beat pattern 6), Water Drop Tongue {WDT} showcase (beat pattern 7), Inward Bass {IB}
showcase (beat pattern 8), and second freestyle pattern (beat pattern 11) all exhibit
bidirectional spreading like beat pattern 5. Beat pattern 7 is additionally confounded by the
presence of two sounds that use tongue body closures when performed in isolation.
Table 26 lists the beatboxing sounds used in the remainder of this chapter, along with
their transcription in BBX notation (see Chapter 3: Sounds). Transcription in notation from
the International Phonetic Alphabet is also provided which incorporates symbols from the
extensions to the International Phonetic Alphabet for disordered speech (Duckworth et al.,
1990, Ball et al., 2018) and the VoQS System for the Transcription of Voice Quality (Ball et
al., 1995; Ball et al., 2018). An articulatory description of each sound is also given in prose.
The table groups the sounds by their role in beatboxing harmony (which the subsequent
analysis provides evidence for). Note that “percussives” are sounds made with a posterior
187
tongue body closure but without the tongue body fronting or retraction associated with
Triggers
Blockers
High Tongue Bass {HTB} [r] Voiced pulmonic egressive alveolar trill (high
pitch)
Other
Lateral alveolar closure {tll} [ǁ] Voiceless percussive lateral alveolar stop
188
3.1 Beat pattern 5—Spit Snare showcase
Beat pattern 5 showcases the Spit Snare {SS}. Section 3.1.1 demonstrates how the tongue
body makes a velar closure throughout the entire performance, making this a relatively
simple case of tongue body closure harmony. The tongue body closure results in alternations
movement associated with ejectives. Section 3.1.2 analyzes the pattern in terms of a tongue
body harmony trigger and undergoers. Table 27 re-lists the beatboxing sounds used in beat
Lateral alveolar closure {tll} [ǁ] Voiceless percussive lateral alveolar stop
Beat pattern 5 is a relatively simple example of tongue body closure harmony in beatboxing.
As the drum tab (Figure 91) and time series (Figure 93) below show, the tongue body makes
a closure against the velum for the entire duration of the beat pattern.
measure), and the unforced Kick Drum occurs in a relatively common pattern on beats 1, 2+,
and 4 of the first measure and beats 2 and 4 of the second measure, repeating the two
189
measure pattern for measures 3 and 4. The dental closure occurs on beat 2 of the first and
third measures, and the lateral alveolar closure occurs on beat 1 of the second and fourth
measures. All the sounds in this beat pattern share the trait of being made with a tongue
body closure. Agreement like this in speech would likely be considered a type of local
harmony.
series: one for labial closures (LAB), one for alveolar closures (COR), one for tongue body
closures (DOR), and one for larynx height (LAR). The labial (LAB) time series includes the
gestures for the unforced Kick Drum {b} and the Spit Snare {SS}, while the coronal (COR)
time series feature the gestures for the dental closure {dc} and the lateral alveolar closure
{tll}. The tongue body (DOR) time series shows that the tongue body stays raised
throughout the beat pattern: the tongue body starts from a lower position at the very
beginning of the beat pattern, represented by low pixel intensity (close to the bottom of the
y-axis), but it quickly moves upward at the beginning of the beat pattern to make a closure
(high pixel intensity, closer to the top of the y-axis) in time for the first unforced Kick Drum
{b}.
190
Figure 92. Regions for beat pattern 5. From top to bottom: the labial (LAB) region for the
unforced Kick Drum {b} and Spit Snare {SS}; the coronal (COR) region for the dental
closure {dc} and lateral alveolar closure {tll}; the dorsal (DOR) region to show tongue body
closure and the laryngeal (LAR) region to show lack of laryngeal activity.
Dorsal closure during Spit Snare and empty larynx region during unforced Kick Drum
191
Figure 93. Time series of vocal tract articulators used in beat pattern 5, captured using a
region of interest technique. From top to bottom, the time series show average pixel intensity
for labial (LAB), coronal (COR), dorsal (DOR), and laryngeal (LAR) regions.
The time series in Figure 93 capture the results of the alternation of forced Kick Drums to
unforced Kick Drums. As discussed in Chapter 5: Alternations, the default forced Kick
Drums are ejectives which means the laryngeal time series of Kick Drums would show an
increase from low intensity to high intensity as a rising larynx enters the region of interest.
The alternative Kick Drum form, the unforced Kick Drum, is made in front of a tongue body
closure, so it is expected to exhibit activity in the dorsal time series. Tongue body closures are
not antithetical to laryngeal movement: they may occur at the same time, and often do for
dorsal ejectives in speech. Yet beat pattern 5 shows that the Kick Drums do have a tongue
192
Figure 94. Upper left: Labial and laryngeal gestures for an ejective/forced Kick Drum at the
beginning of a beat pattern. Upper right: Labial gesture for a non-ejective/unforced Kick
Drum at the beginning of beat pattern 5. A larynx raising gesture occurs with the forced Kick
Drum, but not the unforced Kick Drum. (Pixel intensities for each time series were scaled
[0-1] relative to the other average intensity values in that region; the labial closure of the
forced Kick Drum looks smaller than the labial closure of the unforced Kick Drum because it
was scaled relative to other sounds in its beat pattern with even brighter pixel intensity
during labial closures. Both labial gestures in this figure are full closures.) Lower left: At the
time of maximum labial constriction for the ejective Kick Drum, the vocal folds are closed
(visible as tissue near the top of the trachea) and the airway above the larynx is open; the
velum is raised. Lower right: At the time of maximum labial constriction for the non-ejective
unforced Kick Drum, the vocal folds are open and the tongue body connects with a lowered
velum to make a velar closure.
Forced (ejective) Kick Drum Unforced (lingual) Kick Drum
193
From the perspective of aerodynamic mechanics, this is sensible: laryngeal movement behind
the tongue body closure has no effect on the size of the chamber between the lips and the
tongue body, so it makes no difference whether the larynx moves or not; better to save
energy and not move the larynx. From the perspective of beatboxing phonology, this
example is illuminating: if one assumes based on Chapter 5: Alternations that the forced Kick
Drum was selected for this beat pattern and undergoes an alternation into an unforced Kick
Drum, then the phonological model must provide not only a way to spread the tongue body
closure but also a way to get rid of the larynx raising. (Section 4 addresses this in more
detail.)
Harmony patterns in speech are defined by articulations that spread from a single trigger
sound to other sounds nearby, causing them to undergo assimilation to that articulation. In
beat pattern 5, the Spit Snare is the origin of a lengthy tongue body closure gesture and other
sounds like the Kick Drum undergo assimilate to that dorsal posture as well. The sounds
agree by sharing a tongue body closure, and in this sense they are harmonious.
Kick Drum that mostly appears in environments with surrounding dorsal closures. This was
implicitly characterized as local agreement: the unforced Kick Drum adopts a tongue body
closure when adjacent sounds also have a tongue body closure. Looking beyond the unforced
Kick Drum’s immediate environment however, and considering the pervasive tongue body
194
closure in this beat pattern, the Kick Drum alternation in this beat pattern seems more aptly
described as the result of tongue body harmony: the Kick Drum is not just accidentally
sandwiched between two dorsal sounds—all the sounds, nearby and not, have tongue body
closures. The unforced Kick Drum is a forced Kick Drum that undergoes tongue body
closure harmony.
body closure, even in isolation, could be triggers of harmony. Of the sounds in this particular
beat pattern, only the Spit Snare was ever performed in isolation or identified as a distinct
beatboxing sound by the beatboxer; as the only sound in this beat pattern known to require a
tongue body closure, the Spit Snare is therefore the most likely candidate for a harmony
trigger. In fact, the Spit Snare is associated with long tongue body closures in all the beat
patterns it appears in, and in most cases is the only sound in that pattern known to be
Assuming the Spit Snare is a harmony trigger, then the tongue body closure harmony
in this beat pattern extends bidirectionally: it is regressive from beat 2 of the first measure to
begin with the first unforced Kick Drum {b}, but also progressive from beat 4 of the last
Beat pattern 9 is an example of the “humming while beatboxing” described at the beginning
of this chapter. Section 3.2.1 describes this humming while beatboxing pattern with drum tab
195
notation and articulatory time series. The humming is intermittent in this particular beat
pattern, and there is no need to keep a tongue body closure when humming is not
active—yet as the time series shows, the tongue body closure persists for the entire beat
pattern, suggesting a sustained posture like the ones exhibited in speech harmony. This is
discussed in section 3.2.2 in terms of triggers (the Spit Snare) and undergoers (the
non-humming sounds). For reference, the sounds of this beat pattern are listed in Table 28.
beat pattern 5, the four supralaryngeal sounds in this beat pattern are the unforced Kick
Drum {b}, a Spit Snare {SS}, and two additional percussive closures—one dental {dc} and one
linguolabial {tbc}. The additional humming {hm} sound is a brief upward pitch sweep that
occurs on most beats. (If humming occurs with the first three Spit Snares, it is acoustically
occluded in the audio data of this beat pattern and therefore was not marked.)
196
Figure 95. Drum tab of beat pattern 9.
b |x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
dc |--x-----------x-|----------------|--x-----------x-|------------x---
tbc|----x-----------|x---x-------x---|----x-----------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
hm |x---x-------x---|x---x-------x---|x---x-------x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
showcase). The DOR time series shows that the tongue body is raised consistently
throughout the beat pattern. Laryngeal activity on most major beats (LAR time series)
corresponds to voicing {hm}. There is also activity during the three Spit Snares that are not
marked for voicing in the drum tab; if this is voicing, it may not apparent in the acoustic
signal due to some combination of the noise reduction method used in audio processing and
197
3.2.2 Analysis of beat pattern 9
The main point of note in this beat pattern is that the larynx is not necessarily inactive
during tongue body closure harmony. The description of beat pattern 5 in section 3.1 noted
that when forced Kick Drums undergo tongue body closure harmony, their unforced
alternants do not have a larynx raising gesture. A phonological model needs to be able to
“turn off” the larynx movement of the forced Kick Drums to generate the observed unforced
Kick Drums. But as beat pattern 9 shows, a blanket ban on laryngeal activity during tongue
body closure harmony would not be an appropriate choice for the phonological model
The musical structures of beat patterns 5 and 9 are different in sounds and rhythms,
but the rest of the analysis is essentially the same. Once again, the tongue body closure that
persists throughout the beat pattern is most likely to be associated with the Spit Snares: none
of the other sounds in this beat pattern were produced in isolation by the beatboxer, which
suggests that they are tongue-body alternations of sounds without tongue body gestures (like
the Unforced Kick Drum is an alternation of the Kick Drum) or sounds that are
phonotactically constrained to only occur in the context of a sound with a tongue body
closure—in either case, not independent instigators of a sustained tongue body closure.
Again, the harmony would be bidirectional, spreading leftward to the first sounds of the beat
198
3.3 Beat pattern 4—Liproll showcase
Beat pattern 4 showcases the Liproll {LR}. The Liproll triggers tongue body harmony just
like the Spit Snare did in the previous examples; but unlike the Spit Snare, the tongue body
constriction location during the Liproll changes dramatically during the Liproll’s
production—from the front of the palate all the way to the uvula in one smooth glide.
Tongue body closure harmony is maintained during the Liproll because the constriction
degree of the tongue body stays at a constant closure. When the Liproll is not being
produced, the tongue body adopts a static velar closure. Section 3.3.1 presents the beat
pattern in drum tab and time series forms, and section 3.3.2 analyzes the pattern in terms of a
tongue body harmony trigger (the Liproll) and undergoers (everything else).
Drum {b}, the Liproll {LR}, and percussive alveolar {ac}, dental {dc}, labiodental {pf}, and
199
linguolabial {tbc} closures. The onset of Liprolls are metrically synchronous with unforced
Kick Drums as represented by the “x” symbols, though the time series shows that they are
not simultaneous—a Kick Drum is made first and a Liproll follows quickly thereafter. The “~”
symbol signifies that the labial trill of the Liproll is extended across multiple beats. The
labiodental closure {pf} serves the role of the snare by occurring consistently and exclusively
on beat 3 of each measure; since it was never produced in isolation by the beatboxer, the {pf}
b |x---x-------x---|x---x-------x---|x---x-------x---|x---x-----------
ac |----------x-----|----------x-----|----------x-----|----------------
dc |----------------|--------------x-|----------------|----------------
tbc|----------------|----------------|----------------|----------------
pf |--------x-------|--------x-------|--------x-------|--------x-------
LR |x~~~x~~~----x~~~|x~~~x~~~--------|x~~~x~~~----x~~~|x~~~x~~~--------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
three (LAB, LAB2, and FRONT) have movements relevant to the production of sounds in
this pattern. Labial closures of the unforced Kick Drum {b} and labiodental closure {pf} are
in the LAB time series; labial closures during which the lips are pulled inward over the teeth
for the Liproll {LR} are in LAB2; and the anterior region of the vocal tract into which the
tongue shifts forward at the beginning of a Liproll is represented by FRONT. (A coronal time
200
series for the alveolar, dental, and interlabial closures is not included.) The dorsal DOR and
laryngeal LAR time series are included to show the consistently high tongue body posture
Figure 98. Regions used to make time series for the Liproll beat pattern.
Unforced Kick Drum (left) and labiodental closure (right) in LAB region.
Liproll retraction of lower lip over the teeth into LAB2 region.
201
Liproll tongue body in (left) and out of (right) the FRONT region
The tongue body makes a closure with the velum in the DOR region during the labiodental
closure (left) and there is no laryngeal activity in the LAR region (right).
202
Figure 99. Time series of the beat pattern 4 (Liproll showcase).
The Liproll triggers tongue body closure harmony in beat pattern 4, causing both Kick
Drums and PF Snares to be produced with tongue body closures instead of glottalic egressive
airflow. Figure 98 shows snapshots of the different positions of the tongue body during this
beat pattern: the tongue body adopts a resting position closed against the velum during most
sounds but shifts forward and backward (right image) to create the Liproll.
Beat pattern 10 is a freestyle beat pattern not intended to showcase any particular sound. The
Spit Snare is once again a harmony trigger as it was in beat patterns 5 and 9, but here the
harmony does not spread throughout the whole beat pattern as it did in those earlier ones. In
the first six measures of the beat pattern, tongue body closures triggered by a Spit Snare do
203
not extend through the Inward Liproll or High Tongue Bass. These two pulmonic sounds are
High Tongue Bass {HTB} [r] Voiced pulmonic egressive alveolar trill (high
pitch)
4, 6, and 7-8 the Spit Snare follows a linguolabial closure {tbc} and unforced Kick Drum {b},
indicating that some harmony is occurring. In the same measures, however, there are also
forced Kick Drums and High Tongue Basses that did not undergo harmony. And in measures
1, 3, and 5 the Spit Snare is the only tongue body closure sound around. Only in the final two
measures does the pattern return to one of a sequence of tongue body closure sounds.
204
Figure 100. Drum tab for beat pattern 10.
B |x-----x-----x---|--x-------------|x-----x-----x---|--x-------------
^LR|x~~~~~------x~~~|~~--------------|x~~~~~------x~~~|~~--------------
^K |----------------|----------------|----------------|----------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|------------x~~~
b |----------------|------x---------|----------------|------x---------
dc |----------------|----------------|----------------|----------------
dac|----------------|----------------|----------------|----------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
B |x-----x-----x---|--x-------------|x---------------|----------------
^LR|x~~~~~------x~~~|~~--------------|----------------|----------------
^K |----------------|----------------|----------------|------------x---
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|----------------
b |----------------|------x---------|------x-----x---|--x---x---------
dc |----------------|----------------|--x-----------x-|----------------
dac|----------------|----------------|----x-----------|x---------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Kick Drum {b}, and Spit Snare {SS} all go on the labial closure LAB time series. The Inward
Liproll {^LR} goes on the LAB2 time series which responds to pixel intensity when the lower
lip retracts over the bottom teeth. (It also responds to tongue tip movement in the same
pixels, but there are no meaningful movements highlighted in that case.) The High Tongue
Bass {HTB}, linguolabial closure {tbc}, and dental closure {dc}, are in the COR tongue tip
time series. The Inward K Snare {^K} goes on the DOR region, and the LAR region has the
laryngeal movements for the forced Kick Drum. The dental-alveolar closure {dac} was
captured in a separate region that is not pictured. Black boxes surround movements that
205
Most of the Kick Drums near the Inward Liproll and High Tongue Bass are marked as
forced because laryngeal closure was apparent when visually inspecting the image frames of
those sounds. A forced Kick Drum was also observed in the production of the Inward Liproll
in isolation. But in this beat pattern, the laryngeal activity during most forced Kick Drums is
minimal. In some instances the laryngeal region brightens for a moment and then darkens
again with no apparent vertical movement. Unusually high pixel brightness near the lips and
tongue tip may drown out the details of whatever laryngeal closure/raising there may be. At
other times, there is clear vertical laryngeal movement during a subsequent Spit Snare; Spit
Snares after forced Kick Drums co-occur with larynx raising, while Spit Snares after unforced
The relationship between sounds in beatboxing clusters—like the Kick Drums and
Inward Liprolls organized to the same beat—is unknown territory for beatboxing science, so
it is not clear how to expect those Kick Drums should manifest. For this analysis, the
presence of any laryngeal closure at all during these Kick Drums is taken as indication that
they are forced, and the lack of noticeable vertical movement attributed to undershoot (not
enough time for noticeable movement). Laryngeal movements marked on the time series
correspond to visual observations of laryngeal activity. At the very least, the Kick Drums just
As shown in the DOR time series, the tongue body is sometimes raised into an
extended closure and sometimes not. The tongue body is elevated overall because the DOR
region has at least some brightness at all times except during the Inward K Snare {^K} when
the tongue body completely leaves the region. The aperture of tongue body constriction
206
widens during most Inward Liprolls and High Tongue Basses, then decreases again as the
tongue body moves back into its closure before and after Spit Snares.
Figure 101. The regions used to make the time series for beat pattern 10.
Forced Kick Drum (left), unforced Kick Drum (center), and Spit Snare (right) in LAB region.
High Tongue Bass (left) and linguolabial closure (right) in COR region.
207
Inward K Snare (left) outside of the DOR region and (local) maximum larynx height during
a forced Kick Drum in the LAR region (right).
The domain of the Spit Snare’s harmony extends bidirectionally up to an Inward Liproll
{^LR} or High Tongue Bass {HTB}, then halts. As non-nasal pulmonic sounds, the Inward
Liproll and High Tongue Bass cannot be made with a tongue body closure because a tongue
body closure would prevent the pulmonic airflow from passing over the relevant oral
208
constriction. In speech harmony, sounds with this kind of physical antagonism to harmony
that also seem to stop the spread of harmony are generally analyzed as harmony blockers.
Alternatively, some sounds are analyzed as transparent to harmony, meaning they do not
prevent harmony from spreading but they also do not undergo a qualitative harmonious shift
either. It could be that the Inward Liproll and High Tongue Bass are transparent—tongue
body closure harmony continues through them, but the need for pulmonic airflow
The blocking analysis works slightly better here because of the presence of forced
Kick Drums. As we have seen in every other beat pattern so far, tongue body closure
harmony seems to trigger a qualitative shift in which forced Kick Drums become unforced,
losing their laryngeal closure/raising gestures and gaining a tongue body closure. Here
however there are some forced Kick Drums near pulmonic sounds. If harmony were not
blocked, then the Kick Drums should undergo harmony; since they don’t, then either they
are exceptional Kick Drums that are intrinsically resistant to harmony or they are defended
from harmony by other sounds that block harmony.5 There is no other reason to think that
any Kick Drums should be exceptional compared to others. A phonological analysis with
5
This would be a problem in a traditional phonological analysis that treats sounds as sequential symbol strings.
Consider the sequence {... ^LR B tbc b SS HTB …} in which tongue body harmony has spread regressively from
the Spit Snare {SS} as indicated by underlining beneath the Spit Snare and undergoers. In this format, blocking
from the Inward Liproll must “jump” over the forced Kick Drum to stop the harmony from affecting the forced
Kick Drum and making *{... ^LR b tbc b SS HTB …}. In theories that don’t pretend sounds exist in time and can
overlap, however, this is not as big an issue. If those Kick Drums are sufficiently temporally proximal to the
blockers—and indeed many of the Kick Drums in this beat pattern partially overlap with the pulmonic
sounds—then the harmonizing tongue body closure may simply not have yet been unblocked.
209
blocking is the preferred analysis over transparency here. The beat pattern in section 3.5
Beat pattern 1 is a Clickroll {CR} showcase beat pattern. Section 3.5.1 presents the beat
pattern in drum tab and time series forms, illustrating an example of tongue body harmony
that is periodically interrupted by Inward K Snares. Section 3.5.2 analyzes the pattern in
terms of a tongue body harmony trigger (the Clickroll), undergoers (the unforced Kick
{b} and {B}, Closed Hi-Hat {t}, dental closure {dc}, Inward K Snare {^K}, and Clickroll {CR}.
The Kick Drums follow a two-measure pattern of occurrence—beats 1, 2+, and 4 of the first
210
measure, then the “and”s of each beat in the second measure. The pattern repeats in the latter
half of the beat pattern except that the final Kick Drum is replaced by an Inward K Snare.
Inward K Snares additionally appear on beat 3 of each measure. Clickrolls in this beat
pattern are always co-produced on the same beat as an unforced Kick Drum, though the
reverse is not true (i.e., an unforced Kick Drum at the end of the second measure is not
co-produced with a Clickroll). The dental closure also follows a two-measure pattern with
occurrences on the 2 and 3+ of the first measure and beats 1, 2, and 4 of the second measure;
this pattern repeats in the latter half of the beat pattern, but a Closed Hi-Hat occurs where
labial closures (LAB), alveolar closures (COR), dorsal closures (DOR), velum position
(VEL), and larynx height (LAR). Note that in this beat pattern, the dental closure is usually
the release of a coronal closure caused by a Clickroll or Inward K Snare and does not have its
own closing action. The DOR time series illustrates that the tongue body is raised near the
velum throughout beat pattern 1 except during the Inward K Snare and after the penultimate
Inward K Snare. Figure 104 shows that even what appears to be tongue body lowering during
211
the Inward K Snares is actually tongue fronting; whether for the Inward K Snare or for
Surprisingly, there are several forced Kick Drums in the beat pattern despite the
consistently raised tongue body posture. Tongue body closure and larynx raising are not
physically impossible to produce together, but every example thus far has shown that tongue
body closures cancel larynx closure/raising gestures during harmony. Here, forced Kick
Drums before Inward K Snares are produced with both laryngeal raising and a raised tongue
body. The velum (VEL) time series shows how a Kick Drum can be forced even when the
tongue body is high. In this beat pattern, persistent tongue body closures are made by the
tongue body and velum coming together; among other things, this allows the beatboxer to
breathe through the nose while simultaneously using the mouth to create sound. During the
forced Kick Drums, harmony ends not by lowering the tongue body but rather by raising the
velum in preparation for the pulmonic ingressive Inward K Snare (Figure 105). This directs
laryngeal air pressure manipulations through the air channel now extant over the tongue.
The resulting Kick Drums therefore have larynx raising without tongue body closure, giving
them the form typically expected for Kick Drums. The last forced Kick Drum differs from
the rest: it is the only one for which the tongue body does not appear to be raised toward the
velum—nor does it appear to be making any particular constriction at all (Figure 107).
212
Figure 104. Regions for beat pattern 1 (Clickroll showcase).
Labial (LAB) closures for forced Kick Drum (left) and unforced Kick Drum (right).
Tongue tip (COR) closures for the Clickroll (left), dental closure (center), and Closed
Hi-Hat (right).
The tongue body is out of the DOR region during the Inward K Snare.
213
Larynx (LAR) region filled at the MAXC of the larynx raising associated with a forced Kick
Drum (left). The right image was taken from the PVEL2 of the tongue tip release (COR time
series)
214
Figure 106. The DOR region for the Clickroll showcase (beat pattern 1) in the first {CR dc B
^K}. Left: The tongue body is raised and the velum is lowered during the Clickroll {CR},
leaving no air channel over the tongue body; pixel intensity in the region is high. Center: The
tongue body is raised during a forced Kick Drum, but the velum is also raised so there is a
gap between the tongue body and the velum through which air can pass; pixel intensity in
the region is high. Right: The tongue body is shifted forward during the lateral release of an
Inward K Snare; pixel intensity in the region is low.
Figure 107. Each forced Kick Drum in the beat pattern in order of occurrence. The image was
taken from the frame of the LAB region’s peak velocity (change in pixel intensity)
corresponding to PVEL2 as described in Chapter 2: Method. The final Kick Drum (far right)
signals that harmony has ended because the tongue body is not making a narrow velar
constriction.
215
Figure 108. Upper left: Labial and laryngeal gestures for an ejective/forced Kick Drum before
an Inward K Snare in beat pattern 1. Upper right: Labial gesture for a non-ejective/unforced
Kick Drum in beat pattern 1. Lower left: Near the time of maximum labial constriction for
the ejective Kick Drum, the vocal folds are closed (visible as tissue near the top of the
trachea) and the airway above the larynx is open, including a narrow passage over the tongue
body which is raised but not making a closure with the velum; the velum is raised. Lower
right: At the time of maximum labial constriction for the non-ejective unforced Kick Drum,
the vocal folds are open and the tongue body connects with a lowered velum to make a velar
closure.
Forced (ejective) Kick Drum Unforced (lingual) Kick Drum
216
3.5.2 Analysis of beat pattern 1
of a spreading tongue body closure triggered by a Spit Snare, and in beat pattern 4 (section
3.3) from a Liproll trigger. In those beat patterns, the tongue body constriction degree is
consistent throughout—every sound is made with a tongue body closure. Beat pattern 1 is
more challenging to analyze as harmony because the tongue body closure is frequently
interrupted, making it relatively more difficult to spot prolonged tongue body closures or to
know what sounds might have triggered them. For this beat pattern the Clickroll is the only
sound made with a tongue body closure when performed in isolation, making it the most
likely trigger.
Kick Drum is a Kick Drum alternation that occurs in dorsal environments. This beat pattern
provides evidence that the dental closure {dc} may be an alternation of the Closed Hi-Hat.
The drum tab in Figure 103 shows that the Closed Hi-Hat appears at the end of the
performance in precisely the metrical position where a dental closure is expected. This would
make three clear harmony undergoers from the beat patterns analyzed so far: the forced Kick
Drum to the unforced Kick Drum, the PF Snare to a labiodental closure, and the Closed
But given the frequent tongue body closure interruptions and use of several forced
Kick Drums, it is not clear whether the unforced Kick Drums and dental closure alternants
217
should be considered the result of harmony or more simply a consequence of local
assimilation. All the unforced Kick Drums in this beat pattern save one are produced in the
same metrical position as a Clickroll; the tongue body must be raised in anticipation of the
Clickroll, so the Kick Drum on the same beat as a Clickroll must be co-produced with a
tongue body closure—there is not enough time between the release of the labial closure and
the onset of trilling to raise the tongue body and create the necessary pocket of air between
the tongue body and the tongue tip. Likewise, the starting conditions for most dental
closures are the result of tongue closures for a Clickroll or Inward K Snare. Dental closures
may occur as alternants of Closed Hi-Hats in this environment simply because they are
mechanically advantageous given the current state of the vocal tract, not because of a
Two aspects of the data suggest that this is a harmony pattern. First, the absence of
laryngeal closure/raising gestures: if the Kick Drums simply became percussives because of a
concurrent but non-inhibiting tongue body closure gesture, there should still be larynx
raising—which there is not. Second, there is the sequence {... ^K B dc b | b-CR ...}
which begins from beat 3 of the second measure (the pipe character | indicates the divide
between measures 2 and 3, and the hyphen in {b-CR} indicates that the sounds are made on
the same beat). This sequence features a dental closure {dc} and unforced Kick Drum {b}
(both underlined) that are made without an adjacent Clickroll or even an adjacent Inward K
Snare—that is, without any sounds nearby that require a tongue body closure . If the
alternations from forced Kick Drum to unforced Kick Drum and from Closed Hi-Hat to
dental closure in this beat pattern were due only to coproduction, then these particular
218
dental closure and unforced Kick Drum should have been ejective Closed Hi-Hat and forced
Kick Drum instead. The presence of tongue body closure here, despite there being no
immediate coarticulatory origin for it, indicates harmony. Extrapolating this to the rest of the
beat pattern, the unforced Kick Drums and dental closures in this pattern can be described
as the result of the same bidirectional tongue body closure harmony that appeared in beat
patterns 5, 9, and 4.
a tongue body closure; there are also Inward K Snares which move the tongue body closure
to a different location, lateralize it, and bring air flowing pulmonic ingressively through the
mouth and over the tongue. Neither sound is participating in harmony, either as a trigger or
as an undergoer. Section 3.4 suggested that pulmonic sounds like the Inward Liproll and
High Tongue Bass are harmony blockers that defend the Kick Drums from harmonizing too;
if so, then the pulmonic ingressive Inward K Snare can also be analyzed the same way. Just as
in section 3.4, these forced Kick Drums are close enough temporally to the Inward K Snare
that they can also benefit from the blocking of the tongue body harmony.
The last measure of this beat pattern provides perhaps the clearest demonstration
that harmony is blocked by the Inward K Snare. Figure 107 illustrates that all but the last
forced Kick Drum are co-produced with a tongue body constriction. Suspiciously, all but the
last forced Kick Drum also fall somewhere between a Clickroll and an Inward K Snare,
precisely where harmony is predicted to be trying to spread, whereas the last forced Kick
Drum has no Clickroll in its vicinity. The penultimate Inward K Snare blocks harmony for
219
the last time and so all following sounds are made without influence from a tongue body
closure. This notably includes the Closed Hi-Hat which has never appeared so far in
harmony pattern but occurs frequently outside of harmony (see Chapter 3: Sounds); its
appearance in this beat pattern is another indicator that harmony has ended.
The introduction of this chapter posed two main questions. First, descriptively, does
and blockers? And second, what can be concluded about beatboxing cognition from the
description of beatboxing harmony? With respect to the first question, section 3 found that
there are indeed beat patterns with sustained tongue body closure that can be described as
bidirectional harmony. Those patterns include sounds associated with those tongue body
closures that act as triggers, sounds that undergo qualitative change because of the harmony,
and sounds that block the spread of harmony. The remainder of this chapter addresses the
second question about the implications for beatboxing cognition in two parts: the evidence
for cognitive sub-segmental units (section 4.1), and a discussion of how beatboxing harmony
representations. Beatboxers learn categories of sounds and overtly or covertly organize them
by their musical role; they can also name many of the sounds they can produce, and likewise
220
produce a sound they know when prompted with its name. All of this knowledge is necessary
and inevitable for skilled beatboxers. Less clear is the nature and composition of those
representations. The question at hand is whether there is evidence for cognitive units
phonetic dimensions, but cautioned that finding observable dimensions does not imply the
cognitive reality of those dimensions. The atoms of speech—units the size of features or
gestures—are argued to be cognitive because of many years of observational data and more
recent (40-50) years of experimental data showing that sounds pattern along these phonetic
dimensions. In almost all cases the patterns occur for sounds of a particular “natural” class,
which is to say that the sounds involved share one or more phonetic properties.
If there is any cognitive reality to the phonetic dimensions of beatboxing sounds, then
beatboxing sounds belonging to a given class defined by one or more phonetic dimensions
should share a certain pattern of behavior. Beatboxing harmony provides a window through
undergoers, and blockers have complementary behavior in harmony; if they also have
complementary phonetic dimensions relevant to the harmony, then those dimensions will
221
Table 26 (reprinted). The beatboxing sounds involved in harmony organized by their
harmony role.
Name BBX IPA Description
Triggers
Blockers
High Tongue Bass {HTB} [r] Voiced pulmonic egressive alveolar trill (high
pitch)
Other
Lateral alveolar closure {tll} [ǁ] Voiceless percussive lateral alveolar stop
222
Table 26 (reprinted) lists the sounds that participate in the five analyzed beat patterns with
harmony according to their function in the harmony pattern. The sounds in the “other”
group are sounds which were either prevented from undergoing harmony by nearby blocking
sounds (the forced Kick Drum and Closed Hi-Hat) or for which there is not sufficient
evidence to say what their role is (humming, and some percussives). Within each group, the
sounds do not belong to the same musical category (i.e., snare, kick, roll) and do not have the
same primary constrictors. Though the undergoers all happen to be made with compressed
oral closures (i.e., as stops), neither the triggers nor the blockers pattern by constriction
degree within their groups. The only phonetic dimension along which all three groups
pattern complementarily is their general airstream type: triggers have a lingual airstream,
were never identified by this beatboxer as distinctive sounds, are restricted to occurring near
other sounds with tongue body closures, and pattern metrically like their glottalic egressive
counterparts {B], {PF} (glottalic egressive labiodental affricate), and {t}. (The four coronal
percussives in the “Other” group in Table 26 may also be alternants of a coronal sound like
the Closed Hi-Hat {t}, but there is not enough metrical data to be sure.) Based on this, the
sounds that undergo harmony are likely intended to be glottalic sounds but because of the
harmony are produced with a tongue body closure and without a laryngeal gesture.
Re-phrasing the airstream conclusion from the previous paragraph: triggers have lingual
airstream, undergoers shift from glottalic airstream to percussive, and blockers have
pulmonic airstream.
223
An equivalent way to characterize the pattern is that the triggers are all composed of
a tongue body closure gesture and another more anterior constriction whereas the rest of the
sounds do not have tongue body closures—and in the case of the Inward K Snare, do not
have an additional simultaneous anterior constriction. Pulmonic sounds, the blockers, may
override tongue body closure harmony because they fulfill both musical and homeostatic
roles (keeping the beatboxer alive long enough to finish their performance). The remaining
sounds, which happen to be glottalic, would not benefit homeostatically from blocking the
spread of the tongue body closure (since they do not afford breathing in any case due to
their usual glottal closure) and in undergoing the harmony they lose their laryngeal raising
since it is rendered inert with respect to pressure regulation by the tongue body closure.
that there must 1) be a class of sounds sharing that dimension which 2) collectively
participate in some behavior. Not only do the trigger sounds analyzed in these five beat
patterns all share the lingual airstream dimension, but so also do the showcased sounds in
the beat patterns not analyzed above—the Clop, Duck Meow SFX, Water Drop Air, and
Water Drop Tongue are all either lingual egressive or lingual ingressive and are the most
likely candidates for triggering harmony in their beat patterns. These seven lingual sounds
are also the complete set of lingual airstream sounds for this beatboxer: every lingual
airstream sound performed by this beatboxer is a likely trigger for tongue body closure
harmony. (See the appendix for drum tabs of every harmony-containing beat pattern.) The
triggers therefore constitute a natural class within the set of beatboxing sounds this
beatboxer knows. With respect to the two criteria, harmony triggers 1) share the dimension
224
of lingual airstream and 2) collectively trigger harmony. There is not enough data to say for
sure whether every pulmonic sound is a harmony blocker, but the evidence so far predicts as
representation because it places the trigger sounds in a cognitive relation with each other; in
doing so, it also places the triggers (lingual), blockers (pulmonic), and undergoers (other) in
a complementary cognitive relationship with each other. Section 4.2 offers a theoretical
pulmonic gestures for beatboxing which act as blockers of tongue body closure gestures.
Having established that there are sub-segmental cognitive units of beatboxing, the next step
beatboxing harmony needs to account for the behavior of triggers and their prolonged
tongue body closures, undergoers which lose a laryngeal raising gesture when the extended
tongue body closure spreads through them, and pulmonic blockers that disrupt the
spreading of the tongue body closure. This section compares compares two gestural
accounts—the Gestural Harmony Model (Smith, 2018) and Tilsen’s (2019) extension to the
Chapter 4: Theory provides the basis for an action-based account of beatboxing phonology.
Speech and beatboxing movements share certain constriction tasks and kinematic properties,
suggesting that the fundamental cognitive units of beatboxing are the same types of actions
225
as speech units—albeit with different purposes. In the language of dynamical systems, this
equivalence is expressed through the graph level which speech gestures and beatboxing
The Gestural Harmony Model (Smith, 2018) provides the means for generating these
beatboxing harmony phenomena. The Gestural Harmony Model extends the gestures of task
dynamics with a new parameter for persistent or non-persistent activation, and extends the
another gesture or until the end of the word containing the gesture. These additions to the
model are new parameters; because the graph level deals with selection of dynamical system
parameters and the relationship of those parameters to each other and to dynamical state
variables, the addition of new parameters to the model is a graph-level change (Table 32).
Under the shared-graph hypothesis, the Gestural Harmony Model’s revisions to speech
graphs must also be reflected in the graphs of beatboxing actions and their coordination. All
can have persistent actions which last until they are inhibited by another beatboxing action
6
The coupled oscillator model does not have a mechanism for starting a tongue body closure early, stretching it
regressively. Typically a gesture’s activation is associated with a particular phase of its oscillator; the oscillators
settle into a stable relative phase relationship based on their couplings before an utterance is produced, giving
later activation times to gestures associated with later sounds. The Gestural Harmony Model uses constraints in
an OT grammar to shift the onset of activation of a persistent gesture earlier in an utterance. A similar strategy
could be used for beatboxing harmony, or else a more dynamical method of selecting coupling relationships. In
either case, the force that causes harmony to happen in a theoretical model must be related to the aesthetic
principles that shape beatboxing—here perhaps the drive to create a cohesive aesthetic through a consistently
sized resonance chamber (the oral chamber in front of the tongue body). The formalization of that force is left
for future work.
226
or until the end of a musical phrase. The next few paragraphs schematize how the Gestural
beat pattern 5 for an example). The Kick Drum is an ejective composed of a labial
compression action and a laryngeal closing and raising action, and the Spit Snare is a lingual
egressive sound composed of a labial compression action and a tongue body compression
action. These compositions are laid out in a coupling graph at the top of Figure 109 with
coupling connections between the paired actions for each sound—the specific nature of these
connections in a coupled oscillator model determines the relative timing of these actions and
contributes to the perception of multiple gestures as part of the same segment; for present
purposes, however, the important coupling relationship to watch for is the inhibitory
coupling.
Section 3 characterized the Spit Snare as a harmony trigger, so the tongue body
closure of the Spit Snare needs to turn the Kick Drum into an unforced Kick Drum via
temporal overlap. This is accomplished by flagging the Spit Snare’s tongue body closure
gesture as persistent—marked with arrow heads on the top and bottom of the oscillator in
the coupling graph—causing it to extend temporally as far as possible both forward and
backward. By extending backward, the tongue body closure is activated before or around the
same time as the labial closure of the Kick Drum, resulting in the production of a Kick Drum
that has adopted a tongue body closure (an unforced Kick Drum). The gestural score below
227
Table 32. Non-exhaustive lists of state-, parameter-, and graph-level properties for dynamical
systems used in speech from Chapter 4: Theory. Parameter additions to the system from the
Gestural Harmony Model are underlined. Because the graph level is responsible for the
selection of and relationship between parameter and state variables, the addition of
persistence and inhibition to the parameter space is a graph-level change.
State level Parameter level Graph level
Section 3 also showed that the laryngeal raising gesture of the Kick Drum disappears when it
harmonizes to the tongue body closure of the Spit Snare. This can be accomplished with an
inhibitee sound to which it is coupled, then the inhibitee is prevented from activating at all.
The coupling graph in Figure 109 shows an inhibitory coupling relationship between the
tongue body closure of the Spit Snare and the larynx raising gesture of the Kick Drum; since
the tongue body the closure starts before the laryngeal gesture, the laryngeal gesture never
activates. The gestural score in Figure 109 shows the “ghost” of the laryngeal gesture as a
228
Why does this inhibitory relationship exist in the first place? Laryngeal activity isn’t
clearly laryngeal closure/raising and tongue body closure action can even be collaborative.
And, we have seen that this canceling of the laryngeal closure/raising gesture is not a blanket
inhibition on all laryngeal activity. Figure 110 depicts the same relationship between Kick
Drum and Spit Snare as Figure 109, but with the addition of a humming phonation gesture
from the humming-while-beatboxing pattern (section 3.2). The persistent tongue body
closure from the Spit Snare inhibits the laryngeal raising gesture of the Kick Drum, just as it
did in Figure 109; however, the humming gesture has no inhibitory coupling relations, so it is
free to manifest at the appropriate time. The result is an unforced Kick Drum coarticulated
with humming and followed by a Spit Snare. (The humming gesture is depicted with
in-phase coupling to the labial closure of the Kick Drum as a way of showing that the
humming and the Kick Drum occur together on the same beat. In a more expansive account,
they might not be coupled to each other directly but instead share the activation phase of
One answer is that closing the vocal folds reduces opportunities to manage the
volume of air in the lungs. Expert beatboxing requires constant breath management because
the ability to produce a given sound in an aesthetically pleasing manner requires specific air
pressure conditions. We have seen that beat patterns can include sound sequences with many
different types of airflow; in planning the whole beat pattern (or chunks of the beat pattern),
beatboxers must be prepared to produce a variety of airstream types and so are likely to try
to maintain breath flexibility. Laryngeal closures prevent the flow of air into and out of the
229
lungs for breath management purposes, and therefore are antagonistic not to the tongue
body closure but to breath control. By this explanation, the inhibition of the laryngeal
of skill (Pouplier, 2012). The coordination of the body’s end effectors change as the
patterns; this has been notably recognized in quadrupeds like horses which switch into
distinct but roughly equally efficient gaits for different rates of movement (Hoyt & Taylor,
1981). In this case of laryngeal closure and raising in beatboxing, expert beatboxers are likely
to recognize that the laryngeal gesture they usually associate with a forced Kick Drum (or
other glottalic sounds that undergo harmony) has no audible consequence during tongue
body closure harmony. From this feedback, a beatboxer would learn a qualitative shift in
behavior—to not move the larynx while the tongue is making a closure. A similar thing
happens in speech in the context of assimilation due to overlap: Browman & Goldstein
(1995) provide measurements that when a speaker produces the phrase “tot puddles” there is
wide variation in the magnitude of the final [t] tongue tip constriction gesture, including
effective deletion. In this example, the speaker reduces or deletes their gesture when it would
have no audible consequence anyway. The same could be said of the laryngeal closure and
raising gesture in beatboxing when overlapped with tongue body closure harmony.
230
Figure 109. A schematic coupling graph and gestural score of a Kick Drum and Spit Snare.
The tongue body closure (TBCD) gesture of the Spit Snare overlaps with and inhibits the
closure and raising gesture of the larynx (LAR).
231
Figure 110. A schematic coupling graph and gestural score of a Kick Drum, humming, and a
Spit Snare. The tongue body closure (TBCD) gesture overlaps with and inhibits the closure
and raising gesture of the larynx (LAR) as in Figure 109, but the humming LAR gesture is
undisturbed.
inhibitory coupling. Figure 111 shows an example from beat pattern 1 with a {b CR B ^K}
sequence. The Inward K Snare requires a tongue body closure and lung expansion to draw
air inward over the sides of the tongue body, which is incompatible with a full tongue body
closure triggered by the Clickroll. This lung expansion action ends the persistent tongue
232
body gesture associated with a harmony trigger—if it didn’t, then the tongue body closure
would block the inward airflow and the Inward K Snare couldn’t be produced. Inhibiting the
persistent tongue body closure also prevents the persistent tongue body closure gesture from
inhibiting the laryngeal gesture of the Kick Drum between the Clickroll and the Inward K
Snare. As a result, the first Kick Drum that does overlap with the persistent tongue body
closure gesture has its laryngeal closure/raising gesture inhibited, but the second Kick Drum
will not.
Positing a breathing task is a major departure from the typical tract variables of
Phonology, and reasonably so—no language uses pulmonic ingressive airflow to make a
phonological contrast (Eklund, 2008). Pulmonic egressive airflow, on the other hand, is
practically ubiquitous in speech which means that it does not really operate contrastively
either. Either way, there has been no need to posit any kind of pulmonic gesture for speech.
and appears to contribute to productive sound patterns, indicating that it is cognitive too.
233
Figure 111. A schematic coupling graph and gestural score of a {b CR B ^K} sequence. The
tongue body closure (TBCD) gesture of the Clickroll overlaps with and inhibits the laryngeal
closing and raising gesture (LAR) of the first Kick Drum. The lung expansion (PULM)
gesture coordinated with the Inward K Snare inhibits the TBCD gesture of the Clickroll
before the TBCD gesture can inhibit the second LAR gesture.
The shared-graph hypothesis of Theory chapter predicts that beatboxing and speech will
exhibit similar patterns of behavior permitted by the dynamical graph structures they use.
The Gestural Harmony Model augments the graphs of the task dynamics framework and the
coupling graph system to include gestural persistence and inhibition options; any predictions
of possible action patterns made by the Gestural Harmony Model should therefore also be
predictions about possible beatboxing patterns. The finding that beatboxing harmony exists
in such a speechlike form provides evidence in favor of both the shared-graph hypothesis
The support is all the stronger because the gestural analysis of beatboxing harmony
includes patterns that are predicted by the Gestural Harmony Model but unattested in
234
speech. As Smith (2018:204-206) discusses, intergestural inhibition may not be constrained
enough for speech: inhibition is introduced specifically so that an inhibitor gesture can block
the spread of a persistent inhibitee gesture, but it is equally possible in the model that a
persistent gesture could inhibit non-persistent gestures—even though such a thing never
appears to occur in speech. Within the narrow domain of speech, the Gestural Harmony
Model over-generates inhibition patterns. But beatboxing uses those patterns when
persistent tongue body closure gestures inhibit laryngeal raising gestures; under the
shared-graph hypothesis, the predictions of the Gestural Harmony Model are met with
evidence. (It is of course possible, maybe even likely, that the lack of attestation of this
investigations of the types of speech harmony that could exhibit this. If this pattern were
found in speech, it would mean that the Gestural Harmony Model does not over-generate
patterns and that speech and beatboxing harmony have one more thing in common.)
Tilsen (2019) offers two different gesture-based accounts for the origins of non-local
system, harmony is thought to start off more or less accidentally because of how
domain-general motor control works and to later become phonologized into a regular part of
selected, their excitation level ramps up high enough to trigger the dynamic motor processes
235
associated with that set of gestures; other gestures that have not been selected yet or which
were already selected and subsequently “demoted” are still present but are not strongly
enough excited to be excited or to influence motor planning fields in any way. The
continuous process of selection is discretized into static “epochs” that describe a snapshot
view of the whole state of the system and the gestures therein. One cause of demotion is
in the dynamical field. Gestural antagonism is formalized as one gesture’s excitatory side
conflicting with another’s inhibitory side; when two antagonistic gestures would be selected
into the same epoch, the inhibitory gesture demotes the excited gesture from the selection
pool.
Tilsen’s account of local spreading harmony (which we have argued is the nature of
by which a gesture may be selected early or de-selected late relative to the epoch it would
normally be selected into. Blocking in this model occurs when a gesture which has been
selectionally dissociated conflicts with an inhibitory gesture in another epoch. In the case of
nasal harmony, for example, a velum lowering gesture might fail to be suppressed causing it
gesture. The velum lowering would be blocked if it were ever extended into an epoch in
which an antagonistic, inhibiting velum closing gesture was also selected: the inhibitory
velum raising gesture would demote the velum lowering gesture, causing the lowering
gesture to slip below the threshold of influence over the vocal tract.
236
The tongue body closure spreading and pulmonic airflow blocking in beatboxing can
be accounted for by similar means, with the tongue body closure gesture being anticipatorily
de-gated (selected early) and remaining un-suppressed unless it conflicts with the selection
of an antagonistic pulmonic gesture (e.g., from an Inward K Snare) in a later epoch. This has
the advantage of providing an explicit explanation for why some Kick Drums in sequences
like {CR dc B ^K} do not undergo harmony: if the Kick Drum and Inward K Snare are
selected during the same epoch, then the Inward K Snare’s pulmonic gesture blocks the
spread of harmony during that whole epoch, effectively defending the Kick Drum from
harmony.
mechanism for dealing with non-local phonological agreement patterns: “leaky” gating. A
gesture that is gated is not selected and therefore exerts no influence on the tract variable
planning fields—and therefore, has no influence on the vocal tract. But if a gesture is
imperfectly gated, its influence can leak into the tract variable planning field even though it
hasn’t been selected. Leaky gating cannot be blocked because blocking is formalized as a
co-selection restriction; since the leaking gesture has not actually been selected, it cannot be
features blocking behavior, which makes leaky gating inappropriate for the crux of a
which is generally not blocked). But there is nothing to say that leaky gating can’t be used
with selectional dissociation; on the contrary, if a spreading gesture has an intrinsically high
excitation level, it would be all the more likely to lurk beneath the selection threshold,
237
leaking its influence into the vocal tract without antagonizing the currently selected gestures.
This could explain why the tongue body remains elevated during most of the forced Kick
Drums in the complex example in section 3.5.1: the pulmonic gesture of the Inward K Snare
blocks the spreading tongue body closure gesture by demoting it to sub-selection levels, but
the tongue body closure gesture leakily lingers and keeps the tongue body relatively high.
Only near the end of the beat pattern does the tongue body closure gesture stop leaking,
So far as we can tell however, the loss of the laryngeal closing/raising gestures during
Kick Drums and other harmony-undergoer sounds cannot be accounted for in this model.
gestures—but it is not clear in the model what the nature of the antagonism is or how this
To conclude this theoretical accounting of beatboxing harmony, recall from section 1 that
models of phonological harmony that only account for linguistic harmony should be
dispreferred to models that can accommodate beatboxing harmony as well. What about a
238
more traditional, purely domain-specific phonological framework based around symbolic
beatboxing harmony, though great care would need to be taken in order to define sensible
features for beatboxing. One might posit a set of complementary airstream features {+
pulmonic} and {+ lingual} for sounds with either pulmonic or lingual airstream. An Inward K
Snare would be defined as {+ pulmonic} for airstream and, because it is made with the
tongue body, {+ dorsal} for its place feature (the primary constrictor, when mapped to
phonetics). Because pulmonic and lingual airstreams are complementary, the Inward K Snare
would also be {- lingual}. Though not a deal-breaker per se, it would be a little strange in a
phonetically grounded model for a sound to be both {+ dorsal} and {- lingual}: there is no
qualitative distinction between a tongue body closure used for a pulmonic dorsal sound on
the one hand and a tongue body closure for a lingual egressive, lingual ingressive, or
percussive airstream sound on the other—in either case, the tongue body’s responsibility is to
stop airflow between the oral cavity and the pharynx. There would also need to be a
mechanism for preventing boxeme inputs that are simultaneously {+ lingual, + dorsal}
because the tongue body can’t manipulate air pressure behind itself. The gestural approach
has none of these issues: both an Inward K Snare and a lingual airstream sound just simply
To the main point, there is the question of whether most featural accounts of
linguistic harmony have any justification for extending to beatboxing harmony. We have
seen already that gestures are defined by both their domain-specific task and the
239
domain-general system for producing constriction actions in the vocal tract; by the
hypothesis laid out in Chapter 4: Theory, the domain-general capacity of the graph level to
implement linguistic harmony predicts that gesture-ish beatboxing units should come with
the same ability. Beatboxing harmony is thus predicted from linguistic harmony in a gestural
features are concerned exclusively with their encoding of linguistic contrast and linguistic
patterns, and are historically removed from phonetics and the physical world by design
(though they have become more and more phonetically-grounded over time). The grammars
that operate over those features are intended to operate exclusively over linguistic inputs and
outputs. Phonological features and grammar could be adapted to beatboxing, every part of
5. Conclusion
Phonological harmony is not unique to speech: common beat patterns in beatboxing like the
humming while beatboxing pattern have the signature properties of phonological harmony
including triggers, undergoers, and blockers. This suggests that phonology (or at least
notion that speech and beatboxing phonology are each specializations of a domain-general
240
harmony ability is expressed this way because gestures are essentially domain-general action
241
CHAPTER 7: BEATRHYMING
beatboxing and speech (i.e., singing or rapping) by a single individual. This case study of a
beatrhyming performance demonstrates how the tasks of beatboxing and speech interact to
create a piece of art. Aside from being marvelous in its own right, beatrhyming offers new
insights that challenge the fundamentals of phonological theories built to describe talking
alone.
1. Introduction
1.1 Beatrhyming
One of many questions in contemporary research in phonology is how the task of speech
interacts with other concurrent motor tasks. Co-speech manual gestures (Krivokapić, 2014;
Danner et al., 2019), co-speech ticcing from speakers with vocal Tourette’s disorder (Llorens,
in progress), and musical performance (Hayes & Kaun, 1996; Rialland, 2005; Schellenberg,
2013; Schellenberg & Gick, 2020; McPherson & Ryan, 2018) are just a few examples of
behaviors which may not be under the purview of speech in the strictest traditional sense
but which all collaborate with speech to yield differently organized speech performance
modalities. Studying these and other multi-task behaviors illuminates the flexibility of
speech units and their organization in a way that studying talking alone cannot.
This chapter introduces beatrhyming, a type of speech that has not previously been
investigated from a linguistic perspective (see Blaylock & Phoolsombat, 2019 for the first
242
presentation of this work, and also Fukuda, Kimura, Blaylock, and Lee, 2021). Beatrhyming is
(i.e., singing or rapping) by a single individual. Notable beatrhyming performers include Kid
Lucky, Rahzel, and Kaila Mullady, though more and more beatboxers are taking up
communication that is composed of a beatboxing task and a speech task. The question at
hand is: how do the speech and beatboxing tasks interact in beatrhyming?
each other. Artists differ in their use beatboxing sounds differently in their beatrhyming. For
example, Rahzel’s beatrhyme performance “If Your Mother Only Knew” (an adaption of
Aaliya’s “If Your Girl Only Knew”) uses mostly Kick Drums
beatrhyming is analyzed in this chapter) more often uses a variety of beatboxing sounds in
her beatrhyming.
some cases, words and beatboxing sounds are produced sequentially. Taking the word “got”
{B}[gat] (a Kick Drum, followed by the word “got”). In other cases, words and beatboxing
sounds may overlap, as in {K}[at] (with a Rimshot completely replacing the intended [g] in
the word “dopamine” /dopəmin/ in Figure 112: the Closed Hi-Hat {t} replaces the intended
243
speech sound /d/ and Kick Drum {B} replaces the intended speech sound /p/. In both cases,
the /d/ and /p/ were segmented on the phoneme (“phones”) tier with the same temporal
interval as the replacing beatboxing sound (on the “beatphones” tier). The screenshot also
features one example of partial overlap, a K Snare {^K} that begins in the middle of an [i]
(annotated "iy").
For reference, Table 33 below lists the five main beatboxing sounds that will be
referred to in this chapter. Each beatboxing sound is presented with both Standard Beatbox
Notation (SBN) (TyTe & Splinter, 2019) in curly brackets and IPA in square brackets. (The
IPA notation for the Inward K Snare uses the downward arrow [↓] from the extIPA symbols
for disordered speech to indicate pulmonic ingressive airflow, and should not be confused
Sections 1.2-1.3 presents hypotheses and predictions about how beatboxing and
speech may (or may not) cooperate to support the achievement of their respective tasks.
Section 2 presents the method used for analysis and section 3 describes the results. Finally,
section 4 suggests that more studies of musical speech and other understudied linguistic
behaviors can offer new insights that challenge phonological theories based solely on talking.
244
Figure 112. Waveform, spectrogram, and text grid of the beatrhymed word “dopamine”.
245
1.2 Hypotheses and predictions
1.2.1 Constrictor-matching
Depending on the nature of the replacements, cases like the complete replacement of /d/ and
/p/ in the word “dopamine” from Figure 112 could be detrimental to the tasks of speech
production. In the production of the word "got" [gat], the [g] is intended to be performed as
a dorsal stop. If the [g] were replaced by a beatboxing dorsal stop, perhaps a velar ejective
Rimshot {K’}, at least part of the speech task could be achieved while simultaneously
beatboxing. On the other hand, replacing the intended [g] with a labial Kick Drum {B}
would deviate farther from the intended speech tasks for [g]. If the difference were great
enough, making replacements that do not support the intended speech goals might lead to
listeners misperceiving beatryming lyrics—in this case, perhaps hearing “bot” [bat] instead of
“got”.
So then, if the speech task and the beatboxing task can influence each other during
beatrhyming, the speech task may prefer that beatrhyming replacements match the intended
speech signal as often as possible and along as many phonetic dimensions as possible. This
chapter investigates whether replacements support the speech task by making replacements
that match intended speech sounds in constrictor type (i.e., the lips, the tongue tip, the
(2005) offers the similar hypothesis that beatboxing sounds collectively sound as
246
simultaneous beatboxing and singing when perception might be maximized if beatboxing
sounds have the same place of articulation with the speech sounds they replace.
To summarize: the main hypothesis is that speech and beatboxing interact with each
other in beatrhyming in a way that supports the accomplishment of intended speech tasks.
This predicts that beatboxing sounds and the intended speech sounds they replace are likely
to match in constrictor and constriction degree. Conversely, the null hypothesis is that the
two systems do not interact in a way that supports the intended speech tasks, predicting that
beatboxing sounds replace speech sounds with no regard for the intended constrictor or
constriction degree.
The predictions of these hypotheses for constrictor matching are depicted Figures 113
intended speech labials, 30 intended speech coronals, and 30 intended speech dorsals—are
replaced by beatboxing sounds. The replacing beatboxing sounds come from a similar
replacements are made with no regard to the constrictor of intended speech sounds
(following from the null hypothesis), constrictor matches should occur at chance. Each
Figure 113. But if replacements are sensitive to the intended constrictor (following from the
main hypothesis), then most beatboxing sounds should match the constrictor of the
247
Figure 113. Bar plot of the expected counts of constrictor matching with no task interaction.
Figure 114. Bar plot of the expected counts of constrictor matching with task interaction.
Consider also the predicted distributions for any single beatboxing constriction (Figure 115).
For example, if 30 dorsal beatboxing replacements (i.e., K Snares) are made with no regard to
intended speech constrictor (following from the null hypothesis), then 10 of those
248
replacements should mismatch to intended speech labials, 10 should mismatch to intended
speech coronals, and 10 should match to intended speech dorsals. But if replacements are
sensitive to intended constrictor (following from the main hypothesis), then all 30
beatboxing dorsals are expected to replace intended speech dorsals (Figure 116).
beatrhyming replacements are made with an aim of satisfying speech tasks, then
replacements are more likely to occur between speech sounds and beatboxing that have
similar constriction degrees. Since beatboxing sounds are stops and trills (see
English that has no phonological trills, the prediction of the main hypothesis is that speech
stops will be replaced more frequently than speech sounds of other manners of articulation.
On the other hand, the null hypothesis would be supported by finding that beatboxing
249
Figure 115. Bar plots of the expected counts of K Snare constrictor matching with no task
interaction.
Figure 116. Bar plots of the expected counts of K Snare constrictor matching with task
interaction.
250
1.2.2 Beat pattern repetition
As established in earlier chapters, beatboxing beat patterns have their own predictable sound
organization within a beat pattern. The presence of a snare drum sound on the back beat
(beat 3 of each measure) of a beat pattern in particular is highly consistent, but beat patterns
are also often composed of regular repetition at larger time scales. Speech utterances are
highly structured as well, but the sequence of words (and therefore sounds composing those
words) is determined less by sound patterns and more by syntax (cf. Shih & Zuraw, 2017).
However, artistic speech (i.e., poetry, singing) is sometimes composed alliteratively or with
other specific sound patterns in mind, leveraging the flexibility of language to express similar
There are (at least) two ways beatboxing and speech could interact while maximizing
constrictor matching as hypothesized in section 1.2.1. First, the words of the song could be
planned without any regard for the resulting beat pattern. Any co-speech beatboxing sounds
would be planned based on the words of the song, prioritizing faithfulness to the intended
spoken utterance. Alternatively, the lyrics could be planned around a beatboxing beat
pattern, prioritizing the performance of an aesthetically appealing beat pattern. The counts
of constrictor matches described in section 1.2.1 could look the same either way, but the two
hypotheses predict that the resulting beat patterns will be structured differently. Specifically,
prioritizing the beatboxing beat pattern predicts that beatrhyming will feature highly
sequences. The rest of this section discusses these predictions in more detail.
251
A sequence of beatboxing sounds often repeats itself after just two measures of
music—that is, a two-measure or “two-bar” phrase (and also in this study, a “line” of music)
might be performed several times. For example, Figure 117 shows a sixteen-bar beatboxed
two-bar phrases. Each two-bar phrase could be distinct from the others, but in fact there are
only two types of two-bar phrases: AB and AC, where A, B, and C each refer to a sequence of
sounds in a single measure of music. The two-bar phrase AB occurs six times in the beat
pattern on lines 1, 2, 3, 5, 6, and 7. Lines 4 and 8 of the beat pattern feature the two-bar
phrase AC.
The depiction of the sixteen-bar phrase in Figure 117 appears sequential, but is in fact
hierarchical: pairs of two-bar phrases compose four-bar phrases, pairs of four-bar phrases
compose eight-bar phrases, and a pair of eight-bar phrases composes the entire sixteen-bar
phrase. In fact, one way to model the creation of this structure is to merge progressively
larger repeating units. That is, given an initial two-bar phrase, a four-bar phrase can be
created by assembling two instances of that two-bar phrase into a larger unit. Likewise, an
There is room for variation here, and lines may change based on the artist’s musical
choices. In Figure 117, the end of the first eight-bar phrase deviates from the rest of the
pattern, possibly to musically signal the end of the phrase. In this case, the whole eight-bar
phrase is then copied to create a sixteen-bar phrase, resulting in repetition of that deviation
252
This hierarchical composition can be used to predict where repeating two-bar phrases
are most likely to be found in a sixteen-bar beat pattern. The initial repetition of a two-bar
phrase to make a four-bar phrase predicts that lines 1 & 2 should be similar (where each line
is a two-bar phrase). Likewise, repetition of that four-bar phrase to make an eight-bar phrase
would predict repetition between lines 3 & 4; at a larger time scale, this would also predict
that lines 1 & 3 should be similar to each other, as should lines 2 & 4. In the sixteen-bar
phrase composed of two repeating eight-bar phrases, the repetition relationships from the
previous eight-bar phrase would be copied over (lines pairs 5 & 6, 7 & 8, 5 & 7, and 6 & 8);
repetition would also be expected between corresponding lines of these two eight-bar
phrases, predicting similarity between lines 1 & 5, 2 & 6, 3 & 7, and 4 & 8.
Figure 117. Serial and hierarchical representations of a 16-bar phrase (8 lines with 2 measures
each).
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K th B in B
Line 2 | B t t ^K th B in B
Line 3 | B t t ^K th B in B
Line 4 | B t t ^K B h ^K t t
|
Line 5 | B t t ^K th B in B
Line 6 | B t t ^K th B in B
Line 7 | B t t ^K th B in B
Line 8 | B t t ^K B h ^K t t
253
Because deviations from the initial two-bar pattern are expected to occur in the interest of
musical expression, some pairs of two-bar phrases are more likely to exhibit clear repetition
than others. Consider a four-bar phrase composed of two two-bar phrases AB and AC—their
first measures (A) are identical, but their second measures (B and C) are different. If this
four-bar phrase is repeated to make an eight-bar phrase, the result would be AB-AC-AB-AC.
In this example, lines 1 & 3 match as do lines 2 & 4, but lines 2 & 3 and 1 & 4 do not. For this
study, the discussion of repetition in beatrhyming is limited to just those pairs of lines
described earlier which are most likely to feature repetition (“cross-group” refers to
aren’t sensitive to each other at all or because the speech system accommodates beatboxing
through lyrical choices that result in an ideal beat pattern—then sequences of co-speech
beatboxing sounds should have similarly high repetitiveness compared to beat patterns
performed without speech. But if speech structure is prioritized, then the beat pattern is
predicted to sacrifice repetitiveness in exchange for supporting the speech task by matching
the intended constrictor and constriction degree of any speech segments being replaced.
254
1.2.3 Summary of hypotheses and predictions
The main hypothesis is that speech and beatboxing interact during beatrhyming to
accomplish their respective tasks, and the null hypothesis is that they do not. Support for the
first hypothesis could appear in two different forms, or possibly both at the same time. First,
if beatrhyming replacements are sensitive to the articulatory goals of the intended speech
sound being replaced, then the beatboxing sounds that replace speech sounds are likely to
match their targets in constrictor and constriction degree. Second, if beatboxing sequencing
should exhibit the same structural repetitiveness as non-lyrical beatboxing sequences. Failing
to support either of these predictions would support the null hypothesis and the notion that
between speech and beatboxing depending on the artist’s musical aims. The results of this
study should be taken as an account of one way that beatrhyming has been performed, but
2. Method
This section describes how the data were collected and coded (section 2.1) and analyzed
(2.2).
255
2.1 Data
The data in this study come from a beatrhyming performance called "Dopamine", created
and performed by Kaila Mullady and made publicly available on YouTube (Mullady, 2017).
concurrently with speech. The lyrics of "Dopamine" were provided by Mullady over email.
Segmentation was performed at the level of words, phonemes (“phones”), beatboxing sounds
(“beatphones”), and the musical beat (“beats”) on which beatboxing sounds were performed.
For complete sound replacements, the start and end of the annotation for the interval of the
intended speech phone were the same as the start and end of the beatboxing beatphone
interval.
Five beatboxing sounds were used in the beatrhymed sections of "Dopamine": Kick
Drum {B}, Closed Hi-Hat {t}, PF Snare {PF}, Rimshot {K}, and K Snare {^K}. (It was not clear
from the acoustic signal whether the K Snares were Inward or Outward; a choice was made
to annotate them consistently as Inward {^K}. The choice of Inward or Outward does not
affect the outcome of this study which addresses only constrictor—which Inward and
Outward K Snares share). Each beatboxing sound was coded by its major constrictor: {B} and
{PF} were coded as “labial”, {t} was coded as “coronal” (tongue tip), and {K} and {^K} were
coded as “dorsal” (tongue body). Finally, the metrical position of each replacement was
annotated with points on a PointTier aligned to the beginning of beatboxing sound intervals.
256
2.2 Analysis
The mPraat software (Bořil, & Skarnitzl, 2016) for MATLAB was used to count the number
beatboxing sound replaced two speech sounds) (n = 88). The constrictor of the originally
intended speech sound was then compared against the constrictor for the replacing
beatboxing sound, noting whether the constrictors were the same (matching) or different
(mismatching).
Constriction degree matching was likewise measured by counting how many speech
sounds of different constriction degrees were replaced—or in this case, different manners of
articulation. All the beatboxing sounds that made replacements were stops {B} or affricates
{PF, t, K’, (^)K}; higher propensity for constriction degree matching would be found if the
speech sounds being replaced were more likely to also be stops and affricates instead of
Four sixteen-bar sections labeled B, C, D, and E were chosen for repetition analysis.
(“Dopamine” begins with a refrain, section A, that was not analyzed because it has repeated
lyrics that were expected to inflate the repetition measurements. The intent of the ratios is to
assess whether beat patterns in beatrhyming are as repetitive as beat patterns without lyrics,
not how many times the same lyrical phrase was repeated.) Sections B and D were
257
non-lyrical beat patterns (no words) between the refrain and the first verse and between the
first and second verses, respectively. Sections C and E were the beatrhymed (beatboxing with
words) first and second verses, respectively. The second verse was 24 measures long, but was
Repetitiveness was assessed using two different metrics. The first metric counted how
section of music. The more unique measures are found, the less repetition there is. Rhythmic
variations within a measure were ignored for this metric to accommodate artistic flexibility
in timing. For example, Figure 118 contains two two-bar phrases; of those four measures, this
metric would count three of them as unique: {B t t ^K}, {th PF ^K B}, and {B ^K B}. The first
measures of each two-bar phrase would be counted as the same because the sequence of
sounds in the measure is the same despite use of triplet timing on the lower line (using beats
1.67 and 2.33 instead of beats 1.5 and 2). This uniqueness metric provides an integer value
representing how much repetition there is over a sixteen-bar section; if beatrhyming beat
patterns resemble non-lyrical beatboxing patterns, each section’s uniqueness metric should
The second metric is a proportion called the repetition ratio. For a given pair of
two-bar phrases, the number of beats that had matching beatboxing sounds was divided by
the number of beats that hosted a beatboxing sound across both two-bar phrases. This
provides the proportion of beats in the two phrases that were the same, normalized by the
number of beats that could have been the same, excluding beats for which neither two-bar
258
For example, the two two-bar phrases in Figure 118 are the same for 4/10 beats,
resulting in a repetition ratio of 0.4. In measure 1 the sounds of beats 1 and 3 match, but the
second two sounds of the first phrase are on beats 1.5 and 2 whereas the second two sounds
of the second phrase are performed with triplet timing on beats 1.67 and 2.33. Therefore in
the first measure, six beats have a beatboxing sound in either phrase—beats 1, 1.5, 1.67, 2, 2.33,
and 3—but only two of those beats have matching sounds. In the second measure, four beats
have a beatboxing sound in either phrase—beats 1, 2, 3, and 4. While two of those beats have
the same beatboxing sound in both phrases, beat 1 only has a sound in the first phrase and
beat 2 has a PF Snare in the first phrase but a Kick Drum in the bottom phrase. Looking at
the phrases overall, ten beats carry a beatboxing sound in either phrase but only four beats
have the same sound repeated in both phrases for a repetition ratio of 0.4.
This calculation penalizes cases like the first half of the example in Figure 118 in
which the patterns are identical except for a slightly different rhythm. The high sensitivity to
rhythm of this repetition ratio measurement was selected to complement the rhythmic
insensitivity of the previous technique for counting how many unique measures were in a
beat pattern. In practice, this penalty happened to only lower the repetition ratio for phrases
that were beatboxed without lyrics (co-speech beat patterns rarely had patterns with the
same sounds but different rhythms, so there were few opportunities to be penalized in this
way); despite this, the repetition ratios for beatrhymed patterns were still lower than the
repetition ratios for beatboxed patterns in the same song (see section 3.3 for more details).
259
Figure 118. Example of a two-line beat pattern. Both lines have a sound on beats 1 and 3 of
the first measure and beats 2, 3, and 4, of the second measure.
1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
------------------------------------------------------------------------------
B t t ^K th PF ^K B
B t t ^K B ^K B
Within each section, the repetition ratio was calculated for three types of two-bar phrase
pairs: adjacent pairs (phrases 1 & 2, 3 & 4, 5 & 6, and 7 & 8), alternating pairs (phrases 1 & 3,
2 & 4, 5 & 7, and 6 & 8), and cross-group pairs (phrases 1 & 5, 2 & 6, 3 & 7, and 4 & 8).
Additionally, repetition ratio was calculated between sections B & D and between sections C
& E to see if musically related sections used the same beat pattern. Repetition ratios
measured for the beatboxed and beatrhymed sections were then compared pairwise to assess
measurement techniques. This transcription excluded phonation and trill sounds during the
beatboxing patterns because they extend over multiple beats and would inflate the number
of beats counted in the calculation of the repetition ratio. (The excluded beatboxing sounds
3. Results
Section 3.1 measures the extent to which the beatrhyming replacements were
constrictor-matched and section 3.2 does likewise for manner of articulation; both assess
whether the selection of beatboxing sounds accommodates the speech task. Section 3.3
260
quantifies the degree of repetition during beatrhyming to determine whether the selection of
3.1 Constrictor-matching
Section 3.1.1 shows that replacements are constrictor-matched overall. Section 3.1.2 considers
replacements in two groups, showing that there is a high degree of constrictor matching off
the back beat but little constrictor matching on the back beat. Section 3.1.3 offers possible
explanations for the few exceptional replacements that were off the back beat and not
constrictor-matched.
Figure 119 shows the number of times an intended speech sound was replaced by a
beatboxing sound of the same constrictor (blue bars, the left of each pair) or by a beatboxing
sound of a different constrictor (orange bars, the right of each pair) for every complete
replacement in “Dopamine.”
appearing to support the hypothesis that speech and beatboxing interact in beatrhyming. But
while the majorities of intended labials and intended coronals were also replaced by
beatboxing sounds with matching labial or coronal constrictors, there was still a fairly large
number of mismatches for each (10/28 mismatches for labials, 10/31 mismatches for
coronals). This degree of mismatching is less than the levels of chance predicted by a lack of
261
Figure 119. Bar plot showing measured totals of constrictor matches and mismatches.
Table 34 shows the contingency table of replacements by constrictor. Highlighted cells along
the upper-left-to-bottom-right diagonal represent constrictor matches; all other cells are
constrictor mismatches. Reading across each row reveals how many times an intended
speech constriction was replaced by each beatboxing constrictor. For example, intended
speech labials were replaced by beatboxing labials 18 times, by beatboxing coronals 0 times,
and by beatboxing dorsals 10 times. A chi-squared test over this table rejects the null
2
hypothesis that beatboxing sounds replace intended speech sounds at random (χ = 79.15, df
= 4, p < 0.0001).
262
Table 34. Contingency table of beatboxing sound constrictors (top) and the speech sounds
they replace (left).
Intended speech sound Replacing beatboxing sound Total
Labial 18 0 10 28
Coronal 2 21 8 31
Dorsal 2 0 27 29
Total 22 21 45 88
All 10 labial mismatches and 8/10 coronal mismatches were made by a dorsal beatboxing
sound replacement. Each of those mismatches also happens to occur on beat 3 of the meter,
and the replacing beatboxing sound is always a K Snare {^K}. In beatboxing, beat 3
corresponds to the back beat and almost always features a snare. This conspiracy of so many
dorsal replacements being made on the back beat suggests that it would be more informative
A distinction can be made between replacements that occurred on beat 3 (n = 30) and
replacements made on any other beat or subdivision (n = 58). Figure 120 shows the counts of
matching and mismatching replacements excluding the back beat. With the inviolable back
beat K Snare out of the picture, 54 of 58 replacements have matching constrictor. This
distribution more closely matches the main hypothesis. Looking at just the replacements
made on the back beat (n = 30), however, appears to support the null hypothesis. Beatboxing
sounds on the back beat in "Dopamine" are restricted to the dorsal constrictor for the K
Snare {^K}. The replacements are fairly evenly distributed across all intended speech
263
constrictors, resembling the idealized prediction of no interaction between beatboxing
constrictions and intended speech constrictors (Figure 121). Taking this result with the
previous, this provides evidence for a trading relationship: the speech task is achieved during
One smaller finding obfuscated by the coarse constrictor types is that speech labials
and labiodentals tended to be constrictor-matched to the labial Kick Drum {B} and
labiodental PF Snare {PF}, respectively. PF Snares only ever replaced /f/s, and 4 out of 6
replaced /f/s were replaced by PF Snares. (The other two were on the back beat, and so
replaced by K Snares.) There were two /v/s off the back beat, both of which were in the same
metrical position and in the word "of", and both of which were replaced by Kick Drums.
Labio-dentals were grouped with the rest of the labials to create a simpler picture about
constrictor matching and because the number of labio-dental intended speech sounds was
fairly small. However, for future beatrhyming analysis, it may be useful to separate bilabial
and labio-dental articulations into separate groups rather than covering them with “labial”.
264
Figure 120. Bar plots with counts of the actual matching and mismatching constrictor
replacements everywhere except the back beat.
265
Figure 121. Bar plot with counts of the actual matching and mismatching constrictor
replacements on just the back beat.
There are four constrictor mismatches not on the back beat: two in which a labial beatboxing
sound replaces an intended speech coronal, and two in which a labial beatboxing sound
Both labial-on-coronal cases are of a Kick Drum replacing the word "and", which we
assume (based on the style of the performance) would be pronounced in a reduced form like
[n]. Acoustically, the low frequency burst of a labial Kick Drum {B} is probably a better
match to the nasal murmur of the intended [n] (and thus the manner of the nasal) than the
higher frequency bursts of a Closed Hi-Hat {t}, K Snare {^K}, or Rimshot {K}. All the other
266
nasals replaced by beatboxing sounds were on the back beat and therefore replaced by the K
Snare {^K}.
The two cases where a Kick Drum {B} replaced a dorsal sound can both be found in
the first four lines of the second verse (Figure 122). In one case, a {B} replaced the [g] in "got"
on the first beat 1 of line 3 (underlined in Figure 122). The reason may be a general
preference for having a Kick Drum on beat 1. Only 3 replacements were made on beat 1 in
"Dopamine", and all of them featured a Kick Drum {B}. (The overall scarcity of beat 1
replacements is due at least in part to the musical arrangement and style resulting in
relatively few words on beat 1.) The other case also involved a Kick Drum {B} replacing a
dorsal, this time the [k] in the word “come” on the second beat 2 of line 3 (also underlined).
The replacing {B} in this instance was part of a small recurring beatboxing pattern of {B B}
that didn't otherwise overlap with speech—it occurred on beats 1.5 and 2 of the second
Figure 122. Four lines of beatrhyming featuring two replacement mismatches (underlined).
1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
----------------------------------------------------------------------------------------------------------------
{B t t B} {^K}an't you see {B B} {^K}ou are li- {K'}a
{B}mid- night s{^K}um- mer's {t}ream {B B} {^K}on- ly you
{B}o- {t}em {t}weet {^K}e- lo- {t}ies {B B}ome and {^K}lay with me
{B B} {^K}et's see what the {B}sky {t}urns {^K}in- to {B}
In short, tentative explanations are available for the few constrictor mismatches that occur
off the back beat: two mismatches could be because intended nasal murmur likely matches
the low frequency burst of a Kick Drum better than the burst of the other beatboxing sounds
available, and the other two could be due to established musical patterns specific to this
performance.
267
3.2 Constriction degree (manner of articulation) matching
Figure 123 shows that this is what happens. The sounds that made constrictor-matching
replacements—the Kick Drum {B}, PF Snare {PF}, Closed Hi-Hat {t}, and Rimshot {K’}—
collectively replaced 43 stops but replaced 0 approximants and only 2 nasals and 10 fricatives.
No affricates were replaced at all in the data set. The K Snare {K} replaced 16 stops but also 7
nasals, 8 fricatives, and 2 approximants. For comparison, Figure 124 breaks down the
d k g], nasals [m n] and [ŋ] (written as “ng”), fricatives [f v s z] and [ð] (written as “dh”),
have a uniform distribution or if stops are disproportionately high frequency across the
board. If many stops were in positions to be replaced by a beatboxing sound but were not
replaced, this finding would carry less weight. As of the time of writing, however, it was not
clear how to define which sounds in this song should be expected to be beatboxed; and as
this is the first major beatrhyming study, there was no precedent to draw from.
268
Figure 123. Counts of replacements by beatboxing sounds (bottom) against the manner of
articulation of the speech sound they replace (left).
Figure 124. Counts of replacements by beatboxing sounds (bottom) against the speech sound
they replace (left).
269
3.3 Repetition
The number of unique measures of beatboxing sound sequences in a 16-bar phrase indicates
how much overall repetition there is in that phrase. Sections B and D, the two 16-bar phrases
without lyrics (just beatboxing), had a combined total of just 3 unique measure-long
beatboxing sound sequences: the same three sound sequences were used over and over again.
Section C, the first beatrhymed verse, had 16 unique measures (no repeated measures), and
Section E, the second beatrhymed verse, had 13 unique measures (3 measures were repeated
once each). The beatrhymed sections therefore had far less repetition of measures than the
beatboxed sections. The unique sequences in each section are shown in Figure 125.
This is not to say that there was no repetition at all in the beatrhyming. Portions of
some beatboxed measures were repeated as subsets of some beatrhymed measures. The
beatboxed sequence A {B t t ^K}, for example, is also part of the beatrhymed sequences
^K}. But it turns out that even these subsequences are brief non-lyrical chunks within larger
beatrhyming sections, which means that the repetition of sequences here is not related to the
sequences L and N (and partly of D) are not attached to any beatrhymed lyrics, and the
{^K}s are not constrictor-matching. Likewise, the {B B} of F, G, O, and U also have no lyrics
and the {^K}s do not necessarily constrictor-match with the sound of the lyrics they replace.
270
3.3.2 Analysis 2: Repetition ratio
The complete set of two-bar lines for each of the four analyzed sections and their
corresponding repetition ratios are presented in Figure 126. The repetition ratios of
beatrhyming sections were much lower than the repetition ratios for beatboxing sections.
The repetition ratios for the beatboxed sections B & D are greater than the pairwise
corresponding repetition ratios for the beatrhymed sections C & E in all but one comparison
(31/32 comparisons). The mean of repetition ratios calculated for verses C and E were 0.35
and 0.3, respectively, with a mean cross-section repetition ratio of 0.29. The mean repetition
ratios for the beatboxed sections B and D were 0.68 and 0.70, respectively, with a mean
cross-section repetition ratio of 0.96. The low repetition ratios for beatrhymed sections
corroborates the observation from the unique measure count analysis that there is relatively
271
Figure 125. Four 16-bar beatboxing (sections B and D) and beatrhyming (sections C and E)
phrases with letter labels for each unique sound sequence. Only three measure types were
used between both beatboxing sections.
Section B - first beatboxing section Section D - second beatboxing section
A: {B t t ^K} A: {B t t ^K}
B: {th B in B} B: {th B in B}
C: {B h ^K t t} C: {B h ^K t t}
Section B Beatboxed A B A B A B A C’ A’ B A B A B A C
Section C Beatrhymed DEFGHIJKLMNOPQRS
Section D Beatboxed A B A B A B A C’ A’ B A B A B A C’
Section E Beatrhymed T U V F A W F X Y Z Z K Y AA BB CC
272
Figure 126. Beat pattern display and repetition ratio calculations for sections B, C, D, and E.
Section B (first beatboxing section)
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K th B in B
Line 2 | B t t ^K th B in B
Line 3 | B t t ^K th B in B
Line 4 | B t t ^K B h ^K t t
|
Line 5 | B t t ^K th B in B
Line 6 | B t t ^K th B in B
Line 7 | B t t ^K th B in B
Line 8 | B t t ^K B h ^K t t
1 & 2 3 & 4 5 & 6 7 & 8 1 & 3 2 & 4 5 & 7 6 & 8 1 & 5 2 & 6 3 & 7 4 & 8
B) 8/8 4/10 6/10 4/9 8/8 4/10 6/10 4/9 6/10 8/8 8/8 7/11
mean=
0.68 1.00 0.40 0.60 0.44 1.00 0.40 0.60 0.44 0.60 1.00 1.00 0.64
C) 3/10 5/11 6/12 3/11 5/12 3/11 3/12 3/11 5/10 5/12 3/13 3/10
mean=
0.35 0.30 0.45 0.50 0.27 0.42 0.27 0.25 0.27 0.50 0.42 0.23 0.30
273
Section D (second beatboxing section)
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K th B in B
Line 2 | B t t ^K th B in B
Line 3 | B t t ^K th B in B
Line 4 | B t t ^K B h ^K t t
|
Line 5 | B t t ^K th B in B
Line 6 | B t t ^K th B in B
Line 7 | B t t ^K th B in B
Line 8 | B t t ^K B h ^K t t
1 & 2 3 & 4 5 & 6 7 & 8 1 & 3 2 & 4 5 & 7 6 & 8 1 & 5 2 & 6 3 & 7 4 & 8
F) 8/8 4/10 6/10 4/10 8/8 4/10 6/10 4/10 6/10 8/8 8/8 9/9
mean=
0.70 1.00 0.40 0.60 0.40 1.00 0.40 0.60 0.40 0.60 1.00 1.00 1.00
E) 5/10 2/11 3/11 2/7 5/12 2/10 3/10 2/9 4/12 3/9 3/10 2/9
mean=
0.30 0.50 0.18 0.27 0.29 0.42 0.20 0.30 0.22 0.33 0.33 0.30 0.22
Cross-section pairs
274
4. Discussion
The analysis above investigated whether beatboxing and speech do (the main hypothesis) or
do not (the null hypothesis) interact during beatrhyming in a way that supports both speech
and beatboxing tasks being achieved. The results provide evidence for the main hypothesis.
replacement beatboxing sounds that match the speech segment in vocal tract constrictor and
manner/constriction degree. This presumably serves to help the more global task of
communicating the speech message. But achieving the speech task comes at the cost of
inconsistent beat patterns during lyrical beatrhyming. Theoretically, both the speech task
and the beatboxing repetition task could have been achieved by careful selection of lexical
items whose speech-compatible beatboxing replacement sound would also satisfy repetition,
but this did not happen. Thus, beatboxing sounds are generally selected in such a way as to
optimize speech task achievement, but lexical items are not being selected so as to optimize
beatboxing repetition. That said, the task demands of other aspects of beatboxing do affect
beatboxing sound selection—this is the inviolable use of K Snares {^K} on beat 3 of each
measure to establish the fundamental musical rhythm, even at the expense of the dorsal
constriction of the K Snare not matching the constriction of the intended speech sound it
replaces. Thus the tasks do interact such that one or the other task achievement has priority
275
4.1 Task interaction
Beatrhyming is the union of a beatboxing system and a speech system. Each system is
goal-oriented, defined by aesthetic tasks related to the musical genre, communicative tasks,
motor efficiency, and other tasks. These tasks act as forces that shape the organization of the
Ultimately, a core interest in the study of speech sounds is to understand how forces
like these influence speech. When answering questions of why sounds in a language pattern
efficiency almost axiomatically. But until we understand how these tasks manifest under a
wider variety of linguistic behaviors, we will not have a full sense of the speech system’s
flexibility or limitations. To that end, the contribution of this chapter is to show how the goal
and the beatboxing sound replacing it, and dissatisfied when aesthetic beatboxing tasks take
To close, section 4.2 demonstrates one way this musical linguistic behavior can
beatrhyming.
The results show that when speech and beatboxing are interwoven in beatrhyming, the
276
intended speech task and overrides the constraints of beatboxing task, except in one
environment (beat 3) in which the opposite is true. Given that the selection of lexical items
does not appear to be sensitive to location in the beatboxing structure, the achievement of
both tasks simultaneously is not possible. The resulting optimization can therefore be
modeled by ranking the speech and beatboxing tasks differently in different environments,
which is exactly what Optimality Theory (Prince & Smolensky, 1993/2004) has been
designed to do.
used in Optimality Theory are designed specifically to operate in the domain of speech and
appropriate for a typical phonological model. This approach assumes that this grammar
specialized for beatrhyming exists separately from grammars specialized for speech or
beatboxing but draws on the representations from both systems—that is, speech and
phonology, but the constraints and their rankings are different from any other domain. Based
beatrhyming, the grammar takes speech representations as inputs and returns surface forms
composed of both beatboxing and speech representations as output candidates. For the
purposes of this simple illustration, the computations are restricted to the selection of a
single beatboxing sound that replaces a single speech segment. (Presumably there are
277
higher-ranking constraints that determine which input speech segment representations
Because the analysis requires reference to the metrical position of a sound, input
representations are tagged with the associated beat number as a subscript. The input / b3 /,
for example, symbolizes a speech representation for a voiced bilabial stop on the third beat
of a measure. Output candidates are marked with the same beat number as the
corresponding input; the input-output pairs / b3 / ~ { B3 } and / b3 / ~ { ^K3 } are both possible
in the system because the share the same subscript, but the input-output pair / b3 / ~ { B2 } is
never generated as an option because the input and output have different subscripts. We can
(“Place” feature corresponds to the abstract conception of the constrictor: labial, coronal, and
dorsal.) The tableaux in Figures 127 and 128 demonstrate how possible input-output pairs
like the ones just introduced might be selected differently by the grammar depending on the
*PlaceMismatch to ensure that beat 3 always has a K Snare. Given an input voiced bilabial
stop on beat 3 / b3 / in tableau 15, the output candidate {B3} is constrictor-matched to the
alternative output {^K3} violates *PlaceMismatch, but is a more optimal candidate than {B3}
278
based on this constraint ranking. On the other hand, for an input / b1 / which represents a
voiced bilabial stop on beat 1, the constrictor-matched candidate {B1} violates no constraints
Figure 127. Tableau in which a speech labial stop is replaced by a K Snare on the back beat.
/ b3 / *BackbeatWithoutSnare *PlaceMismatch
a. {B3} *!
b. ☞ {^K3} *
Figure 128. Tableau in which a speech labial stop is replaced by a Kick Drum off the back
beat.
/ b1 / *BackbeatWithoutSnare *PlaceMismatch
a. ☞ {B1}
b. {^K1} *!
This phonological formalism is simple, but effective: just these two constraints produce the
desired outcome for 95% (84/88) of the replacements in this data set. The remaining 5%
described in section 3.1.3 may be accounted for either by additional constraints designed to
fit more specific conditions, by a related but more complicated model MaxEnt (Hayes &
Wilson, 2008), or by gradient symbolic representations (Smolensky et al., 2014) that permit
more flexibility in the input-output place relationships. It is with this optimism in mind that
we suggest below two reasons not to use symbolic representations in models of beatrhyming:
atomic unit.
279
In most symbolic models of phonology, the vocal constriction plan executed by the
place feature like [labial] (or if not privative, [±labial]) is to encode linguistic information,
and that information is defined by the feature’s contrastive relationship to other features
within the same linguistic system. Different phonological theories propose stronger or
weaker associations between a mental representation like [labial] and the physical lips
meaning. But say that a language-like beatboxing {labial} feature did exist, defined according
to some iconic relationship with other beatboxing features and, like a linguistic [labial]
feature, associated to some degree with physical constriction of the lips. This {labial}
different information-bearing roles within their respective systems. Mapping abstract features
[labial] to {dorsal} or {ingressive}. The only reason to map [labial] with {labial} is because
they share an association to the physical lips. But in that case, the crux of the mapping—the
only property shared by both units—is a phonetic referent; the abstract symbolic units
themselves are moot. Given that the model is intended to be a phonological one, it seems
undesirable for the phonological units to have less importance than their phonetic output.
280
The second issue with symbols is that they are notoriously static, existing invariantly
outside of real time. When timing must be encoded in symbolic approaches, the
representations are laid out either in sequence or in other timing slots like autosegmental
tiers (Goldsmith, 1976). And, segments are temporally indivisible—they cannot start at one
time, pause for a bit, then pick up again where they left off. This is not a problem for
phonological models of talking or many other varieties of speech, but Figure 129 illustrates a
beatrhyming example of precisely this kind of split-segment behavior. In this case, the word
“move” [muv] is pronounced [mu]{B}[uv], with a Kick Drum temporarily interrupting the
[u] vowel. The same phenomenon is shown in Figure 130 with the word “sky” pronounced as
[skak͡ʟ̝̊↓a] (the canonical closure to [i] is not apparent in the spectrogram). Figure 112 from
the beginning of this chapter shows a related example of the [i] in “dopamine” prematurely
cut off in the pronunciation of the word as {t}[o]{B}[əmi]{^K}[n]. These cases of beatboxing
sounds that interrupt speech sounds are impossible to represent in a symbolic phonological
model because in many cases they would require splitting an indivisible representation into
struggle with beatrhyming interruptions. Q-Theory (Shih & Inkelas 2014, 2018) may come
the closest: it innovates on traditional segments by splitting them into three quantal
movement, target achievement, and constriction release for a given sound, and are especially
useful for representing a sound that has complex internal structure like a triphthong or a
three-part tone contour. It would be possible to represent the /u/ in “move” /muv/ as having
281
three sub-segmental divisions [u] [u] [u]. But based on our understanding of Q-Theory, it is
not possible to replace the middle sub-segment [u] with an entire and entirely different
segment {B}. Given enough time, it is inevitable that someone could imagine some phonetic
Articulatory Phonology is the hypothesis that the fundamental units of language are
action units, called “gestures” (Browman & Goldstein, 1986, 1989). Unlike symbolic features
which are time-invariant and only reference the physical vocal tract abstractly (if at all),
gestures as phonological units are spatio-temporal entities with deterministic and directly
observable consequences in the vocal tract. Phonological phenomena that are stipulated
systems in the framework of task dynamics (Browman & Goldstein, 1989; Saltzman &
Munhall, 1989). While a gesture is active, it exerts control over a vocal tract variable (e.g., lip
aperture) to accomplish some linguistic task (e.g., a complete labial closure for the
defined by the vocal tract variable—and ultimately, the constrictor—they control. Gestures
are motor plans that leverage and tune the movement potential of the vocal articulators for
speech-specific purposes, but speech gestures are not the only action units that can control
282
the vocal tract. The vocal tract variables used for speech purposes are publicly available to
any other system of motor control, including beatboxing. This allows for a non-arbitrary
relationship between the fundamental phonological units of speech and beatboxing: a speech
unit and a beatboxing unit that both control lip aperture are inherently linked in a
beatboxing grammar because they control the same vocal tract variable.
Figure 129. Waveform, spectrogram, and text grid of the beatrhymed word “move” with a
Kick Drum splitting the vowel into two parts.
283
Figure 130. Waveform, spectrogram, and text grid of the beatrhymed word “sky” with a K
Snare splitting the vowel into two parts.
The cases in which a beatboxing sound temporarily interrupts a vowel can be modeled in
task dynamics with a parameter called gestural blending strength. When two gestures that
use the same constrictor overlap temporally, the movement plan during that time period
becomes the average of the two gestures’ spatial targets (and their time constants or
stiffness) weighted by their relative blending strengths. A stronger gesture exerts more
influence, and a gesture with very high relative blending strength will effectively override any
co-active gestures. For beatrhyming, the interrupting beatboxing sounds could be modeled as
having sufficiently high blending strength that the vowels they co-occur with are overridden
by the beatboxing sound; when the gestures for a beatboxing sound end, control of the vocal
tract returns solely to the vowel gesture. The Gestural Harmony Model (Smith, 2018) uses a
284
5. Conclusion
Vocal music is a powerful lens through which to study speech, offering insights about speech
that may not be accessible from studies of talking. Beatrhyming in particular demonstrates
how the fundamental units of speech can interact with the units of a completely different
behavior—beatboxing—in a complex but organized way. When combined with speech, the
aesthetic goals of musical performance lead to sound patterns that push the limits of
phonological theory and may even cause widely accepted paradigms to break down. This is
the advantage to be gained by building and testing theories based on insights from a more
285
CHAPTER 8: CONCLUSION
This dissertation applied linguistic methods to an analysis of beatboxing and discovered that
beatboxing has a unit-level phonology rooted in the same types of fundamental mental
Chapter 3: Sounds argued that beatboxing sounds have meaning and word-like frequency.
Each sound is composed combinatorially from a reusable set of constrictions; because the
changes the meaning of a sound. This contrastiveness resembles the contrastive organization
of speech sounds within a language. But just like in speech, not every articulatory change is a
contrastive one. Chapter 5: Alternations shows that the Kick Drum and PF Snare, and
perhaps also Closed Hi-Hat, have different phonetic manifestations depending on their
context: they are glottalic egressive in most contexts, but percussive when performed in
proximity to other (made with a tongue body closure and no glottalic airflow initiation).
Chapter 6: Harmony shows that these alternations are—like so often in speech—the result of
multiple constrictions overlapping temporally. Here the contrastive airstreams from Chapter
phonological harmony.
indicate that beatboxing has a phonology rooted in the same types of fundamental mental
286
representations and organization as linguistic phonology. These representations are united
with music cognition through rhythmic patterns, metrical organization, and sound classes
with patterning based on musical function (i.e., regularly placing snare-category sounds in
finding that beatboxing exhibits signs of phonological cognition indicates that the
foundations of both beatboxing and speech (see below) collaborate with aspects of music
cognition, which indicates that the building blocks of different domains superimpose onto
each other in task-specific ways to create each vocal behavior. This can be accounted for in
both modular and integrated approaches to cognition. A story consistent with a modular
approach to cognition is that beatboxing takes mental representations and grammar from
speech, combines them with musical meaning and metrical organization, and thereby adapts
contrasts and to use natural classes as the currency of productive synchronic processes. A
different story consistent with a more integrated approach to cognition is that beatboxing
and phonology both, somewhat independently, are shaped by the interaction of the
capabilities of the vocal tract they share, the recruitment of some domain-general
aesthetic tasks. Regardless of the interpretation, the inescapable result is that linguistic
287
phonology is not particularly unique: beatboxing and speech share the same vocal tract and
representations.
One could choose to model beatboxing with adaptations of either features or gestures as its
fundamental units, and that choice of unit can serve a story of modular cognition or of
integrated cognition. But as Chapter 4: Theory discusses, gestures have the distinction of
explicitly connecting the tasks specific to speech or to beatboxing with the sound-making
potential of the vocal substrate they share, which in turn creates a direct link between speech
gestures and beatboxing gestures. This link is formalized at the graph level of the dynamical
systems by which gestures are defined. The analysis of the graph level theoretical embedding
in this dissertation was focused on individual beatboxing units, their temporal coordination,
and their paradigmatic organization. Future work could formalize the link between speech
and musical prosodic, hierarchical, metrical structure as a different part of the graph level, in
order to better capture the ability of the phonological unit system to integrate in different
The direct formal link between beatboxing and speech units makes predictions about
what types of phonological phenomena beatboxing and speech units are able to
exhibit—including the phonological properties described above. These predictions are born
out insofar as beatboxing and speech phonological phenomena are both able to be accounted
for by the same theoretical mechanisms (e.g., intergestural timing and inhibition). Moreover,
it predicts that the phonological units of the two domains will be able to co-occur as they do
288
in Chapter 7: Beatrhyming, where phenomena that are challenging or impossible to
These advantages of the gestural approach for describing speech, beatboxing, and
not, the phonological system is certainly not encapsulated away from other cognitive
units in similar systems. This appears to fly in the face of conventional wisdom about
phonological units: at least as early as Sapir (1925), phonological units have been defined
exclusively by their psychological linguistic role—by their relationships with each other and
their synchronic patterning, but often without any phonetic or social aspects of their
manifestation and certainly without ties to non-linguistic domains. But the gestural approach
allows phonological units to have domain-specific meaning within their own system while
The attributes that phonology shares with other domains allows it to manifest
flexibly—to be recruited into a multitude of speech behaviors while robustly fulfilling its
primary directives (e.g., communicating a linguistic message). This is different from, say, the
sensory processing involved in auditory spatial localization which is arguably a module in the
strongest sense—automatic, innate, and not (so far as we know) able to be tapped into for
different purposes by conscious cognitive thought (Liberman & Mattingly, 1989). Instead, the
research is continuous with many other speech behaviors and at different levels of
289
phonological structure. Prosodically, conversational speech is continuous with poetry,
rapping, chanting, and singing: just a few small adjustments to rhythm or intonation
transform conversational speech into any of an abundance of genres of vocal linguistic art. A
non-musical speech utterance can even become perceived as musical when it is repeated a
few times (the speech to song illusion; Deutsch et al., 2011). Speech modality is not limited to
the typically-studied arrangement of vocal articulators: surrogate speech like talking drums
(Beier, 1954; Akinbo, 2019), xylophones (McPherson, 2018), and whistle speech (Rialland,
2005) shift phonological expression to new sound systems which are often integrated with
musical structure. And phonological units and grammar are not only used in speech
(Shaw, 2008). And as beatrhyming shows, the conformation of the most elemental
These different speech behaviors are collaborations between speech tasks and other
non-linguistic (e.g., musical) tasks, well-organized to maximize the satisfaction of all tasks
involved (or at least to minimize dissatisfaction). For vocal behaviors, these interactions are
constrained by the vocal substrate in which all of the tasks are active. In singing,
conversational speech prosody cannot manifest at the same time as sung musical melody
because they both require use of the larynx. Sustaining a note during a song therefore
requires selecting between a musical and speech-prosodic pitch and rhythm; but the
contrastive information and structure of the speech sound units are unperturbed—syllable
structure, sound selection, and relative sound order largely remain intact because they do not
compete with melody or rhythm. In some cases there is also text to tune alignment where
290
musical pitch and rhythm reflect the likely prosody of the utterance if it had been spoken
non-musically (Hayes & Kaun, 1996). Similar text to tune alignment is active in languages
with lexical tone, with tone contours exerting greater influence on the musical melody to
avoid producing unintended tones (Schellenberg, 2013; McPherson & Ryan, 2018). And in
beatrhyming, the speech and beatboxing tasks share the vocal tract through a relationship
that leverages their shared vocal apparatus to maximize their compatibility when possible
anything special about speech, it is the speech tasks themselves and how they leverage all of
human vocal potential to flexibly produce these different behaviors. This is consistent with
Lindblom (1990) as an ideology for non-circularly defining and explaining which sounds
as the result of an interaction between the tasks of speech—in Lindblom (1990), “selection
constraints—and the total sound-making potential of the vocal tract. With respect to the
the question as “How do the tasks of speech filter the whole vocal sound-making potential
into a smaller, possibly finite set of speech sounds?” (Figure 131). As discussed in Chapter 4:
perspective.
291
Figure 131. The anthropophonic perspective.
In light of the clear flexibility of the phonological system, however, it must be made clear
that the selection constraints are not only the tasks of speech. There are many musical and
other non-linguistic tasks which shape behavior too—not to mention the social and affective
forces that incessantly impact speech production and phonological variation. A robust
account of phonology needs to be able to explain how the phonological system interacts
with these other forces via both their shared structures and their shared vocal substrate.
292
REFERENCES
Abbs, J. H., Gracco, V. L., & Cole, K. J. (1984). Control of Multimovement Coordination.
Journal of Motor Behavior, 16(2), 195–232.
[Link]
Anderson, S. R. (1981). Why Phonology Isn’t “Natural.” Linguistic Inquiry, 12(4), 493–539.
Archangeli, D., & Pulleyblank, D. (2015). Phonology without universal grammar. Frontiers in
Psychology, 6. [Link]
Archangeli, D., & Pulleyblank, D. (2022). Emergent phonology (Volume 7). Language Science
Press. [Link]
Ball, M. J., Esling, J. H., & Dickson, B. C. (2018). Revisions to the VoQS system for the
transcription of voice quality. Journal of the International Phonetic Association, 48(2),
165–171. [Link]
Ball, M. J., Esling, J., & Dickson, C. (1995). The VoQS System for the Transcription of Voice
Quality. Journal of the International Phonetic Association, 25(2), 71–80.
[Link]
Ball, M. J., Howard, S. J., & Miller, K. (2018). Revisions to the extIPA chart. Journal of the
International Phonetic Association, 48(2), 155–164.
[Link]
Ballard, K. J., Robin, D. A., & Folkins, J. W. (2003). An integrative model of speech motor
control: A response to Ziegler. Aphasiology, 17(1), 37–48.
[Link]
Beale, J. M., & Keil, F. C. (1995). Categorical effects in the perception of faces. Cognition,
57(3), 217–239. [Link]
Beier, U. (1954). The talking drums of the Yoruba. African Music: Journal of the International
Library of African Music, 1(1), 29–31.
293
Bidelman, G. M., Gandour, J. T., & Krishnan, A. (2011). Cross-domain Effects of Music and
Language Experience on the Representation of Pitch in the Human Auditory Brainstem.
Journal of Cognitive Neuroscience, 23(2), 425–434. [Link]
Blaylock, R., & Phoolsombat, R. (2019). Beatrhyming probes the nature of the interface
between phonology and beatboxing. The Journal of the Acoustical Society of America,
146(4), 3081–3081. [Link]
Blaylock, R., Patil, N., Greer, T., & Narayanan, S. S. (2017). Sounds of the Human Vocal Tract.
INTERSPEECH, 2287–2291. [Link]
Boersma, P., & Weenink, D. (1992-2022). Praat: Doing phonetics by computer (6.1.13)
[Computer software]. [Link]
Boersma, Paul (2001). Praat, a system for doing phonetics by computer. Glot International
5:9/10, 341-345.
Bořil, T., & Skarnitzl, R. (2016). Tools rPraat and mPraat. In P. Sojka, A. Horák, I. Kopeček, &
K. Pala (Eds.), Text, Speech, and Dialogue (Vol. 9924, pp. 367–374). Springer International
Publishing. [Link]
Bresch, E., Nielsen, J., Nayak, K., & Narayanan, S. (2006). Synchronized and noise-robust
audio recordings during realtime magnetic resonance imaging scans. The Journal of the
Acoustical Society of America, 120(4), 1791–1794. [Link]
Browman, C. P., & Goldstein, L. (1988). Some notes on syllable structure in articulatory
phonology. Phonetica, 45(2-4), 140-155.
294
Browman, C. P., & Goldstein, L. (1995). Gestural Syllable Position Effects in American
English. In Bell-Berti, F. & Raphael, L. J. (Eds.), Producing Speech: Contemporary Issues.
For Katherine Safford Harris. AIP Press: New York.
Byrd, D., & Saltzman, E. (1998). Intragestural dynamics of multiple prosodic boundaries.
Journal of Phonetics, 26(2), 173–199. [Link]
Byrd, D., & Saltzman, E. (2003). The elastic phrase: Modeling the dynamics of
boundary-adjacent lengthening. Journal of Phonetics, 31(2), 149–180.
[Link]
Coltheart, M. (1999). Modularity and cognition. Trends in Cognitive Sciences, 3(3), 115–120.
[Link]
Cummins, F., & Port, R. (1998). Rhythmic constraints on stress timing in English. Journal of
Phonetics, 26(2), 145–171. [Link]
Danner, S. G., Krivokapić, J., & Byrd, D. (2019). Co-speech movement behavior in
conversational turn-taking. The Journal of the Acoustical Society of America, 146(4),
3082–3082.
295
Dehais-Underdown, A., Vignes, P., Buchman, L. C., & Demolin, D. (2020). Human
Beatboxing: A preliminary study on temporal reduction. Proceedings of the 12th
International Seminar on Speech Production (ISSP), 142–145.
Dehais-Underdown, A., Vignes, P., Crevier-Buchman, L., & Demolin, D. (2021). In and out:
Production mechanisms in Human Beatboxing. 060005. [Link]
Deutsch, D., Henthorn, T., & Lapidis, R. (2011). Illusory transformation from speech to song.
The Journal of the Acoustical Society of America, 129(4), 2245–2252.
[Link]
Diehl, R. L. (1991). The Role of Phonetics within the Study of Language. Phonetica, 48(2–4),
120–134. [Link]
Diehl, R. L., & Kluender, K. R. (1989). On the Objects of Speech Perception. Ecological
Psychology, 1(2), 121–144. [Link]
Duckworth, M., Allen, G., Hardcastle, W., & Ball, M. (1990). Extensions to the International
Phonetic Alphabet for the transcription of atypical speech. Clinical Linguistics &
Phonetics, 4(4), 273–280. [Link]
Dunbar, E., & Dupoux, E. (2016). Geometric Constraints on Human Speech Sound
Inventories. Frontiers in Psychology, 7.
[Link]
Evain, S., Contesse, A., Pinchaud, A., Schwab, D., Lecouteux, B., & Henrich Bernardoni, N.
(2019). Beatbox Sounds Recognition Using a Speech-dedicated HMM-GMM Based
System.
296
Farmer, J. D. (1990). A Rosetta stone for connectionism. Physica D: Nonlinear Phenomena,
42(1), 153–187. [Link]
Feld, S., & Fox, A. A. (1994). Music and Language. Annual Review of Anthropology, 23,
25–53.
Flash, T., & Sejnowski, T. J. (2001). Computational approaches to motor control. Current
Opinion in Neurobiology, 11, 655–662.
Fukuda, M., Kimura, Kosei, Blaylock, Reed, & Lee, Seunghun. (2022). Scope of beatrhyming:
Segments or words. Proceedings of the AJL 6 (Asian Junior Linguists), 59–63.
[Link]
Gafos, A. I. (1996). The articulatory basis of locality in phonology [Ph.D., The Johns Hopkins
University].
[Link]
Gafos, A. I., & Benus, S. (2006). Dynamics of Phonological Cognition. Cognitive Science,
30(5), 905–943. [Link]
Gafos, A., & Goldstein, L. (2011). Articulatory representation and organization. In A. C. Cohn,
C. Fougeron, & M. K. Huffman (Eds.), The Oxford Handbook of Laboratory Phonology
(1st ed.). Oxford University Press.
[Link]
Goldstein, L., Byrd, D., & Saltzman, E. (2006). The role of vocal tract gestural action units in
understanding the evolution of phonology. In M. A. Arbib (Ed.), Action to Language via
the Mirror Neuron System (pp. 215–249). Cambridge University Press.
[Link]
Goldstein, L., Nam, H., Saltzman, E., & Chitoran, I. (2009). Coupled Oscillator Planning
Model of Speech Timing and Syllable Structure. In C. G. M. Fant, H. Fujisaki, & J. Shen
(Eds.), Frontiers in phonetics and speech science (p. 239-249). The Commercial Press.
[Link]
297
Greenwald, J. (2002). Hip-Hop Drumming: The Rhyme May Define, but the Groove Makes
You Move. Black Music Research Journal, 22(2), 259–271. [Link]
Guinn, D., & Nazarov, A. (2018, January). Evidence for features and phonotactics in
beatboxing vocal percussion. 15th Old World Conference on Phonology, University
College London, United Kingdom.
Hale, K., & Nash, D. (1997). Damin and Lardil phonotactics [PDF]. Boundary Rider: Essays
in Honor of Geoffrey O’Grady, 247-259 pages. [Link]
Hale, M., & Reiss, C. (2000). Phonology as Cognition. Phonological Knowledge: Conceptual
and Empirical Issues, 161–184.
Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The Faculty of Language: What Is It, Who
Has It, and How Did It Evolve? 298, 1569–1579.
Hayes, B. (1984). The Phonology of Rhythm in English. Linguistic Inquiry, 15(1), 33–74.
Hayes, B., & Kaun, A. (1996). The role of phonological phrasing in sung and chanted verse.
The Linguistic Review, 13(3–4). [Link]
Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic
learning. Linguistic inquiry, 39(3), 379-440.
Hayes, B., Kirchner, R., & Steriade, D. (Eds.). (2004). Phonetically Based Phonology.
Cambridge University Press.
Himonides, E., Moors, T., Maraschin, D., & Radio, M. (2018). Is there potential for using
beatboxing in supporting laryngectomees? Findings from a public engagement project.
Hoyt, D. F., & Taylor, C. R. (1981). Gait and the energetics of locomotion in horses. Nature,
292(5820), 239–240. [Link]
Icht, M. (2018). Introducing the Beatalk technique: Using beatbox sounds and rhythms to
improve speech characteristics of adults with intellectual disability: Using beatbox sounds
and rhythms to improve speech. International Journal of Language & Communication
Disorders, 54. [Link]
298
Icht, M. (2021). Improving speech characteristics of young adults with congenital dysarthria:
An exploratory study comparing articulation training and the Beatalk method. Journal of
Communication Disorders, 93, 106147. [Link]
Icht, M., & Carl, M. (2022). Points of view: Positive effects of the Beatalk technique on speech
characteristics of young adults with intellectual disability. International Journal of
Developmental Disabilities, 1–5. [Link]
Jakobson, R., Fant, C. G., & Halle, M. (1951). Preliminaries to speech analysis: The distinctive
features and their correlates.
Kelso, J. A. S., & Tuller, B. (1984). A Dynamical Basis for Action Systems. In M. S. Gazzaniga
(Ed.), Handbook of Cognitive Neuroscience (pp. 321–356). Springer US.
[Link]
Kelso, J. A. S., Holt, K. G., Rubin, P., & Kugler, P. N. (1981). Patterns of Human Interlimb
Coordination Emerge from the Properties of Non-Linear, Limit Cycle Oscillatory
Processes. Journal of Motor Behavior, 13(4), 226–261.
[Link]
Kelso, J. A., & Tuller, B. (1984). Converging evidence in support of common dynamical
principles for speech and movement coordination. American Journal of
Physiology-Regulatory, Integrative and Comparative Physiology, 246(6), R928–R935.
[Link]
Kelso, J. S., Tuller, B., Vatikiotis-Bateson, E., & Fowler, C. A. (1984). Functionally specific
articulatory cooperation following jaw perturbations during speech: Evidence for
coordinative structures. Journal of Experimental Psychology: Human Perception and
Performance, 10(6), 812–832. [Link]
Krivokapić, J. (2014). Gestural coordination at prosodic boundaries and its role for prosodic
structure and speech planning processes. Philosophical Transactions of the Royal Society
B: Biological Sciences, 369(1658), 20130397. [Link]
Kröger, B. J., Schröder, G., & Opgen‐Rhein, C. (1995). A gesture‐based dynamic model
describing articulatory movement data. The Journal of the Acoustical Society of America,
98(4), 1878–1889. [Link]
299
Kugler, P. N., Kelso, J. A. S., & Turvey, M. T. (1980). On the Concept of Coordinative
Structures as Dissipative Structures: I. Theoretical Lines of Convergence. In G. E.
Stelmach & J. Requin (Eds.), Advances in Psychology (Vol. 1, pp. 3–47). North-Holland.
[Link]
Kuhl, P. K., & Miller, J. D. (1978). Speech perception by the chinchilla: Identification
functions for synthetic VOT stimuli. The Journal of the Acoustical Society of America,
63(3), 905–917. [Link]
Ladefoged, P. (1989). Representing Phonetic Structure (No. 73; Working Papers in Phonetics).
Phonetics Laboratory, Department of Linguistics, UCLA.
Lammert, A. C., Melot, J., Sturim, D. E., Hannon, D. J., DeLaura, R., Williamson, J. R.,
Ciccarelli, G., & Quatieri, T. F. (2020). Analysis of Phonetic Balance in Standard English
Passages. Journal of Speech, Language, and Hearing Research, 63(4), 917–930.
[Link]
Lammert, A. C., Proctor, M. I., & Narayanan, S. S. (2010). Data-Driven Analysis of Realtime
Vocal Tract MRI using Correlated Image Regions. Interspeech 2010, 1572–1575.
Lammert, A. C., Ramanarayanan, V., Proctor, M. I., & Narayanan, S. S. (2013). Vocal tract
cross-distance estimation from real-time MRI using region-of-interest analysis.
Interspeech 2013, 959–962.
Large, E. W., & Kolen, J. F. (1994). Resonance and the perception of musical meter.
Connection Science, 6(1), 177–208.
Lartillot, O., Toiviainen, P., & Eerola, T. (2008). A Matlab Toolbox for Music Information
Retrieval. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, & R. Decker (Eds.), Data
Analysis, Machine Learning and Applications (pp. 261–268). Springer.
[Link]
Lartillot, O., Toiviainen, P., Saari, P., & Eerola, T. (n.d.). MIRtoolbox (1.7.2) [Computer
software]. [Link]
Lerdahl, F., & Jackendoff, R. (1983/1996). A Generative Theory of Tonal Music. MIT press.
300
Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised.
Cognition, 21(1), 1–36. [Link]
Liberman, A. M., & Mattingly, I. G. (1989). A Specialization for Speech Perception. Science,
243(4890), 489–494.
Liberman, A. M., Isenberg, D., & Rakerd, B. (1981). Duplex perception of cues for stop
consonants: Evidence for a phonetic mode. Perception & Psychophysics, 30(2), 133–143.
[Link]
Liberman, M., & Prince, A. (1977). On Stress and Linguistic Rhythm. Linguistic Inquiry, 8(2),
249–336.
Liljencrants, J., & Lindblom, B. (1972). Numerical Simulation of Vowel Quality Systems: The
Role of Perceptual Contrast. Language, 48(4), 839. [Link]
Lindblom, B. (1990). On the notion of “possible speech sound.” Journal of Phonetics, 18(2),
135–152. [Link]
Lindblom, B., & Maddieson, I. (1988). Phonetic universals in consonant systems. In Language,
speech and mind.
Lindblom, B., Lubker, J., & Gay, T. (1979). Formant frequencies of some fixed-mandible
vowels and a model of speech motor programming by predictive simulation. Journal of
Phonetics, 7(2), 147–161. [Link]
Lingala, S. G., Zhu, Y., Kim, Y.-C., Toutios, A., Narayanan, S., & Nayak, K. S. (2017). A fast and
flexible MRI system for the study of dynamic vocal tract shaping. Magnetic Resonance in
Medicine, 77(1), 112–125. [Link]
Maess, B., Koelsch, S., Gunter, T. C., & Friederici, A. D. (2001). Musical syntax is processed in
Broca’s area: An MEG study. Nature Neuroscience, 4(5), 540–545.
[Link]
301
Mann, V. A., & Liberman, A. M. (1983). Some differences between phonetic and auditory
modes of perception. Cognition, 14(2), 211–235.
[Link]
Martin, M., & Mullady, K. (n.d.). Education. Lightship Beatbox. Retrieved June 6, 2022, from
[Link]
McPherson, L. (2018). The Talking Balafon of the Sambla: Grammatical Principles and
Documentary Implications. Anthropological Linguistics, 60(3), 255–294.
[Link]
McPherson, L., & Ryan, K. M. (2018). Tone-tune association in Tommo So (Dogon) folk
songs. Language, 94(1), 119–156. [Link]
Mielke, J. (2011). Distinctive Features. In The Blackwell Companion to Phonology (pp. 1–25).
John Wiley & Sons, Ltd. [Link]
Moors, T., Silva, S., Maraschin, D., Young, D., Quinn V, J., Carpentier, J., Allouche, J., &
Himonides, E. (2020). Using Beatboxing for Creative Rehabilitation After Laryngectomy:
Experiences From a Public Engagement Project. Frontiers in Psychology, 10, 2854.
[Link]
Mullady, K. (January 25, 2017). Beatboxing rapping and singing at the same time [Video].
YouTube. [Link]
Nam, H., & Saltzman, E. (2003). A Competitive, Coupled Oscillator Model of Syllable
Structure. Proceedings of the 15th International Congress of Phonetic Sciences.
Nam, H., Goldstein, L., & Saltzman, E. (2009). Self-organization of Syllable Structure: A
Coupled Oscillator Model. In F. Pellegrino, E. Marsico, I. Chitoran, & C. Coupé (Eds.),
Approaches to Phonological Complexity (pp. 297–328). Walter de Gruyter.
[Link]
Narayanan, S., Nayak, K., Lee, S., Sethy, A., & Byrd, D. (2004). An approach to real-time
magnetic resonance imaging for speech production. The Journal of the Acoustical Society
of America, 115(4), 1771–1776. [Link]
Oh, M., & Lee, Y. (2018). ACT: An Automatic Centroid Tracking tool for analyzing vocal tract
actions in real-time magnetic resonance imaging speech production data. The Journal of
the Acoustical Society of America, 144(4), EL290–EL296. [Link]
302
Ohala, J. J. (1980). Moderator’s summary of symposium on “Phonetic universals in
phonological systems and their explanation.” Proceedings of the 9th International
Congress of Phonetic Sciences, 3, 181–194.
Ohala, J. J. (1990). There is no interface between phonology and phonetics: A personal view.
Journal of Phonetics, 18(2), 153–171. [Link]
Ohala, J. J. (2008). Languages’ Sound Inventories: The Devil in the Details. UC Berkeley
Phonology Lab Annual Reports, 4. [Link]
O’Dell, M. L., & Nieminen, T. (1999). Coupled oscillator model of speech rhythm.
Proceedings of the 14th International Congress of Phonetic Sciences, 2, 1075–1078.
O’Dell, M. L., & Nieminen, T. (2009). Coupled oscillator model for speech timing: Overview
and examples. Prosody: Proceedings of the 10th Conference, 179–190.
Palmer, C., & Kelly, M. H. (1992). Linguistic Prosody and Musical Meter in Song. Journal of
Memory and Language, 31(4), 525–542.
Park, J. (2016, September 12). 80 Fitz | Build your basic sound arsenal | HUMAN BEATBOX.
HUMAN BEATBOX.
[Link]
Paroni, A., Henrich Bernardoni, N., Savariaux, C., Lœvenbruck, H., Calabrese, P., Pellegrini, T.,
Mouysset, S., & Gerber, S. (2021). Vocal drum sounds in human beatboxing: An acoustic
and articulatory exploration using electromagnetic articulography. The Journal of the
Acoustical Society of America, 149(1), 191–206. [Link]
Paroni, A., Lœvenbruck, H., Baraduc, P., Savariaux, C., Calabrese, P., & Bernardoni, N. H.
(2021). Humming Beatboxing: The Vocal Orchestra Within. MAVEBA 2021 - 12th
International Workshop Models and Analysis of Vocal Emissions for Biomedical
Applications, Universita Degli Studi Firenze.
Parrell, B., & Narayanan, S. (2018). Explaining Coronal Reduction: Prosodic Structure and
Articulatory Posture. Phonetica, 75(2), 151–181. [Link]
303
Patil, N., Greer, T., Blaylock, R., & Narayanan, S. S. (2017). Comparison of Basic Beatboxing
Articulations Between Expert and Novice Artists Using Real-Time Magnetic Resonance
Imaging. Interspeech 2017, 2277–2281. [Link]
Pike, K. L. (1943). Phonetics: A Critical Analysis of Phonetic Theory and a Technique for the
Practical Description of Sounds. University of Michigan Publications.
Pillot-Loiseau, C., Garrigues, L., Demolin, D., Fux, T., Amelot, A., & Crevier-Buchman, L.
(2020). Le human beatbox entre musique et parole: Quelques indices acoustiques et
physiologiques. Volume !, 16 : 2 / 17 : 1, 125–143. [Link]
Pouplier, M. (2012). The gaits of speech: Re-examining the role of articulatory effort in
spoken language. In M.-J. Solé & D. Recasens (Eds.), Current Issues in Linguistic Theory
(Vol. 323, pp. 147–164). John Benjamins Publishing Company.
[Link]
Proctor, M., Bresch, E., Byrd, D., Nayak, K., & Narayanan, S. (2013). Paralinguistic
mechanisms of production in human “beatboxing”: A real-time magnetic resonance
imaging study. The Journal of the Acoustical Society of America, 133(2), 1043–1054.
[Link]
Proctor, M., Lammert, A., Katsamanis, A., Goldstein, L., Hagedorn, C., & Narayanan, S. (2011).
Direct Estimation of Articulatory Kinematics from Real-Time Magnetic Resonance Image
Sequences. Interspeech 2011, 284–281.
Ravignani, A., Honing, H., & Kotz, S. A. (2017). Editorial: The Evolution of Rhythm
Cognition: Timing in Music and Speech. Frontiers in Human Neuroscience, 11.
[Link]
Roon, K. D., & Gafos, A. I. (2016). Perceiving while producing: Modeling the dynamics of
phonological planning. Journal of Memory and Language, 89, 222–243.
[Link]
Rose, S., & Walker, R. (2011). Harmony Systems. In The Handbook of Phonological Theory
(pp. 240–290). John Wiley & Sons, Ltd. [Link]
304
Saltzman, E. L., & Munhall, K. G. (1992). Skill Acquisition and Development: The Roles of
State-, Parameter-, and Graph-Dynamics. Journal of Motor Behavior, 24(1), 49–57.
[Link]
Saltzman, E., & Kelso, J. A. (1987). Skilled actions: A task-dynamic approach. Psychological
Review, 94(1), 84–106. [Link]
Saltzman, E., Nam, H., Goldstein, L., & Byrd, D. (2006). The Distinctions Between State,
Parameter and Graph Dynamics in Sensorimotor Control and Coordination. In M. L.
Latash & F. Lestienne (Eds.), Motor Control and Learning (pp. 63–73). Kluwer Academic
Publishers. [Link]
Saltzman, E., Nam, H., Krivokapic, J., & Goldstein, L. (2008). A task-dynamic toolkit for
modeling the effects of prosodic structure on articulation. Proceedings of the 4th
International Conference on Speech Prosody (Speech Prosody 2008), 175–184.
Schellenberg, M., & Gick, B. (2020). Microtonal Variation in Sung Cantonese. Phonetica,
77(2), 83–106. [Link]
Schyns, P. G., Goldstone, R. L., & Thibaut, J.-P. (1998). The development of features in object
concepts. Behavioral and Brain Sciences, 21(1), 1–17.
[Link]
Shaw, P. A. (2008). Scat syllables and markedness theory. Toronto Working Papers in
Linguistics, 27, 145–191.
Shih, S. S., & Inkelas, S. (2014). A Subsegmental Correspondence Approach to Contour Tone
(Dis)Harmony Patterns. Proceedings of the Annual Meetings on Phonology, 1(1), Article
1. [Link]
Shih, S. S., & Zuraw, K. (2017). Phonological conditions on variable adjective and noun word
order in Tagalog. Language, 93(4), e317–e352. [Link]
305
Smith, C. M. (2018). Harmony in Gestural Phonology [Ph.D., University of Southern
California].
[Link]
Smolensky, P., Goldrick, M., & Mathis, D. (2014). Optimization and quantization in gradient
symbol systems: A framework for integrating the continuous and the discrete in
cognition. Cognitive science, 38(6), 1102-1138.
Sorensen, T., & Gafos, A. (2016). The Gesture as an Autonomous Nonlinear Dynamical
System. Ecological Psychology, 28(4), 188–215.
[Link]
Stevens, K. N. (1989). On the quantal nature of speech. Journal of Phonetics, 17(1–2), 3–45.
[Link]
Stevens, K. N., & Keyser, S. J. (2010). Quantal theory, enhancement and overlap. Journal of
Phonetics, 38(1), 10–19. [Link]
Stowell, D., & Plumbley, M. D. (2008). Characteristics of the beatboxing vocal style (No.
C4DM-TR-08–01; pp. 1–4). Queen Mary, University of London.
Studdert-Kennedy, M., & Goldstein, L. (2003). Launching Language: The Gestural Origin of
Discrete Infinity. In M. H. Christiansen & S. Kirby (Eds.), Language Evolution (pp.
235–254). Oxford University Press.
[Link]
Tilsen, S. (2018, March 28). Three mechanisms for modeling articulation: Selection,
coordination, and intention. Cornell Working Papers in Phonetics and Phonology.
Tilsen, S. (2019). Motoric Mechanisms for the Emergence of Non-local Phonological Patterns.
Frontiers in Psychology, 10. [Link]
Tyte, & SPLINTER. (2014, September 18). Standard Beatbox Notation (SBN). HUMAN
BEATBOX. [Link]
306
Tyte, G. and Splinter, M. (2002/2004). Standard Beatbox Notation (SBN). Retrieved
December 8, 2019 from
[Link]
Walker, R. (2005). Weak Triggers in Vowel Harmony. Natural Language & Linguistic Theory,
23(4), 917. [Link]
Walker, R., Byrd, D., & Mpiranya, F. (2008). An articulatory view of Kinyarwanda coronal
harmony. Phonology, 25(3), 499–535. [Link]
Werker, J. F., & Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual
reorganization during the first year of life. Infant Behavior and Development, 7(1), 49–63.
[Link]
Westbury, J. R. (1983). Enlargement of the supraglottal cavity and its relation to stop
consonant voicing. The Journal of the Acoustical Society of America, 73(4), 1322–1336.
[Link]
Wyttenbach, R. A., May, M. L., & Hoy, R. R. (1996). Categorical Perception of Sound
Frequency by Crickets. Science, 273(5281), 1542–1544.
Ziegler, W. (2003a). Speech motor control is task-specific: Evidence from dysarthria and
apraxia of speech. Aphasiology, 17(1), 3–36. [Link]
Ziegler, W. (2003b). To speak or not to speak: Distinctions between speech and nonspeech
motor control. Aphasiology, 17(2), 99–105. [Link]
Zipf, G. K. (1949). Human Behavior And The Principle Of Least Effort. Addison-Wesley
Press, Inc. [Link]
de Torcy, T., Clouet, A., Pillot-Loiseau, C., Vaissière, J., Brasnu, D., & Crevier-Buchman, L.
(2014). A video-fiberscopic study of laryngopharyngeal behaviour in the human beatbox.
Logopedics Phoniatrics Vocology, 39(1), 38–48.
[Link]
307
APPENDIX: Harmony beat pattern drum tabs
b |x-----------x---|--x-----------x-|x-----------x---|--x-------------
B |------x---------|------x---x-----|------x---------|------x---x-----
t |----------------|----------------|----------------|------------x---
dc|----x-----x-----|x---x-------x---|----x-----x-----|x---x-----------
^K|--------x-------|--------x-------|--------x-------|--------x-----x-
CR|x~~~--------x~~~|--x~------------|x~~~--------x~~~|--x~------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
C |x---x-x---x-x-x-|x-x---x-x-x-xxx-|x---x-x---x-x-x-|x-x---x-x-x-xxx-
ex|x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Note: exhale may be some kind of voicing, given the larynx activity
b |------x---------|--x---x---------|x-----x-----x---|--x---x---x-x---
ac |--x-------x---x-|----------------|--x-------x---x-|----x-----------
dc |----x-----------|x---------------|----x-----------|x---------------
tbc|----------------|----x-----------|----------------|----------------
DM |x-----------x---|----------x-----|x-----------x---|------------x---
^K |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
308
Beat pattern 4: Liproll showcase
Bars 1-4
b |x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
ac |----------x-----|----------x-----|----------x-----|--------x-------
dc |----------------|----x-----------|----------------|------------x---
tbc|----------------|----------------|----------------|----x-----------
pf |--------x-------|--------x-------|--------x-------|------x---------
LR |x~~~~~------x~~~|~~----------x~~~|x~~~~~------x~~~|~~--------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Bars 5-8
b |x---x-------x---|x---x-------x---|x---x-------x---|x---x-----------
ac |----------x-----|----------x-----|----------x-----|----------------
dc |----------------|--------------x-|----------------|----------------
tbc|----------------|----------------|----------------|----------------
pf |--------x-------|--------x-------|--------x-------|--------x-------
LR |x~~~x~~~----x~~~|x~~~x~~~--------|x~~~x~~~----x~~~|x~~~x~~~--------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
b |x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
dc |----x-----------|----------------|----x-----------|----------------
tll|----------------|x---------------|----------------|x---------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
b |x---------------|x---------------|x---------------|x---------------
ac |--x-------x---x-|--x-------x---x-|--x-------x---x-|--x-------x---x-
WDA|----x~~~---x~~--|----x~~~---x~~--|----x~~~---x~~--|----x~~~---x~~--
pf |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
309
Beat pattern 7: Water Drop Tongue showcase
b |x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x---x-
WDT|--x-x---------x-|x---x-------x---|--x-x---------x-|x---x-------x---
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
B |x---------------|----------------|----------------|----------------
b |------------x---|----------------|x-----------x---|----------------
SS |--------x-------|--------x-------|--------x-------|----------------
IB |x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
b |x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
dc |--x-----------x-|----------------|--x-----------x-|------------x---
tbc|----x-----------|x---x-------x---|----x-----------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
hm |x---x-------x---|x---x-------x---|x---x-------x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
310
Beat pattern 10: Unknown 1
Bars 1-4
B |x-----x-----x---|--x-------------|x-----x-----x---|--x-------------
^LR|x~~~~~------x~~~|~~--------------|x~~~~~------x~~~|~~--------------
^K |----------------|----------------|----------------|----------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|------------x~~~
b |----------------|------x---------|----------------|------x---------
dc |----------------|----------------|----------------|----------------
dac|----------------|----------------|----------------|----------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Bars 5-8
B |x-----x-----x---|--x-------------|x---------------|----------------
^LR|x~~~~~------x~~~|~~--------------|----------------|----------------
^K |----------------|----------------|----------------|------------x---
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|----------------
b |----------------|------x---------|------x-----x---|--x---x---------
dc |----------------|----------------|--x-----------x-|----------------
dac|----------------|----------------|----x-----------|x---------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
311
Beat pattern 11: Unknown 2
Bars 1-4
hm |x---x-------x---|x---x-------x---|x---x-------x---|x---x---x---x---
b |x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---x---x-
B |----------------|----------------|----------------|----------------
dc |--x-----------x-|----------------|--x-------------|----------------
tll|----x-----------|x---------------|----x-----------|----------------
tbc|----------------|----x-----------|----------------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
WDT|----------------|------------x---|----------------|------------x---
PF |----------------|----------------|----------------|----------------
ta |----------------|----------------|----------------|----------------
^K |----------------|----------------|----------------|----------------
^LR|----------------|----------------|----------------|----------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Bars 5-8
hm |x---x-------x---|x---x-------x---|----------------|----------------
b |x-----x-----x---|--x-----------x-|----------------|----x-x---------
B |----------------|----------------|----------------|----------x-----
dc |--x-------------|----x-------x---|----x-----x---x-|--x-------------
tll|----x-----------|------x---------|----------------|----------------
tbc|----------------|----------------|----------------|----------------
SS |--------x-------|--------x-------|----------------|----------------
WDT|--------------x-|x---------------|----------------|----------------
PF |----------------|----------------|x-----x-----x---|x---------------
ta |----------------|----------------|--x-----x-------|----------------
^K |----------------|----------------|----------------|--------x-------
^LR|----------------|----------------|----------------|----------x~~~~~
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
312
ProQuest Number: 29323155
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA