0% found this document useful (0 votes)
23 views331 pages

Beat Boxing Phology

This dissertation by Gifford Edward Reed Blaylock explores the phonology of beatboxing, presenting a detailed analysis of its sounds, methods, and theoretical frameworks. It includes various chapters that cover sound descriptions, beat patterns, and the relationship between beatboxing and linguistic principles. The work contributes to the field of linguistics by examining beatboxing as a unique form of vocal music and its implications for understanding language and sound production.

Uploaded by

Ivan Andrade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views331 pages

Beat Boxing Phology

This dissertation by Gifford Edward Reed Blaylock explores the phonology of beatboxing, presenting a detailed analysis of its sounds, methods, and theoretical frameworks. It includes various chapters that cover sound descriptions, beat patterns, and the relationship between beatboxing and linguistic principles. The work contributes to the field of linguistics by examining beatboxing as a unique form of vocal music and its implications for understanding language and sound production.

Uploaded by

Ivan Andrade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BEATBOXING PHONOLOGY

by

Gifford Edward Reed Blaylock

A Dissertation Presented to the


FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(LINGUISTICS)

August 2022

Copyright 2022 Gifford Edward Reed Blaylock


For Ellen, sine qua non.

ii
Acknowledgments

There are not enough pages to give everyone who made this dissertation a reality the thanks

they deserve, but I’ll try anyway.

The first thanks goes to Louis Goldstein for support and endless patience. His aptly

timed pearls of wisdom and nuggets of clarity have triggered more major shifts in my

thinking than I can count. And when I’ve finished reeling from a mental sea change (or even

when the waters are calm), little has been more comforting than his calm demeanor and

readiness to help me accept the new situations I find myself in. Louis, thank you for taking

the time to drag me toward some deeper understanding of language and life.

As for the other members of my committee, I am grateful to Khalil Iskarous for

showing me how to reconceptualize complicated topics into simple problems and for

reinforcing the understanding that the superficial differences we see in the world are only

illusions. And I am grateful to Jason Zevin for letting me contribute to his lab meetings

despite not knowing what I was talking about and for offering me summer research funding

even though I’m pretty sure the project moved backward instead of forward because of me.

And thanks to my committee as a whole who together, though perhaps without realizing it,

did something I would never in a million years have predicted: they sparked my interest in

history—a subject which only a few years ago I cared nothing at all for but which I now find

indispensable. Thanks to all three of you for helping me make it this far.

I have been lucky to have the guidance of USC Linguistics department faculty (some

now moved on to other institutions) like Dani Byrd, Elsi Kaiser, Rachel Walker, Karen

Jesney, and Mary Byram Washburn. Substantial credit for any of my accomplishments at
iii
USC goes to Guillermo Ruiz: he has always worked hard to help me, even when I seemed

determined to shoot myself in the foot; he can never be compensated enough. Many of the

insights in this dissertation can be traced back to conversations with my fellow beatboxing

scientists: Nimisha Patil and Timothy Greer at USC and Seunghun Lee, Masaki Fukuda, and

Kosei Kimura at International Christian University in Tokyo. I have also greatly benefited

from the camaraderie and insights of many of my fellow USC graduate students in

Linguistics including Caitlin Smith, Brian Hsu, Jessica Campbell, Jessica Johnson, Yijing Lu,

Ian Rigby, Jesse Storbeck, Luis Miguel Toquero Perez, Adam Woodnutt, Yifan Yang, Hayeun

Jang, Miran Oh, Tanner Sorensen, Maury Courtland, Binh Ngo, Samantha Gordon Danner,

Alfredo Garcia Pardo, Yoonjeong Lee, Ulrike Steindl, Mythili Menon, Christina Hagedorn,

Lucy Kim, and Ben Parrell.

Special thanks to my cohort-mate Charlie O’Hara who gave me massive amounts of

support and whom I hope I supported in kind. Charlie encouraged my nascent

teaching-community endeavors and showed me—through both his teaching and

research—that it’s possible to actually do things and not just talk about them. Outside of

academia, I’m indebted to Charlie for coaxing me out of my reclusive comfort zone in our

first few grad school years with invitations to his holiday parties and improv performances;

Los Angeles had been intimidating, but Charlie made it less so and opened the door to

almost a decade of confident exploration (though I’ve barely scratched the surface of LA).

Thanks to the USC Speech Production and Articulation kNowledge (SPAN)

group—especially Shri Narayanan and Asterios Toutios—for the research opportunities, for

giving me some chances to flex my coding skills a little, and for collecting the beatboxing

iv
data used in this dissertation (not to mention for securing the NIH and NSF grants that

made this work possible in the first place). Special thanks to Adam Lammert for teaching me

just enough about image analysis to be dangerous during my first year of grad school; our

more recent conversations about music, language, and teaching have been a treat and I hope

to continue them for as long as we can.

The members of the LSA’s Faculty Learning Community (FLC) have shown me time

and time again the importance of community for thoughtful, justice-anchored teaching and

for staying sane during a global pandemic. Our meetings have been the only stable part of

my schedule for years now, and if they ever end I doubt I’ll know what to do with myself.

Thanks to Kazuko Hiramatsu, Christina Bjorndahl, Evan Bradley, Ann Bunger, Kristin

Denham, Jessi Grieser, Wesley Leonard, Michael Rushforth, Rosa Vallejos, and Lynsey Wolter

for helping me understand more than I thought I could. I am particularly indebted to Michal

Temkin Martinez, my academic big sister, who generously created a me-sized opening in the

FLC and who has been an unending font of encouragement since the moment we met.

My colleagues from the USC Ballroom Dance Team are responsible for years of

happiness and growth. I am grateful to Jeff Tanedo, Sanika Bhargaw, Katy Maina, Kim

Luong, Alison Eling, Dana Dinh, Andrew Devore, Alex Hazen, Max Pflueger, Eric

Gauderman, Mark Gauderman, Ashley Grist, Rachel Adams, Sayeed Ahmed, Zoe Schack,

Queenique Dinh, Michael Perez, and so many others for their leadership, support, and

camaraderie during our time together. Tasia Dedenbach was a superb dance partner and

friend; she gave me the confidence to trust my eye in my academic visualizations and also

gave me my first academic poster template which I abuse to this day. Alexey Tregubov is an

v
absolute gem of a human being who, simply by demonstrating his own work ethic, is the

reason I was able to actually finish any of my dissertation chapters. Sara Kwan has been the

most thoughtful friend a person could ask for. She is a brilliant and patient sounding-board

for all of my overthinking—whether it be about personal crises, professional crises, or crises

that develop as we indulge our mutual fondness for board games—and her well-timed snack

deliveries (like the mint chocolate Milano cookies I’m eating at this very moment) are always

appreciated. Lorena Bravo and Jonathan Atkinson have gone from being just my dance

teachers to being cherished friends. Thank you for the encouragement, the life lessons, and

for showing me how to be an Angeleno.

Sarah Harper deserves special recognition, as all who know her can attest. There have

been many consequences to her decade-long scheme of unabashedly hijacking of my social

circles at two different universities. Along with all the good times we’ve had with the friends

we now share, she has also had perhaps the most punishing job of any of my friends—having

to deal with me in my crankiest moods. Sarah, thank you for your unwavering support

through it all.

Thanks to Joyce Soares, Ed Soares, Kethry Soares, and Gunnar Jaffarian for treating

me like family even when I wasn’t officially family yet. Lane Stowell, thank you for being a

true friend to Erin for these last several years, and for taking care of her when I have been

unable. And thanks to Angela Boyer, who holds the record for being friends with me the

longest despite three time zones and a bunch of miles coming between us when I moved to

California. It was Angela who suggested I take my first introductory Linguistics class, which

in turn triggered all the events that led me to writing these words.

vi
Because this is a Linguistics dissertation, this is the paragraph where I am supposed to

thank my parents for instilling in me an early passion for language and learning and thank

my grandmother for teaching me to love reading—all of which is perfectly true and for which

I am indeed grateful. But in the context of this particular dissertation, perhaps even more

credit is owed to them for passing along to me their love of music.

Mom and Dad, I know you were embarrassed, when I was small and you took me to a

concert where Raffi started asking kids what kind of music they listened to at home, because

you thought that we didn’t listen to very much music at all. But my life has been filled with

music thanks to you: listening to Dad’s tuba on his bands’ cassette tapes or at Tuba

Christmas; hearing Mom ring out the descant of my favorite hymns; listening to the two of

you harmonizing on songs from the ancient past; singing in Mom-Mom’s choir at Christmas;

learning musical mnemonics in children’s choir that haunt me to this day; and watching in

awe (and listening in some agony) as Dad started learning the violin during a mid-life stroke

of inspiration. All of this, plus the fourteen years of piano lessons you paid for—for what

little use I made of them—and now a dissertation about vocal music. Altogether, I’d say you

can safely put any worries of my musical impoverishment to rest.

That leaves two very important women left to thank. Erin Soares, thank you for being

patient with me every time I moved the goalposts on you; I am happy to report with some

confidence that my dissertation is finally, truly finished. I owe that to you: you have given me

a lifestyle that makes me feel safe and comfortable enough to write a dissertation—no small

feat. With you I am confident, capable, and loved, and I can’t wait to spend the rest of my life

vii
making you feel the same. And Mairym Llorens Monteserín, thank you for… everything. I

couldn’t have done this without you.

viii
Table of contents

Dedication ii

Acknowledgements iii

List of tables x

List of figures xii

Abstract xvii

Chapter 1: Introduction 1

Chapter 2: Method 21

Chapter 3: Sounds 44

Chapter 4: Theory 124

Chapter 5: Alternations 153

Chapter 6: Harmony 176

Chapter 7: Beatrhyming 242

Chapter 8: Conclusion 286

References 293

Appendix 308

ix
List of tables

Table 1. Notation and descriptions of the most frequent beatboxing sounds. 67


Table 2. The most frequent beatboxing sounds displayed according to constrictor
(top) and airstream (left). 67
Table 3. The most frequent sounds displayed according to constrictor (top) and
constriction degree (left). 68
Table 4. The most frequent sounds displayed according to constrictor (top) and
musical role (left). 68
Table 5. Notation and descriptions of the medium-frequency beatboxing sounds. 77
Table 6. High and medium frequency beatboxing sounds displayed by constrictor
(top) and airstream mechanism (left). 78
Table 7. High and medium frequency sounds displayed by constrictor (top) and
constrictor degree (left). Medium frequency sounds are bolded. 78
Table 8. High and medium frequency beatboxing sounds displayed by constrictor
(top) and musical role (left). Medium frequency sounds are bolded. 79
Table 9. Notation and description of the low-frequency beatboxing sounds. 89
Table 10. High, medium, and low (bolded) frequency sounds displayed by
constrictor (top) and airstream mechanism (left). 90
Table 11. High, medium, and low (bolded) frequency sounds displayed by
constrictor (top) and constriction degree (left). 91
Table 12. High, medium, and low (bolded) frequency sounds displayed by
constrictor (top) and musical role (left). 91
Table 13. Notation and descriptions for the lowest frequency beatboxing sounds. 104
Table 14. All the described beatboxing sounds that could be placed on a table,
arranged by constrictor (top) and airstream mechanism (left). 106
Table 15. All the described beatboxing sounds that could be placed on a table,
arranged by constrictor (top) and constriction degree (left). 107
Table 16. All the described beatboxing sounds that could be placed on a table,
arranged by constrictor (top) and musical role (left). 108
Table 17. 22 beatboxing sounds/sound families, 37 minimal differences. 116
Table 18. 21 sounds with maximal dispersion, 20 minimal differences. 116
Table 19. 23 English consonants, 57 minimal differences ([l] conflated with [r]).
Voiceless on the left, voiced on the right. 117
Table 20. Summary of the minimal sound pair and entropy (place) analyses for
beatboxing, a hypothetical maximally distributed system, and English
consonants. 117
Table 21. Non-exhaustive lists of state-, parameter-, and graph-level properties for
dynamical systems used in speech. 132

x
Table 22. Unforced Kick Drum environments. 174
Table 23. Kick Drum environment type observations. 174
Table 24. Kick Drum token observations. 175
Table 25. Summary of the five beat patterns analyzed. 188
Table 26. The beatboxing sounds used in this chapter. 191
Table 27. Sounds of beatboxing used in beat pattern 5. 192
Table 28. Sounds of beatboxing used in beat pattern 9. 199
Table 29. Sounds of beatboxing used in beat pattern 4. 202
Table 30. Sounds of beatboxing used in beat pattern 10. 207
Table 31. Sounds of beatboxing used in beat pattern 1. 213
Table 32. Non-exhaustive lists of state-, parameter-, and graph-level properties for
dynamical systems used in speech. 231
Table 33. Sounds of beatboxing used in this chapter. 248
Table 34. Contingency table of beatboxing sound constrictors (top) and the speech
sounds they replace (left). 266

xi
List of figures

Figure 1. A simple hierarchical tree structure with alternating strong-weak nodes. 24


Figure 2. Hierarchical strong-weak alternations in which one level (“beats”) is
numbered. 24
Figure 3. Hierarchical strong-weak alternations. 25
Figure 4. Two levels below the beat level have further subdivisions. 26
Figure 5. A metrical grid of the rhythmic structure of the first two lines of an
English limerick. 26
Figure 6. A material grid representation of the metrical structure of Figure 4. 27
Figure 7. A metrical grid representation in which each beat has three subdivisions. 27
Figure 8. A metrical grid in which beats 1 and 3 have four sub-divisions while beats
2 and 4 have three sub-divisions. 28
Figure 9. A metrical grid of the beatboxing sequence {B t PF t B B B PF t}. 28
Figure 10. A drum tab representation of the beat pattern in Figure 9, including a
label definition for each sound. 29
Figure 11. A simplification of a drum tab from Chapter 5: Alternations. 30
Figure 12. Waveform, spectrogram, and text grid of three Kick Drums produced at
relatively long temporal intervals. 33
Figure 13. LAB region, unfilled during a Vocalized Tongue Bass (left) and filled
during the Kick Drum that followed (right). 37
Figure 14. LAB2 region filled during a Liproll (left) and empty after the Liproll is
complete (right). 38
Figure 15. COR region, filled by an alveolar tongue tip closure for a Closed Hi-Hat
{t} (left), filled by a linguolabial closure {tbc} (center), and empty (right). 38
Figure 16. DOR region, filled by a tongue body closure during a Clickroll (left) and
empty when the tongue body is shifted forward for the release of an
Inward Snare (right). 39
Figure 17. FRONT region for Liproll outlined in red, completely filled at the
beginning of the Liproll (left) and empty at the end of the Liproll (right). 39
Figure 18. VEL region demonstrated by a Kick Drum, completely empty while the
velum is lowered for the preceding sound (left) and filled while the Kick
Drum is produced (right). 40
Figure 19. LAR region demonstrated by a Kick Drum (an ejective sound),
completely empty before laryngeal raising (left) and filled at the peak of
laryngeal raising (right). 40
Figure 20. Beatboxing sounds organized by maximal dispersion in a continuous
phonetic space (top) vs organization along a finite number of phonetic
dimensions (bottom). 47
Figure 21. Rank-frequency plot of beatboxing sounds. 55
Figure 22. Histogram of the residuals of the power law fit. 55
Figure 23. Scatter plot of the residuals of the power law fit (gray) against the
expected values (black). 56
xii
Figure 24. Log-log plot of the token frequencies (gray) against the power law fit
(black). 56
Figure 25. The discrete cumulative density function for the token frequencies of the
sounds in this data set (gray) compared to the expected function for
sounds following a power law distribution (black). 57
Figure 26. The discrete cumulative density function (token frequency) of sounds in
this beat pattern (gray, same as Figure 25) against the density function of
the same sounds re-ordered by beat pattern frequency order (black). 58
Figure 27. The forced Kick Drum. 61
Figure 28. The PF Snare. 62
Figure 29. The Inward K Snare. 62
Figure 30. The unforced Kick Drum. 63
Figure 31. The Closed Hi-Hat. 64
Figure 32. The dental closure. 70
Figure 33. The linguolabial closure (dorsal). 70
Figure 34. The linguolabial closure (non-dorsal). 71
Figure 35. The alveolar closure. 71
Figure 36. The alveolar closure (frames 1-2) vs the Water Drop (Air). 71
Figure 37. The Spit Snare. 72
Figure 38. The Throat Kick. 73
Figure 39. The Inward Liproll. 74
Figure 40. The Tongue Bass. 75
Figure 41. Humming. 80
Figure 42. The Vocalized Liproll, Inward. 80
Figure 43. The Closed Tongue Bass. 81
Figure 44. The Liproll. 82
Figure 45. The Water Drop (Tongue). 82
Figure 46. The (Inward) PH Snare. 83
Figure 47. The Inward Clickroll. 84
Figure 48. The Open Hi-Hat. 84
Figure 49. The lateral alveolar closure. 85
Figure 50. The Sonic Laser. 85
Figure 51. The labiodental closure. 86
Figure 52. The Clop. 92
Figure 53. The D Kick. 93
Figure 54. The Inward Bass. 93
Figure 55. The Low Liproll. 94
Figure 56. The Hollow Clop. 94
Figure 57. The Tooth Whistle. 95
Figure 58. The Voiced Liproll. 95
Figure 59. The Water Drop (Air). 96
Figure 60. The Clickroll. 96
Figure 61. The D Kick Roll. 97
Figure 62. The High Liproll. 97

xiii
Figure 63. The Inward Clickroll with Liproll. 98
Figure 64. The Lip Bass. 98
Figure 65. tch. 99
Figure 66. The Liproll with Sweep Technique. 99
Figure 67. The Sega SFX. 100
Figure 68. The Trumpet. 100
Figure 69. The Vocalized Tongue Bass. 101
Figure 70. The High Tongue Bass. 101
Figure 71. The Kick Drum exhale. 102
Figure 72. Histogram of 10,000 random sound pair trials in a 6 x 7 x 2 matrix. 118
Figure 73. Histogram of 10,000 random sound pair trials in a 4 x 7 x 2 matrix. 119
Figure 74. A lip closure time function for a spoken voiceless bilabial stop [p], taken
from real-time MRI data. 133
Figure 75. Schematic example of a spring restoring force point attractor. 134
Figure 76. Schematic example of a critically damped mass-spring system. 135
Figure 77. Schematic example of a critically damped mass-spring system with a
soft spring. 136
Figure 78. Position and velocity time series for labial closures for a beatboxing Kick
Drum {B} (left) and a speech voiceless bilabial stop [p] (right). 144
Figure 79. Parameter values tuned for a specific speech unit are applied to a point
attractor graph, resulting in a gesture. 150
Figure 80. Speech-specific and beatboxing-specific parameters can be applied
separately to the same point attractor graph, resulting in either a speech
action (a gesture) or a beatboxing action. 150
Figure 81. Forced/Classic Kick Drum. Larynx raising, no tongue body closure. 156
Figure 82. Unforced Kick Drum. Tongue body closure, no larynx raising. 158
Figure 83. Spit Snare vs Unforced Kick Drum. 159
Figure 84. Forced Kick Drum beat patterns. 165
Figure 85. Unforced Kick Drum beat patterns. 166
Figure 86. Beat patterns with both forced and unforced Kick Drums. 168
Figure 87. An excerpt from a PointTier with humming. 171
Figure 88. A sequence of a lateral alveolar closure {tll}, unforced Kick Drum {b},
and Spit Snare {SS}. 176
Figure 89. A beat pattern that demonstrates the beatboxing technique of humming
with simultaneous oral sound production. 180
Figure 90. This beat pattern contains five sounds: a labial stop produced with a
tongue body closure labeled {b}, a dental closure {dc}, an lateral closure
{tll}, and lingual egressive labial affricate called a Spit Snare {SS}. All of
the sounds are made with a tongue body closure. 181
Figure 91. Drum tab of beat pattern 5. 193
Figure 92. Regions for beat pattern 5. 194
Figure 93. Time series of vocal tract articulators used in beat pattern 5, captured
using a region of interest technique. 195
Figure 94. Time series and rtMRI snapshots of forced and unforced Kick Drum 196

xiv
Figure 95. Drum tab of beat pattern 9. 200
Figure 96. Time series and gestures of beat pattern 9. 200
Figure 97. Drum tab notation for beat pattern 4. 203
Figure 98. Regions used to make time series for the Liproll beat pattern. 204
Figure 99. Time series of the beat pattern 4 (Liproll showcase). 206
Figure 100. Drum tab for beat pattern 10. 208
Figure 101. The regions used to make the time series for beat pattern 10. 210
Figure 102. Time series of beat pattern 10. 211
Figure 103. Drum tab notation for beat pattern 1. 214
Figure 104. Regions for beat pattern 1 (Clickroll showcase). 216
Figure 105. Time series of beat pattern 1. 217
Figure 106. The DOR region for the Clickroll showcase (beat pattern 1) in the first
{CR dc B ^K}. 218
Figure 107. Each forced Kick Drum in the beat pattern in order of occurrence. 218
Figure 108. Time series and real-time MRI snapshots of forced and unforced Kick
Drums. 219
Figure 109. A schematic coupling graph and gestural score of a Kick Drum and Spit
Snare. 234
Figure 110. A schematic coupling graph and gestural score of a Kick Drum,
humming, and a Spit Snare. 235
Figure 111. A schematic coupling graph and gestural score of a {b CR B ^K}
sequence. 237
Figure 112. Waveform, spectrogram, and text grid of the beatrhymed word
“dopamine”. 248
Figure 113. Bar plot of the expected counts of constrictor matching with no task
interaction. 251
Figure 114. Bar plot of the expected counts of constrictor matching with task
interaction. 251
Figure 115. Bar plots of the expected counts of K Snare constrictor matching with
no task interaction 253
Figure 116. Bar plots of the expected counts of K Snare constrictor matching with
task interaction. 253
Figure 117. Serial and hierarchical representations of a 16-bar phrase (8 lines with 2
measures each). 256
Figure 118. Example of a two-line beat pattern. 263
Figure 119. Bar plot showing measured totals of constrictor matches and
mismatches. 265
Figure 120. Bar plots with counts of the actual matching and mismatching
constrictor replacements everywhere except the back beat. 268
Figure 121. Bar plot with counts of the actual matching and mismatching
constrictor replacements on just the back beat. 269
Figure 122. Four lines of beatrhyming featuring two replacement mismatches
(underlined). 270

xv
Figure 123. Counts of replacements by beatboxing sounds (bottom) against the
manner of articulation of the speech sound they replace (left). 272
Figure 124. Counts of replacements by beatboxing sounds (bottom) against the
speech sound they replace (left). 272
Figure 125. Four 16-bar beatboxing (sections B and D) and beatrhyming (sections
C and E) phrases with letter labels for each unique sound sequence. 275
Figure 126. Beat pattern display and repetition ratio calculations for sections B, C,
D, and E. 276
Figure 127. Tableau in which a speech labial stop is replaced by a K Snare on the
back beat. 283
Figure 128. Tableau in which a speech labial stop is replaced by a Kick Drum off
the back beat. 283
Figure 129. Waveform, spectrogram, and text grid of the beatrhymed word “move”
with a Kick Drum splitting the vowel into two parts. 287
Figure 130. Waveform, spectrogram, and text grid of the beatrhymed word “sky”
with a K Snare splitting the vowel into two parts. 288
Figure 131. The anthropophonic perspective. 296

xvi
Abstract

Beatboxing is a type of non-linguistic vocal percussion that can be performed as an

accompaniment to linguistic music or as a standalone performance. This dissertation is the

first major effort to begin to probe beatboxing cognition—specifically beatboxing

phonology—and to develop a theoretical framework relating representations in speech and

beatboxing that can account for phonological phenomena that speech and beatboxing share.

In doing so, it contributes to the longstanding debate about the domain-specificity of

language: because hallmarks of linguistic phonology like contrastive units (Chapter 3),

alternations (Chapter 5), and harmony (Chapter 6) also exist in beatboxing, beatboxing

phonology provides evidence that beatboxing and speech share not only the vocal tract but

also organizational foundations, including a certain type of mental representations and

coordination of those representations.

Beatboxing has phonological behavior based in its own phonological units and

organization. One could choose to model beatboxing with adaptations of either features or

gestures as its fundamental units. But as Chapter 4: Theory discusses, a gestural approach

captures both domain-specific aspects of phonology (learned targets and parameter settings

for a given constriction) and domain-general aspects (the ability of gestural representations

to contrast, to participate in class-based behavior, and to undergo qualitative changes).

Gestures have domain-specific meaning within their own system (speech or beatboxing)

while sharing a domain-general conformation with other behaviors. Gestures can do this by

explicitly connecting the tasks specific to speech or to beatboxing with the sound-making

potential of the vocal substrate they share; this in turn creates a direct link between speech
xvii
gestures and beatboxing gestures. This link is formalized at the graph level of the dynamical

systems by which gestures are defined.

The direct formal link between beatboxing and speech units makes predictions about

what types of phonological phenomena beatboxing and speech units are able to

exhibit—including phonological alternations and harmony mentioned above. It also predicts

that the phonological units of the two domains will be able to co-occur, with beatboxing and

speech sounds interwoven together by a single individual. This type of behavior is known as

“beatrhyming” (Chapter 7: Beatrhyming).

These advantages of the gestural approach for describing speech, beatboxing, and

beatrhyming underscore a broader point: that regardless of whether phonology is modular or

not, the phonological system is not encapsulated away from other cognitive domains, nor

impermeable to connections with other domains. On the contrary, phonological units are

intrinsically related to beatboxing units—and, presumably, to other units in similar

systems—via the conformation of their mental representations. As beatrhyming helps to

illustrate, the properties that the phonological system shares with other domains are also the

foundation of the phonological system’s ability to flexibly integrate with other (e.g., musical)

domains.

xviii
CHAPTER 1: INTRODUCTION

Beatboxing is a type of non-linguistic vocal percussion that can be performed as an

accompaniment to linguistic music (e.g. rapping or a cappella singing) or as a standalone

performance—the latter being primarily the focus here. Beatboxers are increasingly

recognized in both scientific and popular literature as artists who push the limits of the vocal

tract with unspeechlike vocal articulations that have only recently been captured with

modern imaging technology. Scientific study of beatboxing is valuable on its own merits,

especially for beatboxers hoping to teach and learn beatboxing more effectively. But much of

beatboxing science also serves as a type of speech and linguistic science, aimed at

incorporating beatboxing into innovative speech therapy techniques or understanding the

nature of speech.

This dissertation contributes to both beatboxing science and speech science. As a

piece of beatboxing science, the contribution is the first major effort (that I know of) to

begin to probe beatboxing cognition—specifically beatboxing phonology, including the

discovery of phonological alternations, phonological harmony, and the development of a

theoretical framework relating representations in speech and beatboxing that can account for

these findings. As a type of linguistic science, the dissertation contributes to the longstanding

debate about the domain-specificity of language: because hallmarks of linguistic phonology

like alternations and harmony also exist in beatboxing, beatboxing phonology provides

further evidence that phonology is rooted in domain-general cognition (rather than existing

as, say, a unique and innate component of a modular language faculty).

1
Section 1 introduces the art of beatboxing and briefly summarizes the current state of

beatboxing science, particularly with an eye to beatboxing cognition. Section 2 establishes

the context for how research on a distinctly non-linguistic behavior like beatboxing can be

considered relevant to linguistic inquiry.

1. The art and science of beatboxing

1.1 Beatboxing art

The foundation of beatboxing lies in hip hop. The “old school” of beatboxing began as

human mimicry of the sounds of a beat box, a machine that synthesizes percussion sounds

and other sound effects. The beat box created music that an MC could rap over; when a beat

box wasn’t available, a human could perform the role of a beat box by emulating it vocally.

The two videos below demonstrate how beatboxing was used by early artists like Doug E

Fresh and Buffy to give other MCs a beat to rap over.

Doug E. Fresh and Slick Rick, “La Di Da Di” (1985)

[Link] -

(The beat pattern starts in earnest around 0:48. Before that, you can hear Doug E. Fresh

using his signature Clickrolls—a lateral lingual ingressive trill..)

Fat Boys, “Human Beat Box” (1984)

[Link]

2
(Buffy was well-known for his “bass-heavy breathing technique” (source) that you can hear

from 0:10-0:15.)

The last four decades have given beatboxers plenty of time to innovate in both artistic

composition and beatbox battles that demonstrate mechanical skill. Modern beatboxing

performances often stand alone: if there are any words, they are only occasional and woven

by the beatboxer into the beat pattern rather than said by a second person. (There are art

forms like beatrhyming where singing/rapping and beatboxing are fully integrated, but this is

a different vocal behavior; see Chapter 7: Beatrhyming. Combining words or other vocal

behaviors into beatboxing is sometimes called multi-vocalism.) The next two videos show

that beat patterns in the “new school” of beatboxing may be faster, reflecting contemporary

popular music styles..

Reeps One “Metal Jaw” (2013)

[Link]

Mina Wening (2017)

[Link]

Beatboxing evolves through innovation of new sounds or sound variations, patterns (e.g.,

combinations of sounds or styles of breathing), and integration with other behaviors (e.g.,

beatboxing flute, beatboxing cello, beatrhyming, beatboxing with other beatboxers). For

3
novice beatboxers, the goal is to learn how to sound as good as experts; for expert

beatboxers, the goal is to create art through innovation while keeping up with trends. This

innovation is constrained by both physical and cultural forces. The major physical constraint

is the vocal tract itself which limits the speed and quality (i.e., constriction degree and

location) of possible movements; new beatboxing sounds and patterns are thought to arise

from testing these physical limitations. As for cultural forces, both the musical genres that

inspire beatboxing and the preferences of beatboxers themselves have a role. Three examples

follow.

First, beatboxing started without words, and today most beatboxers still rarely speak

during a beatboxing performance. Though it is not uncommon to hear a word or short

phrase during a beat pattern, usually with non-modal phonation, the fact that beatrhyming

has its own name to distinguish it from beatboxing implies that it is not the same art form.

Second, since the initial role of beatboxing was to provide a clear beat by emulating drum

sounds, non-continuant stops and affricates became very common while continuants like

vowels are almost never used. When drawing on inspiration from other musical sources,

related genres like electronic dance music would have been appealing for their percussive

similarities. Contemporary beat patterns keep the percussive backbone, though some

sustained sounds (i.e., modal or non-modal phonation for pitch) can be used concurrently as

well. And third: more broadly, beatboxing shares musical properties with a broad range of

(Western) genres, resulting in common patterns. One common property is 4/4 time, which

signifies that the smallest musical phrases each contain four main events (which can be

thought of as being grouped into pairs of two). Another common property is the placement

4
of emphasis on the “back beat” (beat 3 in 4/4 time) via snare sounds (Greenwald, 2002).

These types of properties, together with the vocal modality, shape the musical style and

evolution of beatboxing. Innovation in beatboxing is done within these constraints.

Common advice in beatboxing pedagogy is to learn incrementally. New aspiring

beatboxers are encouraged to start by drilling the fundamentals: basic sounds like Kick

Drums {B} [p’], Closed Hi-Hats {t} [t’], and PF Snares {PF} [pf’] should become familiar

first in isolation, then in combos and beat patterns to practice them in a rhythmic context.

(Curly bracket notation indicates a beatboxing sound, while square bracket notation

indicates International Phonetic Alphabet notation.) Once the relatively small set of sounds

is secure, it is time to learn new sounds that facilitate breath management—this is important

for performing progressively more complex and intensive beat patterns that demand more

air regulation. At the same time, new beatboxers also need to focus on “technicality”, a jargon

word in the beatboxing community that refers to how accurately and precisely a sound is

performed. Reference to and imitation of other beatboxers is common for establishing ideals

and task targets. All of these basics are the foundations from which a beatboxer can start to

innovate by making novel sounds and beat patterns; and, beatboxers continue to revisit these

different facets of their art to make improvements at multiple time scales (i.e., improving one

sounds, improving a combination of sounds, developing a flow or a new style). As a

consequence of all this, beatboxers are often aware of or focusing on some facet of their

beatboxing as they perform in a way that fluent speakers of a language may not be aware of

their own performance; moreover, beatboxers at different stages in the learning process (or

5
even at the same stage) may beatbox very differently depending on the sounds they know

and which facet of beatboxing they are practicing.

All of these details are important for later chapters. The fact that beatboxers are

aiming for particular sound qualities and flow patterns means that we should expect to find

beatboxing patterns that balance aesthetics and motor efficiency (Chapter 6: Harmony). The

lack of words in beatboxing, the interest in imitating instruments/sound effects, and the

drive to innovate through the use of new sounds are all hints that beatboxing phonology is

not a variation of speech phonology but a sound organization system in its own right. The

metrical patterns of sounds (e.g., Snares on 3) frames observations about beatboxing sound

alternations (Chapter 5: Alternations) and the relationship between speech and beatboxing

sounds in beatrhyming (Chapter 7: Beatrhyming). And the fact that beatboxers are actively

focusing on different things and cultivating different styles goes a long way to explaining

qualitative variation among beatboxers, including differences in their sound inventories and

the productions of individual sounds (Chapter 3: Sounds).

1.2 Beatboxing science

A guiding theme in beatboxing science is the study of vocal agility and capability

(Dehais-Underdown, 2021). The complex unspeechlike sounds and patterns of beatboxing

inform our understanding of what kinds of vocal sound-producing movements and patterns

can be performed efficiently—and sometimes surprise us when we see articulations that we

didn’t think were possible. This in turn offers a better general phonetic framework for

6
studying the relationship between linguistic tasks, cognitive limitations, physical limitations,

and motor constraints in the evolution of speech.

Likewise, knowing more about the physical abilities of the vocal tract also informs

our understanding of disordered or otherwise non-normative speech production strategies.

Some researchers advocate for using beatboxing for speech therapy (Pillot-Loiseau et al.,

2021). The BeaTalk strategy has been used to improve speech in adults (Icht, 2018, 2021; Icht

& Carl, 2022); and beatboxers Martin & Mullady (n.d.) use beatboxing in their work with

children. (See also Himonides et al., 2018; Moors et al., 2020.) Although beatboxing

interventions for therapeutic purposes are still quite new, the tantalizingly obvious

connection between beatboxing and speech as vocal behaviors has been generating interest

within the beatboxing and academic communities.

Crucial to both these branches of inquiry but almost completely undeveloped within

the field is a theory of beatboxing cognition. The literature offers just three claims about

beatboxing cognition so far, none of which are firmly established: one about the intentions of

beatboxers, and two about the fundamental units of beatboxing. There is a general consensus

that, based on the origins of beatboxing as a tool for supporting hip hop emcees, a

beatboxer’s primary intention is to imitate the sounds of a drum kit, electronic beat box, and

a variety of other sound effects (Lederer, 2005; Stowell & Plumbley, 2008; Pillot-Loiseau et

al., 2020). But treating beatboxing as simple imitation is reductive and disingenuous to the

primacy of the art form (Woods, 2012). Even in the earliest days, old school beatboxers

established distinctive vocal identities that were surely not just attempts to mimic different

electronic beat boxes. The new school of beatboxing has come a long way since then and

7
shows rapidly evolving preferences in artistic expression that a drive for pure imitation seems

unlikely to motivate.

As for the cognitive representations of the sounds themselves, Evain et al (2019) and

Paroni et al. (2021) posit the notion of a “boxeme” by analogy to the phoneme—an

acoustically and articulatorily distinct building block of a beatboxing sequence. While they

imply that boxemes are meant to be a hypothesis of cognitive units, they do not address

other questions begged by the phoneme analogy (Dehais-Underdown, 2021). Are boxemes

the smallest compositional units or are they composed of even smaller elements, as

phonemes are thought to be composed of features? Does beatboxing exhibit phonological

patterns that require a theory with some degree of abstraction? And are boxemes symbolic

units, action units, or something else? Separately, Guinn & Nazarov (2018) argue for the

active role of phonological features in beatboxing based on evidence from variations in beat

patterns and phonotactic place restrictions (an absence of beatboxing coronals in prominent

metrical positions). They do not link features back to larger (i.e., segment-sized) unit; while

they offer the possibility that speech and beatboxing features are linked (perhaps in the same

way that the features of a language learned later in life are linked to the features of a

language spoken from birth), it remains unclear whether or how speech representations and

beatboxing representations (whatever they are) should be considered cognitively linked.

The lack of work on beatboxing cognition is understandable: the field of beatboxing

science is still in its infancy with less than 20 years of research, and the few scientists

involved in the field have had their hands full with other more tractable questions. But it will

be difficult to use beatboxing to inform an account of the physical and cognitive factors that

8
shape speech without both physical and cognitive accounts of beatboxing. And while the

viability of beatboxing as a tool for speech therapy is ultimately an empirical question, a

theory of beatboxing cognition that is explicit about whether and how speech and

beatboxing sounds are cognitively related should help decide what interventions are more or

less likely to work.

This dissertation’s major contribution to beatboxing science is the initiation of a

systematic inquiry into beatboxing cognition—specifically, a hypothesis about the

fundamental units of beatboxing phonology. Chapter 3: Sounds describes beatboxing sounds

and the articulatory properties along which they are organized. Chapter 4: Theory lays out

the hypothesis that those articulatory properties can be formalized as the fundamental

cognitive units of beatboxing, akin to the fundamental linguistic gestures of Articulatory

Phonology (Browman & Goldstein, 1986, 1989). Rooting beatboxing cognition in gesture-like

units offers two benefits: the same types of empirically-testable predictions as Articulatory

Phonology, and a theoretical link between the cognitive units of speech and beatboxing. Both

benefits are advantageous for developing theories of speech informed by beatboxing and for

developing therapeutic beatboxing interventions. Chapter 5: Alternations and Chapter 6:

Harmony support this hypothesis with an example of beatboxing phonology—phonological

harmony complete with triggers, undergoes, and blockers—and offer an account based on

gestures. Finally, Chapter 7: Beatrhyming goes a step further to provide evidence for a direct

link between the cognitive units of speech and beatboxing via the art of simultaneous

production of beatboxing and singing known as beatrhyming.

9
2. Beatboxing as a lens for linguistic inquiry

With respect to linguistic inquiry, the longstanding debate addressed here is one of

domain-specificity: Does the human capacity for language consist only of a specialized

composite of other cognitive systems, or is there some component that is unique to language

and cannot be attributed to specialization of other cognitive systems (Anderson, 1981)? The

question has been central in the development of major linguistic paradigms over the last

several decades, including the Minimalist program that views the human language faculty as

only minimally domain-specific (the language faculty in the narrow sense) and otherwise

composed of a unique assembly of other cognitive functions (e.g., Hauser et al., 2002; Collins,

2017 provides an overview).

One of the strongest theories of domain-specificity in cognition comes from Fodor

(1983) who offers a modular approach in which a cognitive domain constitutes its own

system. In the original conception, modules are low-level (mostly sensory input) systems

which are likely to be encapsulated, automatic, innate, and which perform computations

exclusively over inputs relevant to their domain—hence, domain-specific. Modules are

distinct from the non-specific handling of general cognitive processing. Liberman &

Mattingly’s (1985) Motor Theory couched speech perception as a linguistic module built

around the relationship between intended phonetic gestures and their acoustic output. The

Motor Theory proposes that speech perception is a parallel system to general auditory

processing, a claim supported by duplex perception tasks (Liberman et al., 1981; Mann &

Liberman, 1983). Modularity has been conceived of many different ways by now, and

whether or not a system like language shows all of the typical traits (e.g., encapsulation,
10
innateness) are open to empirical testing, but domain-specificity remains key to the modular

theory (Coltheart, 1999). Even when phonology is not considered a module in the strictest

sense, it is still common to make reference to the modular “interface” between phonetics and

phonology which implies that the linguistic system of sounds is distinct from the physical

implementation of sounds (cf Ohala, 1990).

One of the key arguments in favor of domain-specificity is tied up with innateness:

there are substantial barriers for the infant attempting to learn language, including lack of

segmentability and lack of invariance in the acoustic signal of the ambient language(s); given

how quickly and effectively newborns learn speech production and perception, it stands to

reason that humans may be born with a language faculty that provides a universal starting

point for the acquisition process. This language faculty is domain-specific insofar as the

innate cognitive scaffolding is tailored to address linguistic issues. Werker & Tees (1984) and

related work demonstrated that infants are born with the ability to distinguish speech

sounds across the same categorical boundaries that adults use.

Others argue in favor of accounting for speech patterns using only

language-independent, domain-general information, without relying on an innate,

species-specific language capacity (Universal Grammar) (e.g., Lindblom, 1983; Archangeli &

Pulleyblank, 2015, 2022). This approach has foregrounded major questions in phonology

over the last few decades, all shaped around developing an understanding of how phonetics

shapes phonology. Quantal Theory (Stevens, 1989; Stevens & Keyser, 2010) derives common

phonological categories from quantal regions in the vocal tract where coarticulation is less

likely to interfere with perception. The Theory of Vowel Dispersion (Liljencrants &

11
Lindblom, 1972; Lindblom et al., 1979) generates typologically common vowel patterns using

the principle of maximal contrast but without presupposing any particular phonological

categories. Likewise, proponents of the Auditory Enhancement Hypothesis (Diehl &

Kluender, 1989; Diehl et al., 1991) argue that the common covariation of certain phonological

features is explained by their mutual compatibility in enhancing perceptual contrasts. The

frame/content theory (MacNeilage, 1998) posits that the origins of speech come not from a

spontaneous mutation but rather evolved from homeostatic motor functions; in this case,

phonological syllable structure (the frame) descended from the chewing action.

The question of domain-specificity is an undercurrent of much research in cognitive

science and evolutionary psychology and often involves comparing speech and language to

other types of human or non-human cognition (Hauser et al., 2002). Categorical perception

has been found in chinchillas (Kuhl & Miller, 1978) and crickets (Wyttenbach et al., 1996), as

well as for human perception of non-speech sounds (Fowler & Rosenblum, 1990) and faces

(e.g., Beale & Keil, 1995). Language and music share certain rhythmic (see Ravignani et al.,

2017 for a recent discussion), syntactic, (Lerdahl & Jackendoff, 1983) and neurological

qualities (Maess et al., 2001), with other apparently cross-domain ties (Feld & Fox, 1994;

Bidelman et al., 2011). Comparison of neurotypical speech and disordered speech contributes

to a neurological aspect of the discussion such as whether the motor planning in speech uses

specialized or domain-general circuitry (Ballard et al., 2003; Ziegler, 2003a, 2003b).

Despite the evidence suggesting that language and phonology may not have a

domain-specific component, domain-specific generative models remain the norm in much of

phonological theory. This dissertation’s contribution to the domain-specificity conversation

12
is to argue that domain-general models of phonology have more predictive power than

domain-specific models for modeling phonological behavior that exists both in and outside

of speech.

Models of a theory help scientists describe and explain natural phenomena, and in

doing so also predict what related phenomena we should expect to find. Domain-specific

models are meant to describe and predict only phenomena within their own domain: in a

domain-specific computational phonological model, for example, the inputs and outputs are

exclusively linguistic and the grammar operates only over those linguistic elements. If the

same model were used to try to account for the inputs and outputs of a different cognitive

domain, then by definition the model would either fail or be subject to alterations that make

it no longer domain-specific.1 And when the model predicts phenomena that are not

observed within its domain, the model is said to be imperfect because it overgenerates. As a

consequence, domain-specific models are unable to describe, explain, or predict phenomena

outside of their domain.

The domain-specificity of computational phonological models was entrenched at

least as early as the divorcing of phonetics from phonology (de Saussure, 1916; Baudouin de

Courtenay, 1972) which led to interest in only those aspects of phonology which are

essentially linguistic (Sapir, 1925; Hockett, 1955; Ladefoged, 1989). In programs descended

from this tradition, the features and grammar of phonological theory are domain-specific

because they deal exclusively with phonological inputs, outputs, and processes. The inputs

1
If a domain-specific model needs to be used to account for the phenomena in a different domain,
domain-specificity can be preserved by copying the model’s form and adapting its units/computations to the
new context. This would result in two non-overlapping domain-specific models. This might happen in a case of
cognitive parasitism; see below for more discussion on this point.

13
and outputs are typically expressed as phonological features—atomic representations of

linguistic information defined by their relationship with each other, whose purpose it is to

encode meaningful contrast, and which are the basis of phonological change (Dresher, 2011;

Mielke, 2011). Phonological features are meant to be representations of linguistic meaning

and organization—they are crucially not meant to be representations of any other domain.

Depending on the strictness of a model’s commitment to domain-specificity,

sometimes explanation in phonology may come from outside language. Widespread interest

in the relationship between phonetics and phonology was renewed with the advent of

acoustically-grounded distinctive features (Jakobson et al., 1951) and the mapping of gradient

phonetic features to scalar phonetic (phonological) features in SPE (Chomsky & Halle, 1968;

see Keating, 1996 for the dual role of phonetics in SPE). Phonological grammars commonly

use phonetic grounding to constrain their outputs (Prince & Smolensky, 1993/2004; Hayes,

Kirchner, & Steriade, 2004). On the other hand, other programs based on strict

domain-specific modularity argue that phonetics should have no role in the makeup of the

grammar (e.g., Hale & Reiss, 2000). But in neither case is phonology expected to explain

anything about phonetics—except perhaps at the phonetics-phonology interface where

outputs from the phonological system are transduced into the inputs of the phonetic system

(Keating, 1996; Cohn, 2007). Even then, the interface is not intended to account for any

phonetic phenomenon that is not clearly the result of a linguistic intent, nor is it capable of

doing so without becoming a domain-general model. Regardless of whether there is an overt

commitment to an innate Universal Grammar, the resulting phonological systems are

domain-specific by design.

14
A domain-specific model can of course be of great practical benefit in the interest of

developing a scientific account of phonology. But the issue of the domain-specificity of

language is a hypothesis (not a fact) about the relationship between language and the rest of

human cognition. If we were to discover that phonological phenomena typically described

with a domain-specific approach are also present in another nonlinguistic behavior, then a

single model that encompasses both domains may be preferable to two domain-specific

models that provide separate accounts of their shared phenomena. For this dissertation, the

search for nonlinguistic phonological behavior takes place in the domain of beatboxing.

Beatboxing is particularly useful in the search for the nonlinguistic presentation of

phonology because beatboxing and speech have many qualitative articulatory properties in

common. For both beatboxing and speech, sound is produced when the vocal tract

articulators make constrictions that manipulate air pressure. As discussed in Chapter 3:

Sounds, many of these articulations have similar constriction locations and degrees to

articulations in speech. Most beatboxing sounds require coordination among multiple

articulators. Like speech sounds, beatboxing sounds have a domain-specific classification

system, in this case based on their musical function (e.g., “snare”, “bass”, “kick”) and their

articulation (see Chapter 3: Sounds). The sounds of beatboxing can be combined and

recombined into an unlimited number of different beat patterns—hierarchically structured

phrases of beatboxing sounds produced sequentially—but with certain phonotactic

restrictions as discussed earlier (e.g., “beat 3 must have a snare sound”). And, some common

beatboxing sounds resemble speech sounds enough that they can replace speech sounds in

an utterance (Chapter 7: Beatrhyming). Given the articulatory and organizational similarities

15
between beatboxing and speech, beatboxing is an ideal nonlinguistic behavior against which

to compare speech in the search for phenomena that are unique to phonology (if any).

Assuming for the moment that beatboxing does exhibit phonology-like patterns (a

claim which this dissertation attempts to support), the different approaches to

domain-specificity and domain-generality in phonology described above offer two

explanations for how beatboxing ended up looking phonological. One way starts with

domain-generality as a baseline assumption: phonology and beatboxing are grounded in the

same cognitive capacities, so whatever their shared capacities provide as a publicly available

resource (e.g., phonological harmony) will automatically be available to both phonology and

beatboxing—though not every language or beatboxer will necessarily use it.

On the other hand, phonology could be a domain-specific system from which

beatboxing copies cognitive properties. In this view, beatboxing is parasitic on phonology.

Evidence from this dissertation shows that the strongest sense of parasitism, where

beatboxing copies the actual phonological representations and grammar from phonology,

cannot be true: though there are similarities in the composition of sounds and phonological

behavior, the beatboxing sound system uses cognitive representations that are not used as

phonological units (neither in the beatboxer’s language nor in any universal feature system).

The beatboxing system must be more innovative than strict parasitism allows for.

The weaker hypothesis of parasitism is that beatboxing might take certain qualities of

phonological units and grammar—like the combinatorial nature of the representations and

the framework of a computational grammar (e.g., Optimality Theory)—and re-use them to

create beatboxing representations and beatboxing grammar. Beatboxing would not be

16
constrained to be essentially identical to speech as in the strong parasitic hypothesis, but its

beatboxing-phonological phenomena would be constrained by the limitations of the

representations and grammar whose form it borrowed. Those aspects which beatboxing

borrowed would then technically be domain-general, at least for those two domains, even if

they did not start that way. The weaker parasitic hypothesis is more plausible than the strong

one. Neophyte beatboxers commonly learn beatboxing sound patterns from adaptations of

speech phrases (e.g., “boots and cats” → {B t ^K t}; see Chapter 3: Sounds for a description of

the symbols). Using the physical vocal apparatus to perform similar maneuvers (Chapter 3:

Sounds, Chapter 4: Theory) could in some sense “unlock” access to phonological potential.

(Hauser, Chomsky, & Fitch [2002] suggest that recursion may have similarly been adopted

into speech from domain-specific use in another cognitive domain like navigation.)

This dissertation makes no attempt to provide evidence that distinguishes between

the domain-general hypothesis and the weaker parasitic hypothesis. The difference doesn’t

matter because both approaches arrive at the same (almost paradoxical) conclusion: that

beatboxing and speech share many properties and yet are qualitatively completely different

behaviors governed by non-overlapping intentions and tasks. Instead, this dissertation

focuses on developing a single-model approach that encompasses both domains and predicts

their shared behavior (as opposed to creating two purely domain-specific models). The

starting point for this model is Articulatory Phonology.

Articulatory Phonology (1986, 1989) is the hypothesis that the fundamental cognitive

units of phonology are not symbolic features, but actions called “gestures”. Gestures have

been argued to be advantageous for phonological theory because they unite the discrete,

17
context-invariant properties usually attributed to phonological units with the dynamic,

continuous, context-dependent properties observed in speech. These two sides of gestures

are encoded together in the language of dynamical systems: the system parameters are

invariant during the execution of a speech action, but the state of the system changes

continuously (Fowler, 1980). Chapter 4: Theory argues that dynamical systems also

simultaneously contain domain-specific and domain-general properties. This is because, as

actions, gestures are not unique to speech but they are specialized for speech: by design, the

dynamical equations in the task dynamic framework of motor control can characterize any

goal-oriented action from any domain (Saltzman & Munhall, 1989). This means that gestures

are on the one hand domain-general because the dynamical system that defines them can

serve as the basis for any goal-oriented action, but on the other hand domain-specific

because a given gesture is specialized for a speech-specific (and language-specific) goal

(Browman & Goldstein, 1991:314-315):

“Second, we should note that the use of dynamical equations is not restricted
to the description of motor behavior in speech but has been used to describe
the coordination and control of skilled motor actions in general (Cooke, 1980;
Kelso, Holt, Rubin, & Kugler, 1981; Kelso & Tuller, 1984a, 1984b; Kugler, Kelso,
& Turvey, 1980). Indeed, in its preliminary version the task dynamic model we
are using for speech was exactly the model used for controlling arm
movements, with the articulators of the vocal tract simply substituted for those
of the arm. Thus, in this respect the model is not consistent with Liberman
and Mattingly’s (1985) concept of language or speech as a separate module,
with principles unrelated to other domains. However, in another respect, the
central role of the task in task dynamics captures the same insight as the
“domain-specificity” aspect of the Modularity hypothesis—the way in which
vocal tract articulators is yoked is crucially affected by the task to be achieved
(Abbs, Gracco, & Cole, 1984; Kelso, Tuller, Vatikiotis-Bateson, & Fowler,
1984).”

18
For an approach to phonological theory that can also describe non-linguistic behaviors,

dynamical action units should be preferred over features (or other purely domain-specific

phonological units) because they have domain-general roots but can be specialized for any

domain. When specialized for speech, these action units are gestures; when specialized for

another domain, they are the gesture-like building block of that domain instead.

Beyond their descriptive power, however, gestures can also make predictions about

the organization of sounds in other domains whereas features cannot. Assuming that

beatboxing has gesture-like fundamental units of cognition, any behavior of gestures

determined by their domain-general side is predicted to be relevant to beatboxing as well

(Chapter 4: Theory). Chapter 6: Harmony demonstrates this in the phenomenon of

beatboxing harmony: beatboxing harmony has signature traits of speech harmony including

trigger, undergoer, and blocker sounds, the behavior of all of which is predicted by gestural

approaches to harmony. The gestural model also predicts the possibility of multi-tasking by

using speech and beatboxing gestures simultaneously. Chapter 7: Beatrhyming shows not

only that beatboxing and speech can be produced simultaneously, but also that their

fundamental cognitive units are cognitively related with each other through their tasks of

making constrictions in the vocal apparatus they share.

In contrast, domain-specific phonological models make no predictions about whether

beatboxing harmony could exist or what traits it might have because the features and

grammar are designed only to target linguistic information. Generative linguistic grammars

also cannot generate beatrhyming because they cannot deal with non-linguistic sounds. Of

course, there are ways around these limitations—new models can be constructed that use

19
beatboxing features and beatboxing grammars to generate beatboxing harmony, and

speech-beatboxing cognitive interfaces can be postulated that do computations over the joint

domain of speech and beatboxing sounds. But ultimately all these strategies require making

multiple separate models to account for phenomena that speech and beatboxing share;

compared to a gestural approach that accounts for both speech and beatboxing without any

additional theoretical overhead, the domain-specific starting point is inferior.

20
CHAPTER 2: METHOD

1. Participants and data acquisition

Two novice beatboxers, one intermediate beatboxer, and two expert beatboxers were asked

to produce beatboxing sounds in isolation and in musical rhythms (“beat patterns”), and to

speak several passages while lying supine in the bore of a 1.5 T MRI magnet. Skill level

designations were given by the intermediate beatboxer who had also contacted the

beatboxers, was present for the collection of their data, and provided a beatboxer’s insight at

several points in the earlier stages of analysis. Of those five beatboxers, the productions of

just one expert are reported in the present study. The two novices and the intermediate

beatboxer are not discussed because the aim of this dissertation is to characterize expert

beatboxing, not beatboxing acquisition. (See Patil et al., 2017 for a brief study of the basic

sounds of all five beatboxers.) Data from the second expert beatboxer are not reported

because the beatboxer exhibited large head movements during image acquisition, making

kinematic analysis using the methods described below impossible. The beatboxer studied

here reported being a monolingual speaker of English.

Each beatboxer was asked in advance to provide a list of sounds they know written

with orthographic notation they would recognize. During the scanning session, each sound

label they had written was presented back to them as a visual stimulus. For each sound,

beatboxers were asked to produce the sound three times slowly and three times quickly, and

then to produce the sound in a beat pattern (sometimes referred to hereafter as a “showcase”

beat pattern). The beatboxers were also invited to perform beat patterns of their choosing

21
that were not meant to showcase any particular sound. For the analyzed expert beatboxer,

there were over 50 different showcase or freestyle beat patterns. The beatboxers were paid

for participation in the experiment.

Data were collected using an rtMRI protocol developed for the dynamic study of

vocal tract movements, especially during speech production (Narayanan et al., 2004; Lingala

et al., 2017). The subjects’ upper airways were imaged in the midsagittal plane using a

gradient echo pulse sequence (TR = 6.004 ms) on a conventional GE Signa 1.5 T scanner

(Gmax = 40 mT/m; Smax = 150 mT/m/ms), using an 8- channel upper-airway custom coil.

The slice thickness for the scan was 6 mm, located midsagittally over a 200 mm × 200 mm

field-of-view; image size in the sagittal plane was 84 × 84 pixels, resulting in a spatial

resolution of 2.4 × 2.4 mm. The scan plane was manually aligned with the midsagittal plane

of the subject’s head. The frames were retrospectively reconstructed to a temporal resolution

of 12ms (2 spirals per frame, 83 frames per second) using a temporal finite difference

constrained reconstruction algorithm (Lingala et al., 2017) and an open-source library

(BART). Audio was recorded at a sampling frequency of 20 kHz inside the MRI scanner

while the subjects were imaged, using a custom fiber-optic microphone system. The audio

recordings were noise-canceled, then reintegrated with the reconstructed MR-imaged video

(Bresch et al., 2008). The result allows for dynamic visualization and synchronous audio of

the performers’ vocal tracts.

22
2. Annotation methods

Beat patterns from the real-time MR videos were annotated using a concise plaintext

percussion notation called “drum tabs” and point tier TextGrids in Praat (Boersma &

Weenink, 1992-2022). Beat patterns are performed with a rhythmic structure related to a

musical meter, so each annotation included labels for the beat pattern sounds and the

metrical position of that sound. This section explains how each annotation style was created,

but first begins with an introduction to musical meter.

2.1 Musical meter

Just as a sequence of syllables in languages with alternating stress can be grouped

hierarchically into prosodic feet, words, and phrases, so too is musical meter composed of

strong-weak alternations hierarchically grouped into measures and phrases. But music and

beatboxing are performed isochronously, meaning that there is roughly consistent temporal

spacing between events at the same level of the hierarchy.

The rhythmic structure of the beatboxing under consideration here can be

represented as a binary tree structure resulting in strength alternations (Lerdahl &

Jackendoff, 1983; Palmer & Kelly, 1992; Figure 1). Each branch has two end nodes: a Strong

node (S) on the left, and a Weak node (W) on the right. And, each node can be the parent of

another Strong-Weak pair.

23
Figure 1. A simple hierarchical tree structure with alternating strong-weak nodes.
S
/ \
/ \
/ \
S W
/ \ / \
S W S W

Strong and Weak events at a certain level are sometimes called “beats” and are often marked

with the numbers 1, 2, 3, and 4; the process of finding these beats, say in order to move to

them in dance, is sometimes called beat-induction (Large, 2000). Musical phrases often last

for more than four beats, but it is customary to reset the count back to 1 instead of

continuing on to 5 (Figure 2). When counting music at this level, a musician is likely to say

“one, two, three, four, one, two, three, four, one…”. Each beat 1 is the beginning of a musical

chunk called a “measure.” Since counting the beat resets to 1 after every 4, musicians reading

musical notation might refer to a specific beat in the meter by both measure number and

beat number, as in “measure 2, beat 3.”

Figure 2. Hierarchical strong-weak alternations in which one level (“beats”) is numbered.


/ \ / \
/ \ / \
S W S W
/ \ / \ / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
S W S W S W S W
1 2 3 4 1 2 3 4

Each beat can be further divided into sub-beats in which the Strong node retains the

numerical label of its parent and the Weak node is called “and” (here abbreviated to “+”)

24
(Figure 3). When speaking the meter aloud at this level, a musician would say “one and two

and three and four and one and two and three and four and one and…”.

Figure 3. Hierarchical strong-weak alternations. The beat level is numbered as in Figure 2.


The child nodes of that level inherit the same numbering on the strong nodes and a + on the
weak nodes.
/ \ / \
/ \ / \
S W S W
/ \ / \ / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
S W S W S W S W
1 2 3 4 1 2 3 4
/ \ / \ / \ / \ / \ / \ / \ / \
S W S W S W S W S W S W S W S W
1 + 2 + 3 + 4 + 1 + 2 + 3 + 4 +

These sub-beats can be divided even more. In these sub-sub-beats, the Strong nodes once

again retain the label of the parent node, while the Weak nodes are given different names

(Figure 5). The Weak sub-sub-beat between the beat node (a number) and the “and” node is

called “y” (pronounced [i]), and the Weak sub-sub-beat between the “and” and the next beat

note is called “a” (pronounced [ə]). When a musician speaks the meter at this level of

granularity, they say “one y and a two y and a three y and a four y and a one y and a two y

and a three y and a four y and a…”.

25
Figure 4. Two levels below the beat level have further subdivisions.
/ \ / \
/ \ / \
S W S W
/ \ / \ / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
S W S W S W S W
1 2 3 4 1 2 3 4
/ \ / \ / \ / \ / \ / \ / \ / \
S W S W S W S W S W S W S W S W
1 + 2 + 3 + 4 + 1 + 2 + 3 + 4 +
|\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\
S W S W S W S W S W S W S W S W S W S W S W S W S W S W S W S W
1 y + a 2 y + a 3 y + a 4 y + a 1 y + a 2 y + a 3 y + a 4 y + a

Metrical Phonology uses a more compact representation for hierarchical metrical structure, a

notation with stacks of Xs called a metrical grid (Liberman & Prince, 1977; Hayes, 1984;

Figure 5). In each column, the number of Xs represents the strength of a metrical position

relative to the other metrical positions in the same phrase. In the example below, the lowest

row of Xs correspond to the syllable, the Xs above those to the head of each trisyllabic foot,

and the top Xs to binary groups of feet.

Figure 5. A metrical grid of the rhythmic structure of the first two lines of an English
limerick.
x x x x
x x x (x) x x x
x x x x x x x x x (x) (x) (x) x x x x x x x x x (x)
There once was a man from Nantucket who kept all his cash in a bucket

The example in Figure 6 below is the metrical grid notation of the metrical tree example in

Figure 4.

26
Figure 6. A material grid representation of the metrical structure of Figure 4.
x
x x
x x x x
x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
1 y + a 2 y + a 3 y + a 4 y + a 1 y + a 2 y + a 3 y + a 4 y + a

Just as speech can have trisyllabic feet, in some cases a sub-division of a beat has three

terminal nodes instead of two (or instead of sub-dividing further to four nodes). These

sequences of three are called “triplets” and are counted by musicians as “one and a two and a

three and a four and a one and a…”. For the purposes of this research, it is not important

whether a subdivision with three terminal nodes is a ternary-branching tree or a structure

with two levels; but for simplicity in the metrical grid the two weaker sub-beats in a triplet

are marked as equally weak (Figure 7).

Figure 7. A metrical grid representation in which each beat has three subdivisions.
x
x x
x x x x
x x x x x x x x x x x x
1 + a 2 + a 3 + a 4 + a

If triplets occur in a beat pattern in this research, they are often mixed in among binary

divisions. In the example in Figure 8, beats 1 and 3 have full binary branching while beats 2

and 4 branch into triplets.

27
Figure 8. A metrical grid in which beats 1 and 3 have four sub-divisions while beats 2 and 4
have three sub-divisions.
x
x x
x x x x
x x x x x x
x x x x x x x x x x x x x x
1 y + a 2 + a 3 y + a 4 + a

The preceding description of musical structure has been looking at metrical positions—slots

of abstract time. But not all metrical positions are necessarily used in a beatboxing

performance. For example, in the beat pattern in Figure 9 below each beat (1, 2, 3, or 4) holds

a musical event, but the available metrical positions after each beat (“y + a”) are silent—with

the exception of the “a” of the first 4 on which musical event {B} is produced just before

another {B} on the next beat 1. ({B}, {t}, and {PF} are the beatboxing sounds Kick Drum,

Closed Hi-Hat, and PF Snare, respectively; beatboxing shorthand is denoted by curly

brackets as described in Chapter 3: Sounds.)

Figure 9. A metrical grid of the beatboxing sequence {B t PF t B B B PF t}. All sounds except
the second {B} are produced on a major beat; the second {B} is produced on the fourth
sub-division of beat 4 of the first measure.
x
x x
x x x x
x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
1 y + a 2 y + a 3 y + a 4 y + a 1 y + a 2 y + a 3 y + a 4 y + a
B t PF t B B B PF t

2.2 Drum tab notation

Metrical grids are useful for representing the relative strength of each metrical position

compared to the others in its phrase. But since the relative strengths of positions in the

28
metrical structure of beatboxing is highly regular (1 > 3 > {2, 4} > “+” > {“y”, “a”}), a more

consolidated type of metrical notation can be used. For beatboxing and some other

percussive music that does not require pitch to be encoded, a drum tab may be used (e.g.,

Figure 10).

Figure 10. A drum tab representation of the beat pattern in Figure 9, including a label
definition for each sound.
B |x--------------x|x---x-----------
t |----x-------x---|------------x---
PF|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

B=Kick Drum (labial ejective)


t=Closed Hi-Hat (coronal ejective)
PF=PF Snare (labial ejective affricate)

Drum tablature (or drum tabs) is an unstandardized form of drum beat/pattern notation

(Drum tablature, 2022; DrumTabs, n.d.). Each drum tab (the whole figure) represents a

musical utterance. Except for the last row, which marks out the meter, each drum tab row

indicates the timing of a particular musical event in the meter. Drum tab notation has two

major advantages over metrical grid notation. First, the metrical pattern of each sound is

easier to see because it sits alone on its tier. Second, multiple events can be marked as

occurring on the same metrical position—a common occurrence in many musical

performances including beatboxing. (The metrical grid notation, on the other hand, only

permits a single musical event per metrical position).

The first symbol of each row (except the last row) is the abbreviation for a beatboxing

sound in Standard Beatbox Notation or a different notation if no Standard Beatbox Notation

exists for that sound (Stowell, 2003; Tyte & SPLINTER, 2014). The names of the sounds

29
corresponding to each symbol are listed beneath the drum tab in a key. The symbol x on a

drum tab row marks the occurrence of a sound, and the symbol - (hyphen) indicates that the

sound represented in that row is not performed at that metrical position. When a sound is

sustained, the initiation of the sound is marked with an x and the duration of its sustainment

is marked with ~ (tilde). For example, the Liproll {LR} in the drum tab in Figure 11

(simplified from a longer and more complicated sequence for illustrative purposes) is

sustained for a full beat or slightly longer each time it is produced. (The sounds {b} and {pf}

are alternants of {B} and {PF} as discussed in Chapter 5: Alternations.)

Figure 11. A simplification of a drum tab from Chapter 5: Alternations. Sounds sustained
across multiple beat sub-divisions are marked by tildes “~”.
b |x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
pf|--------x-------|--------x-------|--------x-------|------x---------
LR|x~~~~~------x~~~|~~----------x~~~|x~~~~~------x~~~|~~--------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

The bottom row of the drum tab shows the metrical positions available in the musical

utterance. The first beat of the meter is marked with the number 1. The rest of the beats of

the tactus are marked 2, 3, and 4, with the “+” of each beat evenly spaced between them. As

described in the previous section, each beat can be divided as much as required. Generally in

this research, the labels for the “y” and “a” of each beat are omitted in an attempt to improve

overall legibility of the meter, but their positions exist in the space between the numbered

beats and their “+”s. Pipes (|) visually separate each group of four beats from the next

(separate “measures”) but do not have any role in the meter.

A drum tab transcription was created for each beat pattern in the data set by repeated

audio-visual inspection of each beat pattern’s real-time MRI video. Portions of beat patterns

30
with rapid or unclear articulations were examined frame by frame using the Vocal Tract ROI

Toolbox (Blaylock, 2021). Articulations in the beat pattern were matched to the articulations

of sounds the beatboxer had named and performed in isolation (see Chapter 3: Sounds) in

order to establish which sound labels to use in the drum tab. In many cases, it was easiest to

start by identifying the sounds at the beginning of a phrase (which were often Kick Drums)

and the snare sounds (which fall on the back beat, notated in this dissertation as beat 3),

then look at the sounds in between. Sounds in the beat pattern that did not clearly match a

sound the beatboxer had performed in isolation were identified by cross-reference to

beatboxing tutorial videos and insight from other beatboxers; in cases where the sound could

not be identified, a new symbol and descriptive name was created for it. Initial drum tab

transcriptions were revised based on feedback from spectrogram and waveform

visualizations of the audio while making text grids in Praat and from time series created from

regions of interest in the rtMR videos (see below).

2.3 Praat TextGrid

After creating transcriptions of the beat patterns in drum tabs, MIR Toolbox (v1.7.2)

(specifically the mirevents(..., ‘Attack’) function) was used to automatically find acoustic

events in the audio channel of each video in the data set (Lartillot et al., 2008; n.d.). These

events were converted into points on a Praat PointTier using mPraat (Bořil & Skarnitzl,

2016). MIR Toolbox sometimes identified events that did not correspond to beatboxing

sounds, mostly because the MRI audio (or its reconstruction) led to many sounds having an

“echo” in the signal. For example, Figure 12 shows that the acoustic release of a Kick Drum

31
was followed by several similar but lower amplitude pulses which were not related to any

articulatory movements and which create the illusion that there are several quieter Kick

Drums. Events determined not to be associated with the articulation of a beatboxing sound

(including these duplicate/extra events) were manually removed, keeping only the event with

the highest amplitude (which was also usually the first).

Less commonly, MIR Toolbox sometimes failed to identify low-amplitude events.

When a sound was made but no event was found by MIR Toolbox, a point was manually

placed on the Praat PointTier by selecting a portion of the spectrogram that corresponded to

the sound in question (confirmed by audio inspection). The intensity of that selection was

then extracted, the time point of maximum intensity queried (in Praat, using Praat defaults),

and a point placed on the PointGrid at that time. (In a small sample of comparisons between

the result of this method and points that MIR toolbox had already found, this Praat method

placed points 1-3 ms after the points placed by MIR Toolbox.) If this method failed (either

because the selection window was too small or the lack of maximum in Praat's intensity

signal), a point was manually placed by visual inspection of the waveform at the highest

amplitude point of a stop/affricate release or the middle of a brief moment of phonation.

A label was added to each event in the PointTier corresponding to the appropriate

sound in the drum tab transcription, resulting in a one-to-one correspondence between

drum tab and PointTier labels. A second point tier with meter labels for each musical event

in a beat pattern was created automatically using mPraat: for each beatboxing event, the time

of the event in the label point tier was duplicated onto a meter PointTier and assigned the

corresponding beat value from the drum tab transcription.

32
In some cases, one beat was judged to corresponded to multiple events. For example,

a Kick Drum and the beginning of a Liproll might both occur on beat 1. In all such cases it

was possible to annotate distinct acoustic events for each sound on that beat. On the meter

tier, the beat (1 here) would be used for both events—in this example, both the Kick Drum

label point and Liproll label point.

Figure 12. Waveform, spectrogram, and text grid of three Kick Drums produced at relatively
long temporal intervals. The text grid label of each sound is associated with the true acoustic
release of the sound; the subsequent smaller bursts are artefacts from audio reconstruction.

3. Kinematic visualizations

3.1 Time series from regions of interest

Time series were created from rtMR video pixel intensities using a region of interest method

(Lammert et al., 2010; Blaylock, 2021). Regions of interest reduce the complexity of image

33
processing by isolating relatively small sets of pixels for analysis. The regions distill the

intensities (brightnesses) of all their pixels into a single value (or in the case of a centroid

method, two values). In a video, the region of interest is static but its pixel intensities change

frame by frame; assembling the frame-by-frame intensity aggregates into a list creates a time

series. Regions are generally devised so that pixel intensity changes reflect changes in the

state of a single constriction type relevant to the articulation of a sound. For example, a Kick

Drum {B} is a labial ejective stop (see Chapter 3: Sounds) and so requires a region for lip

aperture and another for larynx height. As the tissue of the relevant articulator(s) moves into

the space encoded by the pixels in a region, the region’s overall pixel intensity increases.

The region of interest analysis technique is versatile, with different region shape types

and time series calculation methods that can be highly effective when used appropriately.

The VocalTract ROI Toolbox (Blaylock, 2021) offers three region shapes: rectangular regions,

pseudocircular regions (Lammert et al., 2013), and regions formed by automatically finding

groups of pixels which covary in intensity (Lammert et al., 2010). (Pseudocircular regions are

“pseudo”-circular because actual circles cannot be constructed from arrangements of square

pixels.) In this dissertation, rectangular regions were used for articulator movements

designated as horizontal or vertical with respect to their absolute orientation in the video,

pseudocircular regions were used for oblique articulator movements, and statistically

correlated regions were used for especially large tongue body movements (i.e., in the Liproll).

Time series calculation methods include averaging the intensities of all the pixels in the

region, transforming the intensities into a binary mode, and tracking the centroid of tissue

34
within the region (Oh & Lee, 2018). This dissertation uses only average pixel intensity time

series.

When regions of interest tracking average pixel intensity are used for capturing the

kinematics of movement along a given vocal dimension (see below), each region needs to be

placed so that it covers the widest aperture of its intended tract variable. At the lowest

average pixel intensity in the region, the relevant articulator(s) should be just outside the

region; pixel intensity will then increase as the relevant articulator(s) move into the region,

up to maximum intensity at the narrowest/tightest constriction. For laryngeal movements

used in glottalic egressive sounds, the region should have maximum intensity when the

arytenoids are at their maximum height; in their lowest position, the arytenoids should be

just below the lower edge of the region. Defining the regions in this way ensures that the

time series will capture—as accurately as possible—the temporal landmarks corresponding to

the start of an articulator’s movement into a constriction, the maximum velocity of the

articulator as it moves into its constriction, and the moment of maximum constriction.

Regions were placed manually as follows:

LAB. (Figure 13.) A rectangular region to measure lip aperture. Vertically, the region

was arranged so that the upper and lower lip were just outside the region at their widest

aperture. Horizontally, the region was wide enough to include the full width of the lips

during bilabial closures as well as the protrusion of the lips during labiodental closures.

LAB2. (Figure 14.) A rectangular region for measuring labial constrictions in which

the lips are pulled inward between the upper and lower teeth. The region is placed adjacent

35
to, posterior of, and non-overlapping with LAB. The width and height of the region

encompassed the pixels of the retracted portions of the upper and lower lip.

COR. (Figure 15.) A rectangular region for measuring alveolar, dental, and

linguolabial tongue tip constrictions. The region is placed so that the anterior edge is

adjacent to the lips and the posterior edge is far enough right to not have the tongue tip

inside while the tongue is pulled back or down. The upper edge of the region is level with the

alveolar ridge.

DOR. (Figure 16.) A pseudocircular region for measuring tongue body constrictions

near the velum. The region is placed adjacent to the lowered velum such that the region is

filled when the tongue body connects with the lowered velum or for narrow tongue body

constrictions while the velum is raised.

FRONT. (Figure 17.) A region for the most anterior tongue body position of the

Liproll. The region was designed so that the anterior edge of the region traced the anterior

edge of the tongue body during its most anterior Liproll constriction, the upper edge of the

region traced the air-tissue boundary along the palate, and the lower/posterior edge traced

the anterior edge of the tongue body at its most posterior Liproll constriction. This shape

was most successfully generated from the aggregate of two adjacent regions of statistically

correlated pixels, one of which contained the front of the tongue body in only its most

anterior Liproll constriction and the other of which contained the front of the tongue body

only while the tongue was in the velar closure posture it adopted during that beat pattern

(see Chapter 6: Harmony).

36
VEL. (Figure 18.) A region for tracking velum height. This was a pseudocircular region

of radius 2 pixels placed over the pixels that contained the velum in its most raised position

and adjacent to the pixels containing the velum in its most lowered state.

LAR. (Figure 19.) A rectangular region placed on the pixels containing the arytenoid

cartilages in their most elevated position.

A default subset of regions was created from inspection of the first few beat patterns

JR performed including beat patterns that highlighted the Kick Drum and Closed Hi-Hat.

These regions were modified for other videos as needed—usually to account for head

movement between videos.

Figure 13. LAB region, unfilled during a Vocalized Tongue Bass (left) and filled during the
Kick Drum that followed (right).

37
Figure 14. LAB2 region filled during a Liproll (left) and empty after the Liproll is complete
(right).

Figure 15. COR region, filled by an alveolar tongue tip closure for a Closed Hi-Hat {t} (left),
filled by a linguolabial closure {tbc} (center), and empty (right).

38
Figure 16. DOR region, filled by a tongue body closure during a Clickroll (left) and empty
when the tongue body is shifted forward for the release of an Inward K Snare (right).

Figure 17. FRONT region for Liproll outlined in red, completely filled at the beginning of the
Liproll (left) and empty at the end of the Liproll (right).

39
Figure 18. VEL region demonstrated by a Kick Drum, completely empty while the velum is
lowered for the preceding sound (left) and filled while the Kick Drum is produced (right).

Figure 19. LAR region demonstrated by a Kick Drum (an ejective sound), completely empty
before laryngeal raising (left) and filled at the peak of laryngeal raising (right).

3.2 Gestural scores

In Articulatory Phonology, gestural scores represent the temporal organization of

fundamental phonological elements called “gestures” (Browman & Goldstein, 1986, 1989).

Gestures are defined with respect to a dynamical system (Chapter 4: Theory). At the level

40
they can be observed, gestures typically involve the motion of a single constriction system

called a vocal tract variable (like the lips or tongue tip) toward a task-relevant goal—often a

spatial target in the vocal tract in terms of some constriction location (where a constriction is

being made in the vocal tract) and degree (how constricted the vocal tract is in that

location). A gesture has a finite life span; while a gesture is active, the dynamical system

parameters that determine a gesture’s behavior (like its intended spatial goal) remain

invariant, but its influence over a tract variable causes continuous articulatory changes

(Fowler, 1980).

Gestural scores are visual representations of the gestures active in a given utterance. A

gestural score often includes two things: a kinematic time series for each tract variable that

estimates the continuous change of that tract variable; and, inferences about when a gesture

is thought to be active and exerting control over a tract variable—its finite duration,

represented by a box or shading accompanying the time series. Gestural scores are here used

to visualize beatboxing movements, though in this case the “gestures” found are intended to

represent only the interval of time during which a constriction is formed and released within

a given region of interest—these constriction intervals do not necessarily correspond to

theoretical beatboxing gestures (though Chapter 4: Theory argues in favor of this

interpretation).

Gestures were found semi-automatically from time series generated by the region of

interest method (Blaylock, 2021). Each beatboxing sound was associated with one or more

regions of interest in a lookup table; for example, the Kick Drum is a glottalic egressive

bilabial stop and so was associated to the LAB (labial) and LAR (laryngeal) regions. Each

41
beatboxing sound in a beat pattern was marked by a point on a Praat point tier as described

earlier. For each sound, the point was used as the basis for automatic use of the DelimitGest

function (Tiede, 2010) on each of the time series associated with that sound. The algorithm

defines seven temporal landmarks for each gesture based on the velocity of the time series

(calculated via the central difference) within a specified search range—in this case, the entire

time series was the search range. The time of maximum constriction (MAXC) is the time of

the velocity minimum nearest that sound’s time point from the point tier. The times of peak

velocity into (PVEL) and out of (PVEL2) the constriction are the times of the nearest

velocity maxima greater than 10% of the maximum velocity of the search range before and

after MAXC, respectively. The time of the onset of movement (GONS) is the time at which

movement velocity is 20% of the range of velocities between the peak velocity into the

constriction and the nearest preceding local velocity minimum; the time of movement end

(GOFFS) was calculated the same way but for the range of velocity between the peak

velocity out of the constriction and the nearest following velocity minimum. Finally, the time

of constriction attainment (NONS) was calculated as the time at which the velocity was 20%

of the range between the the peak velocity into a constriction and the minimum velocity

associated with the time of MAXC; the time at which a constriction began to be released was

likewise calculated as the time of the same velocity threshold but between the velocity

associated with the time MAXC and the peak velocity out of the constriction.

In some cases the automatic gesture-finding algorithm defined gestural landmarks

that were grossly unaligned with the actual articulator movement. Often this was because the

MAXC time point taken from the point tier was placed on a local minimum in pixel intensity

42
rather than a local maximum. Those gestures were manually corrected via the MviewRT GUI

(Tiede, 2010) using the same DelimitGest function and parameters by selecting different

starting frames than the ones generated from the Praat point tier. Some of those manually

placed gestures had temporal landmarks that were still grossly unaligned with their expected

relative pixel intensity values, as when a gestural offset landmark was placed halfway into the

constriction of a later gesture; these landmarks were corrected in MviewRT by eye. Gestural

scores and their time series were plotted for dissertation figures in MATLAB using a branch

of the VocalTract ROI Toolbox. In these plots, manually-adjusted gestures are marked by a

black box around the gesture.

43
CHAPTER 3: SOUNDS

This chapter introduces some of the most frequent sounds of beatboxing and identifies

critical phonetic dimensions along which the inventory of beatboxing sounds appears to be

distributed. There are three major conclusions. First, the sounds of beatboxing have a

roughly Zipf’s Law (power law) token frequency distribution, a pattern that has been

identified for word frequency in texts and corpora but not for sound frequency; this is

interpreted as a reflection of the status of individual beatboxing sounds as meaningful

vocabulary items in the beatboxer’s sound inventory. Second, beatboxing sounds are

organized combinatorially insofar as they can largely be described as combinations of a

relatively small set of articulatory dimensions, though the organization of these dimensions

is not as periodic or economic as the organization of sounds in speech. Third, beatboxing

sounds are contrastive with one another because changing one of the articulatory

components of a sound generally leads to a change in the sound’s meaning. Speech and

beatboxing therefore appear to share not just the vocal apparatus but also distributional and

compositional properties—even though the beatboxing sounds have no meaningful relation

to a beatboxer’s phonological or lexical knowledge.

1. Introduction

At least at some level of representation, beatboxing sounds have intrinsic meaning. The

meaning of a sound often refers to the musical referent it imitates, whether that be part of a

drum kit like a Kick Drum {B} or a synthetic sound effect like a laser (e.g., Sonic Laser

{SonL}). One could therefore compile a list of beatboxing sounds and structure it so that

44
sounds with similar musical roles are listed near each other: kicks, hi-hats, snares, basses,

rolls, sound effects, and more. Catalogs of sounds like this have been assembled by

beatboxers. The boxeme of Paroni et al. (2021) seems to refer to sounds at this level of

granularity which experienced beatboxers are likely able to distinguish and use appropriately

in the context of a beat pattern—a beatboxing utterance. Beatboxing sounds are cognitively

organized by their musical function, at the very least.

But perhaps there is more to the organization of beatboxing sounds than just their

musical function. Other cognitive domains have been described as using a sort of “mental

chemistry” (Schyns et al., 1998:2) in which a few domain-relevant dimensions are variously

combined to create a myriad of representations. Speech is one such system: the sounds of a

language are composed of discrete choices along a relatively small set of phonetic

dimensions like voicing, place, and duration; these dimensions are thought to encode

linguistic meaning through contrast, and are often considered to be the compositional,

cognitive building blocks of speech (i.e., features or gestures).

Abler (1989; see also Studdert-Kennedy & Goldstein, 2003) describes three properties

shared by self-diversifying systems like the systems of speech sounds, genes, and chemical

elements: multiple levels of organization, sustained variation via combinations instead of

blending, and periodicity—the repeated use of a relatively small set of dimensions in

different combinations (referred to by other scholars as feature economy (Ohala, 1980, 2008;

Clements, 2003; Dunbar & Dupoux, 2016). Beatboxing does have at least two levels of

organization—the meanings of the sounds themselves in terms of musical roles (e.g., kick,

45
snare) and their organization into hierarchically structured beat patterns. Less clear is

whether beatboxing sounds are composed of combinatorial, periodically organized units.

If meaningful beatboxing sounds are also composed of smaller units, they should be

classifiable over repeated use of a small set of dimensions—some of which may happen to

overlap with the dimensions along which speech sounds are classified because they share the

same phonetic potential via the vocal tract. Alternatively, if beatboxing sounds are not

composed combinatorially, they might instead be organized dispersively throughout the

vocal tract so that each sound is maximally distinct from the other. This would be

reminiscent of Ohala’s (1980:184) “deliberately provocative” suggestion that, if consonants

are maximally dispersed within a language’s phonological inventory as vowels often seem to

be (Liljencrants & Lindblom, 1972; Lindblom, 1986), then consonant systems like [ɗ k’ ts ɬ m

r ǀ] should be typologically common (which they are not; see Lindblom & Maddieson, 1988).

If beatboxing sounds are organized to be distinctive but not combinatorial, then beatboxing

sounds should not be classifiable over repeating dimensions. Figure 20 schematically

demonstrates these types of organization.

Note that being able to classify beatboxing sounds along articulatory dimensions is

not enough to claim that those dimensions constitute cognitive beatboxing units. The

properties of compositionality and periodicity do not guarantee that the composite

dimensions play a role in the cognitive representation and functioning of the system. In

linguistics, evidence for the cognitive reality of organizing atomic features comes from the

different behavioral patterns speech sounds exhibit depending on which features they are

composed of. This chapter only goes so far as to describe and analyze the articulatory

46
dimensions along which beatboxing sounds appear to be dispersed; later chapters revisit the

question of the cognitive status of some of these dimensions.

Figure 20. Beatboxing sounds organized by maximal dispersion in a continuous phonetic


space (top) vs organization along a finite number of phonetic dimensions (bottom).

This chapter presents two novel analyses of beatboxing sound organization. The first

(Analysis 1) measures the token and beat pattern frequency of beatboxing sounds, providing

the first quantitative account of beatboxing sound frequency. The second (Analysis 2) builds

on the first by evaluating whether higher frequency beatboxing sounds can be analyzed as

composites of choices from a relatively small set of phonetic dimensions. In the process, the

chapter contributes to the still-expanding literature of the phonetic documentation of

beatboxing sound production.

47
2. Method

Describing the organization of beatboxing sounds and assessing whether they are composed

combinatorially requires first making a list of beatboxing sounds to analyze. New beatboxing

sounds continue to be invented, and there is no fully comprehensive documentation of

beatboxing in which to find a list of all the sounds that have been invented so far (though

resources like [Link] offer an attempt). The list of sounds for this analysis was

assembled through inspection of real-time MRI videos of a single expert beatboxer,

supplemented by discussion with other beatboxers and YouTube tutorials of beatboxers

explaining how to produce various sounds. Two particular methodological concerns about

this process merit discussion in advance: how to decide which articulations a beatboxer uses

are and are not beatboxing sounds, and how to determine which of those sounds to include

in an analysis of beatboxing sound organization.

The decision of what counts as a beatboxing sound is rooted in the observations and

opinions of the beatboxer and the analyst (who may be the same but in this case are not). In

the process of data collection for this study, each beatboxer was asked to make a list of

sounds they can produce and then showcase each one in a beat pattern. But more sounds

might be used in those beat patterns than were listed and showcased by the beatboxer, either

because the beatboxer forgot to list them or does not overtly recognize them as a distinct

sound. Likewise, a beatboxer might distinguish between two or more sounds that the analyst

detects no difference between—because the differences were either not imageable,

nonexistent, or not detected. And, some sounds may be different only in ornamentation,

with secondary articulations used to create different aesthetics without fundamentally


48
changing the nature of the sound. The analyst must choose to either rely only on a

beatboxer’s overt knowledge of their sound inventory or add and remove sounds in the list

based on the analysis of their usage. Thus a catalog of a beatboxer’s sounds is biased by the

beatboxer’s knowledge and the analyst’s assumptions, and therefore not likely to be a

complete or fully accurate representation of a beatboxer’s cognitive sound inventory.

The second methodological issue is deciding which of those beatboxing sounds to

include in the analysis, as not all sounds of a beatboxer’s inventory might not have the same

status—just as not all the sounds of a language are equally contrastive (Hockett, 1955). Some

beatboxing sounds may just be entering or leaving the inventory, and some may be less

common than others. If the whole sound inventory is analyzed equally, less stable beatboxing

sounds may throw off the analysis by muddying the dimensions that compose more stable

sounds.

At the same time, if beatboxing sounds are organized combinatorially, beatboxing

sound inventories may fill open “holes” in the inventory over time: given the current state of

sounds in a beatboxer’s inventory, they may be more likely to next learn a sound that is

composed of phonetic dimensions already under cognitive control than to learn a sound

requiring the acquisition of one or more new phonetic dimensions. There is not sufficient

diachronic data in the corpus to measure this directly; however, if we assume that a

beatboxing sound’s corpus frequency is proportional to how early it was learned (higher

frequency indicating earlier acquisition) then cataloging beatboxing sounds from high

frequency to low frequency should yield a growing phonetic feature space. The highest

frequency sounds would be expected to differ along relatively few phonetic dimensions; as

49
sounds of lesser frequency are added to the inventory, we would expect to find that they tend

to fill gaps in the existing phonetic dimension space when possible before opening new

phonetic dimensions. But if the sounds are dispersed non-combinatorially, we may instead

expect to find that even the earliest or most frequent sounds make use of as many phonetic

dimensions as possible to maximize their distinctiveness, with the rest of the sounds fitting

into the spaces between.

The initial list of sounds was designed to be as encompassing as possible in this study.

The 39 sounds which the beatboxer overtly identified and 16 more which were determined

by the analyst to have qualitatively distinct articulation were combined into a list of 55

sounds. The frequency of each sound was calculated by counting how many times it

appeared in the data set overall (token frequency) and how many separate beat patterns it

appeared in (beat pattern frequency). The token frequency distribution analysis of the full

set of sounds is presented in section 3.1. To minimize the impact of infrequent sounds on the

featural/dimensional organization analysis (section 3.2), the list of sounds is presented in

four groups from high to low beat pattern frequency.

Details about the acquisition and annotation of beatboxing real-time MR videos can

be found in Chapter 2: Method. Counts of each beatboxing sound were collected from 46

beat patterns based on their drum tab annotations. Each ‘x’ in a drum tab corresponded to

one token of a sound. Each sound was counted according to both its token frequency (how

many times it shows up in the whole data set) and its beat pattern frequency (how many

different beat patterns it shows up in).

50
Certain sounds and articulations were labeled in drum tabs but excluded from the

analyses: audible exhalations and inhalations, lip licks (linguolabial tongue tip closures)

which presumably help to regulate moisture on the lips but not to make sound,

non-sounding touches of the tongue to the teeth or alveolar ridge, and lip

spreading/constricting, akin to lip rounding in speech and useful for raising and lowering the

frequency of higher amplitude spectral energy. None of these were identified by the

beatboxer as distinct sounds, nor were they clearly associated with the articulation of any

nearby sounds.

3. Results

Section 3.1 examines the overall frequency distribution of the beatboxing sounds in the data

set. Section 3.2 digs further into the production of the most frequent sounds in order to

evaluate whether they are organized combinatorially.

3.1 Frequency distribution

Figure 21 shows the token frequency of each beatboxing sound in decreasing order of

frequency. Lighter shaded bars show the token frequency for sounds that only occurred in

one beat pattern in the data set, and the darker bars are sounds that occurred in two or more

best patterns. Beat pattern frequency does not factor into the power law fitting procedure,

but will be used in section 3.2. The most frequent sound appears much more often than any

of the others; the next few most frequent sounds rapidly decrease in frequency from there.

The bulk of the sounds have relatively low and gradually decreasing frequency.

51
There are many different types of frequency distributions, but one commonly

associated with language that results in a similar distribution is Zipf's Law—a discrete power

law (zeta distribution) with a particular relationship among the relative frequencies of the

items observed (Zipf, 1949). A distribution is Zipfian when the second most frequent item is

half as frequent as the most frequent item, the third most frequent item is one third as

frequent as the most frequent item, and so on. To put numbers to it, if there were 100

instances of the most common item in a corpus, the second most common item should occur

50 times, the third most common item 33 times, the fourth 25, and so on. With respect to

language, Zipf’s Law is known for describing the frequency distribution of words in a corpus:

function words tend to be very frequent, accounting for large portions of the token

frequency, while other words have relatively low frequency. On the other hand, the

distribution of sound types (phones) in linguistic corpora is non-Zipfian: Zipf's Law

overestimates the frequencies of both the highest and lowest frequency phones while

under-estimating the frequencies of phones in the middle (Lammert et al., 2020).

Zipf’s Law is expressed mathematically below, where 𝑛 represents an item’s frequency

rank (i.e., the third most frequent item is 𝑛 = 3) and 𝑥𝑛 represents the frequency of the 𝑛th

word.

1
𝑥𝑛 = 𝑥1 · 𝑛

With respect to this data set, a Zipfian rank-frequency distribution of beatboxing sounds is

predicted to fit the equations above with 𝑥1 = 330 because there were 330 instances of the

most frequent sound, the Kick Drum.

52
1
𝑥𝑛 = 330 · 𝑛

Power laws take the more general form in the equation below; Zipf’s Law is the special case

where 𝑏 = 1 (and 𝑎 = 𝑥1). In this form, the parameters 𝑎 and 𝑏 can be estimated by

non-linear least squares regression using MATLAB’s fit function set to “power1”.

−𝑏
𝑓(𝑥) = 𝑎𝑥

It is difficult to demonstrate conclusively that the frequency distribution of beatboxing

sounds actually follows Zipf’s Law, or even that it follows a power law versus, say, a sum of

exponentials or log-normal distribution—any distribution that is similar in general but with

somewhat different mathematical properties. Even so, estimating the parameters 𝑎 and 𝑏

from the data as described above yields 𝑎 = 325. 6 (316. 9, 334. 3) and

𝑏 = 1. 025 (1. 054, 0. 996), putting the hypothesized model parameters 𝑎 = 330 and

𝑏 = 1 within the 95% confidence intervals of both parameter estimates of Zipf’s Law. The fit

2
has a sum-squared error of 1152.1 and root-mean-square error of 4.66, with 𝑅 = 0. 9914

2
(adjusted 𝑅 = 0. 9912, dfe=53). A visualization of the Zipf’s Law parameters is overlaid on

the token frequencies in Figure 21 as a black line.

The goodness of fit to the power law can be evaluated from other graphs. Figures 22

and 4 show the residuals of the fit: the fit model slightly underestimates tokens of frequency

rank 11-21, then slightly over-estimates the rest of the sounds in the long tail of the

distribution. The systematicity of the residuals suggests that the model may not be an ideal

fit, though overestimating the frequency of items in the tail is a relatively common finding in

other domains where Zipf's Law is said to apply. Figure 24 shows the log-log plot of the

53
frequency distribution and the Zipf's Law fit. Power laws plotted this way resemble a straight

line with a slope equal to the exponent in power law notation; distributions with Zipf’s Law

should therefore resemble a line with a slope of -1. Figure 25 shows the cumulative

probability of the sounds, representing for each sound type (x axis) what proportion of all

the tokens in the data set is that sound or a more frequent sound. The benefit of the

cumulative probability graph is to quickly estimate how much of the data can be accounted

for with groups of sounds of a certain frequency or higher; for example, the five most

frequent sounds account for over 50% of tokens. Again, the first few most frequent sounds

are disproportionately represented in the data while the majority of sound types appear only

rarely. The figure also shows the cumulative probability of the power law fit to the data,

again as a black line.

Regardless of the specific nature of the distribution, the frequency distribution of

beatboxing sounds seems to resemble the Zipfian frequency distribution of words, but not

the frequency distribution of phones.

54
Figure 21. Rank-frequency plot of beatboxing sounds. Beatboxing sound frequencies roughly
follow a power law: the few most frequent sounds are very frequent and most of the sounds
are much less frequent.

Figure 22. Histogram of the residuals of the power law fit. Most sounds have a token
frequency within 5 tokens of their expected frequency.

55
Figure 23. Scatter plot of the residuals of the power law fit (gray) against the expected values
(black). The middle-frequency sounds are a little under-estimated and the lower-frequency
sounds are a little over-estimated.

Figure 24. Log-log plot of the token frequencies (gray) against the power law fit (black).

56
Figure 25. The discrete cumulative density function for the token frequencies of the sounds
in this data set (gray) compared to the expected function for sounds following a power law
distribution (black).

3.2 Sounds and compositionality

In this section, beatboxing sounds are presented in decreasing order of beat pattern

frequency instead of token frequency under the premise that the most stable and flexible

beatboxing sounds will occur in multiple beat patterns. Sounds with low beat pattern

frequency often have low token frequency, but certain high token frequency sounds were

only performed in one pattern are omitted (like a velar closure {k} which is the 7th most

frequent token) or deferred until a later section in according with their beat pattern

frequency (like the Clop {C} which is the 12th most frequent token). Whenever reference is

57
made to a sound's relative frequency or to the cumulative frequency of a set of sounds,

however, those high token frequency sounds are still part of the calculation. Figure 26 shows

a revision of the cumulative probability distribution in which sounds are ordered by beat

pattern frequency (black) instead of token frequency (lighter gray).

Figure 26. The discrete cumulative density function of the token frequency of sounds in this
beat pattern (gray, same as Figure 25) against the density function of the same sounds
re-ordered by beat pattern frequency order (black).

The analysis of the compositionality of beatboxing sounds is presented in five parts. Sections

3.2.1-3.2.4 introduce beatboxing sounds with articulatory descriptions, then summarize the

phonetic dimensions involved in making those sounds. The sounds are presented according

to their beat pattern frequency: section 3.2.1 presents the five sounds that appear in more

than 10 beat patterns each, covering more than 50% of the cumulative token frequency of the

data set; section 3.2.2 adds seven sounds that appear in four or more beat patterns; and,

58
section 3.2.3 introduces ten sounds that each appear in 2 or more beat patterns. Section 2.3.4

adds another 20 lowest-frequency sounds for a total of 43 sounds. Section 3.2.5 summarizes

with an account of the overall compositional makeup of all the presented beatboxing sounds.

Articulatory descriptions of each sound are accompanied by images from real-time

MRI videos representing stages in the articulation of the sound (see Chapter 2: Method for

details of video acquisition and sound elicitation). Usually the images come from one

instance of the sound performed in isolation; some sounds were only performed in beat

patterns, so for those sounds the images come from one instance of a sound in a beat pattern.

Some of the videos from which these images were taken are available online at

[Link]

While most articulatory descriptions will rely on well-established phonetic

terminology, the phonetic dimension of constriction degree will involve three terms that are

not usually used or may be unfamiliar: compressed, contacted, and narrow. A compressed

constriction degree involves a vocal closure in which an articular pushes itself into another

surface (or in the case of labial sounds, the lips may push each other). Compressed

constriction degree is used for many speech stops and affricates, and will be a key property of

many beatboxing sounds as well. Contacted constriction degree refers to a lighter closure in

the vocal tract which results in a trill when air is passed through it. Narrow constriction

degree refers to a constriction that is sufficiently tight to cause airflow to become turbulent;

it is used the same way in Articulatory Phonology (Browman & Goldstein, 1989).

Abbreviations for the sounds are provided in two notation formats: IPA and BBX.

Transcription in IPA notation incorporates symbols from the extensions to the International

59
Phonetic Alphabet for disordered speech (Duckworth et al., 1990, Ball et al., 2018b) and the

VoQS System for the Transcription of Voice Quality (Ball et al., 1995; Ball et al., 2018a). The

BBX notation (an initialism deriving from the word “beatbox”) is the author’s variant of

Standard Beatbox Notation (SBN; Stowell, 2003; Tyte & SPLINTER, 2014). At the time of

writing, Standard Beatbox Notation does not include annotations for many newer or less

common sounds. BBX is not meant to contribute to standardization, but simply to provide

functional labels for the sounds under discussion. In a few cases, BBX uses alternative labels

for sounds that SBN already has a symbol for (for example, the Inward Liproll in SBN is

{BB^BB} and in BBX is {LR}). BBX and SBN notations are indicated with curly brackets {}.

Unlike IPA transcriptions in which a single symbol is intended to correspond to a single

sound, BBX and SBN annotations frequently use multiple symbols to denote a single sound

(e.g., {PF} to represent a single PF Snare).

3.2.1 High-frequency sounds

[Link] Articulatory description of high-frequency sounds


Table 1 at the end of this section summarizes the high frequency beatboxing sounds in list

form. Tables 2-4 show the organization of the sounds based on their place of articulation,

constriction degree, airstream mechanism, and musical role. Unless otherwise indicated, the

MRI images presented in the figures below represent a sequence of snapshots at successive

temporal stages in the production of a sound.

60
Kick Drum

Figure 27. The forced Kick Drum.

The Kick Drum {B} mimics the kick drum sound of a standard drum set. It is one of the most

well-studied sounds in beatboxing science literature, and consistently described as a voiceless

glottalic egressive bilabial plosive (Proctor et al., 2013; de Torcy et al., 2014; Blaylock et al.,

2017; Patil et al., 2017; Dehais-Underdown, 2019). First a complete closure is made at the lips

and glottis, then larynx raising increases intraoral pressure so that a distinct “popping” sound

is produced when lip compression is released. The high-frequency rank of the Kick Drum is

likely due to a variety of factors: it is common in the musical genres on which beatboxing is

based; it replaces the [b] in the “boots and cats” phrase commonly used to introduce new

English beatboxers to their first beat pattern; and, is frequently co-produced with other

sounds like trills (basses and rolls).

61
PF Snare

Figure 28. The PF Snare.

The PF Snare {PF} is a labial affricate; it begins with a full labial closure, then transitions to a

brief labio-dental fricative. That the PF Snare is a glottalic egressive sound is evidenced by

the raised larynx height in the third image.

Inward K Snare

Figure 29. The Inward K Snare.

The Inward K {^K} (sometimes referred to simply as a K Snare due to its high frequency) is a

voiceless pulmonic ingressive lateral velar affricate. In producing the Inward K, the tongue

body initially makes a closure against the palate. It then shifts forward, with at least one side

lowering to produce a moment of pulmonic ingressive frication. The lateral quality is not

62
directly visible in these midsagittal images; however, laterality can be deduced by observing

that the tongue body does not lose contact with the palate in the midsagittal plane: if the

tongue is blocking the center of the mouth, then air can only enter the mouth past the sides

of the tongue.

Unforced Kick Drum

Figure 30. The unforced Kick Drum.

The Kick Drum is sometimes referred to as a “forced” sound. An “unforced” version of the

Kick Drum has also been observed in some beatboxing productions. This unforced Kick

Drum {b} has no observable larynx closure and raising like that of the forced Kick Drum;

instead, it is produced with a dorsal closure along with the closure and release of the lips.

Note however that the tongue body does not generally shift forward or backward during the

production of this unforced Kick Drum; the airstream is therefore neither lingual egressive

nor lingual ingressive, but neutral—a “percussive”, a term for a sound lacking airflow

initiation due to pressure or suction buildup. The source of the sound in a percussive is the

noise produced by the elastic compression then release of the contacting surfaces (Catford,

1977). Section 3.2.2 expands the scope of percussive sounds slightly in the context of

63
beatboxing to include sounds with a relatively small amount of tongue body retraction which

signals the presence of lingual ingressive airflow (“relatively” here compared to other lingual

ingressive sounds which have much larger tongue body retraction).

The extensions to the IPA (Ball et al., 2018) offer the symbol [ʬ] for bilabial

percussives. The unforced Kick Drum is likely a context-dependent alternative form of the

more common forced Kick Drum, as discussed at greater length in Chapter 5: Alternations.

(The same chapter also includes an articulatory comparison between three compressed

bilabial sounds—the forced Kick Drum, the unforced Kick Drum, and the Spit Snare.)

Closed Hi-Hat

Figure 31. The Closed Hi-Hat.

The Closed Hi-Hat {t} is a voiceless glottalic egressive apical alveolar affricate. The tongue tip

rises to the alveolar ridge to make a complete closure while the vocal folds close and the

larynx lifts to increase intraoral pressure.

[Link] Composition summary of high-frequency sounds


Tables 2-4 presents the first five most common beatboxing sounds in this data set, all of

which appear in at least 10 beat patterns and which collectively make up more than 50% of

the cumulative token frequency. These frequently used sounds are spread across three

64
primary constrictors: labial (bilabial, labio-dental), coronal (alveolar), and dorsal. Three of

the sounds {B, PF, t} are glottalic egressive, one {^K} is pulmonic ingressive, and one {b} is

percussive (Table 2). Some beatboxers also use glottalic egressive dorsal sounds (e.g.,

Rimshot), but the Inward K Snare is commonly used as a way to inhale while vocalizing. The

unforced Kick Drum appears to be a context-dependent variety of Kick Drum (see Chapter

5: Alternations), indicating that the glottalic egressive Kick Drum is the default form. With

respect to airstreams, this effectively places the most common beatboxing sounds along two

airstreams: glottalic egressive for the majority, with pulmonic ingressive for the important

inhalation function of the Inward K Snare.

Of the same sounds, three {PF, t, ^K} are phonetically affricates, and two {B, b} are a

stops (Table 3). Proctor et al., (2013) describe the Kick Drum {B} as another affricate; its

production may vary among beatboxers. But the phonological distinction between affricate

and stop that exists in some languages does not have as clear a role in beatboxing; with only

five sounds under consideration so far that mostly vary by constrictor, a simpler description

is that all of these sounds are produced with a compressed closure similar to what both stops

and affricates in speech require. The nature of the release—briefly sustained or not—likely

enhances the similarity of each sound to its musical referent on the drum kit, but may not be

a phonetic dimension along which beatboxing sounds vary.

If beatboxing sounds were organized to maximize distinctiveness without any other

organizational constraints, these five most frequent sounds should be expected to be

completely different with respect to common articulatory dimensions like constriction

degree (similar to manner of articulation), constrictor (place of articulation), coordination of

65
primary and pressure-change-initiator actions (airstream mechanism), as well perhaps as

duration, nasality, voicing, or other phonetic dimensions. Instead, the sounds vary by

constrictor but share the same qualitative constriction degree, lack of nasality, lack of

voicing, and all but one share the same airstream mechanism.

66
Table 1. Notation and descriptions of the most frequent beatboxing sounds.
Sound name BBX IPA Description Token Cumulative Beat pattern
frequency probability frequency

Forced Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop 330 23.44% 34

PF Snare {PF} [p͡f '] Voiceless glottalic egressive labiodental affricate 136 33.10% 23

Inward K Snare {^K} [k͡ʟ̝̊↓] Voiceless pulmonic ingressive lateral velar 91 39.56% 16
affricate

Unforced Kick Drum {b} [ʬ] Voiceless percussive bilabial stop 117 47.87% 14

Closed Hi-Hat {t} [ts’] Voiceless glottalic egressive alveolar affricate 70 52.94% 12

Table 2. The most frequent beatboxing sounds displayed according to constrictor (top) and
airstream (left).
Airstream Bilabial Labiodental Coronal (alveolar) Dorsal

Glottalic egressive B PF t

Pulmonic ingressive ^K

Percussive b

Table 3. The most frequent sounds displayed according to constrictor (top) and constriction
degree (left).
Constriction degree Bilabial Labiodental Coronal (alveolar) Dorsal

Compressed B, b PF t ^K

67
Table 4. The most frequent sounds displayed according to constrictor (top) and musical role
(left).
Musical role Bilabial Labiodental Coronal (alveolar) Dorsal

Kick B, b

Hi-Hat t

Snare PF ^K

68
3.2.2 Medium-frequency sounds

[Link] Articulatory description of medium frequency sounds

Dental closure, linguolabial closure, and alveolar closure

The dental closure, linguolabial closure, and alveolar closure were not identified as distinct

sounds by this beatboxer, and therefore were not given names referring to any particular

musical effect. They are each categorized as a percussive coronal stop, made with the tongue

tip just behind the teeth (dental), touching the alveolar ridge (alveolar), or placed between

the lips (linguolabial).

“Percussive” may be somewhat misleading for these sounds. Each of these sounds is

produced with a posterior dorsal constriction, just like the percussive unforced Kick Drum.

But unlike the unforced Kick Drum, in each of these sounds there is a relatively small

amount of tongue body retraction. This makes them phonetically lingual ingressive sounds

rather than true percussives which are described as sounds produced without inward or

outward airflow. (The linguolabial closure is also found without a dorsal closure, and in

those cases is definitely not lingual ingressive.)

Earlier, the choice was made to not distinguish between constriction release types

stop and affricate because there is no evidence here that beatboxing requires such a

distinction. For the dental, linguolabial, and alveolar clicks, however, there is evidence to

suggest that they should not be grouped with other lingual ingressive sounds that will enter

the sound inventory in section 3.2.3. Articulatorily, there is a great difference between these

“percussives” and other lingual ingressive sounds with respect to the magnitude of their

69
tongue body retraction. The image sequence in Figure 36 shows the production of an alveolar

closure followed immediately by a Water Drop (Air). Both sounds have tongue body

retraction that indicates a lingual ingressive airstream, but the movement of the tongue body

in the alveolar closure (frames 1-2) is practically negligible compared to the movement of the

tongue body in the Water Drop (Air) (frames 3-4). The same holds for the other sounds

coded as lingual ingressive in this chapter. In later chapters, we will also see evidence that the

dental closure and perhaps some other of these “percussive” sounds are context-dependent

variants of other more common sounds (the Closed Hi-Hat and PF Snare).

Figure 32. The dental closure.

Figure 33. The linguolabial closure (dorsal).

70
Figure 34. The linguolabial closure (non-dorsal).

Figure 35. The alveolar closure.

Figure 36. The alveolar closure (frames 1-2) vs the Water Drop (Air). The jaw lowering and
tongue body retraction for the alveolar closure is of lesser magnitude.

71
Spit Snare

Figure 37. The Spit Snare.

The Spit Snare corresponds to the Humming Snare of Paroni et al. (2021), which seems to

have two variants in the beatboxing community: the first, which Paroni et al. (2021)

reasonably describe as a lingual egressive bilabial stop with a brief high frequency trill

release; and the second, sometimes also called a Trap Snare, BMG Snare, or Döme Snare

(due to its popularization by beatboxing artists BMG and Döme (Park, 2017)), which appears

to be a bilabial affricate. The latter articulation is the one described here.

This Spit Snare is a lingual egressive bilabial affricate, produced by squeezing air

through a tight lip compression, creating a short spitting/squirting sound reminiscent of a

hand clap. To create the high oral air pressure that pushes the air through the lip closure, the

volume of the oral cavity is quickly reduced by tongue body fronting and jaw raising. The

lips appear to bulge slightly during this sound, either due to the high air pressure or to the

effort exerted in creating the lip compression.

The IPA annotation for the Spit Snare is composed of the symbol for a bilabial click

(lingual ingressive) tied to the symbol for a voiceless bilabial fricative (pulmonic egressive)

72
followed by an upward arrow. The upward arrow was part of the extensions to the IPA until

the 2008 version, meant to be used as a diacritic in combination with pre-existing click

symbols to represent “reverse clicks” (Ball et al., 2018:159), but was removed in later versions

because such articulations are rarely encountered even in disordered speech (Ball et al.,

2018). The same notation of a bilabial click with an upward arrow was used by Hale & Nash

(1997) to represent the lingual egressive bilabial “spurt” attested in the ceremonial language

Damin. Note that the downward arrow is not complementarily used for lingual ingressive

sounds; instead, its use both in the extension to the IPA and here mark to mark pulmonic

ingressive sounds (designated “Inward” sounds by beatboxers) like the Inward K Snare.

Throat Kick

Figure 38. The Throat Kick.

Another member of the Kick family of sounds is the Throat Kick (also called a Techno Kick,

808 Kick, Techno Kick, or Techno Bass,

[Link] Throat Kicks are placeless

implosives: while there is always an oral closure coproduced with glottal adduction, lowering,

and voicing, it does not seem to matter where the oral constriction is made. In isolation, this

73
beatboxer produces the Throat Kick with full oral cavity closure from lips to velum; in the

beat pattern showcasing the Throat Kick, the oral closure is an apical alveolar one. (This

latter articulation is the origin of the chosen IPA notation for this sound, an unreleased

alveolar implosive [ɗ̚]). Supralaryngeal cavity expansion (presumably to aid the brief voicing

and also to create a larger resonance chamber) is achieved through tongue root fronting,

slight retraction of the pharynx, and lowering of the larynx.

Inward Liproll

Figure 39. The Inward Liproll.

The Inward Lip Roll is a voiceless pulmonic ingressive bilabial trill. It is usually performed

with lateral labial contact. Note that in this example, as in others, the Inward Liproll is

initiated by a forced Kick Drum. Frames 1-3 show the initial position of the vocal tract, the

initiation of the Kick Drum, and the release of the Kick Drum. In frame 4, the

lips—particularly the lower lip—have been pulled inward over the teeth. Frame 5 shows the

final position the tongue body adopts during this sound.

74
Tongue Bass

Figure 40. The Tongue Bass.

The Tongue Bass is a pulmonic egressive alveolar trill. The tongue tip makes loose contact

with the alveolar ridge, then air is expelled from the lungs through the alveolar closure,

causing the tongue tip to vibrate. The arytenoid cartilages appear to be in frame in the later

images, but the thyroarytenoid muscles (which would appear as a bright spot separating the

trachea from the supralaryngeal airway) are not; this means that the sound is voiceless. This

beatboxer distinguishes between the Tongue Bass here and a Vocalized Tongue Bass which

does have voicing (as well as a High Tongue Bass in which the thyroarytenoid muscles are

even clearer).

[Link] Composition summary of medium-frequency sounds


Table 5 adds the next 7 most common beatboxing sounds (a total of 12 sounds), all of which

appear in four or more beat patterns in the data set and comprise about 70% of the

cumulative token frequency. Three dimensional expansions are made by the introduction of

these seven sounds to the earlier most frequent five. First, a new constriction degree: in

addition to the earlier compressed closures, now light contact that results in trills is used as

well. Second, while the tongue tip was earlier only responsible for one sound which was an

75
alveolar closure, it now performs five sounds—three of which are alveolar, and two of which

are different constriction location targets. Third is the addition of glottalic ingressive,

pulmonic egressive, and lingual egressive airstreams for the Throat Kick, Tongue Bass, and

Spit Snare respectively.

Five of the seven sounds use the same compressed constriction degree type as the

most frequent sounds while filling out different constriction location options—though

bilabial and coronal sounds are more popular than the others. The Tongue Bass and Inward

Liproll open a new constriction degree value of light contact but capitalize on the bilabial

and alveolar constrictor locations that already host the most compressed sounds, doubling

down on these two particular constriction locations.

Airstream mechanism is expanded by these sounds. Whereas the five most common

sounds used three airstreams (and only two if you don’t count the percussive unforced Kick

Drum because it almost always occurs in restricted environments), adding the new sounds

increases airstream mechanism types to six (or five, if again you count the percussives as

alternants of other sounds). The airstream expansions do not follow any particular trend: the

glottalic ingressive sound is a laryngeal kick, the pulmonic egressive sound is a coronal bass,

and the lingual egressive sound is bilabial snare.

Overall, places of articulation and the compressed constriction degree established by

the highest frequency sounds continue to be used by the medium frequency sounds, but the

new sounds also expand the system’s dimensions in a few directions.

76
Table 5. Notation and descriptions of the medium-frequency beatboxing sounds.
Sound name BBX IPA Description Token Cumulative Beat pattern
frequency probability frequency

Dental closure {dc} [k͜ǀ] Voiceless percussive dental stop 37 55.47% 9

Linguolabial closure {tbc} [ʘ̺, t̼] Voiceless percussive labiodental stop 23 57.10% 9

Spit Snare {SS} [ʘ͡ɸ↑] Voiceless lingual egressive bilabial affricate 29 59.16% 6

Throat Kick {u} [ɗ̚] Voiced glottalic ingressive unreleased 50 62.71% 5


placeless stop

Inward Liproll {^LR} [ʙ̥↓] Voiceless pulmonic ingressive bilabial trill 31 64.91% 5

Tongue Bass {TB} [r] Voiced pulmonic egressive alveolar trill 27 66.83% 5

Alveolar closure {ac} [k͜ǃ] Voiceless percussive alveolar stop 27 68.75% 4

Table 6. High and medium frequency beatboxing sounds displayed by constrictor (top) and
airstream mechanism (left). Medium frequency sounds are bolded.
Airstream Bilabial Labiodental Coronal Dorsal Laryngeal

Linguolabial Dental Alveolar

Glottalic egressive B PF t

Glottalic ingressive u

Pulmonic egressive TB

Pulmonic ingressive ^LR ^K

Lingual egressive SS

Percussive b tbc dc ac

Table 7. High and medium frequency sounds displayed by constrictor (top) and constrictor
degree (left). Medium frequency sounds are bolded.
Constriction degree Bilabial Labiodental Coronal Dorsal Laryngeal

Linguolabial Dental Alveolar

Compressed B, b, SS PF tbc dc t, ac ^K u

Contacted ^LR TB

77
Table 8. High and medium frequency beatboxing sounds displayed by constrictor (top) and
musical role (left). Medium frequency sounds are bolded.
Musical role Bilabial Labiodental Coronal Dorsal Laryngeal

Linguolabial Dental Alveolar

Kick B, b u

Hi-Hat (tbc) (dc) t, (ac)

Snare SS PF ^K

Roll ^LR

Bass TB

78
3.2.3 Low-frequency sounds

[Link] Articulatory description of low-frequency sounds

Humming

Figure 41. Humming.

Humming is phonation that occurs when there is a closure in the oral cavity but air can be

vented past a lowered velum through the nose. This beatboxer did not identify humming as a

distinct sound per se, but did identify a beat pattern that featured “Humming while

Beatboxing” which is discussed more in Chapter 6: Harmony. The example of humming

shown here was co-produced with an unforced Kick Drum.

Vocalized Liproll, Inward

Figure 42. The Vocalized Liproll, Inward.

This sound is a voiced pulmonic ingressive labial trill. Like some other trills in this data set, it

generally begins with a Kick Drum (frames 1-3).


79
Closed Tongue Bass

Figure 43. The Closed Tongue Bass.

The Closed Tongue Bass is a glottalic egressive alveolar trill performed behind a labial

closure. As with phonation (or any other vibration of this nature), air pressure behind the

closure must be greater than air pressure in front of the closure. Egressive trills usually have

higher air pressure behind the trilling constriction because atmospheric pressure is relatively

low; for the Closed Tongue Bass, the area between the lips and the tongue tip is where

relatively low pressure must be maintained. This appears to be accomplished by allowing the

lips (and possibly cheeks) expand to increase the volume of the chamber while it fills with

air. In the beat pattern that features the Closed Tongue Bass, the beatboxer also uses glottalic

egressive alveolar trills with spread lips, presumably as a non-closed variant of the Closed

Tongue Bass.

80
Liproll

Figure 44. The Liproll.

The Liproll is a lingual ingressive bilabial fricative. It begins with the lips closed together and

the tongue body pressed into the palate. The tongue body then shifts backward, creating a

vacuum into which air flows across the lips, initiating a labial trill.

Water Drop (Tongue)

Figure 45. The Water Drop (Tongue).

The Water Drop (Tongue) is one of two strategies in this data set for producing a water drop

sound effect, the other being the Water Drop (Air). The Water Drop (Tongue) is a lingual

ingressive palatoalveolar stop with substantial lip rounding. With rounded lips, the tongue

body makes a closure by the velum, and the tongue tip makes a closure at the alveolar ridge;

the tongue tip constriction is then released, mimicking the sound of the first strike of a water

droplet. The narrow rounding of the lips may create a turbulent sound, similar to whistling.

81
(Inward) PH Snare

Figure 46. The (Inward) PH Snare.

The (Inward) PH Snare or Inward Classic Snare is a pulmonic ingressive bilabial affricate. In

these beat patterns, it was always followed by an Inward K Snare. A PH Snare closely

followed by an Inward K Snare is sometimes referred to as a PK Snare, and the beatboxer in

this study only explicitly identified the PK Snare as a sound they knew, not the PH Snare.

The choice was made to identify the PH Snare as a distinct sound because the few other

combination sounds in this data set—like the D Kick Roll and Inward Clickroll with

Whistle—also have their component pieces identified separately. (Note: the alternative

choice to treat the combo of PH Snare and Inward K Snare as a single PK Snare would

reduce the number of Inward K Snares in the data set from 91 to 78; re-assessing the power

law fit yields a slightly stronger correlation [R-squared = 0.9957, adjusted R-squared =

0.9956] but an exponent of b=1.032 [confidence interval (1.053, 1.011)] which is slightly larger

than the theoretical b=1).

82
Inward Clickroll

Figure 47. The Inward Clickroll.

The Inward Clickroll (also called Inward Tongue Roll) is a voiceless pulmonic ingressive

central sub-laminal retroflex trill. The tongue tip curls backward so that the underside is

against the palate, and sides of the tongue press against the side teeth so that the only air

passage is across the center of the tongue. The lungs expand, pulling air from outside the

body between the underside of the tongue blade and the palate, initiating a trill.

Open Hi-Hat

Figure 48. The Open Hi-Hat.

The Open Hi-Hat is a voiceless central alveolar affricate with a sustained release. The initial

closure release is ejective, but the part of the release that is sustained to produce frication is

pulmonic egressive.

83
Lateral alveolar closure

Figure 49. The lateral alveolar closure.

The lateral alveolar closure is a percussive lateral alveolar stop.

Sonic Laser

Figure 50. The Sonic Laser.

The Sonic Laser is a pulmonic egressive bilabial fricative with an initial apical alveolar

tongue tip closure followed by a narrow palatal constriction of the tongue body during the

fricative.

84
Labiodental closure

Figure 51. The labiodental closure.

The labiodental closure is a voiceless percussive labiodental affricate. It is usually

accompanied by the tongue moving forward toward an alveolar closure, though it is not clear

if this tongue movement is related to the labiodental closure or the alveolar closure that

typically follows the labiodental closure. Later chapters suggest that the labiodental closure is

a percussive variant of the PF Snare.

[Link] Composition summary of low-frequency sounds


Of these new sounds, all but one use the same two constriction degrees introduced by the

high and medium frequency sounds—compressed (stops/affricates) and contacted (for trills).

The remaining sound is the Sonic Laser {SonL}; it, as well perhaps as the Water Drop

(Tongue) {WDT}, uses a narrow constriction degree akin to speech fricatives. The majority

(7/12) of these sounds are bilabial or alveolar constrictions, following the trend from the

previous section that those two constriction locations hold more sounds than the others.

Labiodental and laryngeal constrictions were also augmented, but only one new place was

added (retroflex). This set of sounds also added the final airstream type, lingual ingressive.

85
Less obvious in Tables 10-12 is that these sounds introduce new phonetic dimensions

that apply to certain sound pairs. The lateral alveolar closure {tll} and alveolar closure differ

by laterality, not by place, constriction degree, or airstream. Likewise, the Inward Liproll

{^LR} and Vocalized Inward Liproll {^VLR} differ by voicing, while the Closed Hi-Hat {t}

and Open Hi-Hat {ts} differ by duration (with the latter adopting a secondary pulmonic

egressive airstream to support its length). These three dimensional additions—laterality,

voicing, and duration—are not leveraged distinctively by most beatboxing sounds.

The difficulty of capturing all the phonetic dimensions a sound uses when placing it

in an IPA-style table (or in this case, tables) is more than an issue of convenience. Using a

tabular structure for sounds is sometimes a useful proxy for assessing their periodicity

(Abler, 1989)—the degree to which sounds can be organized into groups that share similar

behavior—but relies on a certain degree of procrusteanism (Catford, 1977)—a willingness to

force the sounds into a predetermined pattern at the expense of nuanced descriptions, and a

strategy that only becomes less adequate as the beatboxing sound inventory expands. Some

consonants on the IPA table suffer from the same issue: double-articulated sounds like [w]

and non-pulmonic sounds (clicks, ejectives, implosives) do not fit into the reductive

single-articulator, pulmonic-only structure of the major IPA consonants table.

Of the sounds in this section, the Water Drop (Tongue), Sonic Laser, Open Hi-Hat,

and Closed Tongue Bass all use two values on some phonetic dimension which makes them

impossible to place on these tables. The Water Drop (Tongue), Sonic Laser, and Closed

Tongue Bass all use multiple constriction locations, and the Open Hi-Hat uses both glottalic

egressive and pulmonic egressive airstream. Sounds of this nature can be left out of the

86
tables, like [w] in the IPA. Otherwise, there are three ways to include these sounds on the

tables. The first way is to add a sound to multiple locations on the table to show its

multiple-articulation; this helps somewhat in small doses, but quickly gets confusing when

many sounds must be placed on the table two or more times. The second way is to add new

rows or columns or slots for double-valued dimensions; this might be a new “glottalic

egressive + pulmonic egressive” row in the airstream mechanism dimension, or a new “labial

+ coronal” column for the constrictor/place of articulation dimension. But double-valued

dimensions miss the point of having tables in the first place: the aim of the game is to look

for repetition of phonetic features in sounds, but adding new rows and columns only creates

more sparseness and hides repetition. The third way of adding double-valued sounds to the

tables is to assume that one of the dimension values is more important than the other(s) and

place the sound accordingly. This is the epitome of procrusteanism, and for simplicity it is

also the approach adopted in this chapter.

The point here, and even more importantly going forward into the lowest frequency

sounds, is that hard-to-place sounds often flesh out combinatorial possibilities by using

articulations that are already in the system to produce entirely novel sounds. But this will

sometimes not show up in analyses of the IPA-style tables because the sounds cannot be

represented adequately this way.

87
Table 9. Notation and description of the low-frequency beatboxing sounds.
Sound name BBX IPA Description Token Cumulative Beat pattern
frequency probability frequency

Humming {hm} [C̬] Pulmonic egressive nasal voicing 32 71.02% 2

Vocalized Liproll, {^VLR} [ʙ↓] Voiced pulmonic ingressive bilabial trill 23 72.66% 2
Inward

Closed Tongue {CTB} [r'̚] Voiceless glottalic egressive alveolar trill with 19 74.01% 2
Bass optional labial closure

Liproll {LR} [ʙ̥↓] Voiceless lingual ingressive bilabial trill 19 75.36% 2

Water Drop {WDT} [ǂʷ] Voiceless lingual ingressive labialized 16 76.49% 2


(Tongue) palatoalveolar stop

(Inward) PH {^Ph} [p͡ɸ↓] Voiceless pulmonic ingressive bilabial 13 77.41% 2


Snare affricate

Labiodental {pf} [ʘ̪] Voiceless percussive labiodental stop 12 78.27% 2


closure

Inward Clickroll {^CR} [ɽ↓] Voiceless pulmonic ingressive retroflex trill 8 78.84% 2

Open Hi-Hat {ts} [t’s:] Voiceless glottalic egressive alveolar affricate 8 79.40% 2
with sustained pulmonic egressive release

Lateral alveolar {tll} [ǁ] Voiceless percussive lateral alveolar stop 7 79.90% 2
closure

Sonic Laser {SonL} Pulmonic egressive labiodental fricative with 6 80.33% 2


a narrow tongue body constriction

88
Table 10. High, medium, and low (bolded) frequency sounds displayed by constrictor (top)
and airstream mechanism (left).
Airstream Bilabial Labiodental Coronal Dorsal Laryngeal

Linguolabial Dental Alveolar Retroflex

Glottalic egressive B PF t, CTB, ts

Glottalic ingressive u

Pulmonic egressive SonL TB hm

Pulmonic ingressive ^LR, ^VLR, ^Ph ^CR ^K

Lingual egressive SS

Lingual ingressive LR WDT

Percussive b pf tbc dc ac, tll

Table 11. High, medium, and low (bolded) frequency sounds displayed by constrictor (top)
and constriction degree (left).
Constriction Bilabial Labiodental Coronal Dorsal Laryngeal
degree
Linguolabial Dental Alveolar Retroflex

Compressed B, b, SS, ^Ph PF, pf tbc dc t, ts, ac, WDT, tll ^K u

Contacted ^LR, ^VLR, LR TB, CTB ^CR hm

Narrow SonL

89
Table 12. High, medium, and low (bolded) frequency sounds displayed by constrictor (top)
and musical role (left).
Musical role Bilabial Labiodental Coronal Dorsal Laryngeal

Linguolabial Dental Alveolar Retroflex

Kick B, b u

Hi-Hat (tbc) (dc) t, ts, (ac, tll)

Snare SS, ^Ph PF, pf ^K

Roll ^LR, ^VLR, LR ^CR

Bass TB, CTB

Sound Effect SonL WDT hm

90
3.2.4 Lowest-frequency sounds

The previous three sections assigned categorical phonetic descriptions to the set of

beatboxing sounds that appear in more than one beat pattern in this data set. Part of the aim

of doing so was to show what types of sounds are used most frequently in beatboxing, to

avoid making generalizations that weigh a Kick Drum equally with, say, a trumpet sound

effect. This section tests the generalizations of the previous three sections by looking at

another 20 sounds, bringing the total number of sounds described from 23 to 43 (out of a

total 55 sounds, the remainder of which could not be satisfactorily articulatorily described).

If beatboxing sounds are using a somewhat limited set of the many phonetic dimensions

available to a beatboxer, then the same most common phonetic dimensions should be

re-used by these next 22 beatboxing sounds.

[Link] Articulatory description of lowest-frequency sounds

Clop

Figure 52. The Clop.

The Clop is a voiceless lingual ingressive palatal stop.

91
D Kick

Figure 53. The D Kick.

The D Kick is a voiceless glottalic egressive retroflex stop. The underside of the tongue tip

presses against the alveolar ridge, flipping back to an upright position upon release.

Inward Bass

Figure 54. The Inward Bass.

The Inward Bass is pulmonic ingressive voicing. The base of the tongue root participates in

the constriction which may indicate that some other structure than (or in addition to) the

vocal folds is vibrating, such as the ventricular folds. The sound is akin to a growl. In this

case, the pulmonic airflow is directed through the nose rather than the mouth.

92
Low Liproll

Figure 55. The Low Liproll.

The Low Liproll is a voiced glottalic ingressive bilabial trill. The vocal airway is quite wide,

lowering the overall resonance behind the trill to create a deeper sound. Frames 1-2 show the

forced Kick Drum that occurs at the beginning of this sound; frames 3-4 show the lips

retracted and the tongue body pulled back.

Hollow Clop

Figure 56. The Hollow Clop.

The Hollow Clop is a glottalic ingressive alveolar stop. It appears to function similarly to a

click (e.g., the Water Drop Tongue) with the tongue tip making an alveolar closure as the

front part of a seal. In this case, however, the back of the seal is glottalic, not lingual.

Retraction of the tongue and lowering of the larynx expand the cavity directly behind the

seal, resulting in the distinctive position of the tongue tip sealed to the alveolar ridge (frame

3) just before it releases quickly into a wide, open vocal posture.

93
Tooth Whistle

Figure 57. The Tooth Whistle.

The Tooth Whistle is a labiodental whistle, which in this analysis is treated along with

fricatives as a narrow constriction.

Voiced Liproll

Figure 58. The Voiced Liproll.

The Voiced Liproll is a voiced glottalic ingressive bilabial trill, similar to the Low Liproll and

High Liproll. The tongue body retracts during the Voiced Liproll and creates a large cavity

behind the labial constriction.

94
Water Drop (Air)

Figure 59. The Water Drop (Air).

The Water Drop (Air) is a voiceless lingual ingressive palatal stop with subsequent tongue

body fronting. The tongue front and tongue body make a closure, then the tongue body

moves backward to eventually pull the tongue front away from its closure as expected for a

click. Following the release of the tongue front closure, however, the tongue body shifts

forward again. This, combined with lip rounding throughout, creates the sound of a water

drop from a pop that starts with a low resonant frequency and quickly shifts to a higher

resonant frequency.

Clickroll

Figure 60. The Clickroll.

The Clickroll is a voiceless lingual egressive alveolar trill. The tongue tip and tongue body

make a closure as they would for a click. Instead of the tongue body shifting backward or

95
down to widen the seal, the tongue gradually fills the seal to push air past the alveolar

contact, initiating vibration.

D Kick Roll

Figure 61. The D Kick Roll.

The D Kick Roll is a combination of the D Kick and a Closed (but in this case not actually

closed) Tongue Bass. It begins with a voiceless glottalic egressive retroflex stop (the D Kick).

When the tongue tip flips upright again, it makes light contact against the alveolar ridge; the

larynx continues to rise during this closure, pushing air through to make a trill.

High Liproll

Figure 62. The High Liproll.

The High Liproll is a voiced glottalic ingressive bilabial trill. The vocal tract airway is narrow

for the duration of the trill, raising the resonant frequencies behind the trill for a higher

sound.

96
Inward Clickroll with Liproll

Figure 63. The Inward Clickroll with Liproll.

The Inward Clickroll with Liproll is a combination of the Inward Clickroll and an Inward

Liproll. The Inward Clickroll begins the sound as a pulmonic ingressive retroflex trill; the lips

subsequently curl inward to make another trill vibrating over the same pulmonic ingressive

airflow.

Lip Bass

Figure 64. The Lip Bass.

The Lip Bass is a pulmonic egressive bilabial trill.

97
tch

Figure 65. tch.

The tch is a voiceless glottalic egressive laminal alveolar stop. The connection between the

tongue and the alveolar ridge begins with just an apical constriction but quickly transitions

to a laminal closure. The larynx rises at that point, pushing air past the closure into the tch

snare.

Sweep Technique

Figure 66. The Liproll with Sweep Technique.

The Sweep Technique is a Liproll variant in which the tongue tip connects with the

underside of the lower lip to change the frequency of the bilabial vibration.

98
Sega SFX

Figure 67. The Sega SFX.

The Sega SFX (abbreviation for sound effect) is composed of an Inward Clickroll and a

labiodental fricative. The lower lip is pulled farther back across the lower teeth during the

course of the sound to change the fricative frequency.

Trumpet

Figure 68. The Trumpet.

The Trumpet is a voiced pulmonic egressive bilabial (or possibly labiodental with the

connection between the upper teeth and the back of the lower lip) fricative. The tongue tip

makes intermittent alveolar closures to separate the Trumpet into notes with distinct onsets

affiliated with the musical meter.

99
Vocalized Tongue Bass

Figure 69. The Vocalized Tongue Bass.

The Vocalized Tongue Bass is a voiced pulmonic egressive alveolar trill.

High Tongue Bass

Figure 70. The High Tongue Bass.

The High Tongue Bass is a voiced pulmonic egressive alveolar trill, made with a higher

laryngeal position and narrower airway to raise the resonant frequency behind the trill.

100
Kick Drum exhale

Figure 71. The Kick Drum exhale.

The Kick Drum exhale is a forced Kick Drum produced with pulmonic egressive airflow in

addition to the usual glottalic egressive airflow. There are only two tokens of it in the data

set, and they might both be more appropriately analyzed as a true forced Kick Drum (frames

1-2) followed by a bilabial or labiodental fricative (frame 3).

[Link] Composition summary of lowest-frequency sounds


Many of the new sounds fill in gaps left by the earlier sounds. The addition of the Vocalized

Liproll {VLR} and Lip Bass {LB} fill out the bilabial place column, while the additions of the

Hollow Clop {HC} and Clickroll {CR} put a sound in every airstream of the alveolar place

column except pulmonic ingressive (which may be a practically unusable combination—the

Inward Clickroll {^CR} might be better treated typologically as an alveolar that manifests as

retroflex because of the aerodynamics required to make an ingressive trill).

Just as in the previous section, several of the sounds introduced in this section do not

fit into distinctive slots in the IPA-style tables we have established so far. The tch {tch} is a

glottalic egressive alveolar sound like the Closed Hi-Hat {t} except that it uses a laminal

closure instead of an apical closure. (It may also have a release qualitatively similar to a [tʃ].)

101
The Low Liproll {LLR}, High Liproll {HLR}, and Vocalized Liproll {VLR} differ with respect

to the area of the vocal airway behind the labial constriction, as do the Tongue Bass {TB} and

High Tongue Bass {HTB}. The Clop {C} and Water Drop (Air) {WDA} differ by the absence

or presence of a tongue fronting movement. These were placed in the tables procrusteanly by

ignoring the apical/laminal distinction and constrictions that one might judge as secondary

by comparison with speech sounds—this is for convenience of a tabular representation only

and not to be taken as an assumption about the actual nature of beatboxing sounds.

Six of the lowest frequency sounds were not placed on Tables 14-16 because they were

clearly composed of two major tongue and lip constrictions and were judged not to be able

to fit into a single cell: D Kick Roll {DR}, Inward Clickroll and Whistle {^CRW}, Sega SFX

{SFX}, Trumpet {T}, Loud Whistle {LW}, and Sweep Technique {st}. Each involves

constrictions from both the tongue tip and the lips.

102
Table 13. Notation and descriptions for the lowest frequency beatboxing sounds.
Sound name BBX Description Token Beat pattern
frequency frequency

Clop C Voiceless lingual ingressive palatal stop 28 1

D Kick D Voiceless glottalic egressive retroflex stop 17 1

Inward Bass IB Pulmonic ingressive phonation 16 1

Low Liproll LLR Voiceless glottalic ingressive bilabial trill 13 1

Hollow Clop HC Voiceless glottalic ingressive alveolar stop 12 1

Tooth Whistle TW Voiceless pulmonic egressive labiodental whistle 12 1

Voiced Liproll VLR Voiced glottalic ingressive bilabial trill 10 1

Water Drop (Air) WDA Voiceless lingual ingressive palatal stop 8 1

Clickroll CR Voiceless lingual egressive alveolar trill 6 1

D Kick Roll DR Voiceless glottalic egressive retroflex stop with alveolar trill 6 1

High Liproll HLR Voiceless glottalic ingressive bilabial trill 6 1

Inward Clickroll ^CRL Voiceless pulmonic ingressive retroflex trill and bilabial 6 1
with Liproll trill

Lip Bass LB Pulmonic egressive bilabial trill 6 1

tch tch Voiceless glottalic egressive laminal alveolar stop 6 1

Sweep technique st 4 1

Sega SFX SFX Voiceless pulmonic ingressive retroflex trill with labial 4 1
fricative

Trumpet T 4 1

Vocalized VTB Voiced pulmonic egressive alveolar trill 4 1


Tongue Bass

High Tongue HTB Voiced pulmonic egressive alveolar trill with narrowed 3 1
Bass airway behind the constriction

Kick Drum Bx Voiceless pulmonic egressive bilabial stop 2 1


exhale

103
Table 14. All the described beatboxing sounds that could be placed on a table, arranged by
constrictor (top) and airstream mechanism (left). The lowest-frequency sounds are bolded.
Airstream Bilabial Labiodental Coronal Front Dorsal Laryngeal

Linguolabial Dental Alveolar Retroflex Palatal

t, CTB,
Glottalic egressive B PF ts, tch D

LLR, VLR,
Glottalic ingressive HLR HC u

TB, VTB,
Pulmonic egressive LB, Bx SonL, TW HTB hm

^LR,
Pulmonic ingressive ^VLR, ^Ph ^CR ^K IB

Lingual egressive SS CR

Lingual ingressive LR WDT C, WDA

Percussive b pf tbc dc ac, tll

Table 15. All the described beatboxing sounds that could be placed on a table, arranged by
constrictor (top) and constriction degree (left). The lowest-frequency sounds are bolded.
Constriction degree Bilabial Labiodental Coronal Front Dorsal Laryngeal

Linguolabial Dental Alveolar Retroflex Palatal

Compressed B, b, SS, PF, pf tbc dc t, ts, ac, WDT, D C, ^K u


^Ph, Bx tll, HC, tch WDA

Contacted ^LR, CTB, TB, CR, ^CR hm, IB


^VLR, LR, VTB, HTB
LLR, VLR,
HLR, LB

Narrow SonL, TW

104
Table 16. All the described beatboxing sounds that could be placed on a table, arranged by
constrictor (top) and musical role (left). The lowest-frequency sounds are bolded.
Musical Bilabial Labiodental Coronal Front Dorsal Laryngeal
role
Linguolabial Dental Alveolar Retroflex Palatal

Kick B, b, Bx D u

Hi-Hat (tbc) (dc) t, ts, (ac, tll)

Snare SS, ^Ph PF, pf tch ^K

Roll ^LR, ^VLR, LR, CR ^CR


LLR, VLR, HLR

Bass LB TB, CTB, IB


VTB, HTB

Sound SonL, TW WDT, HC C, WDA hm


Effect

105
3.2.5 Quantitative periodicity analysis

Section 1 highlighted the difference between a system that is organized periodically with

combinatorial units (like speech) and a system that is organized to maximize distinctiveness

without repeated use of a small set of elements. So far we have seen that beatboxing sounds

do make repeated use of some phonetic properties. This means that beatboxing sounds are

combinatorial, and it also suggests that the sounds are not organized to maximize

distinctiveness by minimizing phonetic overlap. However, we have not established whether

the sounds are arranged periodically—that is, whether they appear to maximize the use of a

relatively small set of phonetic properties or appear to be distributed randomly in the

phonetic space they occupy.

The following quantitative assessment compares the periodicity of beatboxing sounds

against the periodicity of Standard American English consonants. The English consonant

system was chosen for convenience and because it has a similar number of sounds (22

beatboxing sounds will be used in this analysis; see below) and major phonetic dimensions:

23 English consonants spread across four manners of articulation, seven places of

articulation, and two voicing types (Table 19). The sound [l] is usually the 24th sound and

assumed to contrast with [r] in laterality, but since it is the only sound contrasting in

laterality it is set aside.

If beatboxing sounds are arranged periodically, then at least some sounds should be

expected to differ along only a single phonetic dimension. Two sounds that differ along only

a single dimension are a minimal sound pair. English minimal sound pairs include [p/b],

[b/m], and [t/s]. In beatboxing, the Kick Drum {B} is a minimal sound pair with the PF
106
Snare {PF}, Closed Hi-Hat {t}, and D Kick {D} in constrictor/place of articulation: all are

glottalic egressive and formed with a compressed constriction degree, but each is made with

different points of contact in the vocal tract. The Kick Drum is also in a minimal sound pair

with the Spit Snare {SS} and the Inward PH Snare {^Ph} along the dimension of airstream

mechanism. The first analysis (section [Link]) compares the minimal sound pair counts of

beatboxing and the English consonant system.

Periodic organization may also manifest as relatively high concentrations of sounds

along some phonetic dimensions and relatively few sounds in others. In a maximally

distributed system, on the other hand, no phonetic dimension should be used more than the

others. The second analysis (section [Link]) uses Shannon entropy as a metric of how

distributed sounds are different phonetic dimensions.

These analyses set aside some of the beatboxing sounds that arguably constitute

varieties of a single sound. The Open Hi-Hat {ts} could be considered a variety of Closed

Hi-Hat {t} that differs only in duration of the release. The unforced Kick Drum {b}, as well as

the percussives {pf} and {dc, ac}, are argued in Chapter 5: Alternations and Chapter 6:

Harmony to be context-dependent alternants of the glottalic egressive forced Kick Drum {B},

PF Snare {PF}, and Closed Hi-Hat {t}, respectively. Vocalized Liprolls (Inward or Outward)

as well as high/low Liprolls, are voiced variations on the theme of Liproll and Inward Liproll

(though Vocalized Liproll, High Liproll, and Low Liproll all require the Liproll to be

performed as glottalic ingressive rather than as lingual ingressive). The same goes for the

Vocalized Tongue Bass and High Tongue Bass as variants of the Tongue Bass. All sound sets

like these were consolidated into a single sound for these analyses. In the interest of more

107
closely matching the speech sound dimensions, the two narrow sounds Sonic Laser {SonL}

and Tooth Whistle {TW} were removed. Thus, the two-way voicing contrast of English

consonants matches the now-two-valued beatboxing constriction degree dimension. The

Water Drop (Air) {WDA} was also removed as it was not distinguishable from the Clop {C}

in this reduced feature system, as was {tch} for its similarity to {t}. From the set of sounds in

section 3.2.4, this analysis excludes {SonL, TW, b, pf, tbc, dc, ac, tll, ts, LLR, HLR, VTB, HTB,

^VLR, WDA, tch}. The 22 beatboxing sounds used in this analysis are shown in Table 17.

These final sound systems sacrifice some nuance. Many of the excluded beatboxing

sounds could be analyzed as genuine minimal sound pairs with each other and the remaining

sounds; their exclusion is meant to make the analysis as conservative as possible while

simplifying the minimal sound pair search method by trimming rarely used phonetic

dimensions. Likewise, there are simplifications to both the speech and beatboxing feature

spaces. Phonetically in speech, [f, v] are labiodental while [p, b, m] are bilabial, and [tʃ, dʒ]

are affricates not stops; consolidating them into labial and stop categories reduces the

number of dimensions available in the analysis. Similar choices were made throughout this

chapter for the beatboxing sounds—for example, the Spit Snare {SS} and PF Snare {PF} have

qualitatively different releases compared to the Kick Drum {B} but all are grouped under the

compressed constriction degree. In future analyses, it would be important to explore the

beatboxing dimension space more thoroughly.

[Link] Minimal sound pairs


Consider a hypothetical maximally distributed system of 21 sounds in a three-dimensional

phonetic system (a 6 x 7 x 2 matrix of airstream x place x constriction degree). Maximal

108
dispersion can be created by linearizing the three-dimensional space into an 84-element

one-dimensional vector, then assigning the 21 elements to the vector at every fourth location.

That is, starting with the first position, [ X _ _ _ X _ _ _ X _ …]. The vector is then

de-linearized back into a 6 x 7 x 2 matrix, resulting in the arrangement of elements shown in

Table 18. Minimal sound pairs are found by taking the Hamming distance of each element’s

three properties: airstream, place, and constriction degree.. The Hamming distance counts

how many properties of two elements are different. For example, the first two elements

assigned into the maximally distributed matrix are a compressed glottalic egressive bilabial

sound and a compressed glottalic egressive palatal sound; since they differ only by the place

dimension, their Hamming distance would be 1 and they would be listed as a minimal sound

pair. (In the matrix these are encoded as [1 1 1] and [1 5 1], respectively; the only difference is

the middle number.) The third element assigned is a compressed glottalic ingressive

labiodental sound ([2 2 1] in the matrix) which has a Hamming distance of 2 with each of

the first two sounds—no minimal sound pairs there. The maximally distributed system yields

20 minimal sound pairs from 21 sounds in a 6 x 7 x 2 space (Table 18).

Calculated using the same distance technique, the actual distribution of 22

beatboxing sounds in the same 6 x 7 x 2 space yields 37 minimal sound pairs (Table 17). The

Standard American English consonant system has 23 sounds in a 4 x 7 x 2 (manner x place x

voicing) space with a total of 57 minimal sound pairs (Table 19). The speech system has

fewer dimensions and more sounds, both of which increase the likely number of minimal

sound pairs. Even so, just these three minimal sound pair counts on their own do not give a

sense of whether the beatboxing and English consonant sound systems are more periodic

109
than if they were arranged by chance. To gain a better sense of the periodicity, random sound

distributions were created to find the likelihood of the beatboxing and speech systems

having 37 and 57 minimal sound pairs, respectively, given the number of sounds and

dimensions in their systems.

Ten thousand (10,000) random sound systems were created for each domain using

the same method as the maximally distributed system except that the elements were placed

randomly instead of at every fourth location. For simulations of beatboxing sound

distributions, 22 sounds were arranged randomly in a 6 x 7 x 2 matrix; for simulations of

speech sound distributions, 23 sounds were randomly distributed in a 4 x 7 x 2 matrix.

Figures 72 (beatboxing) and 73 (English consonants) show histograms of how many

minimal sound pairs were found across all trials. The purple bar in each figure marks the

actual number of minimal sound pairs calculated from Tables 17 (beatboxing) and 19

(speech). The probability of the beatboxing sound system having 37 or more minimal sound

pairs is 17.69% (about 1 standard deviation from the mean); the probability of the English

consonant system having 57 or more minimal sound pairs is 0.16% (about 3 standard

deviations from the mean). Though not marked, the hypothetical maximally dispersed

system (~20 minimal sound pairs in Figure 72) is roughly as unlikely as the number of

minimal sound pairs in the English consonant system.

The number of minimal sound pairs found in beatboxing sounds (37) is somewhat

higher than the expected value of minimal sound pairs (mean=33). Compared to the

hypothetical maximally distributed system, this beatboxer’s sound system errs on the side of

more periodic. However, the distribution of beatboxing sounds has far fewer minimal sound

110
pairs than expected compared to the well-ordered system of English consonants. (For the

beatboxing system to be as periodic as the English consonant system in this analysis, there

would have needed to be 45 minimal beatboxing sound pairs.) Assuming that other

languages’ consonant systems share a similar well-orderedness (as has often been claimed),

beatboxing sounds are distributed less periodically than speech consonants.

[Link] Shannon entropy


Entropy is sometimes used as a metric for the diversity of a system, with higher values

representing greater dispersion (less predictability) (Shannon, 1948). As Table 17 shows, the

22 beatboxing sounds are mostly concentrated into labial (8 sounds) and alveolar (6 sounds)

constrictions, with the remaining 8 sounds spread across labiodental (1 sound), retroflex (2

sounds), palatal (1 sound), dorsal (1 sound), and laryngeal (3 sounds) constrictions.

Compared to the other systems’ place distributions, beatboxing has the lowest entropy (2.36

bits) which means it re-uses place features the most. The English consonants are slightly less

predictable (2.56 bits), and the maximally dispersed system has the greatest entropy (2.81

bits).

It is not clear whether entropy is a useful metric of comparison for the other phonetic

dimensions. Take constriction degree as an example: the most straightforward comparison

would be between beatboxing’s three major constriction degrees (compressed, contacted,

and narrow; 1.33 bits) and a similar three-way system for English consonants—compressed

(stops, affricates, and nasals), narrow (fricatives), and approximants (1.42 bits). (This brings

the {SonL} and {TW} sounds back into the mix for a total of 24 beatboxing sounds.) This

comparison suggests that beatboxing sounds are slightly more predictable/less evenly

111
distributed along the constriction degree dimension. But the set of English consonants is

arguably more informative along the dimension of manner of articulation, not constriction

degree, and it makes less sense to compare the distribution of two different parameter spaces.

The same goes for voicing (which English consonants often use contrastively but beatboxing

sounds do not) and airstream mechanism (where beatboxing sounds are distributed along

6-7 values while English consonants have one).

The safest conclusion to draw is that this beatboxer’s beatboxing sounds are more

unevenly distributed along the place dimension than the set of English consonants are,

suggesting that beatboxing has some periodicity but that it manifests more strongly along

some dimensions than others.

112
Table 17. 22 beatboxing sounds/sound families, 37 minimal differences. Compressed on the
left, contacted on the right.
Airstream Bilabial Labiodental Alveolar Retroflex Palatal Dorsal Laryngeal

CT
Glottalic egressive B PF t B D

Glottalic ingressive VLR HC u

Pulmonic egressive Bx LB TB hm

Pulmonic ingressive ^Ph ^LR ^CR ^K IB

Lingual egressive SS CR

Lingual ingressive LR WDT C

Table 18. 21 sounds with maximal dispersion, 20 minimal differences. Compressed on the
left, contacted on the right.
Airstream Bilabial Labiodental Alveolar Retroflex Palatal Dorsal Laryngeal

Glottalic egressive X X X X

Glottalic ingressive X X X

Pulmonic egressive X X X X

Pulmonic ingressive X X X

Lingual egressive X X X X

Lingual ingressive X X X

Table 19. 23 English consonants, 57 minimal differences ([l] conflated with [r]). Voiceless on
the left, voiced on the right.
Manner Labial Dental Alveolar Postalveolar Palatal Velar Glottal

Stop p b t d tʃ dʒ k g

Nasal m n ŋ

Fricative f v θ ð s z ʃ ʒ h

Approximant r j w

113
Table 20. Summary of the minimal sound pair and entropy (place) analyses for beatboxing, a
hypothetical maximally distributed system, and English consonants.
System # Sounds # Min. sound pairs Phonetic dimensions Place entropy (bits)

Beatboxing 22 37 7 place x 6 airstream x 2 constriction degree 2.3565

Maximally distributed 21 20 7 place x 6 airstream x 2 constriction degree 2.8074

English consonants 23 57 7 place x 4 manner x 2 voicing 2.5618

114
Figure 72. Histogram of 10,000 random minimal sound pair trials in a 6 x 7 x 2 matrix. The
probability of a random distribution of 22 sounds having 37 (purple) or more (darker gray)
minimal sound pairs is 17.69% (95% confidence interval: 17.08–18.30%).
Range: 20-53. Mean: 33.34. Median: 33. Standard deviation: 3.95. Skewness: 0.31. Kurtosis: 3.20.

115
Figure 73. Histogram of 10,000 random minimal sound pair trials in a 4 x 7 x 2 matrix. The
probability of a random distribution of 23 sounds having 57 (purple) or more (darker gray)
minimal sound pairs is 0.16% (95% confidence interval: 0.14–0.19%). (The colors are not
visible because the bars counting random distributions with 57 minimal sound pairs are
vanishingly small.)
Range: 36-69. Mean: 46. Median: 46. Standard deviation: 3.73. Skewness: 0.38. Kurtosis: 3.26.

116
4. Discussion

4.1 Summary of analyses

Two analyses were performed to investigate the organization of beatboxing sounds: a

frequency distribution analysis and a phonetic feature analysis. The sounds of this

beatboxer’s beat patterns form a Zipfian frequency distribution, similar to the Zipfian

distribution of words in language corpora. Both systems rely on a few high-frequency items

that support the rest of utterance. In English, these are function words (e.g., “the” or “and”)

that can be deployed in a wide variety of utterances and are likely to be used multiple times

in a single utterance. Words with lower frequency, on the other hand, are more informative

because they are less predictable—words like “temperature” are typically used in a relatively

restricted set of conversational contexts. In beatboxing, the most frequent sounds are the

Kick Drum, Closed Hi-Hat, PF Snare, and Inward K Snare. These sounds form the backbone

of musical performances and can be used flexibly in many different beat patterns. Infrequent

sounds like the Inward Clickroll add variety to beat patterns but may not be suitable

aesthetically for all beat patterns or prolonged use.

As for the phonetic frequency analysis, the primary aim was to determine whether or

not beatboxing sounds are composed combinatorially—and the answer seems to be that they

are. As described by Abler (1989), hallmarks of self-diversifying systems like speech and

chemical elements include sustained variation via combinations of elements (instead of

blending) and periodicity of those elements. This study does not provide evidence about

whether or how beatboxing sounds sustain variation, but it does provide evidence that

117
beatboxing sounds are composed of combinations of phonetic features. Beatboxing has

existed for at least two broadly defined generations (the old school and the new school) to

say nothing of the rapid rate at which beatboxing developed as an art form with cycles of

teaching and learning; since the beatboxer studied here is from the new school of

beatboxing, we can conclude that either the system has recently developed into a

combinatorial one or that the old school of beatboxing was also combinatorial and has

remained so over time. At the very least, no sounds in the inventory are a blend (i.e., an

average) of other sounds; on the contrary, sounds like the D Kick Roll and Inward Clickroll

with Liproll demonstrate that new sounds can be created by non-destructively combining

two existing sounds. That is, the components involved in the sounds separately are still

observable when the sounds are combined.

Section 3.2.5 showed that while beatboxing sounds are not organized with maximal

dispersion, they are also not nearly as periodic as the set of English consonants. In some

sense, the periodicity of the system diminishes as lower frequency sounds are added: the

most frequent sounds are all compressed sounds arranged neatly along major places of

articulation, and all but one (or two, if you count the unforced Kick Drum) are glottalic

egressive; the pulmonic ingressive outlier, the Inward K Snare, only deviates from the others

because it has a crucial homeostatic role to play. Although there is a tendency for later

sounds to pattern into either bilabial or alveolar constrictor and compressed or contacted

constriction degree, still the initial current phonetic dimensions are broadened and more

dimensions are added without filling all the available phonetic space.

118
One reason for this may be that beatboxers do not learn beatboxing sounds like they

learn speech sounds. Speech is ubiquitous in hearing culture; when a child learns one or

more languages, they have an abundance of examples to learn from. Beatboxing is not

ubiquitous, so someone trying to learn beatboxing must usually actively seek out new

vocabulary items to add to their beatboxing inventory; and since it seems many beatboxers

do not start learning to beatbox until at least adolescence, the process of learning even a

single sound may be relatively rather slow. For a beatboxer who learns this way, their sound

inventory is more likely to be less periodic because there is no overt incentive to learn

minimal sound pairs. On the contrary, in the interest of broadening their beatboxing sound

inventory a beatboxer may be more motivated to learn sounds less like the others they

currently know.

As previewed at the ends of sections 3.2.3 and 3.2.4, a major shortcoming of this

periodicity analysis is the reliance on a fixed table structure. Sounds like the Water Drop

(Tongue), Water Drop (Air), Sonic Laser, D Kick Roll, Inward Clickroll with Liproll, Sweep

Technique, Sega SFX, and Trumpet use multiple constrictions that are relatively common

among the sounds but do not manifest in a tabular periodicity measurement. To take the

Water Drop (Tongue) as an example: it uses both labial and coronal constrictors with a

lingual ingressive (tongue body closure and retraction) airstream. Placing it in only the

coronal constrictor column causes the analysis to under-count the labial articulation; but

placing the sound in the labial column too would inflate the number of sounds that use

lingual ingressive airstream. Rather than looking for periodicity in whole sounds, it would be

better in the future to look for periodicity among individual vocal constrictions. Chapter 4:

119
Theory discusses this issue more and the possibility of treating these combinatorial

constrictions as cognitive gestures.

4.2 Implications

4.2.1 Contrastiveness in beatboxing

The notion that speech sounds have a relationship with each other—and are in fact defined

by this relationship—is a major insight of pre-generative phonology. Sapir (1925) for example

emphasized that speech sounds (unlike non-speech sounds) form a well-defined set within

which each speech sound has a “psychological aloofness” (1925:39) from the others, creating

relational gaps that encode linguistic information through contrast. Many phonological

theories assume that the fundamental informational units of speech are aligned to specific

phonetic dimensions and combine to make a larger unit (a segment). We have seen that

beatboxing sounds have meaning and that there is even a Zipfian organization to the use of

beatboxing sounds which implies that they have word-like meanings—that is, their meanings

are directly accessible by the speaker or beatboxer, as opposed to the featural or segmental

information of speech sounds which speakers generally do not have awareness of. Since

beatboxing sounds are combinatorial, does that make the individual phonetic dimensions

contrastive? Cognitive?

Beatboxing sounds clearly do not encode the same literal information as speech

sounds because beatboxing cannot be interpreted as speech. But 37 minimal sound pairs

were identified in a reductive three-dimensional framework of 22 beatboxing sounds, and the

less reductive system of over 40 sounds includes minimal differences in parameters like

120
voicing, double articulations, and double airstreams. The analysis in section [Link] may not

have found evidence for robust periodicity, but it did find that there are far more minimal

sound pairs in this beatboxer’s inventory than if the sounds were carefully arranged to

minimize compositionality. Changing one phonetic property of a beatboxing sound may

change the meaning of that sound just as changing one phonetic property of a word may

change the meaning of the word (e.g., changing the nasality of the final sound in “ban”

[bæn] results in “bad” [bæd]). In this sense, yes: the sounds of beatboxing are in a

contrastive relationship with each other.

Because the sounds of this beatboxer are not arranged very periodically, the contrasts

are not as neatly arranged as they are in speech. But even in speech contrast is a gradient

rather than a categorical phenomenon (Hockett, 1955): sounds may encode contrasts to

different degrees depending on the phonetic dimensions involved (e.g., the laterality of [l] in

English applies to only that one sound) or their role in larger constructions (e.g., [ŋ] only

occurs word-finally in English and so is only contrastive word-finally whereas [n] is

contrastive word-initially and word-finally). Beatboxing sounds can contrast with each other

even if the contrasting system is not as dimensionally-efficient as a language’s contrastive

sound system.

Less clear is whether the differences between beatboxing sounds are also cognitive

differences. The answer depends in part on whether beatboxing sounds have phonological

patterning that is predictable based on certain phonetic dimensions. For example, velum

lowering is generally considered a cognitive gesture for nasality because nasality is active in

phonological behavior (e.g., spreading in phonological harmony); the velum raising that

121
makes oral sounds possible, on the other hand, is often considered inert because it does not

appear to play a role in phonological behavior. Whether any of the combinatorial dimensions

of beatboxing sounds are cognitive is taken up in detail in Chapter 6: Harmony.

4.2.2 Domain-general explanation for similar phonetic dimensions

Despite their phonetic similarities, beatboxing is not an offshoot of or parasite on phonology

(cf the vocal art form scatting which does draw on phonological well-formedness conditions

for the production of non-linguistic music; Shaw, 2008). For one thing, the lack of vowels

precludes the possibility that the near-universal CV syllable could exist in beatboxing. For

another, if beatboxing sounds were composed of linguistic phonological units then there

would be no pulmonic ingressive or lingual egressive beatboxing sounds because those do

not exist in language either (Eklund, 2008; cf Hale & Nash, 1997 for lingual egressive sounds

in Damin).

Even so, we have seen conspicuous overlap between the combinatorial phonetic

dimensions leveraged by speech and beatboxing—shared constriction locations and

constriction degrees, some use of voicing and laterality, and overlapping airstream

mechanisms. This may be easily explained by domain-general approaches to speech (and

beatboxing) cognition. For example, the Quantal Theory (Stevens, 1989; Stevens & Keyser,

2010) deduces common phonological features by searching for regions in the vocal tract that

afford stable relationships between articulation and acoustics; the apparent universality of

features in speech is thus explained as arising from humans sharing the same vocal tract

physiology. But the relationship between articulation and acoustics in the vocal tract is not

122
special to speech—it is simply a property of the human vocal instrument, and so could just as

easily apply to beatboxing. The prediction would be that beatboxing and speech would share

many of the same phonetic features, which is indeed what we found here. Auditory theories

of speech could likewise apply to beatboxing audition, though to my knowledge there is no

work on beatboxing perception.

Chapter 4: Theory offers an explicit gesture-based approach to phonology and

beatboxing which capitalizes on the domain-general properties the systems share. That

chapter also includes a brief discussion of how a gestural description might encode

beatboxing contrast more effectively than the procrustean tables of sounds used here. Since

speech and beatboxing units are informationally unrelated to each other, purely

domain-specific theories of phonology cannot offer any explanation for why beatboxing and

speech might have similar structural units.

123
CHAPTER 4: THEORY

This chapter introduces a theoretical framework under which speech and beatboxing

phonological units are formally linked. Specifically, in the context of the task-dynamics

framework of skilled motor control, speech and beatboxing are argued to have atomic units

that share the same graph (that is, the same fundamental architecture) but may differ

parametrically in task-driven ways. Under the hypothesis from Articulatory Phonology that

information-bearing action units are the fundamental cognitive (phonological) units of

language, the graph-level link between speech and beatboxing actions becomes a cognitive

relationship. This cognitive link permits the formation of hypotheses about similarities and

differences between beatboxing and speech actions.

1. Introduction

At present, there is no theory of the cognitive structure of beatboxing or its fundamental

(motor) units. Therefore, there is no theoretically-motivated basis for drawing comparisons

between the atoms of speech and beatboxing or their organization. This chapter aims to

sketch such a theory of beatboxing fundamental units and their organization that can

provide a way of formally relating units in speech and beatboxing,

Dynamical systems are here used as the basis for understanding beatboxing units and

organization. The framework of task dynamics (Saltzman & Munhall, 1989) is commonly

used in Articulatory Phonology (Browman & Goldstein, 1986, 1989) to model the

coordination of a set of articulators in achieving the motor tasks (gestures) into which

speech can be decomposed. These task-based gestures are hypothesized to be isomorphic

124
with the fundamental cognitive units of speech. The coordination of the multiple units

composing speech is in turn modeled by coupling the activation dynamics of these units

(Nam & Saltzman, 2003; Goldstein et al., 2009; Nam et al., 2009). But task dynamics and the

coupling model are not speech-specific; they are inspired by nonlinguistic behaviors and can

be used to model any skilled motor task. Section 2 introduces concepts from dynamical

systems that will be the foundation of the link between speech and beatboxing. Section 3

argues that beatboxing sounds may be composed of gestures, and section 4 illustrates the

specific hypothesis that the fundamental units of beatboxing and speech share the same

domain-general part of the equations of task dynamics (the graph level). This establishes a

formal link between the cognitive units of speech and beatboxing that can serve as the basis

for comparison and hypothesis testing.

2. Dynamical systems and their role in speech

Articulatory Phonology hypothesizes that the fundamental units of phonology are action

units called “gestures” (Browman & Goldstein, 1986, 1989). Unlike symbolic features which

make no reference to time and only reference the physical vocal tract abstractly (if at all),

gestures as phonological action units vary in space and over time according to an invariant

differential equation (Saltzman & Munhall, 1989) that predicts directly observable

consequences in the vocal tract. While a gesture is active, it exerts control over a vocal tract

task variable (e.g., lip aperture) through coordinated activity in a set of articulators, in order

to accomplish some phonological task (e.g., a complete labial closure for the production of a

labial stop) as specified by the parameters of its differential equation. Phonological

125
phenomena that are stipulated through computational processes in other models emerge in

Articulatory Phonology from the coordinated overlap in time of gestures in an utterance.

Section 2.1 describes dynamical systems in terms of state, parameter, and graph levels.

Section 2.2 explains different point attractor dynamical systems and the usefulness of point

attractors as phonological units in speech.

2.1 State, parameter, and graph levels levels.

The dynamical systems used to model phonological units and their organization can be

characterized with three levels: the state level, the parameter level, and the graph level

(Farmer, 1990; Saltzman & Munhall, 1992; see Saltzman et al., 2006 for a more thorough

introduction). The dynamical system in Equation 1 which characterizes the movement of a

damped mass-spring and is commonly used as the basic equation for gestures in Articulatory

Phonology (Saltzman & Munhall, 1989):

Equation 1. 𝑥̈ =− 𝑏𝑥̇ − 𝑘(𝑥 − 𝑥0)

State level. In Equation 1, The variables 𝑥, 𝑥̇, and 𝑥̈ all encode the instantaneous value of the

state variable(s) of the system: the first represents its position, the second represents its

velocity, and the third represents its acceleration. The state variables generally are the vocal

tract task variables referred to above, such as the distance between the tongue body and the

palate or pharynx (tongue body constriction degree) or the distance between the upper and

lower lip (lip aperture). The values of those state variables change continuously as vocal tract

articulators move to achieve the goal of the system.

126
Parameter level. The task goal of the system (𝑥0) is defined at the parameter level: it

does not change while this gesture is active. Other parameters in this equation are 𝑏 (a

damping coefficient) and 𝑘 (which determines the stiffness of the system—that is, how fast

the system moves toward its goal). Each phonological gesture is associated with its own

distinct parameters. For example, the lip aperture gestures for a voiced bilabial stop [p] and a

voiced bilabial fricative [ɸ] are different primarily in their aperture goal 𝑥0: the goal of the

stop is lip compression (parameterized as a negative value for 𝑥0), while the goal of the

fricative is a light closure or slight space between the lips (parameterized as a value for 𝑥0

near 0). (Parameter values change more slowly over time, such as when a person moves to a

new community and adapts to a new variety of their language). Thus, Equation 1 states that a

fixed relation defined by the phonological parameters holds among the physical state

variables at every moment in time that the gesture is active. This fixed relationship defines a

phonological unit.

Graph level. The graph level is the architecture of the system. Part of the architecture

is the relationship between states and parameters in an equation (Saltzman et al., 2006). For

example, notice that the term for the spring restoring force 𝑘(𝑥 − 𝑥0) in the mass-spring

system above is subtracted from the damping force 𝑏𝑥̇; if it were added instead, that would

be a change in the graph level of this dynamical system. The system’s graph architecture also

includes the number and composition of the equations in a system (Saltzman & Munhall,

1992). With respect to speech, composition crucially includes the specification of which tract

variables are active at any time.

127
Different graphs can result in qualitatively different behaviors. Changing the number

of equations in a system can create entirely different sounds. For example, the graph for an

oral labial stop [b] uses a lip aperture tract variable, but the graph for a nasal labial stop [m]

uses tract variables for both lip aperture and velum position. Alternatively, changing the

relationship between terms in an equation can affect how the same effector moves. Equation

2 shows the graph of a periodic attractor (Saltzman & Kelso, 1987); this type of dynamical

system describes the behavior of a repetitive action like rhythmic finger tapping or turning a

crank, which is qualitatively different from a point attractor system with a goal of a single

point in space. The graph for the periodic attractor in Equation 2 is modified from the

damped mass-spring system in Equation 1 by the addition of the term 𝑓(𝑥, 𝑥̇) which adds or

removes energy from the system to sustain the intended movement amplitude.

Equation 2. 𝑥̈ = 𝑓(𝑥, 𝑥̇) − 𝑏𝑥̇ − 𝑘(𝑥 − 𝑥0)

Taken together, state, parameter, and graph levels characterize a dynamical system. In

Articulatory Phonology and task-dynamics, the mental units of speech and their organization

are dynamical systems, and so the state, parameter, and graph levels characterize a speaker’s

phonology. Table 21 summarizes the roles of state, parameter, and graph levels in gestures

(i.e., Equation 1).

128
Table 21. Non-exhaustive lists of state-, parameter-, and graph-level properties for dynamical
systems used in speech.
State level Parameter level Graph level

System type: Gesture

Position Target state System topology (e.g., point


Velocity Stiffness attractor)
Acceleration Strength of other movement Tract variable selection
Activation strength forces (e.g., damping) Selection of and relationship
Blending strength between parameter and
state variables

2.2 Point attractors

Vocal tract movements in speech have certain characteristics that suggest what an

appropriate dynamical topology (graph) may be. Speech is produced by forming an

overlapping series of constrictions and releases in the vocal tract; each constriction affects

the acoustic properties of the vocal instrument in a specific, unique way such that

constrictions of different magnitudes and at different locations in the vocal tract create

distinctive acoustic signals. Each speech action can therefore be characterized as having a

relatively fixed spatial target for the location and degree of constriction. Moreover, speech

movements exhibit equifinality: they reach their targets regardless of the initial states of the

articulators creating the constriction or perturbations from external forces—as long as there

is enough time for the constriction to be completed and the same articulators are not being

used to try to achieve two incompatible constrictions simultaneously. Figure 74 demonstrates

how position and velocity change as a function of time during a spoken labial closure.

129
Figure 74. A lip closure time function for a spoken voiceless bilabial stop [p], taken from
real-time MRI data.

Point attractor dynamical systems generate precisely these qualities. Several different

differential equations can be used to model point attractor dynamics, and their goodness of

fit to the data can be assessed by comparing the model kinematics against the real-world

kinematics. For example, consider the first-order point attractor in Equation 3 in which 𝑥 is

the current spatial state of the system, 𝑥̇ is the system velocity, 𝑥0 is the system’s spatial

target, and 0 < 𝑘 < 1 is a constant that determines how quickly the system state changes.

Regardless of the starting value of 𝑥, the state always moves (asymptotically) toward 𝑥0—that

is, it is attracted to the target point 𝑥0.

Equation 3. 𝑥̇ =− 𝑘(𝑥 − 𝑥0)

130
Figure 75. Schematic example of a spring restoring force point attractor.

But comparing the spring restoring force (Figure 75) to an actual speech movement (Figure

74) reveals that the details of the velocity state variable for this first-order point attractor are

not a good fit for speech kinematics. Speech movements generally start with 0 velocity and

have increasing velocity until they reach peak velocity sometime in the middle of the

movement trajectory, but this first-order spring system in Equation 3 begins at maximum

velocity and has only decreasing velocity over time. A kinematic profile that starts at peak

velocity is not an accurate portrayal of speech kinematics which tend to start at 0 velocity.

A better choice for modeling the dynamics of speech atoms is the damped

mass-spring system from Equation 1. When critically damped (𝑏 = 2 𝑘), the damped

mass-spring system acts as a point attractor: regardless of the initial starting state 𝑥, the state

131
of the system will converge toward its goal 𝑥0 and stay at that goal for as long as the system is

active. The position time series for a critically damped mass-spring system (Figure 76) results

in a somewhat better fit to the characteristic kinematic properties of speech movements

(Figure 74): velocity starts at 0, increases until it peaks, then gradually decreases again as 𝑥

approaches 𝑥0. However, the peak velocity of the observed speech movement exhibits a more

symmetric velocity profile with the peak velocity about halfway through the gesture; the time

of peak velocity for the mass-spring equation in Equation 1 (Figure 76) is much earlier,

indicating that a different equation may have a better fit.

Figure 76. Schematic example of a critically damped mass-spring system.

A third point attractor with a different graph is the damped mass-spring system with a “soft

spring” that has been suggested to more accurately model the kinematics of vocal movement

132
(Sorensen & Gafos, 2016). This equation (Equation 4) has the same pieces as Equation 1,

3
plus a cubic term 𝑑(𝑥 − 𝑥0) that weakens the spring restoring force when the current state

is relatively far from the target state. In other words, the system won’t move as quickly

toward its target state at the beginning of its trajectory.

3
Equation 4. 𝑥̈ =− 𝑏𝑥̇ − 𝑘(𝑥 − 𝑥0) + 𝑑(𝑥 − 𝑥0)

Figure 77. Schematic example of a critically damped mass-spring system with a soft spring.

One of the most noticeable differences between the damped mass-spring systems with

(Figure 77) and without (Figure 76) the soft spring is the difference in the relative timing of

peak velocity. Both systems start out with 0 velocity and gradually increase velocity until

velocity reaches its peak; however, the system with the soft spring reaches its peak velocity

later than the system without the soft spring, which Sorensen and Gafos (2016) show is a

133
better fit to speech data (compare for example against the speech labial closure in Figure 74).

The critically-damped mass-spring system without the soft spring can result in this

kinematic profile if gestures have ramped activation—that is, rather than treating gestures as

if they turn on and off like a light switch, increasing a gesture’s control over the vocal tract

gradually like a dimmer switch also delays the time to peak velocity (Kröger et al., 1995; Byrd

& Saltzman, 1998, 2003). Sorensen & Gafos (2016) argue that the dynamical system with the

soft spring term should be preferred to the simpler damped mass-spring system with ramped

activation to preserve a gesture’s intrinsic timing (see Fowler, 1980).

Details of equation architecture (graph) aside, point attractor dynamical systems are

useful as speech units for many reasons. A variety of speech phenomena can be accounted

for by specifying the temporal interval during which a point attractor exerts control in the

vocal tract. For example, if a gesture’s dynamical system does not last long enough for the

gesture to reach its goal, gestural undershoot might lead to phonological alternation e.g.,

between a stop and a flap or fricative (Parrell & Narayanan, 2018). Alternatively, if a gesture

remains active after reaching its target state, the gesture is prolonged; this is one account of

the articulation of geminates (Gafos & Goldstein, 2011). The temporal coordination of two or

more gestures can result in spatio-temporal overlap that may account for certain types of

phonological contrasts and alternations (Browman & Goldstein, 1992). Some types of

phonological assimilation, harmony, and epenthesis can be described as resulting from the

temporal overlap of gestures (Browman & Goldstein, 1992. And when two gestures are active

at the same time over the same vocal tract variable(s), those gestures blend together,

resulting in coarticulation.

134
All in all, point attractor topologies are advantageous as models of gestures for a

variety of reasons. Section 4 argues that the point attractor system that gestures share are

advantageous for beatboxing as well.

3. Beatboxing units as gestures

Beatboxers have mental representations of beatboxing sounds. Beatboxers are highly aware

of the beatboxing sounds in their inventory and the differences between many of those

sounds. They give names to most of the sounds—though the names may differ from language

to language or beatboxer to beatboxer as they do for the “power kick” (Paroni et al., 2021)

and “kick drum” which are both names for a bilabial ejective that fulfills a particular musical

role. Likewise, a crucial component of beatboxing pedagogy is associating a name and a

sound; beatboxers who want to learn a sound they heard someone else perform first need to

know the name of the sound so they can ask for instruction. The naming and identification

of beatboxing sounds suggest that skilled beatboxers can distinguish a wide variety of vocal

articulations within the context of beatboxing. (Unsupervised classifier models can also

reliably group beatboxing sounds into clusters based on the acoustic signal via MFCCs;

Paroni et al., 2021). As distinct objects, each associated with some meaning and available to

abstract thought, beatboxing sounds can be thought of as segment-sized mental

representations.

Chapter 3: Sounds motivates looking at the phonological-level patterning of

beatboxing in terms of gestures rather than more traditional phonetic dimensions. Gestures

in speech are cognitive representations that fill the role of the most elementary abstract,

135
compositional units of phonological information. Information implies phonological contrast:

changing, removing, adding, or re-timing a gesture can often change the meaning of a word.

Chapter 3: Sounds argued that beatboxing sounds are composed compositionally in the same

way that speech sounds are—though without the same degree of periodicity/feature

economy—and that there are a not insubstantial number of minimal sound pairs for which

changing one articulator task, one gesture, can change the meaning of the sound.

Superficially, at least, this is similar to the use of speech gestures.

Gestures are particularly advantageous for describing beatboxing patterns because a

gestural description of a sound can incorporate time and complex multi-articulator

combinations that are difficult or impossible to manage in a symbolic, IPA chart-type

phonetic notation (though foregoing symbols and charts sacrifices some brevity). But the

actual number of tract variable gestures used to create the different beatboxing sounds is

relatively small. Sounds like the D Kick Roll and Open Hi-Hat do not fit into a table just

because they use too many tract variables to conveniently fit in any one place in a table.

Others like the Water Drop (Air) re-use the same tract variable multiple times (in this case,

the tongue body constriction location). A gestural approach to describing beatboxing can

account for these types of cases in a way that looking at the sounds in a table cannot. If the

number of tract variables used really is small, then a gestural perspective might even show

that the periodicity of beatboxing sounds is comparable to the periodicity of speech sounds.

Beyond the descriptive convenience, evidence for gestures as primitives of

information in the cognitive system underlying beatboxing would require a more complete

inventory of contrastive beatboxing gestures as well as evidence that these gestures play a

136
role in defining natural classes for characterizing beatboxing alternations (on analogy to

phonological alternations in speech). The role of gestures in characterizing beatboxing

natural classes is investigated in Chapter 5: Alternations and Chapter 6: Harmony. A

complete inventory is beyond the scope of this work, and the task might in principle be

impossible—beatboxing sounds appear to be an open set, and the gestures themselves might

come and go as a beatboxer’s sound inventory changes.

That said, it is possible to hazard some educated guesses about what a set of

beatboxing gestures might include. Frequently-used constrictors like the lips and the tongue

tip are likely to be associated with beatboxing gestures at compressed and contacted

constriction degrees (that can be encoded in task dynamics as constriction degree goals of a

negative value and 0, respectively). Since beatboxing involves a wider range of mechanisms

for controlling pressures in the vocal tract than does speech, contrasting gestures to initiate

pressure changes in the vocal tract would seem to be required: pulmonic, laryngeal, and

lingual tasks, along with contrasts in the goal value of such gestures (increased or decreased

pressure). Voicing seems to make a difference in some sounds too, and at the very least is the

only clearly identifiable property of humming and the Inward Bass.

Suspiciously, almost all of these hypothetical gestures use the same vocal tract

variables as speech and with ultimately similar pressure control aims, though not necessarily

with speechlike targets or configurations. Pulmonic tasks (for increased vs. decreased

pressure) is the only one of these not attested to be used contrastively in speech, although

pulmonic ingressive airflow is used somewhat commonly around the world as a sort of

pragmatic contrast to non-ingressive speech (Eklund, 2008). But the point is not that speech

137
and beatboxing are built on the same set of gestures—the contrastive use of pulmonic

airstreams as well as the use of lateral labial constrictions (not reported in this dissertation

because there was only a midsagittal view) rules out the possibility that beatboxing is a

reconfiguration of speech units. Rather, the point is that a gesture-based account may work

well for both speech and beatboxing. In this sense, just as Articulatory Phonology is based

around the hypothesis that gestures are the fundamental units of speech production and

perception, a similar hypothesis for beatboxing phonology is that beatboxing actions

controlled by task dynamic differential equations are the fundamental units of beatboxing

production and perception. The next sections are dedicated to developing an understanding

of how gestures are recruited differently for speech and beatboxing while simultaneously

linking the two domains through the potential of the vocal instrument.

4. A dynamical link between speech and beatboxing

The state, parameter, and graph levels of the differential equations in task dynamics provide

an explicit way to formally compare and link speech and beatboxing sounds. Beatboxing and

speech actions use the same vocal tract articulators to create sound, which means they are

constrained by the same physical limits of the vocal apparatus. In task dynamics, these

limitations constrain the graph dynamics and parameter space of actions available to a vocal

system. Functional speech-specific and beatboxing-specific constraints can further delineate

and refine each domain’s graph dynamics and parameter space; even so, as this section

argues, the actions in both domains appear to use the same point attractor topologies, tract

138
variables, and coordination, all of which indicate that speech and beatboxing share the same

graph.

4.1 Hypothesis: speech and beatboxing share the same graph

These graph properties appear to be shared by speech and beatboxing: the individual actions

are point attractors (section 4.1.1) operating mostly over the same tract variables as speech

gestures (section 4.1.2) with similar timing relationships (section 4.1.3). In addition, coupled

oscillator models of prosodic structure have been used to account for both speech and

musical timing, making them a good fit for beatboxing as well (section 4.1.4).

4.1.1 Point attractor topology

Point attractors have been used as models of action units for behaviors other than speech,

even behaviors without the kinds of phonological patterns that speech has (Shadmehr, 1998;

Flash & Sejnowski, 2001). Goldstein et al. (2006:218) “view the control of these units of

action in speech to be no different from that involved in controlling skilled movements

generally, e.g., reaching, grasping, kicking, pointing, etc.”

Beatboxing and speech sounds leverage the same vocal tract physics: wider

constrictions (as in sonorants) alter the acoustic resonances of the vocal tract, and narrower

constrictions or closures obstruct the flow of air to create changes in mean flowrate and

intraoral pressure and generate acoustic sources. Moreover, beatboxing and speech both have

discrete sound categories, like a labial stop [p] in speech and a labial stop Kick Drum {B} in

beatboxing. Creating discrete sounds requires vocal constrictions with specific targeted

139
constriction locations and degrees (Browman & Goldstein, 1989). As discussed in section 2.2,

point attractors are ideal for such actions.

The kinematics of many beatboxing movements bear a strong resemblance to the

kinematics created by speech point attractor gestures: they start slow, increase velocity until a

peak somewhere near the middle of the movement, and slow down again as the target

constriction is attained (Figure 78). This suggests that beatboxing actions share both

qualitative and quantitative point attractor graphs with speech gestures.

4.1.2 Tract variables

Part of the graph level is the specification of which task variables are active at any time.

Beatboxing and speech operate over the same vocal tract organs and therefore have access to

(and are limited to) the same vocal tract variables. In Chapter 3: Sounds it was established

that many beatboxing sounds resemble speech sounds in constriction degree and location.

The specific tract variables used by each behavior may not completely overlap. Beatboxers

for example sometimes use lateral bilabial constrictions but there are no speech task

variables for controlling laterality—due partly to the difficulty of acquiring relevant lateral

data to know what a lateral task variable might be, but also to the fact that laterals in speech

are always coronal and can be modeled by adding an appropriate dorsal gesture to a coronal

gesture. Such a strategy would not work for modeling lateral labials. Overall, though, speech

and beatboxing movements are more similar than they are different.

140
Figure 78. Position and velocity time series for labial closures for a beatboxing Kick Drum {B}
(left) and a speech voiceless bilabial stop [p] (right). Movements were produced by the same
individual, tracked using the same rectangular region of interest that encompassed both the
upper and lower lips. Average pixel intensity time series in the region of interest were
smoothed using locally weighted linear regression (kernal = 0.9, Proctor et al., 2011; Blaylock,
2021), and velocity was calculated using the central difference theorem as implemented in
the DelimitGest function (Tiede, 2010). Both movements were extracted from a longer,
connected utterance (a beatbox pattern with the Kick Drum and the phrase “good pants”
from a sentence produced by the same beatboxer). See Chapter 2: Method for details of data
acquisition.

The physics of sound manipulation in the vocal tract are the same for speech and

beatboxing: different constriction magnitudes and locations along the vocal tract result in

different acoustics. Some regions of the vocal tract are more stable than others, meaning that

variation of constriction location within some regions results in little acoustic change; these

stable regions are argued to shape the set of distinctive contrasts in a language so that

coarticulation does not dramatically alter the acoustic signal and lead to unwanted percepts

(Stevens, 1989; Stevens & Keyser, 2010). Though beatboxing does not have linguistically

contrastive features to convey, parity must still be achieved between an expert beatboxer and

a novice beatboxer in order for learning to occur. Beatboxers exploit the same vocal physics

141
to maximize transmission of the beatboxing signal, resulting in beatboxers leveraging the

same vocal tract variables.

4.1.3 Intergestural coupling relationships

The relative timing of two speech gestures can make a meaningful difference within a word.

For example, the timing of a velum lowering gesture makes all the difference between “mad”

[mæd] (velum timed to lower at the beginning of the word) and [bæn] (velum timed to

lower closer to the end of the word). Timing between gestures can also be contrastive even

within a single segment too, like the relative timing of the oral closure gesture and laryngeal

lowering that distinguishes voiced plosives and voiced implosives (Oh, 2021).

A common model of intergestural timing in Articulatory Phonology is a system of

coupled periodic timing oscillators or “clocks” (Nam & Saltzman, 2003; Goldstein et al.,

2009; Nam et al., 2009). While a clock is running, its state (the phase of the clock)

continually changes just like the hands on a clock move around a circle. These clocks are

responsible for triggering the activation of its associated gesture(s) in time; the triggering

occurs when a clock’s state is equal to a particular activation phase. Thinking back to the

graph level, coupling two oscillators means that the dynamical equation for each oscillator

includes a term corresponding to the state (phase) of the oscillator(s) to which it is coupled;

thus, the phase of each oscillator at any time depends on the phases of the other oscillators

to which it is coupled. This inter-clock dependency is a major advantage of the oscillator

model of intergestural timing: the phases of coupled clocks settle into different modes like

in-phase (0 degree difference in phase) or anti-phase (180 degree difference in phase) that

142
result in gestures being triggered synchronously or sequentially, respectively. The state,

parameter, and graph components of the coupled oscillator model are given in Table 22.

Table 22. Non-exhaustive lists of state-, parameter-, and graph-level properties for coupled
timing oscillators (periodic attractors).
State level Parameter level Graph level

System type: Coupled oscillators

Phase Activation/deactivation phase Number of tract variables


Oscillator frequency Intergestural coupling
Coupling strength & direction Selection of and relationship
Coupling type (in-phase, between parameter and
anti-phase) state variables

In-phase coupling between a consonant gesture and a vowel gesture results in a CV syllable

or mora; it is also used intrasegmentally for consonants with more than two gestures, for

example a voiceless stop with both an oral constriction gesture and a glottal opening gesture.

Anti-phase coupling between a vowel and consonant typically results in a sequential

nucleus-coda syllable structure. Anti-phase coupling may also exist in some languages

between consonants in an onset cluster, with all the consonants coupled in-phase to the

vowel but anti-phase to each other, resulting in what has been described as the C-Center

effect (Browman & Goldstein, 1988).

The specific timing relations needed to model beatboxing are unclear at the moment,

and it is not clear if beatboxing needs a coupled oscillator model of timing per se. On the one

hand, beatboxing does not usually feature wide vowel-like constrictions, so there does not

appear to be anything quite like a CV syllable in beatboxing, much less something like a

syllable coda; in general, beatboxing sounds are coordinated with the alternating rhythmic

143
beats (section 4.1.4), so intergestural coupling relations might usually be relevant only among

the component gestures of a given beatboxing sound. On the other hand, there is clear

evidence for intra-segmental timing relationships that may benefit from a coupled oscillator

approach. Some of the most common beatboxing sounds are ejectives, and these require

careful coordination between the release of an oral constriction and the laryngeal

closing/raising action that increases intraoral pressure (Oh, 2021); the same is likely true for

lingual and pulmonic beatboxing sounds. In addition, some beat patterns feature two

beatboxing sounds coordinated to the same metrical beat, resulting in sound clusters like a

Kick Drum followed closely by some kind of trill. This kind of relationship between sounds

and the meter suggests that the beatboxing sounds in these clusters may be coupled with

each other in some way.

4.1.4 Coupled timing oscillators

Hierarchical prosodic structure in speech has also been modeled using coupled oscillators,

including syllable- and foot-level oscillators (Cummins & Port, 1998; Tilsen, 2009; Saltzman

et al., 2008; O’Dell & Neiminen, 2009). The cyclical nature of oscillators matches the ebb and

flow of prominence in some languages, including stress languages that alternate (more or

less regularly) stressed and unstressed syllables.

In Chapter 2: Method, it was shown that the musical meter in styles related to

beatboxing have highly regular, hierarchically nested strong-weak alternations. Coupled

oscillators are well-suited for modeling these types of rhythmic alternations in music (e.g.,

Large & Kolen, 1994): each oscillator contributes to alternations at one level of the hierarchy,

and the oscillators to which it is coupled have either half its frequency (hierarchically
144
“above”, with slower alternations) or double its frequency (hierarchically “below”, with

slower alternations), yielding a stable 1:2 frequency coupling relationship between each level.

Other rhythmic structures like triplets can be modeled by temporarily changing oscillator

frequencies (a parameter level change).

4.2 Tuning the graph with domain-specific parameters

Speech and beatboxing share the same set of vocal organs, each of which has its own

mechanical potential and limitations for any movement. Tasks are constrained by the

physical abilities of the effectors that implement them; in the task dynamics model, this is

represented as a constraint on the range of values of each dynamical parameter that fits into

a given graph. Therefore, speech and beatboxing share both their graph structures and a

physically-constrained parameter space.

Within that physically-constrained parameter space, the difference between two

speech gestures that use the same tract variable is encoded by different parameter values.

Different constriction targets (represented as x0 in Equation 1) can lead to different manners

of articulation, with a narrow constriction target for a fricative, a lightly closed constriction

target for a trill, or a compression target for a stop. For a given sound, the selection of a tract

variable (or tract variables) and the associated learned parameter values are part of a

person’s knowledge about their language (and may differ slightly from person to person for a

given language).

Gestures can be viewed as available “pre-linguistically” (Browman & Goldstein, 1989):

the action units that become gestures are not inherently linguistic, but are harnessed by the

145
language-user to be used as phonological units. This is accomplished by tuning the

parameters of a gesture to generate a movement pattern appropriate for the information

contrasts relevant to language being spoken. The same pre-linguistic actions can be

harnessed for non-linguistic purposes, including beatboxing; they may simply require

different parameter tuning associated with their domain-specific function.

This tuning of speech gestures to functionally-determined values is spelled out within

the task-dynamics framework (Saltzman & Munhall, 1989) as the specification of values at

the parameter level of a dynamical equation described in section 2.1. When a gesture is

implemented, the task-specific parameter values for that gesture are applied to the system

graph. This application is depicted in Figure 79. The point attractor graph space on the left

represents the untuned dynamical system that is (by hypothesis) the foundational structure

of every gesture (Saltzman & Munhall, 1989). Learned parameters associated with a

particular speech task are summoned from the phonological lexicon to tune the dynamical

system, like the intention to form a labial constriction for a /b/, represented in the figure as

an unfilled (dark) circle. The result of this tuning is a speech action—a phonological gesture,

represented in the figure as a filled (light) circle.

146
Figure 79. Parameter values tuned for a specific speech unit are applied to a point attractor
graph, resulting in a gesture.

Figure 80. Speech-specific and beatboxing-specific parameters can be applied separately to


the same point attractor graph, resulting in either a speech action (a gesture) or a beatboxing
action. Applying appropriately tuned parameters to a graph specializes the action for one
domain or another.

As argued above, speech and beatboxing actions can both be described as point attractors

operating over a shared set of tract variables, though the use of those tract variables

sometimes differs between the two domains. With respect to parameter tuning in

task-dynamics, this simply means that beatboxing actions use the same point attractor graph

147
as speech but with beatboxing-specific parameter values (Figure 80). This is one way of

establishing a formal link between beatboxing and speech in task dynamics: the atomic

actions of each behavior share the same graph, but differ by domain-specific parameter

values.2

What determines the parameter values for speech sounds and beatboxing sounds?

The answer lies in the intention behind each behavior: beatboxing actions create musical

information, speech actions communicate lexico-semantic information, and the dynamical

parameters are tuned accordingly. For example, beatboxing and some languages both feature

bilabial ejectives in their system of sounds. A beatboxing bilabial ejective is a Kick Drum, and

has a particular aesthetic quality to convey, so its labial and laryngeal gestures may have

different targets, stiffnesses, and inter-gestural coordination compared to a speech bilabial

ejective which contributes to the communication of a linguistic message.

Beatboxing phonology—including its fundamental contrastive and cognitive units and

the interplay between those units—arises from the interaction between the physiological

constraints of vocal sound production and the broader tasks of beatboxing, just as the

fundamental contrastive and cognitive units of speech and the interplay between those units

arise from the interaction between the same constraints and the tasks of speech. Gestures are

2
As noted earlier, an alternative hypothesis is that beatboxing is “parasitic” on speech, recombining whole
speech gestures—including existing phonological parameterizations—into the set of beatboxing sounds. This
seems unlikely because the tract variables and target values used by speech and beatboxing do not fully overlap.
Beatboxing does not adopt the speech gestures used for making approximants and vowels. More to the point,
English-speaking beatboxers use lateral labial gestures, constrictions that make trills, and a variety of
non-pulmonic-egressive airstreams, none of which are attested in the phonology of English. Even if one were to
assume an innate, universal set of phonological elements for beatboxing to pilfer from, the lack of attestation of
phonologically contrastive pulmonic ingressive and lingual egressive units rules them out from the set of
universal features—since beatboxing has them, it must have gotten them from somewhere else besides speech.
For illumination by comparison: there are vocal music genres like scatting (Shaw, 2008) that do seem to be
parasitic on speech gestures and phonological patterns; these behaviors sound speechlike, and beatboxing does
not.

148
a useful way of modeling this interaction in both domains because they encode both

domain-specific intention and domain-general abilities/constraints. The possible parameter

values for a given gesture are constrained both by the physical limitations of the system and

by domain-specific task requirements.

This approach to speech and beatboxing is in some sense a formalization of the

anthropophonic perspective of speech sound. The term anthropophonics originated with Jan

Baudouin de Courtenay as part of the distinction between the physical (anthropophonic)

and the psychological (psychophonic) properties of speech sounds. Catford (1977) defines

anthropophonics as a person’s total sound-producing potential, referring to all the vocal

sound possibilities that can be described (general phonetics) of which the whole set of

speech possibilities is only a subset (linguistic phonetics). Lindblom (1990) adopted the

anthropophonic perspective as part of the broader program of deducing the properties of

speech from non-speech phonetic principles, specifically with respect to the question of how

to define a possible sound of speech (cf Ladefoged, 1989). Particularly as used in the vein of

Catford and Lindblom, anthropophonics is about taking domain-general vocal potential—all

of the possible vocal sound-making strategies and configurations—and understanding how

domain-specific tasks filter all that potential into a coherent system. The dynamical

formalization accomplishes this by encoding domain-general possibilities at the graph level

and domain-specific tasks in the control parameters.

149
5. Predictions of the shared-graph hypothesis

The argument so far is that speech and beatboxing are domain-specific tunings of a shared

graph. Moreover, by the hypothesis of Articulatory Phonology that the actions composing

speech are also the fundamental cognitive units of speech, the graph-level link between

speech and beatboxing is a domain-general cognitive link between speech and beatboxing

sounds. This is how similarities and differences between speech and beatboxing phonology

can be predicted: any phenomenon that could emerge due to the nature of the graph in one

domain is fair game for the other (but task-specific phenomena, including which units are

selected for production and the task-specific parameters of those units, are not). Likewise,

any hypotheses made about speech graphs may therefore manifest in the beatboxing graph

as well, and vice-versa. For example, the Gestural Harmony Model (Smith, 2018)

hypothesizes two new graph elements: a persistence parameter that allows a gesture to have

no specified ending activation phase, and an inhibitive type intergestural coupling

relationship by which one gesture inhibits the activation of another. In doing so, the model

simultaneously makes predictions about the parameter space and coupling graph options

that beatboxing has access to as well. It turns out that beatboxing fulfills these predictions as

described in Chapter 6: Harmony.

The proposed graph-level link also introduces a new behavioral possibility: that

speech and beatboxing sounds may co-mingle and be coordinated as part of the same motor

plan. After all, no part of the framework outlined above precludes the simultaneous use of a

point attractor with speech parameters and a point attractor with beatboxing parameters.

People do not spontaneously or accidentally beatbox in the middle of a typical sentence, but
150
during vocal play speakers may for fun mix sounds that are otherwise unattested in their

language variety into their utterances; and, beatboxers sometimes use words or word phrases

as part of their music. But the clearest evidence for the existence of speech-and-beatboxing

behavior (and support for the graph-level link) is the art form known as beatrhyming, the

simultaneous production of speech (i.e., singing or rapping) and beatboxing by an individual.

Beatrhyming shows that humans can take full advantage of the flexibility of the motor

system to blend two otherwise distinct tasks into a brand new task. Beatrhyming is discussed

more thoroughly in Chapter 7: Beatrhyming.

There are alternatives to gestures as the fundamental beatboxing units. Paroni et al.

(2021) suggest the term boxeme be used to mean a distinct unit of beatboxing sound,

analogous to a phoneme. Boxemes are posited as the building blocks of beatboxing

performances; since beatboxers explicitly refer to these individual sounds in the composition

of a beat pattern, the notion seems to be that every sound that can be differentiated from

another sound (by name, acoustics, or articulation) is a boxeme candidate. Given the

evidence that beatboxing sounds are composites of smaller units, a phoneme-like boxeme

could be said to be composed of symbolic beatboxing features. (Paroni et al., 2021 do not

commit to either a symbolic or dynamical approach, and “boxeme” may simply be a useful,

theory-agnostic way to refer to a meaningful segment-sized beatboxing sound init; for the

sake of argument, we assume that the clear connection to “phoneme” is meant to imply a

symbolic perspective.)

As mental representations for speech, gestures and phonemes are two very different

hypotheses for the encoding of abstract phonological information: phonemes are purely

151
domain-specific, abstract, symbolic representations composed of atomic phonological

features that are not deterministic with respect to the physical manifestation of a sound.

Gestures on the other hand are simultaneously abstract and concrete (domain-specific and

domain-general) by virtue of their dynamical representation—a specific differential equation

that is predicted to be observably satisfied at every point in time during which a gesture is

being produced. Gestures are particularly advantageous for treating timing relationships (at

multiple time scales) as part of a person’s phonological knowledge. In this sense, the

difference between a beatboxing gesture and a beatboxing feature would similarly be a

difference between units that are both domain-specific and domain-general and units that

are purely domain-specific. As discussed in Chapter 1: Introduction, gestures are the

preferred choice of representation when attempting to draw comparisons between speech

and beatboxing because their partly domain-general nature creates explicit, testable links

between the domains. Symbolic boxemes and phonemes, on the other hand, have no basis

for comparison with each other, no intrinsic links to each other, and no basis for one making

predictions about the other because they are defined purely with respect to their own

domain.

152
CHAPTER 5: ALTERNATIONS

This section addresses whether ”forced” {B} and “unforced” {b} varieties of Kick Drum are

cognitively distinct sound categories or cognitively related, context-dependent alternatives of

a single sound category. It is shown that forced and unforced Kick Drums fulfill the same

rhythmic role in a beat pattern, with unforced Kick Drums generally occurring between

sounds with dorsal constrictions and forced Kick Drums generally occurring elsewhere. The

forced and unforced Kick Drums therefore appear to be context-dependent alternations of a

single Kick Drum category, similar to phonological alternations observed in speech.

1. Introduction to Kick Drums

The Kick Drum mimics the kick drum sound of a standard drum set. It is typically

performed as a voiceless glottalic egressive bilabial plosive, also known as a bilabial ejective

(de Torcy et al. 2013, Proctor et al. 2013, Blaylock et al. 2017, Patil et al. 2017, Underdown

2018). Figure 81 illustrates how one expert beatboxer from the rtMRI beatboxing corpus

produces a classic, ejective Kick Drum. First a complete closure is made at the lips and glottis

(Figure 81a), then larynx raising increases intraoral pressure so that a distinct “popping”

sound is produced when lip compression is released (Figure 81b).

153
Figure 81. Forced/Classic Kick Drum. Larynx raising, no tongue body closure.

a. b.

Many labial articulations produced by this beatboxer during connected beatboxing

utterances (“beat patterns”) were clearly identifiable as classic ejective Kick Drums during

the transcription process based on observations of temporally proximal labial closures and

larynx raisings. These Kick Drums in beat patterns qualitatively matched the production of

the Kick Drum in isolation (albeit with some quantitative differences, e.g., in movement

magnitude of the larynx).

However, some sounds produced with labial closures in the beat patterns of this data

set did not match the expected Kick Drum articulation—nor were they the same as other

labial articulations like the PF Snare (a labio-dental ejective affricate) or Spit Snare (a

buccal-lingual egressive bilabial affricate). These “mystery” sounds had labial closures and

release bursts most similar to those of the Kick Drum, but were generally produced with a

tongue body closure and without any larynx raising. These differences are visible in a

comparison of Figure 81 (the Kick Drum) with Figure 82 (the mystery labial): in Figure 81,

the tongue body never makes a constriction against the palate or velum, and bright spot at

the top of the trachea indicates that the vocal folds are closed; but in Figure 82, the tongue

body is pressed against a lowered velum, and the lack of a bright spot indicates that the vocal

folds are spread apart.

154
Figure 82. Unforced Kick Drum. Tongue body closure, no larynx raising.

a. b. c.

Based both on consultation with beatboxers and on the analysis that follows below, this

mystery labial sound has been identified as what is known in the beatboxing community as

an “unforced Kick Drum”—a “weaker” alternative to the more classic ejective “forced” Kick

Drum, and which does not have a common articulatory definition (compared to the forced

Kick Drum, which beatbox researchers have established is commonly an ejective) (Tyte &

SPLINTER, 2014; Human Beatbox, 2018). Given the clear dorsal closure, one might expect

that the unforced Kick Drum would be performed as a lingual (velaric) ingressive (clicklike)

or egressive sound. However, preliminary analysis suggests that the unforced Kick Drum is a

“percussive” (Pike, 1943), referring to a lack of ingressive or egressive airstream during the

production of this sound (not to be confused with their role in musical percussion). Figure

83 illustrates this via comparison to the Spit Snare, a lingual egressive bilabial sound: the Spit

Snare reduces the volume of the chamber in front of the tongue through tongue fronting and

jaw raising (Figure 83, left), whereas the unforced Kick Drum does neither (Figure 83, right).

155
Figure 83. Spit Snare vs Unforced Kick Drum. The Spit Snare (left) and unforced Kick Drum
(right) are both bilabial obstruents made with lingual closures. The top two images of each
sound are frames representing time of peak velocity into the labial closure and initiation of
movement out of the labial closure (found with the DelimitGest function of Tiede [2010]).
The difference between frames (bottom) was generated using the imshowpair function in
MATLAB’s Image Processing Toolbox. In both images, purple pixels near the lips indicate
that the lips are closer together in the later frame than in the first. For the Spit Snare, the
purple pixels near the tongue indicate that the tongue moved forward between the two
frames, and the green pixels near the jaw indicate that the jaw rose. For the unforced Kick
Drum, the relative lack of color around the tongue and jaw indicate that the tongue and jaw
did not move much between these two frames.

Not all beatboxers appear to be aware of the distinction between forced and unforced Kick

Drums—or if they are aware, they do not necessarily feel the need to specify which type of

Kick Drum they are using. Hence, while the expert beatboxer in this study did not identify

the difference between forced and unforced Kick Drums and chose to produce only forced

Kick Drums in isolation, they made liberal use of both Kick Drum types in beat patterns

throughout the data acquisition session, as shown in Chapter 3: Sounds.

156
For another example of beatboxers not distinguishing between forced and unforced

Kick Drums: during an annotation session in the early days of this research, a

researcher-beatboxer of self-assessed intermediate skill involved with this project

demonstrated a beat pattern featuring only sounds with dorsal articulations (a common

strategy used for the practice of phonating while beatboxing, as discussed in Chapter 6:

Harmony). In the beat pattern, she produced several of what we now recognize as unforced

Kick Drums—sounds that act as Kick Drums but have a dorsal articulation instead of an

ejective one. But when asked to name the sound, she simply called it “a Kick Drum,” not

specifying whether it was forced or unforced and apparently not noticing (or caring about,

for that beat pattern) the difference.

The parallel to similar observations about speech are striking. English speakers who

have a sense that words are composed of sounds can often recognize the existence of a

category of sounds like /t/, but may not be aware that it manifests differently (sometimes

very differently) in production depending on a variety of factors including its phonological

environment. In the same way, beatboxers are aware of the Kick Drum sound category but

may not always be aware of the different ways it manifests in production. In symbolic

approaches to phonology, this type of observation has been used to argue for the existence of

abstract phonological categories (e.g., phonemes) with context-dependent alternants

(allophones). In Articulatory Phonology, much of allophony is accounted for by gestural

overlap: instead of categorical changes from one allophone to another depending on context,

the gestures for a given sound are invariant and only appear to change when co-produced

with gestures from another sound (Browman & Goldstein, 1992; see Gafos & Goldstein, 2011

157
for a review). In either approach, there is a single sound category (a phoneme or gestural

constellation) the manifestation of which varies predictably and unconsciously based on the

sounds in its environment.

Do beatboxers treat forced and unforced Kick Drums as alternate forms of the same

sound category? If so forced and unforced Kick Drums would be expected to be members of

the same class of sounds and to occur in complementary distributions conditioned by their

phonetic environments. Articulatory Phonology’s account of allophony via temporal overlap

furthermore predicts that the constriction that makes the difference between the sounds will

come from a nearby sound’s gesture. Assuming that the forced Kick Drum is the default

sound because it was the one produced in isolation by the beatboxer, the tongue body

closure characterizing the unforced Kick Drum is predicted to be a gesture associated with

another sound nearby. Establishing the first criterion, that the forced and unforced Kick

Drums are members of the same class of sounds, is done with a musical analysis. A

subsequent phonetic analysis looks for evidence that the two Kick Drums are

complementary distribution conditioned by tongue body closures of nearby sounds. Both

analyses are summarized below.

The musical analysis takes into account that beatboxing sounds are organized into

meaningful musical classes. Musical classes of sounds have aesthetically-conditioned metrical

constraints that can be satisfied by any sound in the class; for example, although snare

sounds as a class are generally required on beat 3 (the back beat) of any beatboxing

performance, the requirement can be accomplished with any sound from the class of snares

including a PF Snare, a Spit Snare, or an Inward K Snare. The members of a musical class of

158
sounds are not necessarily alternations of the same sound—PF Snares and Inward K Snares

are not argued here to be context-dependent variants of an abstract snare category. But for

forced and unforced Kick Drums to be alternants of the same category, they minimally must

belong to the same musical class. Because sounds in a musical class have metrical occurrence

restrictions, a test of musical class membership is to observe whether forced and unforced

Kick Drums are performed with the same rhythmic patterns and metrical distributions. If

they are not, then they are not members of the same musical class and therefore cannot be

alternants of a single abstract category.3 (The names of the sounds clearly imply that

beatboxers treat the forced Kick Drum and unforced Kick Drum as two members of the Kick

Drum musical class; the musical analysis below illustrates this relationship in detail.)

The phonetic analysis is to note the phonetic environment of each Kick Drum type

and look for patterns in the gestures of those environments. Complementary distribution is

found if the phonetic environments of the two types of Kick Drum are

non-overlapping—that is, the selection of a forced or unforced Kick Drum should be

predictable based on its phonetic environment. This type of analysis is performed in many

introductory phonology classes where complementary distribution is often taken as evidence

for the existence of phonemes with multiple allophones.

Sections 2 and 3 below establish that in this data set, forced and unforced Kick Drums

are in fact environmentally-conditioned alternations of a Kick Drum sound category: they

share the same rhythmic patterning (Section 2.1), but unforced Kick Drums are mostly found

3
It may be useful in future analyses to consider the possibility that some sounds vary by metrical position or
otherwise exhibit positional allophony. Guinn & Nazarov (2018) suggest that phonotactic restrictions on place
that prevent coronals from occurring in metrically strong positions; perhaps those restrictions are part of a
broader pattern of allophony.

159
between two dorsal sounds whereas forced Kick Drums have a wider distribution (Section

2.2). The unforced Kick Drum therefore appears to be a Kick Drum that has assimilated to

an inter-dorsal environment (and lost its laryngeal gesture in the process). This account of

the data will be reinforced in Chapter 6: Harmony when it is shown that unforced Kick

Drums often emerge due to tongue body harmony.

2. Analyses

Beat patterns were transcribed into drum tab notation from real-time MRI videos as

described in Chapter 2: Method. Based on those transcriptions, section 2.1 shows that

unforced Kick Drums have a similar rhythmic distribution to forced Kick Drums,

particularly beat 1 of a beat pattern. Section 2.2 shows that unforced Kick Drums appear to

have a fairly restricted environment, occurring mostly between two dorsal sounds. The two

findings combined suggest that forced and unforced Kick Drums are alternative

contextually-conditioned manifestations of a Kick Drum category (discussed in Section 3).

From this point forward, the ejective (classic/forced Kick Drum) version will be

written in Standard Beatbox Notation {B}, whereas the unforced Kick Drum will be written

in Standard Beatbox Notation {b} (Tyte & SPLINTER, 2014). (Note that uppercase vs

lowercase in Standard Beatbox Notation cannot always be interpreted as a forced vs unforced

distinction. For example, the Closed Hi-Hat is considered a forced sound, but is written with

a lowercase {t}.)

160
2.1. Rhythmic patterns of Kick Drums

Forty beat patterns were identified as containing a forced Kick Drum, unforced Kick Drum,

or both. One beat pattern with forced Kick Drums was omitted because it also included

unusually breathy (possibly Aspirated) Kick Drums which are not the subject of the analysis.

Of the remaining thirty-nine beat patterns, all but six were exactly four measures long; for

this analysis, the six longer beat patterns were truncated to just the first four measures. An

exception was made for beat pattern 38 (Figure 86) which comes from the same

performance as beat pattern 28 (Figure 84). The originating beat pattern was 32 measures

long; the first section (measures 1-4, beat pattern 28) used forced Kick Drums whereas the

last section (measures 29-32, beat pattern 38) used both forced and unforced Kick Drums,

and the two sections were judged to have sufficiently distinctive beat patterns that they could

both be included in the analysis.

A total of 40 four-measure Kick Drum patterns were sorted into three groups: 28 beat

patterns that only contain forced Kick Drums (Figure 84), 7 beat patterns that only contain

unforced Kick Drums (Figure 85), and 5 beat patterns that contain both forced and unforced

Kick Drums (Figure 86).

There are many possible forced Kick Drum patterns (Figure 84), but three particular

details will facilitate comparison to unforced Kick Drums. First, in all beat patterns but one

the forced Kick Drum occurs on the very first beat of the very first measure (27/28 cases,

96.4%, beat patterns 2-28). Second, in several cases the Kick Drum occurs on beats 1, 2+, and

4 of the first and third measures (9/28 cases, 32.1%, beat patterns 18-26). And third, 7 of those

same 9 beat patterns feature Kick Drums on 1+ and 2+ of measure 2, with similar patterns in

161
measure 4 (beat patterns 19-25). There are fewer beat patterns that use unforced Kick Drums

to the exclusion of forced Kick Drums (Figure 85), but the unforced Kick Drums in these

beat patterns have similar patterns to the ones just described for forced Kick Drums above.

First, in all but one beat pattern the unforced Kick Drum occurs on beat 1 of measure 1 (6/7

cases, 85.7%, beat patterns 30-35). Second, the Kick Drum tends to also occur on beats 1, 2+,

and 4 of the first and third measures (5/7 cases, 71.4%, beat patterns 31-35). And third, 4 of

those same 5 beat patterns feature Kick Drums on 1+ and 2+ of measure 2, with similar

patterns in measure 4 (beat patterns 32-35).

Figure 84. Forced Kick Drum beat patterns.


1) B|------x---------|--x---x---------|------x---------|--x---x---------
2) B|x---------------|----------------|x-----------x---|----------------
3) B|x---------------|----x-----------|x---------------|----x-----------
4) B|x---------------|----x-----------|x---------------|----x-----------
5) B|x---------------|x---------x-----|x---------------|x-----x---------
6) B|x--------------x|x---x-----------|x--------------x|x---x-----------
7) B|x--------------x|x---x-----------|x--------------x|x---x-----------
8) B|x--------------x|x---x-----------|x--------------x|x---x-----------
9) B|x--------------x|x---x-----------|x--------------x|x---x-----------
10) B|x-------------x-|----x-----------|x-------------x-|----x-----------
11) B|x-----------x---|----------------|x-----------x---|----------------
12) B|x-----------x---|----------------|x-----------x---|----------------
13) B|x-----------x---|----x-----------|x-----------x---|----x-----------
14) B|x-----------x---|----x-----------|x-----------x---|----x-----------
15) B|x-----------x---|----x-----------|x-----------x---|----x-----------
16) B|x-----------x---|----x-------x---|x-----------x---|----x-----------
17) B|x-----------x---|----x-----x---x-|------------x---|----x-----------
18) B|x-----x-----x---|--x-------------|x-----x-----x---|--x-x-----------
19) B|x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---------
20) B|x-----x-----x---|--x---x-----x---|x-----x-----x---|--x---x-----x---
21) B|x-----x-----x---|--x---x-----x---|x-----x-----x---|--x---x-----x---
22) B|x-----x-----x---|--x---x-----x---|x-----x-----x---|--x---x-----x---
23) B|x-----x-----x---|--x---x---x-----|x-----x-----x---|--x---x---x-----
24) B|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x-------x-----
25) B|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x-----
26) B|x-----x-----x---|-x----x---------|x-----x-----x---|------x---------
27) B|x---x-----x---x-|--x-x-----x-----|x---x-----x---x-|--x-x-----x-x---
28) B|x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
| Measure 1 | Measure 2 | Measure 3 | Measure 4

162
Figure 85. Unforced Kick Drum beat patterns.
29) b|------x---------|--x---x---------|x-----x-----x---|--x---x---x-x---
30) b|x---------------|x---------------|x---------------|x---------------
31) b|x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
32) b|x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---x---x-
33) b|x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
34) b|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x---x-
35) b|x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
| Measure 1 | Measure 2 | Measure 3 | Measure 4

Even the two beat patterns in which a Kick Drum does not occur on beat 1 of measure 1

(beat pattern 1 of Figure 84 and beat pattern 29 of Figure 85) are similar: both have a single

Kick Drum on beat 2+ of measure 1, followed by two Kick Drums on beats 1+ and 2+ of

measure 2. (These beat patterns without Kick Drums on the first beat seem exceptional

compared to the rest of the beat patterns that do have Kick Drums on beat 1. Examining the

real-time MRI reveals that there are, in fact, labial closures on beats 1 and 4 of measure 1 in

both of these beat patterns, mimicking the common pattern of Kick Drums on beats 1, 2+,

and 4 of the first measure. The labial closures on beats 1 and 4 are co-produced with other

sounds on the same beat—a Lip Bass in the case of the forced Kick Drum (Figure 84, beat

pattern 1), and a Duck/Meow sound effect in the case of the unforced Kick Drum (Figure 85,

beat pattern 29). While many of the other beat patterns also feature Kick Drums

co-produced with other sounds on the same beat, the labial closures on beats 1 and 4 in these

two exceptional beat patterns have no acoustic release corresponding to the sound of a Kick

Drum, and so are absent from the drum tab transcription.)

Figure 86 shows five cases of beat patterns with both forced and unforced Kick

Drums. Each beat pattern is presented with both forced {B} and unforced {b} Kick Drum

drum tab lines as well as a “both” drum tab line that is the superposition of the two types of

163
Kick Drum. Notice that the two types of Kick Drum never interfere with each other (i.e., by

occurring on the same beat); on the contrary, they are spaced apart from each other in ways

that create viable Kick Drum patterns. This is especially noticeable in beat patterns 36, 37,

and 40: the Kick Drums collectively create pattern of Kick Drums on beats 1, 2+, and 4 of the

first measure, one of the common patterns described above (Figure 84, patterns 18-26); but

neither the forced nor the unforced Kick Drums accomplish this pattern alone—the pattern

is only apparent when the two Kick Drum types are combined on the same drum tab line.

Beat patterns 38 and 39 demonstrate that even inconsistent selection of forced and

unforced Kick Drums can still yield an appropriate Kick Drum beat pattern. In beat pattern

38, the first two measures feature mostly forced Kick Drums while the second two measures

feature mostly unforced Kick Drums; despite this, the resulting Kick Drum beat pattern is

clearly repeated with Kick Drums on beats 1, 2+, and 4 of the first and third measures as well

as beats 1+ and 2+ of the second and fourth measures. Likewise in beat pattern 39: even

though the penultimate Kick Drum is the only unforced Kick Drum, it contributes to

repeating the beat pattern from the first two measures.

164
Figure 86. Beat patterns with both forced and unforced Kick Drums.
36) B|x-----------x---|----x-------x---|x-----------x---|----x-------x---
b|------x---------|----------------|------x---------|----------------
both|x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

37) B|x-----------x---|--x-------------|x-----------x---|--x-------------
b|------x---------|----------------|------x---------|----------------
both|x-----x-----x---|--x-------------|x-----x-----x---|--x-------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

38) B|x-----x-----x---|--x-------------|x---------------|----------------
b|----------------|------x---------|------x-----x---|--x---x---------
both|x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

39) B|x-----x-----x---|--x---x---x-----|x-----x-----x---|------x---------
b|----------------|----------------|----------------|--x-------------
both|x-----x-----x---|--x---x---x-----|x-----x-----x---|--x---x---------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

40) B|------x---------|------x---x-----|------x---------|------x---x-----
b|x-----------x---|--x-----------x-|x-----------x---|--x-------------
both|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x-----
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

In summary: forced and unforced Kick Drums fill the same metrical positions. When they

occur together in the same beat pattern, their joint patterning resembles typical Kick Drum

patterns—that is, they fill in each other’s gaps. For a beatboxer, this finding is probably

unsurprising. After all, the sounds are both just varieties of “Kick Drum”, so it makes sense

that their occurrences in musical performances would be similar.

But notice now that out of 40 beat patterns, only 5 used both forced and unforced

Kick Drums to build Kick Drum patterns; the remaining 35 beat patterns used either forced

or unforced Kick Drums, but not both. In fact, even in 3 of the 5 beat patterns with both

types of Kick Drums, the metrical distribution of Kick Drums is highly regular. For example,

in beat pattern 36 of Figure 86, unforced Kick Drums only occur on beat 2+ of measures 1

165
and 3. If forced and unforced Kick Drums are both fulfilling the role of Kick Drum in these

beat patterns, why do they not appear together in the same beat pattern more often? Why do

they not occur in free variation? The next section demonstrates that although forced Kick

Drums and unforced Kick Drums are members of the same musical class, their distribution

is conditioned by the articulations of the musical events around them—similar to some

phonological alternations.

2.2 Phonological environment

2.2.1 Method

Beat patterns were encoded as PointTiers as described in Chapter 2: Method. The PointTier

linearizes beat pattern events into sequences, even when two events are metrically on the

same beat. Most of the time this is desirable; even though a Kick Drum and Liproll may

occur on the same beat, the Kick Drum is in fact produced first in time followed quickly by

the Liproll. However, this is undesirable linearization for laryngeal articulations like

humming which may in fact be simultaneous with co-produced oral sounds, not sequential.

Figure 87 shows a sample waveform and spectrogram in which acoustic noise and the release

of oral closures may hide the true onset of voicing. Humming articulations that were

annotated in drum tabs as co-occurring on the same beat as an oral sound were removed,

leaving only oral articulations. Each beat pattern’s PointTier representation was converted to

a string in MATLAB using mPraat (Bořil & Skarnitzl, 2016).

Environment types. Similar to some classical phonological analyses, trigram

environments were created from these beat patterns (i.e., {C X D}, where {C} and {D} are two

166
beat pattern events and {X} is a forced or unforced Kick Drum). Each unique trigram in the

corpus of beat patterns is called an environment type. To ensure that each Kick Drum was in

the middle of an environment type, each beat pattern was prefixed with an octothorpe (“#”)

to represent the beginning of a beat pattern and suffixed with a dollar sign (“$”) to represent

the end of a beat pattern. An utterance-initial unforced Kick Drum before a Clickroll {CR}

might therefore appear as the trigram {# b CR}, and an utterance-final forced Kick Drum

after a Closed Hi-Hat would be {t B $}. The set of unique environment types was generated

from the Text Analytics MATLAB toolbox. Forced Kick Drums were found in 141

environment types; unforced Kick Drums were found in 54 environment types.

Environment classes. Since a major articulatory difference between the forced and

unforced Kick Drums appears to be the presence (for unforced Kick Drums) or absence (for

forced Kick Drums) of a dorsal articulation, the unique trigram environment types were

grouped into environment classes4 based on the dorsal-ness of the sounds adjacent to the

Kick Drum. These environment classes are generalizations that highlight the patterns of Kick

Drum distribution with respect to dorsal-ness.

4
Linguists would traditionally be looking for “natural” classes here. The term “environment class” skates around
issues of “naturalness” in speech and beatboxing, but the methodological approach to classifying a sound’s
phonological environment is essentially the same.

167
Figure 87. An excerpt from a PointTier with humming. In this beat pattern, the oral
articulators produce the sequence {b dc tbc b SS}, where {b} is an unforced Kick Drum, {dc}
and {tbc} are dental and interlabial clicks, and {SS} is a Spit Snare. The initial unforced Kick
Drum {b} and the interlabial click {tbc} are both co-produced with an upward pitch sweep
marked as {hm} and called “humming”. These hums were removed for this analysis, leaving
only the oral articulations. (Note that this audio signal was significantly denoised from its
original recording associated with the real-time MRI data acquisition, but a few artefacts
remain as echoes that follow most sounds in the recording.)

For example, consider two hypothetical trigram environment types: {SS b dc}, which is an

unforced Kick Drum between a Spit Snare {SS} and dental closure {dc}, and {^K b LR},

which is an unforced Kick Drum between an Inward K Snare {^K} and a Liproll {LR}. The

Spit Snare, dental closure, Inward K Snare, and Liproll all involve dorsal articulations, so the

environment types {SS b dc} and {^K b LR} would both be members of the environment

class {[+ dorsal] __ [+ dorsal]}. (The +/- binary feature notation style used here is for

convenience to represent the existence or absence of a dorsal closure and should not be

taken as an implication that this is a symbolic featural analysis). The options [+ dorsal], [-

dorsal], and utterance-boundary (“#” or “$”) can occur in both the before and after positions

168
for a Kick Drum environment, resulting in nine (3 * 3 = 9) logically possible Kick Drum

environment classes; two of these nine did not have any Kick Drum tokens in them, leaving

seven Kick Drum environment classes listed in Tables 23 and 24. Not all environment classes

were used by either type of Kick Drum.

2.2.2 Results

Tables 21 and 22 presents the forced and unforced Kick Drum frequency distributions

environments by token frequency (how many Kick Drums of a given kind were in each

environment class) and type frequency (how many unique trigram environment types of a

given Kick Drum kind were in each environment class). Table 21 shows the results of the

analysis for the forced Kick Drum environments, and Table 22 shows the results for the

unforced Kick Drum environments.

Table 21 summarizes the distribution of 330 forced Kick Drum tokens across 141

unique trigram environment types, which generalize to six environment classes. The

majority of forced Kick Drum tokens and environment types did not include proximity to a

dorsal sound ("Not near a dorsal" in Table 21). The forced Kick Drums that did occur near

dorsals tended to have a non-dorsal sound on their opposite side (i.e., {[- dorsal] B [+

dorsal]} or {[+ dorsal] B [- dorsal]}). As shown in Table 22, the vast majority (93.9%) of

unforced Kick Drum tokens occurred in environment classes that included one or more

dorsal sounds near the unforced Kick Drum (the “Near a dorsal” classes), with most of those

(83.3%) featuring dorsal sounds on both sides of the unforced Kick Drum. This is essentially

the reverse of the distribution of forced Kick Drums which were highly unlikely to occur

between dorsal sounds.


169
Tables 23 and 24 show contingency tables for observations of forced and unforced

Kick Drum environment types (Table 23) and tokens (Table 24). Fisher’s exact tests on these

tables were significant (p < 0.001 in both cases), meaning that the frequency distribution of

Kick Drums in these environments deviated from the expected frequencies—that is, Kick

Drum types appeared often in some environments and sparsely in others. Tables 23 and 24

highlight in green the cells with the highest frequencies and which correspond to the

observations in Tables 21 and 22: forced Kick Drums tend to occur between non-dorsal

sounds while unforced Kick Drums tend to occur between dorsal sounds.

Table 21. Forced Kick Drum environments.


Environment class Number of Tokens in environment
environment types class
Before After

Near a dorsal [+ dorsal] B [+ dorsal] 8 5.7% 18 5.5%

# B [+ dorsal] 1 0.7% 1 0.3%

[+ dorsal] B [- dorsal] 28 19.9% 60 18.2%

[- dorsal] B [+ dorsal] 20 14.2% 42 12.7%

Not near a dorsal [- dorsal] B [- dorsal] 63 44.7% 183 55.5%

# B [- dorsal] 21 14.9% 26 7.9%

Total 141 100% 330 100%

170
Table 22. Unforced Kick Drum environments.
Environment class Number of Tokens in environment
environment types class
Before After

Near a dorsal [+ dorsal] b [+ dorsal] 42 76.4% 95 83.3%

[- dorsal] b [+ dorsal] 1 1.8% 1 0.9%

# b [+ dorsal] 5 9.1% 7 6.1%

[+ dorsal] b [- dorsal] 2 3.6% 2 1.8%

[+ dorsal] b $ 2 3.6% 2 1.8%

Not near a dorsal [- dorsal] b [- dorsal] 3 5.5% 7 6.1%

Total 55 100% 114 100%

Table 23. Kick Drum environment type observations. Forced Kick Drums trigram
environment types were most likely to be of the {[- dorsal] B [- dorsal]} environment class,
while unforced Kick Drum environment types were most likely to be ofn the {[+ dorsal] b [+
dorsal]} environment class.
Forced Kick Drum Unforced Kick Drum
Environment class environment type environment type Total

[+ dorsal] X [+ dorsal] 8 41 45

[+ dorsal] X [- dorsal] 28 2 25

[- dorsal] X [+ dorsal] 20 1 18

[- dorsal] X [- dorsal] 63 3 78

# X [+ dorsal] 1 5 6

# X [- dorsal] 21 0 21

[+ dorsal] X $ 0 2 2

Total 141 54 195

171
Table 24. Kick Drum token observations. Forced Kick Drum tokens were most likely to occur
in the {[- dorsal] B [- dorsal]} environment class, while unforced Kick Drum tokens were
most likely to occur in the {[+ dorsal] b [+ dorsal]} environment class.
Forced Kick Drum Unforced Kick Drum token
Environment class token frequency frequency Total

[+ dorsal] X [+ dorsal] 18 95 106

[+ dorsal] X [- dorsal] 60 2 55

[- dorsal] X [+ dorsal] 42 1 39

[- dorsal] X [- dorsal] 183 7 208

# X [+ dorsal] 1 7 8

# X [- dorsal 26 0 26

[+ dorsal] X $ 0 2 2

Total 330 114 444

Figure 88 shows the time series for a sequence of a lateral alveolar closure, unforced Kick

Drum, and Spit Snare {tll b SS}. The sounds surrounding the unforced Kick Drum both have

tongue body closure: the lateral alveolar closure is a percussive like the unforced Kick Drum,

which in this case means it has tongue body closure but no substantial movement of the

tongue body forward or backward to cause a change in air pressure; the Spit Snare on the

other hand is a lingual ingressive sound, requiring a tongue body closure and subsequent

squeezing of air past the lips. The tongue body makes a closure high throughout the

sequence as represented by consistently high values for pixel intensity in the DOR region,

indicating that the Kick Drum may be unforced because of gestural overlap with one or more

tongue body closures intended for a nearby sound like the Spit Snare. The LAR time series

172
for larynx height is also included to confirm that there is no ejective-like action here that

would correspond to a forced Kick Drum.

Figure 88. A sequence of a lateral alveolar closure {tll}, unforced Kick Drum {b}, and Spit
Snare {SS}. The DOR region of the tongue body has relatively high pixel intensity
throughout the sequences, and the LAR region of the larynx has low pixel intensity.

3. Conclusion

Forced and unforced Kick Drums are in complementary distribution: unforced Kick Drums,

which were described earlier as having a dorsal articulation in addition to a labial closure,

tend to occur near dorsal sounds; forced Kick Drums do not share this dorsal articulation,

and tend to occur near non-dorsal sounds. Based on this context-dependent complementary

distribution and their similar rhythmic patterning, the forced and unforced Kick Drums

seem to be cognitively related as alternations of a single Kick Drum category.

173
Given the matching dorsal or non-dorsal quality of a Kick Drum and its

surroundings, it seems likely that the alternations are specifically participating in a

phonological agreement/assimilation phenomenon. The tongue body does not appear to

release its closure between the unforced Kick Drum and the sound before or after it. In a

traditional phonological analysis, one could posit a phonological rule to characterize this

distribution such as: “Kick Drums are unforced (dorsal) between dorsal sounds and forced

(ejective) elsewhere.” (Forced Kick Drums are the elsewhere case because their occurrence is

distributed somewhat more evenly over more environment classes.)

{B} —> {b} / {+ dorsal} ___ {+ dorsal}

The Articulatory Phonology analysis is roughly the same, if not so featural: Kick Drums are

unforced if they overlap with a tongue body closure. These interpretations assume a causal

relationship in which the Kick Drum is altered by its environment, but an alternative story

reverses the causation: forced and unforced Kick Drums are distinct sound categories that

trigger dorsal assimilation in the sounds nearby. The analysis of beatboxing phonological

harmony in Chapter 6: Harmony provides further evidence that the Kick Drum is subject to

change depending on the sounds nearby—including non-adjacent dorsal harmony

triggers—and not the other way around.

Kick Drums are not the only sound in the data set to show this type of pattern,

though their relatively high token frequency makes them the only sounds to show it so

robustly. As Chapter 3: Sounds listed, there are two labio-dental compression sounds: a

glottalic egressive PF Snare and a percussive labio-dental sound. As its name implies, the PF

Snares fulfills the musical role of a snare by occurring predominantly on the back beat of a

174
beat pattern. Suspiciously, the labio-dental percussive also appears on the back beat in the

two beat patterns it occurs in, and just like the unforced Kick Drum it occurs surrounded by

sounds with tongue body closures. The same goes for the Closed Hi-Hat and some of the

coronal percussives, though the pattern is confounded somewhat by the percussives being

distributed over several places of articulation while the Closed Hi-Hat is a distinctly alveolar

sound. Taking the Kick Drum, PF Snare, and Closed Hi-Hat together suggests that the

phenomenon discussed in this chapter is actually part of a general pattern that causes some

ejectives to become percussives other sounds with tongue body closures sounds are nearby.

Again, Chapter 6: Harmony addresses this in more detail.

175
CHAPTER 6: HARMONY

Some beatboxing patterns include sequences of sounds that share a tongue body closure, a

type of agreement that in speech might be called phonological harmony. This chapter

demonstrates that beatboxing harmony has many of the signature attributes that

characterize harmony in phonological systems in speech: sounds that are harmony triggers,

undergoers, and blockers. In beatboxing, the function of a sound in harmony is predictable

based on the phonetic dimension of airstream initiator. This analysis of beatboxing harmony

provides the first evidence for the existence of sub-segmental cognitive units of beatboxing

(vs whole segment-sized beatboxing sounds). These patterns also show that the harmony

found in spoken phonological systems is not unique to phonology.

1. Introduction

A common type of beat pattern in beatboxing involves the simultaneous production of

obstruent beatboxing sounds and phonation (which may not always be modal). This type of

"humming while beatboxing" beat pattern is well-known by beatboxers and treated as a skill

to be developed in the pursuit of beatboxing expertise (Stowell & Plumbley, 2008; Park, 2016;

WIRED, 2020).

176
Figure 89. A beat pattern that demonstrates the beatboxing technique of humming with
simultaneous oral sound production. This beat pattern contains five sounds: an unforced
Kick Drum {b}, a dental closure {dc}, a linguolabial closure {tbc}, a Spit Snare {SS}, and brief
moment of phonation/humming {hm}. In this beat pattern, humming co-occurs with other
beatboxing sounds on most major beats (i.e., 1, 2, 3, and 4, but not their subdivisions).
b |x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
dc |--x-----------x-|----------------|--x-----------x-|------------x---
tbc|----x-----------|x---x-------x---|----x-----------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
hm |x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Without knowing the articulation in advance, a humming while beatboxing beat pattern is

an pneumatic paradox: humming requires a lowered velum to keep air pressure low above

the vocal folds while they vibrate and to allow air to escape through the nose, but glottalic

and pulmonic obstruents—which many beatboxing sounds are (see Chapter 3:

Sounds)—require a raised velum so air pressure can build up behind an oral closure. The

production of voiced stops in speech comes with similar challenges; languages with voiced

stops use a variety of strategies such as larynx lowering to decrease supraglottal pressure

(Catford, 1977; Ohala, 1983; Westbury, 1983). Real-time MRI examples later in this chapter

show that beatboxers use a different strategy to deal with the humming vs obstruent

antagonism: separating the vocal tract into two uncoupled chambers with a tongue body

closure (see also Dehais-Underdown et al., 2020; Paroni, 2021b). Behind the tongue body

closure, the velum is lowered and phonation can occur freely with consistently low

supraglottal pressure. In front of the tongue body closure, air pressure is manipulated by the

coordination of the tongue body and the lips or tongue tip. In speech, a similar articulatory

arrangement is used for the production of voiced or nasal clicks.

177
The examples above of speech remedies for voiced obstruents operate over a

relatively short time span near when voicing is desired. Notice, however, that phonation {hm}

in the beat pattern from Figure 891 is neither sustained nor co-produced with every oral

beatboxing sound, yet every sound in the pattern is produced with a tongue body closure. It

turns out that other beat patterns like the one in Figure 90 also feature many sounds with

tongue body closures even when the beat pattern has no phonation at all; the humming

while beatboxing example is just one of several beat pattern types in which multiple sounds

share the property of being produced with a tongue body constriction. When multiple

sounds share the same attribute in speech, the result is phonological “harmony”.

This chapter demonstrates the existence of harmony in beatboxing, and in doing so

offers deep insights about the makeup of the fundamental units of beatboxing cognition. The

remainder of this section provides a basic overview of local (vowel-consonant) harmony in

speech (section 1.1) and previews some of the major theoretical issues at stake in the

description of tongue body closure harmony for beatboxing (section 1.2).

Figure 90. This beat pattern contains five sounds: a labial stop produced with a tongue body
closure labeled {b}, a dental closure {dc}, an lateral closure {tll}, and lingual egressive labial
affricate called a Spit Snare {SS}. All of the sounds are made with a tongue body closure.
b |x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
dc |----x-----------|----------------|----x-----------|----------------
tll|----------------|x---------------|----------------|x---------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

178
1.1 Speech harmony

Harmony in speech occurs when multiple distinct phonological segments can be said to

“agree” with each other by expressing the same particular phonological property. There are a

few different types of harmony patterns in speech, but the most relevant to this study is

“local harmony” in which the sounds that agree with each other occur in an uninterrupted

sequence. (Local harmony is also known as “vowel-consonant harmony” because it affects

both vowels and consonants). Rose & Walker (2011) describe a few types of local harmony

including nasal harmony, emphatic (pharyngeal) harmony, and retroflex harmony.

As a phonological phenomenon, harmony is ultimately governed by the goals of

speech—specifically, the task of communicating a linguistic message. Part of accomplishing

this task is to create messages that have a high likelihood of being accurately recovered by

someone perceiving the message. Harmony is one of several mechanisms that have been

hypothesized for strengthening contrasts that may otherwise be perceptually weak:

perceptually weak phonological units are more likely to be heard if they last longer and

overlap with multiple segments (Kaun, 2004; Walker 2005; Kimper 2011). Less teleologically,

others suppose that (local) harmony is the diachronic phonologization of coarticulation

(Ohala, 1994) or stochastic motor control variation (Tilsen, 2019), which may have

perceptual benefits. In either view, local harmony is initiated by a “trigger” segment which

has some phonological property to spread (i.e., a feature or gesture). Through harmony, that

property is shared with other nearby segments (“targets” or “undergoers”) so that they end

up expressing the same phonological information as the trigger segment.

179
The same overarching task of producing a perceptually recoverable message which

may motivate harmony also constrains which phonological properties of a sound will spread

and how. Harmony must be unobtrusive enough that it does not destroy other crucial

phonological contrasts; tongue body closure harmony, for example, is unattested in speech

because it would destroy too much information by turning all vowels and consonants into

velar stops (Gafos, 1996; Smith, 2018). Likewise, sounds that would be disrupted by harmony

should be able to resist harmonizing and prevent its spread; these types of sounds are called

“blockers”. In other languages, some sounds might be “transparent” instead, meaning that

they neither undergo nor block the harmony.

Theoretical accounts generally treat local harmony as the spreading of a single

phonological property to other sounds (Rose & Walker 2011). In featural accounts, this is

often done by formally linking a feature to adjacent segments according to some rule,

constraint, or other grammatical or dynamical force. In gestural accounts, local harmony has

been modeled as maintaining a particular vocal tract constriction over the course of multiple

segments (Gafos, 1996; Walker et al., 2008; Smith, 2018).

In sum, local/vowel-consonant harmony in speech is observed when multiple sounds

in a row share the same feature or gesture. Harmony is analyzed as a feature or gesture

spreading from a trigger unit onto or through adjacent segments called undergoers, though

some segments may also block harmony or be transparent to it. To the extent that harmony

is goal-oriented, it is likely motivated by a speech goal of promoting perceptual recoverability

of a linguistic message; harmony supports this goal by providing a listener more

opportunities to perceive what might otherwise be a perceptually weak feature or gesture.

180
1.2 Beatboxing harmony

Figures 89 and 90 provided examples of beatboxing sequences in which each sound has a

tongue body closure. While these beat patterns may be harmonious in the sense that the

sounds agree on some property, it does not mean that beatboxing harmony has the same

traits as speech harmony. The overarching goals of beatboxing are more aesthetic than

communicative, so beatboxing harmony may be related to less meaningful—but still

perceptually salient—aesthetic goals. For example, the humming while beatboxing pattern

described earlier allows the beatboxer to add melody to a beat pattern. Even without

phonation, it may sometimes be desirable to make many sounds with a tongue body closure

to create a consistent sound quality from the shorter resonating chamber in front of the

tongue body. Given the completely different tasks that drive speech and beatboxing

harmonies, they could in principle arise from completely distinct motivations using

completely distinct mechanisms, such that any resemblance between them is purely

superficial.

One way to determine whether beatboxing harmony bears only superficial similarity

to harmony in speech or a deeper one one based on a partly shared cognitive system

underlying sequence production is to see whether or not beatboxing harmony exhibits the

signature properties of speech harmony beyond the existence of sequences that share some

properties, namely: triggers, undergoers, and blockers. For example, consider a beatboxing

sequence like *{CR WDT SS WDT}. (The asterisk on that beat pattern indicates that it is not

a sequence found in this data set, which is not quite the same thing as saying that it is an

ill-formed beatboxing sequence.) In that sequence, each sound requires a tongue body

181
closure, so there may be a separate tongue body closure for each sound rather than a

prolonged tongue body closure that would be expected in speech harmony. Either way, none

of the sounds would have to trigger or undergo a tongue body closure assimilation to create

harmony because they all have tongue body closures in any context in which they appear;

and if there is no evidence for triggers, there could be no evidence for blockers either.

Alternatively, evidence could suggest that harmony in speech and beatboxing share

some deeper principles. Local harmony in speech involves prolonged constrictions; since

plenty of other nonspeech behaviors involve holding a body part in one place for an

extended period of time, beatboxing could do that too in order to create a prolonged

aesthetic effect. And if a beatboxer holds a tongue body closure for an extended period of

time during a beat pattern, the closure would temporally overlap with other sounds and

ensure that they are made with a tongue body closure too—even if they weren’t necessarily

selected to have one and wouldn’t have the tongue body closure in other contexts. Thus,

beatboxing might have triggers and undergoers (alternants, as in Chapter 5: Alternations).

Furthermore, if some beatboxing sounds in the same pattern cannot be produced with a

tongue body closure without radically compromising their character, those sounds might

block the tongue body closure harmony. Beatboxing harmony might present all the same

signature properties as speech harmony but for different aims.

Finding evidence in beatboxing for sustained constrictions and sounds with signature

harmony properties is not enough to claim that beatboxing harmony is like speech harmony.

Phonological harmony is a sound pattern. It’s predictable. Triggers, undergoers, and blockers

are classes of sounds organized by sub-segmental properties they share. If beatboxing has the

182
same type of harmony, then the sounds of beatboxing harmony must be organized along

similarly sub-segmental lines. Chapter 3: Sounds used analytic dimensions to describe the

phonetic organization of beatboxing sounds. The aim of the current chapter is to test

whether any of these dimensions play a role in the active cognitive patterning of beatboxing.

If beatboxing can be shown to exhibit harmony, then the roles of the sounds in a harmony

pattern—triggers, undergoers, blockers—should be predictable by some phonetic dimension

along which they are distributed. In turn, those same phonetic dimensions must be

sub-segmental cognitive units for beatboxing.

In the context of the larger question of domain-specificity of language cognition, the

analyses of this chapter aim at answering whether or not harmony is unique to language.

Theories of phonological harmony are designed only to account for language data; but if

beatboxing also has harmony, then a theory is needed that accounts for the shared or

overlapping cognitive structures of speech and beatboxing.. The shared-graph hypothesis in

Chapter 4: Theory represents an initial attempt to do that.

In summary, beatboxing harmony may resemble speech harmony one of two

different ways: in only the superficial sense that sequences of sounds share similar properties,

or in the more profound sense that harmony is governed by phonological principles similar

to those found for speech. In the latter case, beatboxing sounds that participate in harmony

patterns should be reliably classifiable into roles like trigger, undergoer, and blocker.

Furthermore, if these roles can be predicted by one or more phonetic attributes, then

harmony in beatboxing is also evidence for the existence of cognitive sub-segmental

183
beatboxing units. Like speech harmony, beatboxing harmony should then be able to be

accounted for using phonological models of harmony.

Section 2 introduces the method by which the beatboxing corpus was probed to

discover and analyze beatboxing harmony examples. Section 3 describes a subset of the

harmony examples in terms of the evidence for triggers, undergoers, and blockers. Section 4

argues for the existence of cognitive sub-segmental beatboxing elements relating to airflow

initiators and provides an account of beatboxing harmony patterns using gestures made

possible via Chapter 4: Theory.

2. Method

See Chapter 2: Method for details of how the rtMR videos were acquired and annotated then

converted to time series and gestural scores for the analysis below.

The videos and drum tabs of each beat pattern were visually inspected in order to

identify those which had sequences of sounds produced with tongue body closures. Eleven

such beat patterns were identified. For this analysis, each of those 11 beats patterns was

examined more closely to evaluate the constriction state of the tongue body during and

between the articulation of sounds in the beat pattern. These observations were

supplemented and corroborated by region-of-interest time series analysis.

Most of the beat patterns in the database were performed to showcase a particular

beatboxing sound. Seven of the eleven beat patterns exhibiting persistent tongue body

closure were from these showcase beat patterns, each of which features a sound that is

produced with a tongue body closure: Clickroll {CR}, Clop {C}, Duck Meow SFX, Liproll

184
{LR}, Spit Snare {SS}, Water Drop Air {WDA}, and Water Drop Tongue {WDT}. Two other

beat patterns showcasing the Inward Bass and the Humming while Beatboxing pattern were

also performed with a persistent tongue body closure; both of these beat patterns included

the Spit Snare {SS}. The final two beat patterns did not showcase any beatboxing sound in

particular: one was a long beat pattern featuring the Spit Snare, in which the last few

measures were made with a persistent tongue body closure; the other includes both the Spit

Snare and the Water Drop Tongue.

3. Results: Description of beatboxing harmony patterns

Five of the eleven beat patterns with harmony are discussed in this section to illustrate how

beatboxing harmony manifests and to test the hypothesis that beatboxing harmony exhibits

some of the same properties as the signature properties of speech harmony discussed above.

These five are the Spit Snare {SS} showcase (beat pattern 5), the humming while beatboxing

pattern (beat pattern 9), the Clickroll {CR} showcase (beat pattern 1), the Liproll {LR}

showcase (beat pattern 4), and a freestyle beat pattern that was not produced with the

intention of showcasing any particular beatboxing sound (beat pattern 10). As summarized

in Table 25, these beat patterns depict a beatboxing harmony complete with sounds that

trigger the bidirectional spreading of a lingual closure, sounds that undergo alternations due

to that closure, and sounds that block the spread of harmony.

Table 25. Summary of the five beat patterns analyzed.


Section 3.1. Beat pattern 5 — Spit Snare {SS} showcase

Observation Analysis
The tongue body rises into a velar closure at The Spit Snare triggers bidirectional tongue

185
the beginning of the utterance and stays body closure harmony. Kick Drums in the
there until the end of the utterance. Kick environment of the harmony lose their
Drums in the scope of this velar closure lose larynx raising movement when they gain
their larynx raising movement. their tongue body closure, and therefore
exhibit an alternation from a glottalic
egressive to percussive airstream.

Section 3.2. Beat pattern 9 — Humming while beatboxing

Observation Analysis
A velar tongue body closure splits the vocal Tongue body closure harmony is triggered
tract into two chambers so that percussion again by the Spit Snare. It does not restrict
and voicing can be produced independently. all laryngeal activity—it allows vocal fold
adduction for voicing (humming), but
eliminates the larynx raising movements
associated with Kick Drums.

Section 3.3. Beat pattern 4 — Liproll {LR} showcase

Observation Analysis
Tongue body harmony is again achieved by Tongue body closure harmony does not
maintaining a closure against the upper require a static tongue posture; it allows
airway. However, the location of that variability in constriction location so long as
closure moves back and forth between the the constriction degree remains a closure.
palate and the uvula as required by the The Liproll is the harmony trigger this time,
Liproll. When the Liproll is not active, the and PF Snares undergo harmony.
tongue body adopts a velar position.

Section 3.4. Beat pattern 10 — Freestyle pattern 1

Observation Analysis
Some sequences of sounds agree in tongue The Spit Snare is once again a harmony
body closure, but these groups are separated trigger, but the Inward Liproll {^LR} and
from each other by sounds without tongue High Tongue Bass {HTB} block the spread
body closure including the Inward Liproll of harmony. Both blocking sounds are
and High Tongue Bass. Kick Drums near pulmonic, indicating that harmony is
these two sounds retain their larynx raising blocked by pulmonic airflow. Temporal
movements. proximity to the harmony blockers prevents
the Kick Drums from harmonizing.

Section 3.5. Beat pattern 1 — Clickroll {CR} showcase

Observation Analysis
Brief sequences agreeing in tongue body The Clickroll triggers tongue body closure

186
closure are broken up by forced Kick Drums harmony and the pulmonic Inward K Snare
and Inward K Snares. The tongue body is blocks harmony. As with beat pattern 10,
elevated during the forced Kick Drums but Kick Drums close to the harmony blocker
an air channel over the tongue is created by are not susceptible to harmonizing. The
raising the velum. elevated tongue body position during forced
Kick Drums is argued to be anticipatory
coarticulation from the Inward K Snare

As for the other six beat patterns not discussed: the Clop {C} showcase (beat pattern 2) was

not analyzed because it only contains one oral sound—the Clop {C}; the Duck Meow SFX

{DM} showcase was not analyzed because a complete phonetic description of the Duck

Meow SFX was not currently possible to give in Chapter 3: Sounds, making an articulatory

analysis unfeasible. The remaining beat patterns for the Water Drop Air {WDA} showcase

(beat pattern 6), Water Drop Tongue {WDT} showcase (beat pattern 7), Inward Bass {IB}

showcase (beat pattern 8), and second freestyle pattern (beat pattern 11) all exhibit

bidirectional spreading like beat pattern 5. Beat pattern 7 is additionally confounded by the

presence of two sounds that use tongue body closures when performed in isolation.

Table 26 lists the beatboxing sounds used in the remainder of this chapter, along with

their transcription in BBX notation (see Chapter 3: Sounds). Transcription in notation from

the International Phonetic Alphabet is also provided which incorporates symbols from the

extensions to the International Phonetic Alphabet for disordered speech (Duckworth et al.,

1990, Ball et al., 2018) and the VoQS System for the Transcription of Voice Quality (Ball et

al., 1995; Ball et al., 2018). An articulatory description of each sound is also given in prose.

The table groups the sounds by their role in beatboxing harmony (which the subsequent

analysis provides evidence for). Note that “percussives” are sounds made with a posterior

187
tongue body closure but without the tongue body fronting or retraction associated with

lingual airstream sounds.

Table 26. The beatboxing sounds used in this chapter.


Name BBX IPA Description

Triggers

Spit Snare {SS} [ʘ͡ɸ↑] Voiceless lingual egressive bilabial affricate

Clickroll {CR} [*] Voiceless lingual egressive alveolar trill

Liproll {LR} [ʙ̥↓] Voiceless lingual ingressive bilabial trill

Blockers

Inward Liproll {^LR} [ʙ̥↓] Voiceless pulmonic ingressive bilabial trill

High Tongue Bass {HTB} [r] Voiced pulmonic egressive alveolar trill (high
pitch)

Inward K Snare {^K} [k͡ʟ̝̊↓] Voiceless pulmonic ingressive lateral velar


affricate

Undergoers (alternants of other sounds)

Unforced Kick Drum {b} [ʬ] Voiceless percussive bilabial stop

Labiodental closure {pf} [ʘ̪] Voiceless percussive labiodental stop

Dental closure {dc} [k͜ǀ] Voiceless percussive dental stop

Other

Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop

Closed Hi-Hat {t} [t’] Voiceless glottalic egressive alveolar stop

Humming {hm} [C̬] Pulmonic egressive nasal voicing

Linguolabial closure {tbc} [ʘ̺] Voiceless percussive linguolabial stop

Dental-alveolar closure {dac} Voiceless percussive laminal dental stop

Alveolar closure {ac} [k͜ǃ] Voiceless percussive alveolar stop

Lateral alveolar closure {tll} [ǁ] Voiceless percussive lateral alveolar stop

188
3.1 Beat pattern 5—Spit Snare showcase

Beat pattern 5 showcases the Spit Snare {SS}. Section 3.1.1 demonstrates how the tongue

body makes a velar closure throughout the entire performance, making this a relatively

simple case of tongue body closure harmony. The tongue body closure results in alternations

from forced (ejective) to unforced (percussive) sounds as well as a lack of laryngeal

movement associated with ejectives. Section 3.1.2 analyzes the pattern in terms of a tongue

body harmony trigger and undergoers. Table 27 re-lists the beatboxing sounds used in beat

pattern 5 for reference.

Table 27. Sounds of beatboxing used in beat pattern 5.


Name BBX IPA Description

Unforced Kick Drum {b} [ʬ] Voiceless percussive bilabial stop

Dental closure {dc} [k͜ǀ] Voiceless percussive dental stop

Spit Snare {SS} [ʘ͡ɸ↑] Voiceless lingual egressive bilabial affricate

Lateral alveolar closure {tll} [ǁ] Voiceless percussive lateral alveolar stop

3.1.1 Description of beat pattern 5

Beat pattern 5 is a relatively simple example of tongue body closure harmony in beatboxing.

As the drum tab (Figure 91) and time series (Figure 93) below show, the tongue body makes

a closure against the velum for the entire duration of the beat pattern.

[Link] Drum tab


The Spit Snare is metrically positioned as expected on the back beat (beat 3 of each

measure), and the unforced Kick Drum occurs in a relatively common pattern on beats 1, 2+,

and 4 of the first measure and beats 2 and 4 of the second measure, repeating the two

189
measure pattern for measures 3 and 4. The dental closure occurs on beat 2 of the first and

third measures, and the lateral alveolar closure occurs on beat 1 of the second and fourth

measures. All the sounds in this beat pattern share the trait of being made with a tongue

body closure. Agreement like this in speech would likely be considered a type of local

harmony.

Figure 91. Drum tab of beat pattern 5.


b |x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
dc |----x-----------|----------------|----x-----------|----------------
tll|----------------|x---------------|----------------|x---------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

[Link] Time series


The articulator movements for beat pattern 5 are illustrated in Figure 93 with four time

series: one for labial closures (LAB), one for alveolar closures (COR), one for tongue body

closures (DOR), and one for larynx height (LAR). The labial (LAB) time series includes the

gestures for the unforced Kick Drum {b} and the Spit Snare {SS}, while the coronal (COR)

time series feature the gestures for the dental closure {dc} and the lateral alveolar closure

{tll}. The tongue body (DOR) time series shows that the tongue body stays raised

throughout the beat pattern: the tongue body starts from a lower position at the very

beginning of the beat pattern, represented by low pixel intensity (close to the bottom of the

y-axis), but it quickly moves upward at the beginning of the beat pattern to make a closure

(high pixel intensity, closer to the top of the y-axis) in time for the first unforced Kick Drum

{b}.

190
Figure 92. Regions for beat pattern 5. From top to bottom: the labial (LAB) region for the
unforced Kick Drum {b} and Spit Snare {SS}; the coronal (COR) region for the dental
closure {dc} and lateral alveolar closure {tll}; the dorsal (DOR) region to show tongue body
closure and the laryngeal (LAR) region to show lack of laryngeal activity.

Unforced Kick Drum and Spit Snare

Dental closure and lateral alveolar closure

Dorsal closure during Spit Snare and empty larynx region during unforced Kick Drum

191
Figure 93. Time series of vocal tract articulators used in beat pattern 5, captured using a
region of interest technique. From top to bottom, the time series show average pixel intensity
for labial (LAB), coronal (COR), dorsal (DOR), and laryngeal (LAR) regions.

The time series in Figure 93 capture the results of the alternation of forced Kick Drums to

unforced Kick Drums. As discussed in Chapter 5: Alternations, the default forced Kick

Drums are ejectives which means the laryngeal time series of Kick Drums would show an

increase from low intensity to high intensity as a rising larynx enters the region of interest.

The alternative Kick Drum form, the unforced Kick Drum, is made in front of a tongue body

closure, so it is expected to exhibit activity in the dorsal time series. Tongue body closures are

not antithetical to laryngeal movement: they may occur at the same time, and often do for

dorsal ejectives in speech. Yet beat pattern 5 shows that the Kick Drums do have a tongue

body closure but do not have a laryngeal movement (Figure 94).

192
Figure 94. Upper left: Labial and laryngeal gestures for an ejective/forced Kick Drum at the
beginning of a beat pattern. Upper right: Labial gesture for a non-ejective/unforced Kick
Drum at the beginning of beat pattern 5. A larynx raising gesture occurs with the forced Kick
Drum, but not the unforced Kick Drum. (Pixel intensities for each time series were scaled
[0-1] relative to the other average intensity values in that region; the labial closure of the
forced Kick Drum looks smaller than the labial closure of the unforced Kick Drum because it
was scaled relative to other sounds in its beat pattern with even brighter pixel intensity
during labial closures. Both labial gestures in this figure are full closures.) Lower left: At the
time of maximum labial constriction for the ejective Kick Drum, the vocal folds are closed
(visible as tissue near the top of the trachea) and the airway above the larynx is open; the
velum is raised. Lower right: At the time of maximum labial constriction for the non-ejective
unforced Kick Drum, the vocal folds are open and the tongue body connects with a lowered
velum to make a velar closure.
Forced (ejective) Kick Drum Unforced (lingual) Kick Drum

Forced (ejective) Kick Drum Unforced (lingual) Kick Drum

193
From the perspective of aerodynamic mechanics, this is sensible: laryngeal movement behind

the tongue body closure has no effect on the size of the chamber between the lips and the

tongue body, so it makes no difference whether the larynx moves or not; better to save

energy and not move the larynx. From the perspective of beatboxing phonology, this

example is illuminating: if one assumes based on Chapter 5: Alternations that the forced Kick

Drum was selected for this beat pattern and undergoes an alternation into an unforced Kick

Drum, then the phonological model must provide not only a way to spread the tongue body

closure but also a way to get rid of the larynx raising. (Section 4 addresses this in more

detail.)

3.1.2 Analysis of beat pattern 5

Harmony patterns in speech are defined by articulations that spread from a single trigger

sound to other sounds nearby, causing them to undergo assimilation to that articulation. In

beat pattern 5, the Spit Snare is the origin of a lengthy tongue body closure gesture and other

sounds like the Kick Drum undergo assimilate to that dorsal posture as well. The sounds

agree by sharing a tongue body closure, and in this sense they are harmonious.

[Link] Harmony undergoers


As established in Chapter 5: Alternations, the unforced Kick Drum is an alternation of the

Kick Drum that mostly appears in environments with surrounding dorsal closures. This was

implicitly characterized as local agreement: the unforced Kick Drum adopts a tongue body

closure when adjacent sounds also have a tongue body closure. Looking beyond the unforced

Kick Drum’s immediate environment however, and considering the pervasive tongue body

194
closure in this beat pattern, the Kick Drum alternation in this beat pattern seems more aptly

described as the result of tongue body harmony: the Kick Drum is not just accidentally

sandwiched between two dorsal sounds—all the sounds, nearby and not, have tongue body

closures. The unforced Kick Drum is a forced Kick Drum that undergoes tongue body

closure harmony.

[Link] Harmony trigger


Of all the sounds in a beat pattern, only the ones that are always produced with a tongue

body closure, even in isolation, could be triggers of harmony. Of the sounds in this particular

beat pattern, only the Spit Snare was ever performed in isolation or identified as a distinct

beatboxing sound by the beatboxer; as the only sound in this beat pattern known to require a

tongue body closure, the Spit Snare is therefore the most likely candidate for a harmony

trigger. In fact, the Spit Snare is associated with long tongue body closures in all the beat

patterns it appears in, and in most cases is the only sound in that pattern known to be

produced with a tongue body closure.

Assuming the Spit Snare is a harmony trigger, then the tongue body closure harmony

in this beat pattern extends bidirectionally: it is regressive from beat 2 of the first measure to

begin with the first unforced Kick Drum {b}, but also progressive from beat 4 of the last

measure to co-occur with the final unforced Kick Drum.

3.2 Beat pattern 9—Humming while beatboxing

Beat pattern 9 is an example of the “humming while beatboxing” described at the beginning

of this chapter. Section 3.2.1 describes this humming while beatboxing pattern with drum tab

195
notation and articulatory time series. The humming is intermittent in this particular beat

pattern, and there is no need to keep a tongue body closure when humming is not

active—yet as the time series shows, the tongue body closure persists for the entire beat

pattern, suggesting a sustained posture like the ones exhibited in speech harmony. This is

discussed in section 3.2.2 in terms of triggers (the Spit Snare) and undergoers (the

non-humming sounds). For reference, the sounds of this beat pattern are listed in Table 28.

Table 28. Sounds of beatboxing used in beat pattern 9.


Name BBX IPA Description

Unforced Kick Drum {b} [ʬ] Voiceless percussive bilabial stop

Dental closure {dc} [k͜ǀ] Voiceless percussive dental stop

Spit Snare {SS} [ʘ͡ɸ↑] Voiceless lingual egressive bilabial affricate

Linguolabial closure {tbc} [ʘ̺] Voiceless percussive linguo-labial stop

Humming {hm} [C̬] Pulmonic egressive nasal voicing

3.2.1 Description of beat pattern 9

[Link] Drum tab


Beat pattern 9 showcases the strategy of humming {hm} while beatboxing (Figure 95). As in

beat pattern 5, the four supralaryngeal sounds in this beat pattern are the unforced Kick

Drum {b}, a Spit Snare {SS}, and two additional percussive closures—one dental {dc} and one

linguolabial {tbc}. The additional humming {hm} sound is a brief upward pitch sweep that

occurs on most beats. (If humming occurs with the first three Spit Snares, it is acoustically

occluded in the audio data of this beat pattern and therefore was not marked.)

196
Figure 95. Drum tab of beat pattern 9.
b |x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
dc |--x-----------x-|----------------|--x-----------x-|------------x---
tbc|----x-----------|x---x-------x---|----x-----------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
hm |x---x-------x---|x---x-------x---|x---x-------x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

[Link] Time series


The time series were generated by the same regions used in section 3.1 (the Spit Snare

showcase). The DOR time series shows that the tongue body is raised consistently

throughout the beat pattern. Laryngeal activity on most major beats (LAR time series)

corresponds to voicing {hm}. There is also activity during the three Spit Snares that are not

marked for voicing in the drum tab; if this is voicing, it may not apparent in the acoustic

signal due to some combination of the noise reduction method used in audio processing and

the high amplitude of the Spit Snare itself.

Figure 96. Time series and gestures of beat pattern 9.

197
3.2.2 Analysis of beat pattern 9

The main point of note in this beat pattern is that the larynx is not necessarily inactive

during tongue body closure harmony. The description of beat pattern 5 in section 3.1 noted

that when forced Kick Drums undergo tongue body closure harmony, their unforced

alternants do not have a larynx raising gesture. A phonological model needs to be able to

“turn off” the larynx movement of the forced Kick Drums to generate the observed unforced

Kick Drums. But as beat pattern 9 shows, a blanket ban on laryngeal activity during tongue

body closure harmony would not be an appropriate choice for the phonological model

because the vocal folds can still phonate.

The musical structures of beat patterns 5 and 9 are different in sounds and rhythms,

but the rest of the analysis is essentially the same. Once again, the tongue body closure that

persists throughout the beat pattern is most likely to be associated with the Spit Snares: none

of the other sounds in this beat pattern were produced in isolation by the beatboxer, which

suggests that they are tongue-body alternations of sounds without tongue body gestures (like

the Unforced Kick Drum is an alternation of the Kick Drum) or sounds that are

phonotactically constrained to only occur in the context of a sound with a tongue body

closure—in either case, not independent instigators of a sustained tongue body closure.

Again, the harmony would be bidirectional, spreading leftward to the first sounds of the beat

pattern and rightward until the end.

198
3.3 Beat pattern 4—Liproll showcase

Beat pattern 4 showcases the Liproll {LR}. The Liproll triggers tongue body harmony just

like the Spit Snare did in the previous examples; but unlike the Spit Snare, the tongue body

constriction location during the Liproll changes dramatically during the Liproll’s

production—from the front of the palate all the way to the uvula in one smooth glide.

Tongue body closure harmony is maintained during the Liproll because the constriction

degree of the tongue body stays at a constant closure. When the Liproll is not being

produced, the tongue body adopts a static velar closure. Section 3.3.1 presents the beat

pattern in drum tab and time series forms, and section 3.3.2 analyzes the pattern in terms of a

tongue body harmony trigger (the Liproll) and undergoers (everything else).

Table 29. Sounds of beatboxing used in beat pattern 4.


Name BBX IPA Description

Liproll {LR} [ʙ̥↓] Voiceless lingual ingressive bilabial trill

Alveolar closure {ac} [k͜ǃ] Voiceless percussive alveolar stop

Unforced Kick Drum {b} [ʬ] Voiceless percussive bilabial stop

Linguolabial closure {tbc} [ʘ̺] Voiceless percussive linguolabial stop

Labiodental closure {pf} [ʘ̪] Voiceless percussive labiodental stop

Dental closure {dc} [k͜ǀ] Voiceless percussive laminal alveolar stop

3.3.1 Description of beat pattern 4

[Link] Drum tab


Beat pattern 4 (Figure 97; split into two parts) is composed of six sounds: the unforced Kick

Drum {b}, the Liproll {LR}, and percussive alveolar {ac}, dental {dc}, labiodental {pf}, and

199
linguolabial {tbc} closures. The onset of Liprolls are metrically synchronous with unforced

Kick Drums as represented by the “x” symbols, though the time series shows that they are

not simultaneous—a Kick Drum is made first and a Liproll follows quickly thereafter. The “~”

symbol signifies that the labial trill of the Liproll is extended across multiple beats. The

labiodental closure {pf} serves the role of the snare by occurring consistently and exclusively

on beat 3 of each measure; since it was never produced in isolation by the beatboxer, the {pf}

is analyzed as an alternant of the glottalic egressive {PF} snare.

Figure 97. Drum tab notation for beat pattern 4.


b |x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
ac |----------x-----|----------x-----|----------x-----|--------x-------
dc |----------------|----x-----------|----------------|------------x---
tbc|----------------|----------------|----------------|----x-----------
pf |--------x-------|--------x-------|--------x-------|------x---------
LR |x~~~~~------x~~~|~~----------x~~~|x~~~~~------x~~~|~~--------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

b |x---x-------x---|x---x-------x---|x---x-------x---|x---x-----------
ac |----------x-----|----------x-----|----------x-----|----------------
dc |----------------|--------------x-|----------------|----------------
tbc|----------------|----------------|----------------|----------------
pf |--------x-------|--------x-------|--------x-------|--------x-------
LR |x~~~x~~~----x~~~|x~~~x~~~--------|x~~~x~~~----x~~~|x~~~x~~~--------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

[Link] Time series


The time series representation for beat pattern 4 (Figure 99) follows five time series. The first

three (LAB, LAB2, and FRONT) have movements relevant to the production of sounds in

this pattern. Labial closures of the unforced Kick Drum {b} and labiodental closure {pf} are

in the LAB time series; labial closures during which the lips are pulled inward over the teeth

for the Liproll {LR} are in LAB2; and the anterior region of the vocal tract into which the

tongue shifts forward at the beginning of a Liproll is represented by FRONT. (A coronal time

200
series for the alveolar, dental, and interlabial closures is not included.) The dorsal DOR and

laryngeal LAR time series are included to show the consistently high tongue body posture

and the lack of laryngeal activity, respectively.

Figure 98. Regions used to make time series for the Liproll beat pattern.

Unforced Kick Drum (left) and labiodental closure (right) in LAB region.

Liproll retraction of lower lip over the teeth into LAB2 region.

201
Liproll tongue body in (left) and out of (right) the FRONT region

The tongue body makes a closure with the velum in the DOR region during the labiodental
closure (left) and there is no laryngeal activity in the LAR region (right).

202
Figure 99. Time series of the beat pattern 4 (Liproll showcase).

3.3.2 Analysis of beat pattern 4

The Liproll triggers tongue body closure harmony in beat pattern 4, causing both Kick

Drums and PF Snares to be produced with tongue body closures instead of glottalic egressive

airflow. Figure 98 shows snapshots of the different positions of the tongue body during this

beat pattern: the tongue body adopts a resting position closed against the velum during most

sounds but shifts forward and backward (right image) to create the Liproll.

3.4 Beat pattern 10—Freestyle beat pattern

Beat pattern 10 is a freestyle beat pattern not intended to showcase any particular sound. The

Spit Snare is once again a harmony trigger as it was in beat patterns 5 and 9, but here the

harmony does not spread throughout the whole beat pattern as it did in those earlier ones. In

the first six measures of the beat pattern, tongue body closures triggered by a Spit Snare do

203
not extend through the Inward Liproll or High Tongue Bass. These two pulmonic sounds are

analyzed as harmony blockers.

Table 30. Sounds of beatboxing used in beat pattern 10.


Name BBX IPA Description

Inward Liproll {^LR} [ʙ̥↓] Voiceless pulmonic ingressive bilabial trill

Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop

Inward K Snare {^K} [k͡ʟ̝̊↓] Voiceless pulmonic ingressive lateral velar


affricate

Spit Snare {SS} [ʘ͡ɸ↑] Voiceless lingual egressive bilabial affricate

Dental-alveolar closure {dac} Voiceless percussive laminal dental stop

Unforced Kick Drum {b} [ʬ] Voiceless percussive bilabial stop

Linguolabial closure {tbc} [ʘ̺] Voiceless percussive linguolabial stop

High Tongue Bass {HTB} [r] Voiced pulmonic egressive alveolar trill (high
pitch)

Dental closure {dc} [k͜ǀ] Voiceless percussive laminal alveolar stop

3.4.1 Description of beat pattern 10

[Link] Drum tab


The Spit Snare {SS} occurs on beat three of each measure of this beat pattern. In measures 2,

4, 6, and 7-8 the Spit Snare follows a linguolabial closure {tbc} and unforced Kick Drum {b},

indicating that some harmony is occurring. In the same measures, however, there are also

forced Kick Drums and High Tongue Basses that did not undergo harmony. And in measures

1, 3, and 5 the Spit Snare is the only tongue body closure sound around. Only in the final two

measures does the pattern return to one of a sequence of tongue body closure sounds.

204
Figure 100. Drum tab for beat pattern 10.
B |x-----x-----x---|--x-------------|x-----x-----x---|--x-------------
^LR|x~~~~~------x~~~|~~--------------|x~~~~~------x~~~|~~--------------
^K |----------------|----------------|----------------|----------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|------------x~~~
b |----------------|------x---------|----------------|------x---------
dc |----------------|----------------|----------------|----------------
dac|----------------|----------------|----------------|----------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

B |x-----x-----x---|--x-------------|x---------------|----------------
^LR|x~~~~~------x~~~|~~--------------|----------------|----------------
^K |----------------|----------------|----------------|------------x---
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|----------------
b |----------------|------x---------|------x-----x---|--x---x---------
dc |----------------|----------------|--x-----------x-|----------------
dac|----------------|----------------|----x-----------|x---------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

[Link] Time series


Nine beatboxing sounds manifest along six time series. The forced Kick Drum {B}, unforced

Kick Drum {b}, and Spit Snare {SS} all go on the labial closure LAB time series. The Inward

Liproll {^LR} goes on the LAB2 time series which responds to pixel intensity when the lower

lip retracts over the bottom teeth. (It also responds to tongue tip movement in the same

pixels, but there are no meaningful movements highlighted in that case.) The High Tongue

Bass {HTB}, linguolabial closure {tbc}, and dental closure {dc}, are in the COR tongue tip

time series. The Inward K Snare {^K} goes on the DOR region, and the LAR region has the

laryngeal movements for the forced Kick Drum. The dental-alveolar closure {dac} was

captured in a separate region that is not pictured. Black boxes surround movements that

were partially or completely manually corrected.

205
Most of the Kick Drums near the Inward Liproll and High Tongue Bass are marked as

forced because laryngeal closure was apparent when visually inspecting the image frames of

those sounds. A forced Kick Drum was also observed in the production of the Inward Liproll

in isolation. But in this beat pattern, the laryngeal activity during most forced Kick Drums is

minimal. In some instances the laryngeal region brightens for a moment and then darkens

again with no apparent vertical movement. Unusually high pixel brightness near the lips and

tongue tip may drown out the details of whatever laryngeal closure/raising there may be. At

other times, there is clear vertical laryngeal movement during a subsequent Spit Snare; Spit

Snares after forced Kick Drums co-occur with larynx raising, while Spit Snares after unforced

Kick Drums do not.

The relationship between sounds in beatboxing clusters—like the Kick Drums and

Inward Liprolls organized to the same beat—is unknown territory for beatboxing science, so

it is not clear how to expect those Kick Drums should manifest. For this analysis, the

presence of any laryngeal closure at all during these Kick Drums is taken as indication that

they are forced, and the lack of noticeable vertical movement attributed to undershoot (not

enough time for noticeable movement). Laryngeal movements marked on the time series

correspond to visual observations of laryngeal activity. At the very least, the Kick Drums just

before linguolabial closures {tbc} have clear laryngeal closure/raising.

As shown in the DOR time series, the tongue body is sometimes raised into an

extended closure and sometimes not. The tongue body is elevated overall because the DOR

region has at least some brightness at all times except during the Inward K Snare {^K} when

the tongue body completely leaves the region. The aperture of tongue body constriction

206
widens during most Inward Liprolls and High Tongue Basses, then decreases again as the

tongue body moves back into its closure before and after Spit Snares.

Figure 101. The regions used to make the time series for beat pattern 10.

Forced Kick Drum (left), unforced Kick Drum (center), and Spit Snare (right) in LAB region.

Inward Liproll in LAB2 region.

High Tongue Bass (left) and linguolabial closure (right) in COR region.

207
Inward K Snare (left) outside of the DOR region and (local) maximum larynx height during
a forced Kick Drum in the LAR region (right).

Figure 102. Time series of beat pattern 10.

3.4.2 Analysis of beat pattern 10

The domain of the Spit Snare’s harmony extends bidirectionally up to an Inward Liproll

{^LR} or High Tongue Bass {HTB}, then halts. As non-nasal pulmonic sounds, the Inward

Liproll and High Tongue Bass cannot be made with a tongue body closure because a tongue

body closure would prevent the pulmonic airflow from passing over the relevant oral
208
constriction. In speech harmony, sounds with this kind of physical antagonism to harmony

that also seem to stop the spread of harmony are generally analyzed as harmony blockers.

Alternatively, some sounds are analyzed as transparent to harmony, meaning they do not

prevent harmony from spreading but they also do not undergo a qualitative harmonious shift

either. It could be that the Inward Liproll and High Tongue Bass are transparent—tongue

body closure harmony continues through them, but the need for pulmonic airflow

temporarily trumps the tongue body closure.

The blocking analysis works slightly better here because of the presence of forced

Kick Drums. As we have seen in every other beat pattern so far, tongue body closure

harmony seems to trigger a qualitative shift in which forced Kick Drums become unforced,

losing their laryngeal closure/raising gestures and gaining a tongue body closure. Here

however there are some forced Kick Drums near pulmonic sounds. If harmony were not

blocked, then the Kick Drums should undergo harmony; since they don’t, then either they

are exceptional Kick Drums that are intrinsically resistant to harmony or they are defended

from harmony by other sounds that block harmony.5 There is no other reason to think that

any Kick Drums should be exceptional compared to others. A phonological analysis with

unexplained exceptionality is less appealing than an analysis that explains everything, so

5
This would be a problem in a traditional phonological analysis that treats sounds as sequential symbol strings.
Consider the sequence {... ^LR B tbc b SS HTB …} in which tongue body harmony has spread regressively from
the Spit Snare {SS} as indicated by underlining beneath the Spit Snare and undergoers. In this format, blocking
from the Inward Liproll must “jump” over the forced Kick Drum to stop the harmony from affecting the forced
Kick Drum and making *{... ^LR b tbc b SS HTB …}. In theories that don’t pretend sounds exist in time and can
overlap, however, this is not as big an issue. If those Kick Drums are sufficiently temporally proximal to the
blockers—and indeed many of the Kick Drums in this beat pattern partially overlap with the pulmonic
sounds—then the harmonizing tongue body closure may simply not have yet been unblocked.

209
blocking is the preferred analysis over transparency here. The beat pattern in section 3.5

reinforces the blocking analysis.

3.5 Beat pattern 1—Clickroll showcase

Beat pattern 1 is a Clickroll {CR} showcase beat pattern. Section 3.5.1 presents the beat

pattern in drum tab and time series forms, illustrating an example of tongue body harmony

that is periodically interrupted by Inward K Snares. Section 3.5.2 analyzes the pattern in

terms of a tongue body harmony trigger (the Clickroll), undergoers (the unforced Kick

Drum and dental closure), and a blocker (Inward K Snare).

Table 31. Sounds of beatboxing used in beat pattern 1.


Name BBX IPA Description

Clickroll {CR} [*] Voiceless lingual egressive alveolar trill

Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop

Unforced Kick Drum {b} [ʬ] Voiceless percussive bilabial stop

Inward K Snare {^K} [k͡ʟ̝̊↓] Voiceless pulmonic ingressive lateral velar


affricate

Closed Hi-Hat {t} [t’] Voiceless glottalic egressive alveolar stop

Dental closure {dc} [k͜ǀ] Voiceless percussive dental stop

3.5.1 Description of beat pattern 1

[Link] Drum tab


Beat pattern 1 (Figure 103) is composed of six sounds: the unforced and forced Kick Drums

{b} and {B}, Closed Hi-Hat {t}, dental closure {dc}, Inward K Snare {^K}, and Clickroll {CR}.

The Kick Drums follow a two-measure pattern of occurrence—beats 1, 2+, and 4 of the first

210
measure, then the “and”s of each beat in the second measure. The pattern repeats in the latter

half of the beat pattern except that the final Kick Drum is replaced by an Inward K Snare.

Inward K Snares additionally appear on beat 3 of each measure. Clickrolls in this beat

pattern are always co-produced on the same beat as an unforced Kick Drum, though the

reverse is not true (i.e., an unforced Kick Drum at the end of the second measure is not

co-produced with a Clickroll). The dental closure also follows a two-measure pattern with

occurrences on the 2 and 3+ of the first measure and beats 1, 2, and 4 of the second measure;

this pattern repeats in the latter half of the beat pattern, but a Closed Hi-Hat occurs where

the last dental closure is expected.

Figure 103. Drum tab notation for beat pattern 1.


b |x-----------x---|--x-----------x-|x-----------x---|--x-------------
B |------x---------|------x---x-----|------x---------|------x---x-----
t |----------------|----------------|----------------|------------x---
dc|----x-----x-----|x---x-------x---|----x-----x-----|x---x-----------
^K|--------x-------|--------x-------|--------x-------|--------x-----x-
CR|x~~~--------x~~~|--x~------------|x~~~--------x~~~|--x~------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

[Link] Time series


The time series representation for beat pattern 1 (Figure 105) follows five distinct time series:

labial closures (LAB), alveolar closures (COR), dorsal closures (DOR), velum position

(VEL), and larynx height (LAR). Note that in this beat pattern, the dental closure is usually

the release of a coronal closure caused by a Clickroll or Inward K Snare and does not have its

own closing action. The DOR time series illustrates that the tongue body is raised near the

velum throughout beat pattern 1 except during the Inward K Snare and after the penultimate

Inward K Snare. Figure 104 shows that even what appears to be tongue body lowering during

211
the Inward K Snares is actually tongue fronting; whether for the Inward K Snare or for

harmony, the tongue body is elevated until close to the end.

Surprisingly, there are several forced Kick Drums in the beat pattern despite the

consistently raised tongue body posture. Tongue body closure and larynx raising are not

physically impossible to produce together, but every example thus far has shown that tongue

body closures cancel larynx closure/raising gestures during harmony. Here, forced Kick

Drums before Inward K Snares are produced with both laryngeal raising and a raised tongue

body. The velum (VEL) time series shows how a Kick Drum can be forced even when the

tongue body is high. In this beat pattern, persistent tongue body closures are made by the

tongue body and velum coming together; among other things, this allows the beatboxer to

breathe through the nose while simultaneously using the mouth to create sound. During the

forced Kick Drums, harmony ends not by lowering the tongue body but rather by raising the

velum in preparation for the pulmonic ingressive Inward K Snare (Figure 105). This directs

laryngeal air pressure manipulations through the air channel now extant over the tongue.

The resulting Kick Drums therefore have larynx raising without tongue body closure, giving

them the form typically expected for Kick Drums. The last forced Kick Drum differs from

the rest: it is the only one for which the tongue body does not appear to be raised toward the

velum—nor does it appear to be making any particular constriction at all (Figure 107).

212
Figure 104. Regions for beat pattern 1 (Clickroll showcase).

Labial (LAB) closures for forced Kick Drum (left) and unforced Kick Drum (right).

Tongue tip (COR) closures for the Clickroll (left), dental closure (center), and Closed
Hi-Hat (right).

The tongue body is out of the DOR region during the Inward K Snare.

213
Larynx (LAR) region filled at the MAXC of the larynx raising associated with a forced Kick
Drum (left). The right image was taken from the PVEL2 of the tongue tip release (COR time
series)

Figure 105. Time series of beat pattern 1.

214
Figure 106. The DOR region for the Clickroll showcase (beat pattern 1) in the first {CR dc B
^K}. Left: The tongue body is raised and the velum is lowered during the Clickroll {CR},
leaving no air channel over the tongue body; pixel intensity in the region is high. Center: The
tongue body is raised during a forced Kick Drum, but the velum is also raised so there is a
gap between the tongue body and the velum through which air can pass; pixel intensity in
the region is high. Right: The tongue body is shifted forward during the lateral release of an
Inward K Snare; pixel intensity in the region is low.

Figure 107. Each forced Kick Drum in the beat pattern in order of occurrence. The image was
taken from the frame of the LAB region’s peak velocity (change in pixel intensity)
corresponding to PVEL2 as described in Chapter 2: Method. The final Kick Drum (far right)
signals that harmony has ended because the tongue body is not making a narrow velar
constriction.

215
Figure 108. Upper left: Labial and laryngeal gestures for an ejective/forced Kick Drum before
an Inward K Snare in beat pattern 1. Upper right: Labial gesture for a non-ejective/unforced
Kick Drum in beat pattern 1. Lower left: Near the time of maximum labial constriction for
the ejective Kick Drum, the vocal folds are closed (visible as tissue near the top of the
trachea) and the airway above the larynx is open, including a narrow passage over the tongue
body which is raised but not making a closure with the velum; the velum is raised. Lower
right: At the time of maximum labial constriction for the non-ejective unforced Kick Drum,
the vocal folds are open and the tongue body connects with a lowered velum to make a velar
closure.
Forced (ejective) Kick Drum Unforced (lingual) Kick Drum

Forced (ejective) Kick Drum Unforced (lingual) Kick Drum

216
3.5.2 Analysis of beat pattern 1

[Link] Harmony trigger


Harmony was readily apparent in beat patterns 5 (section 3.1) and 9 (section 3.2) in the form

of a spreading tongue body closure triggered by a Spit Snare, and in beat pattern 4 (section

3.3) from a Liproll trigger. In those beat patterns, the tongue body constriction degree is

consistent throughout—every sound is made with a tongue body closure. Beat pattern 1 is

more challenging to analyze as harmony because the tongue body closure is frequently

interrupted, making it relatively more difficult to spot prolonged tongue body closures or to

know what sounds might have triggered them. For this beat pattern the Clickroll is the only

sound made with a tongue body closure when performed in isolation, making it the most

likely trigger.

[Link] Harmony undergoers


Chapter 5: Alternations used metrical patterning to motivate the analysis that the unforced

Kick Drum is a Kick Drum alternation that occurs in dorsal environments. This beat pattern

provides evidence that the dental closure {dc} may be an alternation of the Closed Hi-Hat.

The drum tab in Figure 103 shows that the Closed Hi-Hat appears at the end of the

performance in precisely the metrical position where a dental closure is expected. This would

make three clear harmony undergoers from the beat patterns analyzed so far: the forced Kick

Drum to the unforced Kick Drum, the PF Snare to a labiodental closure, and the Closed

Hi-Hat to a dental closure.

But given the frequent tongue body closure interruptions and use of several forced

Kick Drums, it is not clear whether the unforced Kick Drums and dental closure alternants

217
should be considered the result of harmony or more simply a consequence of local

assimilation. All the unforced Kick Drums in this beat pattern save one are produced in the

same metrical position as a Clickroll; the tongue body must be raised in anticipation of the

Clickroll, so the Kick Drum on the same beat as a Clickroll must be co-produced with a

tongue body closure—there is not enough time between the release of the labial closure and

the onset of trilling to raise the tongue body and create the necessary pocket of air between

the tongue body and the tongue tip. Likewise, the starting conditions for most dental

closures are the result of tongue closures for a Clickroll or Inward K Snare. Dental closures

may occur as alternants of Closed Hi-Hats in this environment simply because they are

mechanically advantageous given the current state of the vocal tract, not because of a

harmonizing tongue body closure.

Two aspects of the data suggest that this is a harmony pattern. First, the absence of

laryngeal closure/raising gestures: if the Kick Drums simply became percussives because of a

concurrent but non-inhibiting tongue body closure gesture, there should still be larynx

raising—which there is not. Second, there is the sequence {... ^K B dc b | b-CR ...}

which begins from beat 3 of the second measure (the pipe character | indicates the divide

between measures 2 and 3, and the hyphen in {b-CR} indicates that the sounds are made on

the same beat). This sequence features a dental closure {dc} and unforced Kick Drum {b}

(both underlined) that are made without an adjacent Clickroll or even an adjacent Inward K

Snare—that is, without any sounds nearby that require a tongue body closure . If the

alternations from forced Kick Drum to unforced Kick Drum and from Closed Hi-Hat to

dental closure in this beat pattern were due only to coproduction, then these particular

218
dental closure and unforced Kick Drum should have been ejective Closed Hi-Hat and forced

Kick Drum instead. The presence of tongue body closure here, despite there being no

immediate coarticulatory origin for it, indicates harmony. Extrapolating this to the rest of the

beat pattern, the unforced Kick Drums and dental closures in this pattern can be described

as the result of the same bidirectional tongue body closure harmony that appeared in beat

patterns 5, 9, and 4.

[Link] Harmony blocker


As noted earlier, there are forced Kick Drums here which by definition are produced without

a tongue body closure; there are also Inward K Snares which move the tongue body closure

to a different location, lateralize it, and bring air flowing pulmonic ingressively through the

mouth and over the tongue. Neither sound is participating in harmony, either as a trigger or

as an undergoer. Section 3.4 suggested that pulmonic sounds like the Inward Liproll and

High Tongue Bass are harmony blockers that defend the Kick Drums from harmonizing too;

if so, then the pulmonic ingressive Inward K Snare can also be analyzed the same way. Just as

in section 3.4, these forced Kick Drums are close enough temporally to the Inward K Snare

that they can also benefit from the blocking of the tongue body harmony.

The last measure of this beat pattern provides perhaps the clearest demonstration

that harmony is blocked by the Inward K Snare. Figure 107 illustrates that all but the last

forced Kick Drum are co-produced with a tongue body constriction. Suspiciously, all but the

last forced Kick Drum also fall somewhere between a Clickroll and an Inward K Snare,

precisely where harmony is predicted to be trying to spread, whereas the last forced Kick

Drum has no Clickroll in its vicinity. The penultimate Inward K Snare blocks harmony for

219
the last time and so all following sounds are made without influence from a tongue body

closure. This notably includes the Closed Hi-Hat which has never appeared so far in

harmony pattern but occurs frequently outside of harmony (see Chapter 3: Sounds); its

appearance in this beat pattern is another indicator that harmony has ended.

4. Theoretical accounts and implications

The introduction of this chapter posed two main questions. First, descriptively, does

beatboxing exhibit signature properties of phonological harmony like triggers, undergoers,

and blockers? And second, what can be concluded about beatboxing cognition from the

description of beatboxing harmony? With respect to the first question, section 3 found that

there are indeed beat patterns with sustained tongue body closure that can be described as

bidirectional harmony. Those patterns include sounds associated with those tongue body

closures that act as triggers, sounds that undergo qualitative change because of the harmony,

and sounds that block the spread of harmony. The remainder of this chapter addresses the

second question about the implications for beatboxing cognition in two parts: the evidence

for cognitive sub-segmental units (section 4.1), and a discussion of how beatboxing harmony

might be implemented in gestural and featural frameworks (section 4.2).

4.1 Evidence for cognitive sub-segmental beatboxing units

It is hopefully uncontroversial that beatboxing sounds are (or have) cognitive

representations. Beatboxers learn categories of sounds and overtly or covertly organize them

by their musical role; they can also name many of the sounds they can produce, and likewise

220
produce a sound they know when prompted with its name. All of this knowledge is necessary

and inevitable for skilled beatboxers. Less clear is the nature and composition of those

representations. The question at hand is whether there is evidence for cognitive units

different from whole beatboxing sounds (sub-segmental units), like gestures.

Chapter 3: Sounds characterized beatboxing sounds along a relatively small set of

phonetic dimensions, but cautioned that finding observable dimensions does not imply the

cognitive reality of those dimensions. The atoms of speech—units the size of features or

gestures—are argued to be cognitive because of many years of observational data and more

recent (40-50) years of experimental data showing that sounds pattern along these phonetic

dimensions. In almost all cases the patterns occur for sounds of a particular “natural” class,

which is to say that the sounds involved share one or more phonetic properties.

If there is any cognitive reality to the phonetic dimensions of beatboxing sounds, then

beatboxing sounds belonging to a given class defined by one or more phonetic dimensions

should share a certain pattern of behavior. Beatboxing harmony provides a window through

which to assess the possibility of sub-segmental cognitive beatboxing units. Triggers,

undergoers, and blockers have complementary behavior in harmony; if they also have

complementary phonetic dimensions relevant to the harmony, then those dimensions will

satisfy the criteria above for being cognitively real.

221
Table 26 (reprinted). The beatboxing sounds involved in harmony organized by their
harmony role.
Name BBX IPA Description

Triggers

Spit Snare {SS} [ʘ͡ɸ↑] Voiceless lingual egressive bilabial affricate

Clickroll {CR} [*] Voiceless lingual egressive alveolar trill

Liproll {LR} [ʙ̥↓] Voiceless lingual ingressive bilabial trill

Blockers

Inward Liproll {^LR} [ʙ̥↓] Voiceless pulmonic ingressive bilabial trill

High Tongue Bass {HTB} [r] Voiced pulmonic egressive alveolar trill (high
pitch)

Inward K Snare {^K} [k͡ʟ̝̊↓] Voiceless pulmonic ingressive lateral velar


affricate

Undergoers (alternants of other sounds)

Unforced Kick Drum {b} [ʬ] Voiceless percussive bilabial stop

Labiodental closure {pf} [ʘ̪] Voiceless percussive labiodental stop

Dental closure {dc} [k͜ǀ] Voiceless percussive dental stop

Other

Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop

Closed Hi-Hat {t} [t’] Voiceless glottalic egressive alveolar stop

Humming {hm} [C̬] Pulmonic egressive nasal voicing

Linguolabial closure {tbc} [ʘ̺] Voiceless percussive linguolabial stop

Dental-alveolar closure {dac} Voiceless percussive laminal dental stop

Alveolar closure {ac} [k͜ǃ] Voiceless percussive alveolar stop

Lateral alveolar closure {tll} [ǁ] Voiceless percussive lateral alveolar stop

222
Table 26 (reprinted) lists the sounds that participate in the five analyzed beat patterns with

harmony according to their function in the harmony pattern. The sounds in the “other”

group are sounds which were either prevented from undergoing harmony by nearby blocking

sounds (the forced Kick Drum and Closed Hi-Hat) or for which there is not sufficient

evidence to say what their role is (humming, and some percussives). Within each group, the

sounds do not belong to the same musical category (i.e., snare, kick, roll) and do not have the

same primary constrictors. Though the undergoers all happen to be made with compressed

oral closures (i.e., as stops), neither the triggers nor the blockers pattern by constriction

degree within their groups. The only phonetic dimension along which all three groups

pattern complementarily is their general airstream type: triggers have a lingual airstream,

undergoers are percussives, and blockers have pulmonic airstream.

As discussed in section 3 and in Chapter 5: Alternations, the percussive undergoers

were never identified by this beatboxer as distinctive sounds, are restricted to occurring near

other sounds with tongue body closures, and pattern metrically like their glottalic egressive

counterparts {B], {PF} (glottalic egressive labiodental affricate), and {t}. (The four coronal

percussives in the “Other” group in Table 26 may also be alternants of a coronal sound like

the Closed Hi-Hat {t}, but there is not enough metrical data to be sure.) Based on this, the

sounds that undergo harmony are likely intended to be glottalic sounds but because of the

harmony are produced with a tongue body closure and without a laryngeal gesture.

Re-phrasing the airstream conclusion from the previous paragraph: triggers have lingual

airstream, undergoers shift from glottalic airstream to percussive, and blockers have

pulmonic airstream.

223
An equivalent way to characterize the pattern is that the triggers are all composed of

a tongue body closure gesture and another more anterior constriction whereas the rest of the

sounds do not have tongue body closures—and in the case of the Inward K Snare, do not

have an additional simultaneous anterior constriction. Pulmonic sounds, the blockers, may

override tongue body closure harmony because they fulfill both musical and homeostatic

roles (keeping the beatboxer alive long enough to finish their performance). The remaining

sounds, which happen to be glottalic, would not benefit homeostatically from blocking the

spread of the tongue body closure (since they do not afford breathing in any case due to

their usual glottal closure) and in undergoing the harmony they lose their laryngeal raising

since it is rendered inert with respect to pressure regulation by the tongue body closure.

The criteria for a phonetic dimension to be counted as a cognitive dimension were

that there must 1) be a class of sounds sharing that dimension which 2) collectively

participate in some behavior. Not only do the trigger sounds analyzed in these five beat

patterns all share the lingual airstream dimension, but so also do the showcased sounds in

the beat patterns not analyzed above—the Clop, Duck Meow SFX, Water Drop Air, and

Water Drop Tongue are all either lingual egressive or lingual ingressive and are the most

likely candidates for triggering harmony in their beat patterns. These seven lingual sounds

are also the complete set of lingual airstream sounds for this beatboxer: every lingual

airstream sound performed by this beatboxer is a likely trigger for tongue body closure

harmony. (See the appendix for drum tabs of every harmony-containing beat pattern.) The

triggers therefore constitute a natural class within the set of beatboxing sounds this

beatboxer knows. With respect to the two criteria, harmony triggers 1) share the dimension

224
of lingual airstream and 2) collectively trigger harmony. There is not enough data to say for

sure whether every pulmonic sound is a harmony blocker, but the evidence so far predicts as

much. Tongue body closure can therefore be analyzed as a sub-segmental cognitive

representation because it places the trigger sounds in a cognitive relation with each other; in

doing so, it also places the triggers (lingual), blockers (pulmonic), and undergoers (other) in

a complementary cognitive relationship with each other. Section 4.2 offers a theoretical

account of tongue body closure harmony in a gesture-based framework, notably positing

pulmonic gestures for beatboxing which act as blockers of tongue body closure gestures.

4.2 Gestural implementation of beatboxing harmony

Having established that there are sub-segmental cognitive units of beatboxing, the next step

is to develop a theoretical account of harmony based on those units. A theoretical account of

beatboxing harmony needs to account for the behavior of triggers and their prolonged

tongue body closures, undergoers which lose a laryngeal raising gesture when the extended

tongue body closure spreads through them, and pulmonic blockers that disrupt the

spreading of the tongue body closure. This section compares compares two gestural

accounts—the Gestural Harmony Model (Smith, 2018) and Tilsen’s (2019) extension to the

selection-coordination model—and, briefly, a symbolic account.

4.2.1 Gestural Harmony Model

Chapter 4: Theory provides the basis for an action-based account of beatboxing phonology.

Speech and beatboxing movements share certain constriction tasks and kinematic properties,

suggesting that the fundamental cognitive units of beatboxing are the same types of actions

225
as speech units—albeit with different purposes. In the language of dynamical systems, this

equivalence is expressed through the graph level which speech gestures and beatboxing

actions are hypothesized to share.

The Gestural Harmony Model (Smith, 2018) provides the means for generating these

beatboxing harmony phenomena. The Gestural Harmony Model extends the gestures of task

dynamics with a new parameter for persistent or non-persistent activation, and extends the

coupled oscillator model of intergestural coordination with an option for intergestural

inhibition.6 In speech, persistent activation allows a gesture to last until it is inhibited by

another gesture or until the end of the word containing the gesture. These additions to the

model are new parameters; because the graph level deals with selection of dynamical system

parameters and the relationship of those parameters to each other and to dynamical state

variables, the addition of new parameters to the model is a graph-level change (Table 32).

Under the shared-graph hypothesis, the Gestural Harmony Model’s revisions to speech

graphs must also be reflected in the graphs of beatboxing actions and their coordination. All

of the different gestural arrangements possible under the Gestural Harmony

Model—including pathological patterns that are unattested in speech, as discussed

below—are predicted to be available to beatboxing as well. So just like speech, beatboxing

can have persistent actions which last until they are inhibited by another beatboxing action

6
The coupled oscillator model does not have a mechanism for starting a tongue body closure early, stretching it
regressively. Typically a gesture’s activation is associated with a particular phase of its oscillator; the oscillators
settle into a stable relative phase relationship based on their couplings before an utterance is produced, giving
later activation times to gestures associated with later sounds. The Gestural Harmony Model uses constraints in
an OT grammar to shift the onset of activation of a persistent gesture earlier in an utterance. A similar strategy
could be used for beatboxing harmony, or else a more dynamical method of selecting coupling relationships. In
either case, the force that causes harmony to happen in a theoretical model must be related to the aesthetic
principles that shape beatboxing—here perhaps the drive to create a cohesive aesthetic through a consistently
sized resonance chamber (the oral chamber in front of the tongue body). The formalization of that force is left
for future work.

226
or until the end of a musical phrase. The next few paragraphs schematize how the Gestural

Harmony Model accounts for the harmony patterns in beatboxing.

[Link] Harmony triggers and undergoers


To start, consider the simplest sequence of two sounds: a Kick Drum and a Spit Snare (see

beat pattern 5 for an example). The Kick Drum is an ejective composed of a labial

compression action and a laryngeal closing and raising action, and the Spit Snare is a lingual

egressive sound composed of a labial compression action and a tongue body compression

action. These compositions are laid out in a coupling graph at the top of Figure 109 with

coupling connections between the paired actions for each sound—the specific nature of these

connections in a coupled oscillator model determines the relative timing of these actions and

contributes to the perception of multiple gestures as part of the same segment; for present

purposes, however, the important coupling relationship to watch for is the inhibitory

coupling.

Section 3 characterized the Spit Snare as a harmony trigger, so the tongue body

closure of the Spit Snare needs to turn the Kick Drum into an unforced Kick Drum via

temporal overlap. This is accomplished by flagging the Spit Snare’s tongue body closure

gesture as persistent—marked with arrow heads on the top and bottom of the oscillator in

the coupling graph—causing it to extend temporally as far as possible both forward and

backward. By extending backward, the tongue body closure is activated before or around the

same time as the labial closure of the Kick Drum, resulting in the production of a Kick Drum

that has adopted a tongue body closure (an unforced Kick Drum). The gestural score below

the coupling graph in Figure 109 shows this temporal organization.

227
Table 32. Non-exhaustive lists of state-, parameter-, and graph-level properties for dynamical
systems used in speech from Chapter 4: Theory. Parameter additions to the system from the
Gestural Harmony Model are underlined. Because the graph level is responsible for the
selection of and relationship between parameter and state variables, the addition of
persistence and inhibition to the parameter space is a graph-level change.
State level Parameter level Graph level

System type: Gesture

Position Target state System topology (e.g., point


Velocity Stiffness attractor)
Acceleration Strength of other movement Tract variable selection
Activation strength forces (e.g., damping) Selection of and relationship
Blending strength between parameter and
Persistence state variables

System type: Coupled oscillators

Phase Activation/deactivation phase Number of tract variables


Oscillator frequency Intergestural coupling
Coupling strength & direction Selection of and relationship
Coupling type (in-phase, between parameter and
anti-phase) state variables
Inhibition

Section 3 also showed that the laryngeal raising gesture of the Kick Drum disappears when it

harmonizes to the tongue body closure of the Spit Snare. This can be accomplished with an

inhibitory coupling relationship: if an inhibitor sound is scheduled to activate before an

inhibitee sound to which it is coupled, then the inhibitee is prevented from activating at all.

The coupling graph in Figure 109 shows an inhibitory coupling relationship between the

tongue body closure of the Spit Snare and the larynx raising gesture of the Kick Drum; since

the tongue body the closure starts before the laryngeal gesture, the laryngeal gesture never

activates. The gestural score in Figure 109 shows the “ghost” of the laryngeal gesture as a

visual indication that it was intended but never produced.

228
Why does this inhibitory relationship exist in the first place? Laryngeal activity isn’t

antagonistic to tongue body closure—dorsal ejectives are well-attested in languages, so

clearly laryngeal closure/raising and tongue body closure action can even be collaborative.

And, we have seen that this canceling of the laryngeal closure/raising gesture is not a blanket

inhibition on all laryngeal activity. Figure 110 depicts the same relationship between Kick

Drum and Spit Snare as Figure 109, but with the addition of a humming phonation gesture

from the humming-while-beatboxing pattern (section 3.2). The persistent tongue body

closure from the Spit Snare inhibits the laryngeal raising gesture of the Kick Drum, just as it

did in Figure 109; however, the humming gesture has no inhibitory coupling relations, so it is

free to manifest at the appropriate time. The result is an unforced Kick Drum coarticulated

with humming and followed by a Spit Snare. (The humming gesture is depicted with

in-phase coupling to the labial closure of the Kick Drum as a way of showing that the

humming and the Kick Drum occur together on the same beat. In a more expansive account,

they might not be coupled to each other directly but instead share the activation phase of

some metrical oscillator.)

One answer is that closing the vocal folds reduces opportunities to manage the

volume of air in the lungs. Expert beatboxing requires constant breath management because

the ability to produce a given sound in an aesthetically pleasing manner requires specific air

pressure conditions. We have seen that beat patterns can include sound sequences with many

different types of airflow; in planning the whole beat pattern (or chunks of the beat pattern),

beatboxers must be prepared to produce a variety of airstream types and so are likely to try

to maintain breath flexibility. Laryngeal closures prevent the flow of air into and out of the

229
lungs for breath management purposes, and therefore are antagonistic not to the tongue

body closure but to breath control. By this explanation, the inhibition of the laryngeal

closure/raising gesture by the tongue body closure gesture is a formalization of a qualitative

shift in how airflow is managed in the vocal tract.

A different explanation is that cancellation of the laryngeal closure/raising gesture is

an adaptive coordinative pattern that dynamical approaches to speech predict as a hallmark

of skill (Pouplier, 2012). The coordination of the body’s end effectors change as the

organism’s intentions change, sometimes resulting in qualitative shifts in coordinative

patterns; this has been notably recognized in quadrupeds like horses which switch into

distinct but roughly equally efficient gaits for different rates of movement (Hoyt & Taylor,

1981). In this case of laryngeal closure and raising in beatboxing, expert beatboxers are likely

to recognize that the laryngeal gesture they usually associate with a forced Kick Drum (or

other glottalic sounds that undergo harmony) has no audible consequence during tongue

body closure harmony. From this feedback, a beatboxer would learn a qualitative shift in

behavior—to not move the larynx while the tongue is making a closure. A similar thing

happens in speech in the context of assimilation due to overlap: Browman & Goldstein

(1995) provide measurements that when a speaker produces the phrase “tot puddles” there is

wide variation in the magnitude of the final [t] tongue tip constriction gesture, including

effective deletion. In this example, the speaker reduces or deletes their gesture when it would

have no audible consequence anyway. The same could be said of the laryngeal closure and

raising gesture in beatboxing when overlapped with tongue body closure harmony.

230
Figure 109. A schematic coupling graph and gestural score of a Kick Drum and Spit Snare.
The tongue body closure (TBCD) gesture of the Spit Snare overlaps with and inhibits the
closure and raising gesture of the larynx (LAR).

231
Figure 110. A schematic coupling graph and gestural score of a Kick Drum, humming, and a
Spit Snare. The tongue body closure (TBCD) gesture overlaps with and inhibits the closure
and raising gesture of the larynx (LAR) as in Figure 109, but the humming LAR gesture is
undisturbed.

[Link] Harmony blockers


The apparent blocking behavior of the Inward K Snare can also be accounted for with

inhibitory coupling. Figure 111 shows an example from beat pattern 1 with a {b CR B ^K}

sequence. The Inward K Snare requires a tongue body closure and lung expansion to draw

air inward over the sides of the tongue body, which is incompatible with a full tongue body

closure triggered by the Clickroll. This lung expansion action ends the persistent tongue

232
body gesture associated with a harmony trigger—if it didn’t, then the tongue body closure

would block the inward airflow and the Inward K Snare couldn’t be produced. Inhibiting the

persistent tongue body closure also prevents the persistent tongue body closure gesture from

inhibiting the laryngeal gesture of the Kick Drum between the Clickroll and the Inward K

Snare. As a result, the first Kick Drum that does overlap with the persistent tongue body

closure gesture has its laryngeal closure/raising gesture inhibited, but the second Kick Drum

will not.

Positing a breathing task is a major departure from the typical tract variables of

Articulatory Phonology. Lung movements are not considered a gesture in Articulatory

Phonology, and reasonably so—no language uses pulmonic ingressive airflow to make a

phonological contrast (Eklund, 2008). Pulmonic egressive airflow, on the other hand, is

practically ubiquitous in speech which means that it does not really operate contrastively

either. Either way, there has been no need to posit any kind of pulmonic gesture for speech.

But in beatboxing, pulmonic activity is contrastive (see Beatboxing sound frequencies )

and appears to contribute to productive sound patterns, indicating that it is cognitive too.

233
Figure 111. A schematic coupling graph and gestural score of a {b CR B ^K} sequence. The
tongue body closure (TBCD) gesture of the Clickroll overlaps with and inhibits the laryngeal
closing and raising gesture (LAR) of the first Kick Drum. The lung expansion (PULM)
gesture coordinated with the Inward K Snare inhibits the TBCD gesture of the Clickroll
before the TBCD gesture can inhibit the second LAR gesture.

The shared-graph hypothesis of Theory chapter predicts that beatboxing and speech will

exhibit similar patterns of behavior permitted by the dynamical graph structures they use.

The Gestural Harmony Model augments the graphs of the task dynamics framework and the

coupling graph system to include gestural persistence and inhibition options; any predictions

of possible action patterns made by the Gestural Harmony Model should therefore also be

predictions about possible beatboxing patterns. The finding that beatboxing harmony exists

in such a speechlike form provides evidence in favor of both the shared-graph hypothesis

and the Gestural Harmony Model.

The support is all the stronger because the gestural analysis of beatboxing harmony

includes patterns that are predicted by the Gestural Harmony Model but unattested in

234
speech. As Smith (2018:204-206) discusses, intergestural inhibition may not be constrained

enough for speech: inhibition is introduced specifically so that an inhibitor gesture can block

the spread of a persistent inhibitee gesture, but it is equally possible in the model that a

persistent gesture could inhibit non-persistent gestures—even though such a thing never

appears to occur in speech. Within the narrow domain of speech, the Gestural Harmony

Model over-generates inhibition patterns. But beatboxing uses those patterns when

persistent tongue body closure gestures inhibit laryngeal raising gestures; under the

shared-graph hypothesis, the predictions of the Gestural Harmony Model are met with

evidence. (It is of course possible, maybe even likely, that the lack of attestation of this

particular inhibitory pattern is simply be due to a relative scarcity of articulatory

investigations of the types of speech harmony that could exhibit this. If this pattern were

found in speech, it would mean that the Gestural Harmony Model does not over-generate

patterns and that speech and beatboxing harmony have one more thing in common.)

4.2.2 Extension to selection-coordination-intention

Tilsen (2019) offers two different gesture-based accounts for the origins of non-local

phonological processes as emerging from stochastic motor control variability within an

extension to the selection-coordination-intention model (Tilsen, 2018)—that is, in this

system, harmony is thought to start off more or less accidentally because of how

domain-general motor control works and to later become phonologized into a regular part of

speech. In this model, the selection of gestures or groups of gestures is modeled as a

competitive gating process of activation in a dynamical field: when a group of gestures is

selected, their excitation level ramps up high enough to trigger the dynamic motor processes
235
associated with that set of gestures; other gestures that have not been selected yet or which

were already selected and subsequently “demoted” are still present but are not strongly

enough excited to be excited or to influence motor planning fields in any way. The

continuous process of selection is discretized into static “epochs” that describe a snapshot

view of the whole state of the system and the gestures therein. One cause of demotion is

inhibition—gestures are conceived of as having both excitatory and inhibitory manifestations

in the dynamical field. Gestural antagonism is formalized as one gesture’s excitatory side

conflicting with another’s inhibitory side; when two antagonistic gestures would be selected

into the same epoch, the inhibitory gesture demotes the excited gesture from the selection

pool.

Tilsen’s account of local spreading harmony (which we have argued is the nature of

beatboxing harmony above) arises from “selectional dissociation”, a theoretical mechanism

by which a gesture may be selected early or de-selected late relative to the epoch it would

normally be selected into. Blocking in this model occurs when a gesture which has been

selectionally dissociated conflicts with an inhibitory gesture in another epoch. In the case of

nasal harmony, for example, a velum lowering gesture might fail to be suppressed causing it

to remain selected in later epochs; this would be progressive/rightward spreading of a

gesture. The velum lowering would be blocked if it were ever extended into an epoch in

which an antagonistic, inhibiting velum closing gesture was also selected: the inhibitory

velum raising gesture would demote the velum lowering gesture, causing the lowering

gesture to slip below the threshold of influence over the vocal tract.

236
The tongue body closure spreading and pulmonic airflow blocking in beatboxing can

be accounted for by similar means, with the tongue body closure gesture being anticipatorily

de-gated (selected early) and remaining un-suppressed unless it conflicts with the selection

of an antagonistic pulmonic gesture (e.g., from an Inward K Snare) in a later epoch. This has

the advantage of providing an explicit explanation for why some Kick Drums in sequences

like {CR dc B ^K} do not undergo harmony: if the Kick Drum and Inward K Snare are

selected during the same epoch, then the Inward K Snare’s pulmonic gesture blocks the

spread of harmony during that whole epoch, effectively defending the Kick Drum from

harmony.

As hinted above, the selection-coordination-intention model offers a second

mechanism for dealing with non-local phonological agreement patterns: “leaky” gating. A

gesture that is gated is not selected and therefore exerts no influence on the tract variable

planning fields—and therefore, has no influence on the vocal tract. But if a gesture is

imperfectly gated, its influence can leak into the tract variable planning field even though it

hasn’t been selected. Leaky gating cannot be blocked because blocking is formalized as a

co-selection restriction; since the leaking gesture has not actually been selected, it cannot be

demoted. Local spreading harmony—including the beatboxing examples above—often

features blocking behavior, which makes leaky gating inappropriate for the crux of a

spreading harmony analysis (but useful as an analysis of long-distance agreement harmony

which is generally not blocked). But there is nothing to say that leaky gating can’t be used

with selectional dissociation; on the contrary, if a spreading gesture has an intrinsically high

excitation level, it would be all the more likely to lurk beneath the selection threshold,

237
leaking its influence into the vocal tract without antagonizing the currently selected gestures.

This could explain why the tongue body remains elevated during most of the forced Kick

Drums in the complex example in section 3.5.1: the pulmonic gesture of the Inward K Snare

blocks the spreading tongue body closure gesture by demoting it to sub-selection levels, but

the tongue body closure gesture leakily lingers and keeps the tongue body relatively high.

Only near the end of the beat pattern does the tongue body closure gesture stop leaking,

presumably because there are no more Clickrolls to reinforce its excitation.

So far as we can tell however, the loss of the laryngeal closing/raising gestures during

Kick Drums and other harmony-undergoer sounds cannot be accounted for in this model.

Laryngeal closure/raising is not physically antagonistic to tongue body closure, so there is no

reason to posit a pre-phonologized inhibitory relationship between a tongue body gesture

and a laryngeal closure/raising gesture. If antagonism is defined phonologically instead of

phonetically, the complementary behavior of triggers, undergoes, and blockers may be

enough to set up phonological antagonism between their respective airstream initiator

gestures—but it is not clear in the model what the nature of the antagonism is or how this

type of antagonism might be learned without a physical antagonism first.

4.2.3 Domain-specific approaches

To conclude this theoretical accounting of beatboxing harmony, recall from section 1 that

models of phonological harmony that only account for linguistic harmony should be

dispreferred to models that can accommodate beatboxing harmony as well. What about a

238
more traditional, purely domain-specific phonological framework based around symbolic

features instead of gestures?

Most computational approaches are likely to be able to provide an account of

beatboxing harmony, though great care would need to be taken in order to define sensible

features for beatboxing. One might posit a set of complementary airstream features {+

pulmonic} and {+ lingual} for sounds with either pulmonic or lingual airstream. An Inward K

Snare would be defined as {+ pulmonic} for airstream and, because it is made with the

tongue body, {+ dorsal} for its place feature (the primary constrictor, when mapped to

phonetics). Because pulmonic and lingual airstreams are complementary, the Inward K Snare

would also be {- lingual}. Though not a deal-breaker per se, it would be a little strange in a

phonetically grounded model for a sound to be both {+ dorsal} and {- lingual}: there is no

qualitative distinction between a tongue body closure used for a pulmonic dorsal sound on

the one hand and a tongue body closure for a lingual egressive, lingual ingressive, or

percussive airstream sound on the other—in either case, the tongue body’s responsibility is to

stop airflow between the oral cavity and the pharynx. There would also need to be a

mechanism for preventing boxeme inputs that are simultaneously {+ lingual, + dorsal}

because the tongue body can’t manipulate air pressure behind itself. The gestural approach

has none of these issues: both an Inward K Snare and a lingual airstream sound just simply

use a tongue body constriction degree gesture.

To the main point, there is the question of whether most featural accounts of

linguistic harmony have any justification for extending to beatboxing harmony. We have

seen already that gestures are defined by both their domain-specific task and the

239
domain-general system for producing constriction actions in the vocal tract; by the

hypothesis laid out in Chapter 4: Theory, the domain-general capacity of the graph level to

implement linguistic harmony predicts that gesture-ish beatboxing units should come with

the same ability. Beatboxing harmony is thus predicted from linguistic harmony in a gestural

framework. But computational features are traditionally defined domain-specifically: the

features are concerned exclusively with their encoding of linguistic contrast and linguistic

patterns, and are historically removed from phonetics and the physical world by design

(though they have become more and more phonetically-grounded over time). The grammars

that operate over those features are intended to operate exclusively over linguistic inputs and

outputs. Phonological features and grammar could be adapted to beatboxing, every part of

their nature suggests that they should not.

5. Conclusion

Phonological harmony is not unique to speech: common beat patterns in beatboxing like the

humming while beatboxing pattern have the signature properties of phonological harmony

including triggers, undergoers, and blockers. This suggests that phonology (or at least

phonological harmony) is not a special part of language but rather a specialization of a

domain-general capacity for harmonious patterns. The existence of beatboxing harmony

provides evidence for sub-segmental cognitive units in beatboxing. The articulatory

manifestation of beatboxing harmony is amenable to an analysis based on gestures. The

notion that speech and beatboxing phonology are each specializations of a domain-general

240
harmony ability is expressed this way because gestures are essentially domain-general action

units specialized for a particular behavior.

241
CHAPTER 7: BEATRHYMING

Beatrhyming is a type of multi-vocalism performed by simultaneous production of

beatboxing and speech (i.e., singing or rapping) by a single individual. This case study of a

beatrhyming performance demonstrates how the tasks of beatboxing and speech interact to

create a piece of art. Aside from being marvelous in its own right, beatrhyming offers new

insights that challenge the fundamentals of phonological theories built to describe talking

alone.

1. Introduction

1.1 Beatrhyming

One of many questions in contemporary research in phonology is how the task of speech

interacts with other concurrent motor tasks. Co-speech manual gestures (Krivokapić, 2014;

Danner et al., 2019), co-speech ticcing from speakers with vocal Tourette’s disorder (Llorens,

in progress), and musical performance (Hayes & Kaun, 1996; Rialland, 2005; Schellenberg,

2013; Schellenberg & Gick, 2020; McPherson & Ryan, 2018) are just a few examples of

behaviors which may not be under the purview of speech in the strictest traditional sense

but which all collaborate with speech to yield differently organized speech performance

modalities. Studying these and other multi-task behaviors illuminates the flexibility of

speech units and their organization in a way that studying talking alone cannot.

This chapter introduces beatrhyming, a type of speech that has not previously been

investigated from a linguistic perspective (see Blaylock & Phoolsombat, 2019 for the first

242
presentation of this work, and also Fukuda, Kimura, Blaylock, and Lee, 2021). Beatrhyming is

a type of multi-vocalism performed by simultaneous production of beatboxing and speech

(i.e., singing or rapping) by a single individual. Notable beatrhyming performers include Kid

Lucky, Rahzel, and Kaila Mullady, though more and more beatboxers are taking up

beatrhyming as well. In terms of tasks, beatrhyming is an overarching task for artistic

communication that is composed of a beatboxing task and a speech task. The question at

hand is: how do the speech and beatboxing tasks interact in beatrhyming?

A beatrhyming performance contains words and beatboxing sounds interspersed with

each other. Artists differ in their use beatboxing sounds differently in their beatrhyming. For

example, Rahzel’s beatrhyme performance “If Your Mother Only Knew” (an adaption of

Aaliya’s “If Your Girl Only Knew”) uses mostly Kick Drums

([Link] whereas Kaila Mullady (whose

beatrhyming is analyzed in this chapter) more often uses a variety of beatboxing sounds in

her beatrhyming.

Words and beatboxing sounds may interact in different ways in beatrhyming. In

some cases, words and beatboxing sounds are produced sequentially. Taking the word “got”

[gat] as an example, a sequence of beatboxing and speech sounds would be transcribed as

{B}[gat] (a Kick Drum, followed by the word “got”). In other cases, words and beatboxing

sounds may overlap, as in {K}[at] (with a Rimshot completely replacing the intended [g] in

“got”) or [ga]{^K}[at] (with a K Snare interrupting the [a] vowel of “got”).

Complete replacement is illustrated by two examples in the acoustic segmentation of

the word “dopamine” /dopəmin/ in Figure 112: the Closed Hi-Hat {t} replaces the intended

243
speech sound /d/ and Kick Drum {B} replaces the intended speech sound /p/. In both cases,

the /d/ and /p/ were segmented on the phoneme (“phones”) tier with the same temporal

interval as the replacing beatboxing sound (on the “beatphones” tier). The screenshot also

features one example of partial overlap, a K Snare {^K} that begins in the middle of an [i]

(annotated "iy").

For reference, Table 33 below lists the five main beatboxing sounds that will be

referred to in this chapter. Each beatboxing sound is presented with both Standard Beatbox

Notation (SBN) (TyTe & Splinter, 2019) in curly brackets and IPA in square brackets. (The

IPA notation for the Inward K Snare uses the downward arrow [↓] from the extIPA symbols

for disordered speech to indicate pulmonic ingressive airflow, and should not be confused

with the similar arrow in IPA that indicates downstep.)

Sections 1.2-1.3 presents hypotheses and predictions about how beatboxing and

speech may (or may not) cooperate to support the achievement of their respective tasks.

Section 2 presents the method used for analysis and section 3 describes the results. Finally,

section 4 suggests that more studies of musical speech and other understudied linguistic

behaviors can offer new insights that challenge phonological theories based solely on talking.

244
Figure 112. Waveform, spectrogram, and text grid of the beatrhymed word “dopamine”.

Table 33. Sounds of beatboxing used in this chapter.


Name SBN IPA Description

Kick Drum {B} [p’] Voiceless ejective bilabial stop

PF Snare {PF} [p͡f ’] Voiceless ejective labio-dental affricate

Closed Hi-Hat {t} [t’] Voiceless ejective alveolar stop

Rimshot {K} [k’] Voiceless ejective velar stop

(Inward) K Snare {^K} [k͡ʟ̝̊↓] Voiceless pulmonic ingressive lateral velar


affricate

245
1.2 Hypotheses and predictions

1.2.1 Constrictor-matching

Depending on the nature of the replacements, cases like the complete replacement of /d/ and

/p/ in the word “dopamine” from Figure 112 could be detrimental to the tasks of speech

production. In the production of the word "got" [gat], the [g] is intended to be performed as

a dorsal stop. If the [g] were replaced by a beatboxing dorsal stop, perhaps a velar ejective

Rimshot {K’}, at least part of the speech task could be achieved while simultaneously

beatboxing. On the other hand, replacing the intended [g] with a labial Kick Drum {B}

would deviate farther from the intended speech tasks for [g]. If the difference were great

enough, making replacements that do not support the intended speech goals might lead to

listeners misperceiving beatryming lyrics—in this case, perhaps hearing “bot” [bat] instead of

“got”.

So then, if the speech task and the beatboxing task can influence each other during

beatrhyming, the speech task may prefer that beatrhyming replacements match the intended

speech signal as often as possible and along as many phonetic dimensions as possible. This

chapter investigates whether replacements support the speech task by making replacements

that match intended speech sounds in constrictor type (i.e., the lips, the tongue tip, the

tongue body) and constriction degree (approximated by manner of articulation). Lederer

(2005) offers the similar hypothesis that beatboxing sounds collectively sound as

un-speechlike as possible to differentiate beatboxing from speech—except during

246
simultaneous beatboxing and singing when perception might be maximized if beatboxing

sounds have the same place of articulation with the speech sounds they replace.

To summarize: the main hypothesis is that speech and beatboxing interact with each

other in beatrhyming in a way that supports the accomplishment of intended speech tasks.

This predicts that beatboxing sounds and the intended speech sounds they replace are likely

to match in constrictor and constriction degree. Conversely, the null hypothesis is that the

two systems do not interact in a way that supports the intended speech tasks, predicting that

beatboxing sounds replace speech sounds with no regard for the intended constrictor or

constriction degree.

The predictions of these hypotheses for constrictor matching are depicted Figures 113

and 3. Imagine a beatrhyming performance in which 90 intended speech sounds—30

intended speech labials, 30 intended speech coronals, and 30 intended speech dorsals—are

replaced by beatboxing sounds. The replacing beatboxing sounds come from a similar

distribution: 30 beatboxing labials, 30 beatboxing coronals, and 30 beatboxing dorsals. If

replacements are made with no regard to the constrictor of intended speech sounds

(following from the null hypothesis), constrictor matches should occur at chance. Each

replacement would have a 1 in 3 chance of having a constrictor match, resulting in 10

constrictor matches and 20 constrictor mismatches per intended constrictor as depicted in

Figure 113. But if replacements are sensitive to the intended constrictor (following from the

main hypothesis), then most beatboxing sounds should match the constrictor of the

intended speech sound they replace (Figure 114).

247
Figure 113. Bar plot of the expected counts of constrictor matching with no task interaction.

Figure 114. Bar plot of the expected counts of constrictor matching with task interaction.

Consider also the predicted distributions for any single beatboxing constriction (Figure 115).

For example, if 30 dorsal beatboxing replacements (i.e., K Snares) are made with no regard to

intended speech constrictor (following from the null hypothesis), then 10 of those

248
replacements should mismatch to intended speech labials, 10 should mismatch to intended

speech coronals, and 10 should match to intended speech dorsals. But if replacements are

sensitive to intended constrictor (following from the main hypothesis), then all 30

beatboxing dorsals are expected to replace intended speech dorsals (Figure 116).

The prediction of constriction degree matching follows a similar line of thinking. If

beatrhyming replacements are made with an aim of satisfying speech tasks, then

replacements are more likely to occur between speech sounds and beatboxing that have

similar constriction degrees. Since beatboxing sounds are stops and trills (see

Beatboxing sound frequencies ), and since “Dopamine” is performed in a variety of

English that has no phonological trills, the prediction of the main hypothesis is that speech

stops will be replaced more frequently than speech sounds of other manners of articulation.

On the other hand, the null hypothesis would be supported by finding that beatboxing

sounds replace all manners of speech sounds equally.

249
Figure 115. Bar plots of the expected counts of K Snare constrictor matching with no task
interaction.

Figure 116. Bar plots of the expected counts of K Snare constrictor matching with task
interaction.

250
1.2.2 Beat pattern repetition

As established in earlier chapters, beatboxing beat patterns have their own predictable sound

organization within a beat pattern. The presence of a snare drum sound on the back beat

(beat 3 of each measure) of a beat pattern in particular is highly consistent, but beat patterns

are also often composed of regular repetition at larger time scales. Speech utterances are

highly structured as well, but the sequence of words (and therefore sounds composing those

words) is determined less by sound patterns and more by syntax (cf. Shih & Zuraw, 2017).

However, artistic speech (i.e., poetry, singing) is sometimes composed alliteratively or with

other specific sound patterns in mind, leveraging the flexibility of language to express similar

ideas with a variety of utterances.

There are (at least) two ways beatboxing and speech could interact while maximizing

constrictor matching as hypothesized in section 1.2.1. First, the words of the song could be

planned without any regard for the resulting beat pattern. Any co-speech beatboxing sounds

would be planned based on the words of the song, prioritizing faithfulness to the intended

spoken utterance. Alternatively, the lyrics could be planned around a beatboxing beat

pattern, prioritizing the performance of an aesthetically appealing beat pattern. The counts

of constrictor matches described in section 1.2.1 could look the same either way, but the two

hypotheses predict that the resulting beat patterns will be structured differently. Specifically,

prioritizing the beatboxing beat pattern predicts that beatrhyming will feature highly

regular/repetitive beatboxing sound sequences characteristic of beatboxing music, whereas

prioritizing the speech structure would lead to irregular/non-repeating beatboxing sound

sequences. The rest of this section discusses these predictions in more detail.

251
A sequence of beatboxing sounds often repeats itself after just two measures of

music—that is, a two-measure or “two-bar” phrase (and also in this study, a “line” of music)

might be performed several times. For example, Figure 117 shows a sixteen-bar beatboxed

(non-lyrical) section of “Dopamine”. As a sixteen-bar phrase, it is composed of eight smaller

two-bar phrases. Each two-bar phrase could be distinct from the others, but in fact there are

only two types of two-bar phrases: AB and AC, where A, B, and C each refer to a sequence of

sounds in a single measure of music. The two-bar phrase AB occurs six times in the beat

pattern on lines 1, 2, 3, 5, 6, and 7. Lines 4 and 8 of the beat pattern feature the two-bar

phrase AC.

The depiction of the sixteen-bar phrase in Figure 117 appears sequential, but is in fact

hierarchical: pairs of two-bar phrases compose four-bar phrases, pairs of four-bar phrases

compose eight-bar phrases, and a pair of eight-bar phrases composes the entire sixteen-bar

phrase. In fact, one way to model the creation of this structure is to merge progressively

larger repeating units. That is, given an initial two-bar phrase, a four-bar phrase can be

created by assembling two instances of that two-bar phrase into a larger unit. Likewise, an

eight-bar phrase can be thought of as copy-and-merge of a four-bar phrase with itself.

There is room for variation here, and lines may change based on the artist’s musical

choices. In Figure 117, the end of the first eight-bar phrase deviates from the rest of the

pattern, possibly to musically signal the end of the phrase. In this case, the whole eight-bar

phrase is then copied to create a sixteen-bar phrase, resulting in repetition of that deviation

at the end of both eight-bar phrases.

252
This hierarchical composition can be used to predict where repeating two-bar phrases

are most likely to be found in a sixteen-bar beat pattern. The initial repetition of a two-bar

phrase to make a four-bar phrase predicts that lines 1 & 2 should be similar (where each line

is a two-bar phrase). Likewise, repetition of that four-bar phrase to make an eight-bar phrase

would predict repetition between lines 3 & 4; at a larger time scale, this would also predict

that lines 1 & 3 should be similar to each other, as should lines 2 & 4. In the sixteen-bar

phrase composed of two repeating eight-bar phrases, the repetition relationships from the

previous eight-bar phrase would be copied over (lines pairs 5 & 6, 7 & 8, 5 & 7, and 6 & 8);

repetition would also be expected between corresponding lines of these two eight-bar

phrases, predicting similarity between lines 1 & 5, 2 & 6, 3 & 7, and 4 & 8.

Figure 117. Serial and hierarchical representations of a 16-bar phrase (8 lines with 2 measures
each).
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K th B in B
Line 2 | B t t ^K th B in B
Line 3 | B t t ^K th B in B
Line 4 | B t t ^K B h ^K t t
|
Line 5 | B t t ^K th B in B
Line 6 | B t t ^K th B in B
Line 7 | B t t ^K th B in B
Line 8 | B t t ^K B h ^K t t

16-bar Phrase: ... ...


/ \
8-bar Phrase: / \
/ \ / \
/ \ / \
/ \ / \
/ \ / \
/ \ / \
/ \ / \
/ \ / \
4-bar Phrase: / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
2-bar Phrase (line): 1 2 3 4 5 6 7 8
/ \ / \ / \ / \ / \ / \ / \ / \
Measure: A B A B A B A C* A* B A B A B A C

253
Because deviations from the initial two-bar pattern are expected to occur in the interest of

musical expression, some pairs of two-bar phrases are more likely to exhibit clear repetition

than others. Consider a four-bar phrase composed of two two-bar phrases AB and AC—their

first measures (A) are identical, but their second measures (B and C) are different. If this

four-bar phrase is repeated to make an eight-bar phrase, the result would be AB-AC-AB-AC.

In this example, lines 1 & 3 match as do lines 2 & 4, but lines 2 & 3 and 1 & 4 do not. For this

study, the discussion of repetition in beatrhyming is limited to just those pairs of lines

described earlier which are most likely to feature repetition (“cross-group” refers to

corresponding lines in two different eight-bar phrases):

● Adjacent two-bar phrases—lines 1 & 2, 3 & 4, 5 & 6, and 7 & 8

● Alternating two-bar phrases—lines 1 & 3, 2 & 4, 5 & 7, and 6 & 8

● Cross-group two-bar phrases—lines 1 & 5, 2 & 6, 3 & 7, and 4 & 8

If beatboxing structure is prioritized in beatrhyming—either because beatboxing and speech

aren’t sensitive to each other at all or because the speech system accommodates beatboxing

through lyrical choices that result in an ideal beat pattern—then sequences of co-speech

beatboxing sounds should have similarly high repetitiveness compared to beat patterns

performed without speech. But if speech structure is prioritized, then the beat pattern is

predicted to sacrifice repetitiveness in exchange for supporting the speech task by matching

the intended constrictor and constriction degree of any speech segments being replaced.

254
1.2.3 Summary of hypotheses and predictions

The main hypothesis is that speech and beatboxing interact during beatrhyming to

accomplish their respective tasks, and the null hypothesis is that they do not. Support for the

first hypothesis could appear in two different forms, or possibly both at the same time. First,

if beatrhyming replacements are sensitive to the articulatory goals of the intended speech

sound being replaced, then the beatboxing sounds that replace speech sounds are likely to

match their targets in constrictor and constriction degree. Second, if beatboxing sequencing

patterns are prioritized in beatrhyming, then sequences of beatrhyming sound replacements

should exhibit the same structural repetitiveness as non-lyrical beatboxing sequences. Failing

to support either of these predictions would support the null hypothesis and the notion that

speech and beatboxing have no cognitive relationship during beatrhyming.

Note that different beatrhyming performances may feature different relationships

between speech and beatboxing depending on the artist’s musical aims. The results of this

study should be taken as an account of one way that beatrhyming has been performed, but

not necessarily the only way to beatrhyme.

2. Method

This section describes how the data were collected and coded (section 2.1) and analyzed

(2.2).

255
2.1 Data

The data in this study come from a beatrhyming performance called "Dopamine", created

and performed by Kaila Mullady and made publicly available on YouTube (Mullady, 2017).

The composition of "Dopamine" includes sections of beatboxing in isolation and beatboxing

concurrently with speech. The lyrics of "Dopamine" were provided by Mullady over email.

An undergraduate in the Linguistics program at USC performed manual acoustic

segmentation of the beatrhymed portions of “Dopamine” using Praat (Boersma, 2001).

Segmentation was performed at the level of words, phonemes (“phones”), beatboxing sounds

(“beatphones”), and the musical beat (“beats”) on which beatboxing sounds were performed.

For complete sound replacements, the start and end of the annotation for the interval of the

intended speech phone were the same as the start and end of the beatboxing beatphone

interval.

Five beatboxing sounds were used in the beatrhymed sections of "Dopamine": Kick

Drum {B}, Closed Hi-Hat {t}, PF Snare {PF}, Rimshot {K}, and K Snare {^K}. (It was not clear

from the acoustic signal whether the K Snares were Inward or Outward; a choice was made

to annotate them consistently as Inward {^K}. The choice of Inward or Outward does not

affect the outcome of this study which addresses only constrictor—which Inward and

Outward K Snares share). Each beatboxing sound was coded by its major constrictor: {B} and

{PF} were coded as “labial”, {t} was coded as “coronal” (tongue tip), and {K} and {^K} were

coded as “dorsal” (tongue body). Finally, the metrical position of each replacement was

annotated with points on a PointTier aligned to the beginning of beatboxing sound intervals.

256
2.2 Analysis

2.2.1 Constrictor-matching analysis

The mPraat software (Bořil, & Skarnitzl, 2016) for MATLAB was used to count the number

of complete one-to-one replacements (excluding partial replacements or cases where one

beatboxing sound replaced two speech sounds) (n = 88). The constrictor of the originally

intended speech sound was then compared against the constrictor for the replacing

beatboxing sound, noting whether the constrictors were the same (matching) or different

(mismatching).

Constriction degree matching was likewise measured by counting how many speech

sounds of different constriction degrees were replaced—or in this case, different manners of

articulation. All the beatboxing sounds that made replacements were stops {B} or affricates

{PF, t, K’, (^)K}; higher propensity for constriction degree matching would be found if the

speech sounds being replaced were more likely to also be stops and affricates instead of

nasals, fricatives, or approximants.

2.2.2 Repetition analysis

Four sixteen-bar sections labeled B, C, D, and E were chosen for repetition analysis.

(“Dopamine” begins with a refrain, section A, that was not analyzed because it has repeated

lyrics that were expected to inflate the repetition measurements. The intent of the ratios is to

assess whether beat patterns in beatrhyming are as repetitive as beat patterns without lyrics,

not how many times the same lyrical phrase was repeated.) Sections B and D were

257
non-lyrical beat patterns (no words) between the refrain and the first verse and between the

first and second verses, respectively. Sections C and E were the beatrhymed (beatboxing with

words) first and second verses, respectively. The second verse was 24 measures long, but was

truncated to 16 measures for the analysis.

Repetitiveness was assessed using two different metrics. The first metric counted how

many unique measure-long sequences of beatboxing sounds were performed as part of a

section of music. The more unique measures are found, the less repetition there is. Rhythmic

variations within a measure were ignored for this metric to accommodate artistic flexibility

in timing. For example, Figure 118 contains two two-bar phrases; of those four measures, this

metric would count three of them as unique: {B t t ^K}, {th PF ^K B}, and {B ^K B}. The first

measures of each two-bar phrase would be counted as the same because the sequence of

sounds in the measure is the same despite use of triplet timing on the lower line (using beats

1.67 and 2.33 instead of beats 1.5 and 2). This uniqueness metric provides an integer value

representing how much repetition there is over a sixteen-bar section; if beatrhyming beat

patterns resemble non-lyrical beatboxing patterns, each section’s uniqueness metric should

be approximately the same.

The second metric is a proportion called the repetition ratio. For a given pair of

two-bar phrases, the number of beats that had matching beatboxing sounds was divided by

the number of beats that hosted a beatboxing sound across both two-bar phrases. This

provides the proportion of beats in the two phrases that were the same, normalized by the

number of beats that could have been the same, excluding beats for which neither two-bar

phrase had a beatboxing sound.

258
For example, the two two-bar phrases in Figure 118 are the same for 4/10 beats,

resulting in a repetition ratio of 0.4. In measure 1 the sounds of beats 1 and 3 match, but the

second two sounds of the first phrase are on beats 1.5 and 2 whereas the second two sounds

of the second phrase are performed with triplet timing on beats 1.67 and 2.33. Therefore in

the first measure, six beats have a beatboxing sound in either phrase—beats 1, 1.5, 1.67, 2, 2.33,

and 3—but only two of those beats have matching sounds. In the second measure, four beats

have a beatboxing sound in either phrase—beats 1, 2, 3, and 4. While two of those beats have

the same beatboxing sound in both phrases, beat 1 only has a sound in the first phrase and

beat 2 has a PF Snare in the first phrase but a Kick Drum in the bottom phrase. Looking at

the phrases overall, ten beats carry a beatboxing sound in either phrase but only four beats

have the same sound repeated in both phrases for a repetition ratio of 0.4.

This calculation penalizes cases like the first half of the example in Figure 118 in

which the patterns are identical except for a slightly different rhythm. The high sensitivity to

rhythm of this repetition ratio measurement was selected to complement the rhythmic

insensitivity of the previous technique for counting how many unique measures were in a

beat pattern. In practice, this penalty happened to only lower the repetition ratio for phrases

that were beatboxed without lyrics (co-speech beat patterns rarely had patterns with the

same sounds but different rhythms, so there were few opportunities to be penalized in this

way); despite this, the repetition ratios for beatrhymed patterns were still lower than the

repetition ratios for beatboxed patterns in the same song (see section 3.3 for more details).

259
Figure 118. Example of a two-line beat pattern. Both lines have a sound on beats 1 and 3 of
the first measure and beats 2, 3, and 4, of the second measure.
1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
------------------------------------------------------------------------------
B t t ^K th PF ^K B
B t t ^K B ^K B

Within each section, the repetition ratio was calculated for three types of two-bar phrase

pairs: adjacent pairs (phrases 1 & 2, 3 & 4, 5 & 6, and 7 & 8), alternating pairs (phrases 1 & 3,

2 & 4, 5 & 7, and 6 & 8), and cross-group pairs (phrases 1 & 5, 2 & 6, 3 & 7, and 4 & 8).

Additionally, repetition ratio was calculated between sections B & D and between sections C

& E to see if musically related sections used the same beat pattern. Repetition ratios

measured for the beatboxed and beatrhymed sections were then compared pairwise to assess

whether the beatrhymed sections were as repetitive as the beatboxed sections.

A transcription of the beatboxing sounds of “Dopamine” was used for both

measurement techniques. This transcription excluded phonation and trill sounds during the

beatboxing patterns because they extend over multiple beats and would inflate the number

of beats counted in the calculation of the repetition ratio. (The excluded beatboxing sounds

were repeated as consistently as the other sounds in the beatboxing section.)

3. Results

Section 3.1 measures the extent to which the beatrhyming replacements were

constrictor-matched and section 3.2 does likewise for manner of articulation; both assess

whether the selection of beatboxing sounds accommodates the speech task. Section 3.3

260
quantifies the degree of repetition during beatrhyming to determine whether the selection of

lyrics accommodated the beatboxing task.

3.1 Constrictor-matching

Section 3.1.1 shows that replacements are constrictor-matched overall. Section 3.1.2 considers

replacements in two groups, showing that there is a high degree of constrictor matching off

the back beat but little constrictor matching on the back beat. Section 3.1.3 offers possible

explanations for the few exceptional replacements that were off the back beat and not

constrictor-matched.

3.1.1 All replacements

Figure 119 shows the number of times an intended speech sound was replaced by a

beatboxing sound of the same constrictor (blue bars, the left of each pair) or by a beatboxing

sound of a different constrictor (orange bars, the right of each pair) for every complete

replacement in “Dopamine.”

The intended speech dorsals were predominantly replaced by beatboxing dorsals,

appearing to support the hypothesis that speech and beatboxing interact in beatrhyming. But

while the majorities of intended labials and intended coronals were also replaced by

beatboxing sounds with matching labial or coronal constrictors, there was still a fairly large

number of mismatches for each (10/28 mismatches for labials, 10/31 mismatches for

coronals). This degree of mismatching is less than the levels of chance predicted by a lack of

interaction between beatboxing and speech—the expectation at chance was 20 mismatches

per constrictor, not 10.

261
Figure 119. Bar plot showing measured totals of constrictor matches and mismatches.

Table 34 shows the contingency table of replacements by constrictor. Highlighted cells along

the upper-left-to-bottom-right diagonal represent constrictor matches; all other cells are

constrictor mismatches. Reading across each row reveals how many times an intended

speech constriction was replaced by each beatboxing constrictor. For example, intended

speech labials were replaced by beatboxing labials 18 times, by beatboxing coronals 0 times,

and by beatboxing dorsals 10 times. A chi-squared test over this table rejects the null

2
hypothesis that beatboxing sounds replace intended speech sounds at random (χ = 79.15, df

= 4, p < 0.0001).

262
Table 34. Contingency table of beatboxing sound constrictors (top) and the speech sounds
they replace (left).
Intended speech sound Replacing beatboxing sound Total

Labial Coronal Dorsal

Labial 18 0 10 28

Coronal 2 21 8 31

Dorsal 2 0 27 29

Total 22 21 45 88

3.1.2 Replacements on and off the back beat

All 10 labial mismatches and 8/10 coronal mismatches were made by a dorsal beatboxing

sound replacement. Each of those mismatches also happens to occur on beat 3 of the meter,

and the replacing beatboxing sound is always a K Snare {^K}. In beatboxing, beat 3

corresponds to the back beat and almost always features a snare. This conspiracy of so many

dorsal replacements being made on the back beat suggests that it would be more informative

to split the analysis into two pieces.

A distinction can be made between replacements that occurred on beat 3 (n = 30) and

replacements made on any other beat or subdivision (n = 58). Figure 120 shows the counts of

matching and mismatching replacements excluding the back beat. With the inviolable back

beat K Snare out of the picture, 54 of 58 replacements have matching constrictor. This

distribution more closely matches the main hypothesis. Looking at just the replacements

made on the back beat (n = 30), however, appears to support the null hypothesis. Beatboxing

sounds on the back beat in "Dopamine" are restricted to the dorsal constrictor for the K

Snare {^K}. The replacements are fairly evenly distributed across all intended speech

263
constrictors, resembling the idealized prediction of no interaction between beatboxing

constrictions and intended speech constrictors (Figure 121). Taking this result with the

previous, this provides evidence for a trading relationship: the speech task is achieved during

replacements under most circumstances, but not on the back beat.

One smaller finding obfuscated by the coarse constrictor types is that speech labials

and labiodentals tended to be constrictor-matched to the labial Kick Drum {B} and

labiodental PF Snare {PF}, respectively. PF Snares only ever replaced /f/s, and 4 out of 6

replaced /f/s were replaced by PF Snares. (The other two were on the back beat, and so

replaced by K Snares.) There were two /v/s off the back beat, both of which were in the same

metrical position and in the word "of", and both of which were replaced by Kick Drums.

Labio-dentals were grouped with the rest of the labials to create a simpler picture about

constrictor matching and because the number of labio-dental intended speech sounds was

fairly small. However, for future beatrhyming analysis, it may be useful to separate bilabial

and labio-dental articulations into separate groups rather than covering them with “labial”.

264
Figure 120. Bar plots with counts of the actual matching and mismatching constrictor
replacements everywhere except the back beat.

265
Figure 121. Bar plot with counts of the actual matching and mismatching constrictor
replacements on just the back beat.

3.1.3 Examining mismatches more closely

There are four constrictor mismatches not on the back beat: two in which a labial beatboxing

sound replaces an intended speech coronal, and two in which a labial beatboxing sound

replaces an intended speech dorsal.

Both labial-on-coronal cases are of a Kick Drum replacing the word "and", which we

assume (based on the style of the performance) would be pronounced in a reduced form like

[n]. Acoustically, the low frequency burst of a labial Kick Drum {B} is probably a better

match to the nasal murmur of the intended [n] (and thus the manner of the nasal) than the

higher frequency bursts of a Closed Hi-Hat {t}, K Snare {^K}, or Rimshot {K}. All the other

266
nasals replaced by beatboxing sounds were on the back beat and therefore replaced by the K

Snare {^K}.

The two cases where a Kick Drum {B} replaced a dorsal sound can both be found in

the first four lines of the second verse (Figure 122). In one case, a {B} replaced the [g] in "got"

on the first beat 1 of line 3 (underlined in Figure 122). The reason may be a general

preference for having a Kick Drum on beat 1. Only 3 replacements were made on beat 1 in

"Dopamine", and all of them featured a Kick Drum {B}. (The overall scarcity of beat 1

replacements is due at least in part to the musical arrangement and style resulting in

relatively few words on beat 1.) The other case also involved a Kick Drum {B} replacing a

dorsal, this time the [k] in the word “come” on the second beat 2 of line 3 (also underlined).

The replacing {B} in this instance was part of a small recurring beatboxing pattern of {B B}

that didn't otherwise overlap with speech—it occurred on beats 1.5 and 2 of the second

measure of lines 1-3 as well as in the first measure of line 4.

Figure 122. Four lines of beatrhyming featuring two replacement mismatches (underlined).
1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
----------------------------------------------------------------------------------------------------------------
{B t t B} {^K}an't you see {B B} {^K}ou are li- {K'}a
{B}mid- night s{^K}um- mer's {t}ream {B B} {^K}on- ly you
{B}o- {t}em {t}weet {^K}e- lo- {t}ies {B B}ome and {^K}lay with me
{B B} {^K}et's see what the {B}sky {t}urns {^K}in- to {B}

In short, tentative explanations are available for the few constrictor mismatches that occur

off the back beat: two mismatches could be because intended nasal murmur likely matches

the low frequency burst of a Kick Drum better than the burst of the other beatboxing sounds

available, and the other two could be due to established musical patterns specific to this

performance.

267
3.2 Constriction degree (manner of articulation) matching

Figure 123 shows that this is what happens. The sounds that made constrictor-matching

replacements—the Kick Drum {B}, PF Snare {PF}, Closed Hi-Hat {t}, and Rimshot {K’}—

collectively replaced 43 stops but replaced 0 approximants and only 2 nasals and 10 fricatives.

No affricates were replaced at all in the data set. The K Snare {K} replaced 16 stops but also 7

nasals, 8 fricatives, and 2 approximants. For comparison, Figure 124 breaks down the

replacements by intended speech segment ordered by top to bottom in order of stops [p b t

d k g], nasals [m n] and [ŋ] (written as “ng”), fricatives [f v s z] and [ð] (written as “dh”),

and approximants [l j].

In a future study, it would be good to check if the non-replaced beatboxable sounds

have a uniform distribution or if stops are disproportionately high frequency across the

board. If many stops were in positions to be replaced by a beatboxing sound but were not

replaced, this finding would carry less weight. As of the time of writing, however, it was not

clear how to define which sounds in this song should be expected to be beatboxed; and as

this is the first major beatrhyming study, there was no precedent to draw from.

268
Figure 123. Counts of replacements by beatboxing sounds (bottom) against the manner of
articulation of the speech sound they replace (left).

Figure 124. Counts of replacements by beatboxing sounds (bottom) against the speech sound
they replace (left).

269
3.3 Repetition

3.3.1 Analysis 1: Unique measure count

The number of unique measures of beatboxing sound sequences in a 16-bar phrase indicates

how much overall repetition there is in that phrase. Sections B and D, the two 16-bar phrases

without lyrics (just beatboxing), had a combined total of just 3 unique measure-long

beatboxing sound sequences: the same three sound sequences were used over and over again.

Section C, the first beatrhymed verse, had 16 unique measures (no repeated measures), and

Section E, the second beatrhymed verse, had 13 unique measures (3 measures were repeated

once each). The beatrhymed sections therefore had far less repetition of measures than the

beatboxed sections. The unique sequences in each section are shown in Figure 125.

This is not to say that there was no repetition at all in the beatrhyming. Portions of

some beatboxed measures were repeated as subsets of some beatrhymed measures. The

beatboxed sequence A {B t t ^K}, for example, is also part of the beatrhymed sequences

sequences D {B t t ^K ^K}, L {B t t ^K B}, and N {B t t ^K K’}; similarly, sequence F {B B ^K}

can also be found in sequences G {B B ^K t K’}, O {B B ^K B B}, U {B B ^K K’}, and W {t B B

^K}. But it turns out that even these subsequences are brief non-lyrical chunks within larger

beatrhyming sections, which means that the repetition of sequences here is not related to the

organization of constrictor-matching or -mismatching replacements. The {B t t} portions of

sequences L and N (and partly of D) are not attached to any beatrhymed lyrics, and the

{^K}s are not constrictor-matching. Likewise, the {B B} of F, G, O, and U also have no lyrics

and the {^K}s do not necessarily constrictor-match with the sound of the lyrics they replace.

270
3.3.2 Analysis 2: Repetition ratio

The complete set of two-bar lines for each of the four analyzed sections and their

corresponding repetition ratios are presented in Figure 126. The repetition ratios of

beatrhyming sections were much lower than the repetition ratios for beatboxing sections.

The repetition ratios for the beatboxed sections B & D are greater than the pairwise

corresponding repetition ratios for the beatrhymed sections C & E in all but one comparison

(31/32 comparisons). The mean of repetition ratios calculated for verses C and E were 0.35

and 0.3, respectively, with a mean cross-section repetition ratio of 0.29. The mean repetition

ratios for the beatboxed sections B and D were 0.68 and 0.70, respectively, with a mean

cross-section repetition ratio of 0.96. The low repetition ratios for beatrhymed sections

corroborates the observation from the unique measure count analysis that there is relatively

little repetition among beatboxing sounds during beatrhyming.

271
Figure 125. Four 16-bar beatboxing (sections B and D) and beatrhyming (sections C and E)
phrases with letter labels for each unique sound sequence. Only three measure types were
used between both beatboxing sections.
Section B - first beatboxing section Section D - second beatboxing section
A: {B t t ^K} A: {B t t ^K}
B: {th B in B} B: {th B in B}
C: {B h ^K t t} C: {B h ^K t t}

Section C - first verse Section E - second verse


D: {B t t ^K ^K} T: {B t t B ^K}
E: {PF ^K} U: {B B ^K K’}
F: {B B ^K} V: {B ^K t}
G: {B B ^K t K’} F: {B B ^K}
H: {B t t B ^K K’ B} A: {B t t ^K}
I: {B ^K B t} W: {t B B ^K}
J: {K’ B ^K t B} F: {B B ^K}
K: {B ^K} X: {B t ^K B}
L: {B t t ^K B} Y: {B ^K B}
M: {B t ^K K’} Z: {B ^K B t PF}
N: {B t t ^K K’} Z: {B ^K B t PF}
O: {B B ^K B B} K: {B ^K}
P: {B ^K K’ t} Y: {B ^K B}
Q: {t t ^K K’} AA: {PF ^K B}
R: {B K’ K’ ^K} BB: {t t ^K t}
S: {t ^K t} CC: {B}

Section B Beatboxed A B A B A B A C’ A’ B A B A B A C
Section C Beatrhymed DEFGHIJKLMNOPQRS
Section D Beatboxed A B A B A B A C’ A’ B A B A B A C’
Section E Beatrhymed T U V F A W F X Y Z Z K Y AA BB CC

272
Figure 126. Beat pattern display and repetition ratio calculations for sections B, C, D, and E.
Section B (first beatboxing section)
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K th B in B
Line 2 | B t t ^K th B in B
Line 3 | B t t ^K th B in B
Line 4 | B t t ^K B h ^K t t
|
Line 5 | B t t ^K th B in B
Line 6 | B t t ^K th B in B
Line 7 | B t t ^K th B in B
Line 8 | B t t ^K B h ^K t t

Section C (first beatrhymed verse)


Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K ^K PF ^K
Line 2 | B B ^K B B ^K t K'
Line 3 | B t t B ^K K' B B ^K B t
Line 4 | K' B ^K t B B ^K
|
Line 5 | B t t ^K B B t ^K K'
Line 6 | B t t ^K K' B B ^K B B
Line 7 | B ^K K' t t t ^K K'
Line 8 | B K' K' ^K t ^K t

Adjacent pairs Alternating pairs Cross-group pairs

1 & 2 3 & 4 5 & 6 7 & 8 1 & 3 2 & 4 5 & 7 6 & 8 1 & 5 2 & 6 3 & 7 4 & 8

B) 8/8 4/10 6/10 4/9 8/8 4/10 6/10 4/9 6/10 8/8 8/8 7/11
mean=
0.68 1.00 0.40 0.60 0.44 1.00 0.40 0.60 0.44 0.60 1.00 1.00 0.64

C) 3/10 5/11 6/12 3/11 5/12 3/11 3/12 3/11 5/10 5/12 3/13 3/10
mean=
0.35 0.30 0.45 0.50 0.27 0.42 0.27 0.25 0.27 0.50 0.42 0.23 0.30

273
Section D (second beatboxing section)
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K th B in B
Line 2 | B t t ^K th B in B
Line 3 | B t t ^K th B in B
Line 4 | B t t ^K B h ^K t t
|
Line 5 | B t t ^K th B in B
Line 6 | B t t ^K th B in B
Line 7 | B t t ^K th B in B
Line 8 | B t t ^K B h ^K t t

Section E (second beatrhymed verse)


Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t B ^K B B ^K K'
Line 2 | B ^K t B B ^K
Line 3 | B t t ^K t B B ^K
Line 4 | B B ^K B t ^K B
|
Line 5 | B ^K B B ^K B t PF
Line 6 | B ^K B t PF B ^K
Line 7 | B ^K B PF ^K B
Line 8 | t t ^K t B

Adjacent pairs Alternating pairs Cross-group pairs

1 & 2 3 & 4 5 & 6 7 & 8 1 & 3 2 & 4 5 & 7 6 & 8 1 & 5 2 & 6 3 & 7 4 & 8

F) 8/8 4/10 6/10 4/10 8/8 4/10 6/10 4/10 6/10 8/8 8/8 9/9
mean=
0.70 1.00 0.40 0.60 0.40 1.00 0.40 0.60 0.40 0.60 1.00 1.00 1.00

E) 5/10 2/11 3/11 2/7 5/12 2/10 3/10 2/9 4/12 3/9 3/10 2/9
mean=
0.30 0.50 0.18 0.27 0.29 0.42 0.20 0.30 0.22 0.33 0.33 0.30 0.22

Cross-section pairs

1 & 1 2 & 2 3 & 3 4 & 4 5 & 5 6 & 6 7 & 7 8 & 8

B & D 8/8 8/8 8/8 9/9 8/8 8/8 8/8 7/11


mean=0.9
6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.64

C & E 5/9 5/9 4/15 2/10 2/12 3/12 2/10 1/9


mean=0.2
9 0.56 0.56 0.27 0.20 0.17 0.25 0.20 0.11

274
4. Discussion

The analysis above investigated whether beatboxing and speech do (the main hypothesis) or

do not (the null hypothesis) interact during beatrhyming in a way that supports both speech

and beatboxing tasks being achieved. The results provide evidence for the main hypothesis.

Speech tasks are achieved, in a local sense, in beatrhyming by generally selecting

replacement beatboxing sounds that match the speech segment in vocal tract constrictor and

manner/constriction degree. This presumably serves to help the more global task of

communicating the speech message. But achieving the speech task comes at the cost of

inconsistent beat patterns during lyrical beatrhyming. Theoretically, both the speech task

and the beatboxing repetition task could have been achieved by careful selection of lexical

items whose speech-compatible beatboxing replacement sound would also satisfy repetition,

but this did not happen. Thus, beatboxing sounds are generally selected in such a way as to

optimize speech task achievement, but lexical items are not being selected so as to optimize

beatboxing repetition. That said, the task demands of other aspects of beatboxing do affect

beatboxing sound selection—this is the inviolable use of K Snares {^K} on beat 3 of each

measure to establish the fundamental musical rhythm, even at the expense of the dorsal

constriction of the K Snare not matching the constriction of the intended speech sound it

replaces. Thus the tasks do interact such that one or the other task achievement has priority

over the other at different moments in time.

275
4.1 Task interaction

Beatrhyming is the union of a beatboxing system and a speech system. Each system is

goal-oriented, defined by aesthetic tasks related to the musical genre, communicative tasks,

motor efficiency, and other tasks. These tasks act as forces that shape the organization of the

sounds of speech, beatboxing, and beatrhyming.

Ultimately, a core interest in the study of speech sounds is to understand how forces

like these influence speech. When answering questions of why sounds in a language pattern

a particular way, we turn to explanations of effective message transmission and motor

efficiency almost axiomatically. But until we understand how these tasks manifest under a

wider variety of linguistic behaviors, we will not have a full sense of the speech system’s

flexibility or limitations. To that end, the contribution of this chapter is to show how the goal

of message transmission is active in the linguistic behavior beatrhyming: it is satisfied during

beatrhyming sound replacements by matching the constrictor of an intended speech sound

and the beatboxing sound replacing it, and dissatisfied when aesthetic beatboxing tasks take

priority on the back beat.

To close, section 4.2 demonstrates one way this musical linguistic behavior can

impact phonological theory by briefly introducing a simple phonological model of

beatrhyming.

4.2 Beatrhyming phonology

The results show that when speech and beatboxing are interwoven in beatrhyming, the

selection of beatboxing sounds to replace a speech sound is generally constrained by the

276
intended speech task and overrides the constraints of beatboxing task, except in one

environment (beat 3) in which the opposite is true. Given that the selection of lexical items

does not appear to be sensitive to location in the beatboxing structure, the achievement of

both tasks simultaneously is not possible. The resulting optimization can therefore be

modeled by ranking the speech and beatboxing tasks differently in different environments,

which is exactly what Optimality Theory (Prince & Smolensky, 1993/2004) has been

designed to do.

In Optimality Theory, ranked constraints guide the prediction of a surface output

representation from an underlying input representation. The representations and constraints

used in Optimality Theory are designed specifically to operate in the domain of speech and

phonology, so representations and constraints involving beatboxing sounds are not

appropriate for a typical phonological model. This approach assumes that this grammar

specialized for beatrhyming exists separately from grammars specialized for speech or

beatboxing but draws on the representations from both systems—that is, speech and

beatboxing representations are the same as they would be in a speech or beatboxing

phonology, but the constraints and their rankings are different from any other domain. Based

on this chapter’s interpretation that beatboxing sounds replace speech sounds in

beatrhyming, the grammar takes speech representations as inputs and returns surface forms

composed of both beatboxing and speech representations as output candidates. For the

purposes of this simple illustration, the computations are restricted to the selection of a

single beatboxing sound that replaces a single speech segment. (Presumably there are

277
higher-ranking constraints that determine which input speech segment representations

should be replaced by a beatboxing sound in the output.)

Because the analysis requires reference to the metrical position of a sound, input

representations are tagged with the associated beat number as a subscript. The input / b3 /,

for example, symbolizes a speech representation for a voiced bilabial stop on the third beat

of a measure. Output candidates are marked with the same beat number as the

corresponding input; the input-output pairs / b3 / ~ { B3 } and / b3 / ~ { ^K3 } are both possible

in the system because the share the same subscript, but the input-output pair / b3 / ~ { B2 } is

never generated as an option because the input and output have different subscripts. We can

use two loosely defined constraints:

*BackbeatWithoutSnare - Assign a violation to outputs on beat three ({X3})

that are not snares.

*PlaceMismatch - Assign a violation to an output whose Place feature does not

match the Place feature of the corresponding input.

(“Place” feature corresponds to the abstract conception of the constrictor: labial, coronal, and

dorsal.) The tableaux in Figures 127 and 128 demonstrate how possible input-output pairs

like the ones just introduced might be selected differently by the grammar depending on the

beat associated with the input sound. *BackbeatWithoutSnare is ranked above

*PlaceMismatch to ensure that beat 3 always has a K Snare. Given an input voiced bilabial

stop on beat 3 / b3 / in tableau 15, the output candidate {B3} is constrictor-matched to the

input and satisfies *PlaceMismatch but violates high-ranking *BackbeatWithoutSnare; the

alternative output {^K3} violates *PlaceMismatch, but is a more optimal candidate than {B3}

278
based on this constraint ranking. On the other hand, for an input / b1 / which represents a

voiced bilabial stop on beat 1, the constrictor-matched candidate {B1} violates no constraints

and harmonically bounds {^K1} which violates *PlaceMismatch (Figure 128).

Figure 127. Tableau in which a speech labial stop is replaced by a K Snare on the back beat.
/ b3 / *BackbeatWithoutSnare *PlaceMismatch

a. {B3} *!

b. ☞ {^K3} *

Figure 128. Tableau in which a speech labial stop is replaced by a Kick Drum off the back
beat.
/ b1 / *BackbeatWithoutSnare *PlaceMismatch

a. ☞ {B1}

b. {^K1} *!

This phonological formalism is simple, but effective: just these two constraints produce the

desired outcome for 95% (84/88) of the replacements in this data set. The remaining 5%

described in section 3.1.3 may be accounted for either by additional constraints designed to

fit more specific conditions, by a related but more complicated model MaxEnt (Hayes &

Wilson, 2008), or by gradient symbolic representations (Smolensky et al., 2014) that permit

more flexibility in the input-output place relationships. It is with this optimism in mind that

we suggest below two reasons not to use symbolic representations in models of beatrhyming:

the arbitrariness of speech-beatboxing feature mappings and the impossibility of splitting an

atomic unit.

279
In most symbolic models of phonology, the vocal constriction plan executed by the

motor system is not part of a phonological representation. The purpose of a phonological

place feature like [labial] (or if not privative, [±labial]) is to encode linguistic information,

and that information is defined by the feature’s contrastive relationship to other features

within the same linguistic system. Different phonological theories propose stronger or

weaker associations between a mental representation like [labial] and the physical lips

themselves, but there is an inherent duality that separates abstract phonological

representations from the concrete phonetic constrictors that implement them.

It is not clear what a mental representation of beatboxing should look like—especially

compared to speech representations—because beatboxing sounds do not encode contrastive

meaning. But say that a language-like beatboxing {labial} feature did exist, defined according

to some iconic relationship with other beatboxing features and, like a linguistic [labial]

feature, associated to some degree with physical constriction of the lips. This {labial}

beatboxing feature and [labial] phonological feature would have no meaningful

correspondence or inherent connection because they would be defined by completely

different information-bearing roles within their respective systems. Mapping abstract features

[labial] to {labial} to would be arbitrary and just as computationally efficient as mapping

[labial] to {dorsal} or {ingressive}. The only reason to map [labial] with {labial} is because

they share an association to the physical lips. But in that case, the crux of the mapping—the

only property shared by both units—is a phonetic referent; the abstract symbolic units

themselves are moot. Given that the model is intended to be a phonological one, it seems

undesirable for the phonological units to have less importance than their phonetic output.

280
The second issue with symbols is that they are notoriously static, existing invariantly

outside of real time. When timing must be encoded in symbolic approaches, the

representations are laid out either in sequence or in other timing slots like autosegmental

tiers (Goldsmith, 1976). And, segments are temporally indivisible—they cannot start at one

time, pause for a bit, then pick up again where they left off. This is not a problem for

phonological models of talking or many other varieties of speech, but Figure 129 illustrates a

beatrhyming example of precisely this kind of split-segment behavior. In this case, the word

“move” [muv] is pronounced [mu]{B}[uv], with a Kick Drum temporarily interrupting the

[u] vowel. The same phenomenon is shown in Figure 130 with the word “sky” pronounced as

[skak͡ʟ̝̊↓a] (the canonical closure to [i] is not apparent in the spectrogram). Figure 112 from

the beginning of this chapter shows a related example of the [i] in “dopamine” prematurely

cut off in the pronunciation of the word as {t}[o]{B}[əmi]{^K}[n]. These cases of beatboxing

sounds that interrupt speech sounds are impossible to represent in a symbolic phonological

model because in many cases they would require splitting an indivisible representation into

two parts to achieve the appropriate output representation.

Even theories that permit a certain amount of intra-segment temporal flexibility

struggle with beatrhyming interruptions. Q-Theory (Shih & Inkelas 2014, 2018) may come

the closest: it innovates on traditional segments by splitting them into three quantal

sub-segmental pieces. These sub-segments roughly correspond articulatorily to the onset of

movement, target achievement, and constriction release for a given sound, and are especially

useful for representing a sound that has complex internal structure like a triphthong or a

three-part tone contour. It would be possible to represent the /u/ in “move” /muv/ as having

281
three sub-segmental divisions [u] [u] [u]. But based on our understanding of Q-Theory, it is

not possible to replace the middle sub-segment [u] with an entire and entirely different

segment {B}. Given enough time, it is inevitable that someone could imagine some phonetic

implementation rules or different flavor of symbolic representation that generates these

kinds of interruptions. In the meantime, we consider these interruptions and the

speech-beatboxing constrictor mapping discussed earlier as evidence against symbolic units

and in favor of gestural units as described next.

Articulatory Phonology is the hypothesis that the fundamental units of language are

action units, called “gestures” (Browman & Goldstein, 1986, 1989). Unlike symbolic features

which are time-invariant and only reference the physical vocal tract abstractly (if at all),

gestures as phonological units are spatio-temporal entities with deterministic and directly

observable consequences in the vocal tract. Phonological phenomena that are stipulated

through computational processes in other models emerge in Articulatory Phonology from

the coordination of gestures in an utterance. Gestures are commonly defined as dynamical

systems in the framework of task dynamics (Browman & Goldstein, 1989; Saltzman &

Munhall, 1989). While a gesture is active, it exerts control over a vocal tract variable (e.g., lip

aperture) to accomplish some linguistic task (e.g., a complete labial closure for the

production of a labial stop) as specified by the parameters of the system.

Constrictor-matching emerges from a gestural framework because gestures are

defined by the vocal tract variable—and ultimately, the constrictor—they control. Gestures

are motor plans that leverage and tune the movement potential of the vocal articulators for

speech-specific purposes, but speech gestures are not the only action units that can control

282
the vocal tract. The vocal tract variables used for speech purposes are publicly available to

any other system of motor control, including beatboxing. This allows for a non-arbitrary

relationship between the fundamental phonological units of speech and beatboxing: a speech

unit and a beatboxing unit that both control lip aperture are inherently linked in a

beatboxing grammar because they control the same vocal tract variable.

Figure 129. Waveform, spectrogram, and text grid of the beatrhymed word “move” with a
Kick Drum splitting the vowel into two parts.

283
Figure 130. Waveform, spectrogram, and text grid of the beatrhymed word “sky” with a K
Snare splitting the vowel into two parts.

The cases in which a beatboxing sound temporarily interrupts a vowel can be modeled in

task dynamics with a parameter called gestural blending strength. When two gestures that

use the same constrictor overlap temporally, the movement plan during that time period

becomes the average of the two gestures’ spatial targets (and their time constants or

stiffness) weighted by their relative blending strengths. A stronger gesture exerts more

influence, and a gesture with very high relative blending strength will effectively override any

co-active gestures. For beatrhyming, the interrupting beatboxing sounds could be modeled as

having sufficiently high blending strength that the vowels they co-occur with are overridden

by the beatboxing sound; when the gestures for a beatboxing sound end, control of the vocal

tract returns solely to the vowel gesture. The Gestural Harmony Model (Smith, 2018) uses a

similar approach to account for transparent segments in phonological harmony.

284
5. Conclusion

Vocal music is a powerful lens through which to study speech, offering insights about speech

that may not be accessible from studies of talking. Beatrhyming in particular demonstrates

how the fundamental units of speech can interact with the units of a completely different

behavior—beatboxing—in a complex but organized way. When combined with speech, the

aesthetic goals of musical performance lead to sound patterns that push the limits of

phonological theory and may even cause widely accepted paradigms to break down. This is

the advantage to be gained by building and testing theories based on insights from a more

diverse set of linguistic behaviors.

285
CHAPTER 8: CONCLUSION

This dissertation applied linguistic methods to an analysis of beatboxing and discovered that

beatboxing has a unit-level phonology rooted in the same types of fundamental mental

representations and organization as the phonology of speech, while embedded in a

performance task whose metrical structure is governed by musical organization principles.

Chapter 3: Sounds argued that beatboxing sounds have meaning and word-like frequency.

Each sound is composed combinatorially from a reusable set of constrictions; because the

sounds have meaning, these constrictions are contrastive—changing a constriction usually

changes the meaning of a sound. This contrastiveness resembles the contrastive organization

of speech sounds within a language. But just like in speech, not every articulatory change is a

contrastive one. Chapter 5: Alternations shows that the Kick Drum and PF Snare, and

perhaps also Closed Hi-Hat, have different phonetic manifestations depending on their

context: they are glottalic egressive in most contexts, but percussive when performed in

proximity to other (made with a tongue body closure and no glottalic airflow initiation).

Chapter 6: Harmony shows that these alternations are—like so often in speech—the result of

multiple constrictions overlapping temporally. Here the contrastive airstreams from Chapter

3: Sounds participate actively as triggers, undergoers, and blockers in a process akin to

phonological harmony.

Taken together, the combinatorial contrastiveness of the constrictions that form

beatboxing sounds, the context-dependent alternations of beatboxing sounds, and the

class-based patterning of beatboxing sounds based on their combinatorial constrictions all

indicate that beatboxing has a phonology rooted in the same types of fundamental mental
286
representations and organization as linguistic phonology. These representations are united

with music cognition through rhythmic patterns, metrical organization, and sound classes

with patterning based on musical function (i.e., regularly placing snare-category sounds in

specific metrical positions).

As discussed in Chapter 1: Introduction, the interaction and overlap of different

cognitive pieces is related to a question of domain specificity, a topic which is sometimes

related to a theoretical dichotomy of modular cognition versus integrated cognition. The

finding that beatboxing exhibits signs of phonological cognition indicates that the

fundamental structure of phonology is not domain-specific. Furthermore, the phonological

foundations of both beatboxing and speech (see below) collaborate with aspects of music

cognition, which indicates that the building blocks of different domains superimpose onto

each other in task-specific ways to create each vocal behavior. This can be accounted for in

both modular and integrated approaches to cognition. A story consistent with a modular

approach to cognition is that beatboxing takes mental representations and grammar from

speech, combines them with musical meaning and metrical organization, and thereby adapts

them to a new use. Borrowing representations enables beatboxing to create phonological

contrasts and to use natural classes as the currency of productive synchronic processes. A

different story consistent with a more integrated approach to cognition is that beatboxing

and phonology both, somewhat independently, are shaped by the interaction of the

capabilities of the vocal tract they share, the recruitment of some domain-general

computations (i.e., combinatorial mental units), and their respective communicative or

aesthetic tasks. Regardless of the interpretation, the inescapable result is that linguistic

287
phonology is not particularly unique: beatboxing and speech share the same vocal tract and

organizational foundations, including mental representations and coordination of those

representations.

Beatboxing has phonological behavior based in phonological units and organization.

One could choose to model beatboxing with adaptations of either features or gestures as its

fundamental units, and that choice of unit can serve a story of modular cognition or of

integrated cognition. But as Chapter 4: Theory discusses, gestures have the distinction of

explicitly connecting the tasks specific to speech or to beatboxing with the sound-making

potential of the vocal substrate they share, which in turn creates a direct link between speech

gestures and beatboxing gestures. This link is formalized at the graph level of the dynamical

systems by which gestures are defined. The analysis of the graph level theoretical embedding

in this dissertation was focused on individual beatboxing units, their temporal coordination,

and their paradigmatic organization. Future work could formalize the link between speech

and musical prosodic, hierarchical, metrical structure as a different part of the graph level, in

order to better capture the ability of the phonological unit system to integrate in different

ways with music cognition.

The direct formal link between beatboxing and speech units makes predictions about

what types of phonological phenomena beatboxing and speech units are able to

exhibit—including the phonological properties described above. These predictions are born

out insofar as beatboxing and speech phonological phenomena are both able to be accounted

for by the same theoretical mechanisms (e.g., intergestural timing and inhibition). Moreover,

it predicts that the phonological units of the two domains will be able to co-occur as they do

288
in Chapter 7: Beatrhyming, where phenomena that are challenging or impossible to

represent with symbolic units are easily represented using gestures.

These advantages of the gestural approach for describing speech, beatboxing, and

beatrhyming underscore a broader point: that regardless of whether phonology is modular or

not, the phonological system is certainly not encapsulated away from other cognitive

domains, nor impermeable to connections with other domains. On the contrary,

phonological units are intrinsically related to beatboxing units—and, presumably, to other

units in similar systems. This appears to fly in the face of conventional wisdom about

phonological units: at least as early as Sapir (1925), phonological units have been defined

exclusively by their psychological linguistic role—by their relationships with each other and

their synchronic patterning, but often without any phonetic or social aspects of their

manifestation and certainly without ties to non-linguistic domains. But the gestural approach

allows phonological units to have domain-specific meaning within their own system while

sharing a domain-general conformation with other behaviors.

The attributes that phonology shares with other domains allows it to manifest

flexibly—to be recruited into a multitude of speech behaviors while robustly fulfilling its

primary directives (e.g., communicating a linguistic message). This is different from, say, the

sensory processing involved in auditory spatial localization which is arguably a module in the

strongest sense—automatic, innate, and not (so far as we know) able to be tapped into for

different purposes by conscious cognitive thought (Liberman & Mattingly, 1989). Instead, the

conversational or laboratory-style speech that is the subject of the bulk of phonological

research is continuous with many other speech behaviors and at different levels of

289
phonological structure. Prosodically, conversational speech is continuous with poetry,

rapping, chanting, and singing: just a few small adjustments to rhythm or intonation

transform conversational speech into any of an abundance of genres of vocal linguistic art. A

non-musical speech utterance can even become perceived as musical when it is repeated a

few times (the speech to song illusion; Deutsch et al., 2011). Speech modality is not limited to

the typically-studied arrangement of vocal articulators: surrogate speech like talking drums

(Beier, 1954; Akinbo, 2019), xylophones (McPherson, 2018), and whistle speech (Rialland,

2005) shift phonological expression to new sound systems which are often integrated with

musical structure. And phonological units and grammar are not only used in speech

contexts—scat singing is utterly non-linguistic but follows phonological restrictions anyway

(Shaw, 2008). And as beatrhyming shows, the conformation of the most elemental

phonological units affords connections to similar units in beatboxing.

These different speech behaviors are collaborations between speech tasks and other

non-linguistic (e.g., musical) tasks, well-organized to maximize the satisfaction of all tasks

involved (or at least to minimize dissatisfaction). For vocal behaviors, these interactions are

constrained by the vocal substrate in which all of the tasks are active. In singing,

conversational speech prosody cannot manifest at the same time as sung musical melody

because they both require use of the larynx. Sustaining a note during a song therefore

requires selecting between a musical and speech-prosodic pitch and rhythm; but the

contrastive information and structure of the speech sound units are unperturbed—syllable

structure, sound selection, and relative sound order largely remain intact because they do not

compete with melody or rhythm. In some cases there is also text to tune alignment where

290
musical pitch and rhythm reflect the likely prosody of the utterance if it had been spoken

non-musically (Hayes & Kaun, 1996). Similar text to tune alignment is active in languages

with lexical tone, with tone contours exerting greater influence on the musical melody to

avoid producing unintended tones (Schellenberg, 2013; McPherson & Ryan, 2018). And in

beatrhyming, the speech and beatboxing tasks share the vocal tract through a relationship

that leverages their shared vocal apparatus to maximize their compatibility when possible

through constrictor matching.

In short, flexibility is a defining characteristic of the phonological system. If there is

anything special about speech, it is the speech tasks themselves and how they leverage all of

human vocal potential to flexibly produce these different behaviors. This is consistent with

an anthropophonic perspective of linguistic inquiry, initially framed by Catford (1977) and

Lindblom (1990) as an ideology for non-circularly defining and explaining which sounds

could be possible speech sounds. It is a deductive approach to explaining speech phenomena

as the result of an interaction between the tasks of speech—in Lindblom (1990), “selection

constraints—and the total sound-making potential of the vocal tract. With respect to the

question of “What is a possible speech sound?”, the anthropophonic perspective re-frames

the question as “How do the tasks of speech filter the whole vocal sound-making potential

into a smaller, possibly finite set of speech sounds?” (Figure 131). As discussed in Chapter 4:

Theory, gestures as phonological units are in a sense a formalization of the anthropophonic

perspective.

291
Figure 131. The anthropophonic perspective.

In light of the clear flexibility of the phonological system, however, it must be made clear

that the selection constraints are not only the tasks of speech. There are many musical and

other non-linguistic tasks which shape behavior too—not to mention the social and affective

forces that incessantly impact speech production and phonological variation. A robust

account of phonology needs to be able to explain how the phonological system interacts

with these other forces via both their shared structures and their shared vocal substrate.

292
REFERENCES

Abbs, J. H., Gracco, V. L., & Cole, K. J. (1984). Control of Multimovement Coordination.
Journal of Motor Behavior, 16(2), 195–232.
[Link]

Abler, W. (1989). On the particulate principle of self-diversifying systems. Journal of Social


and Biological Systems, 12(1), 1–13. [Link]

Akinbo, S. (2019). Representation of Yorùbá Tones by a Talking Drum. An Acoustic Analysis.


Linguistique et Langues Africaines, 5, 11–23. [Link]

Anderson, S. R. (1981). Why Phonology Isn’t “Natural.” Linguistic Inquiry, 12(4), 493–539.

Archangeli, D., & Pulleyblank, D. (2015). Phonology without universal grammar. Frontiers in
Psychology, 6. [Link]

Archangeli, D., & Pulleyblank, D. (2022). Emergent phonology (Volume 7). Language Science
Press. [Link]

Ball, M. J., Esling, J. H., & Dickson, B. C. (2018). Revisions to the VoQS system for the
transcription of voice quality. Journal of the International Phonetic Association, 48(2),
165–171. [Link]

Ball, M. J., Esling, J., & Dickson, C. (1995). The VoQS System for the Transcription of Voice
Quality. Journal of the International Phonetic Association, 25(2), 71–80.
[Link]

Ball, M. J., Howard, S. J., & Miller, K. (2018). Revisions to the extIPA chart. Journal of the
International Phonetic Association, 48(2), 155–164.
[Link]

Ballard, K. J., Robin, D. A., & Folkins, J. W. (2003). An integrative model of speech motor
control: A response to Ziegler. Aphasiology, 17(1), 37–48.
[Link]

Baudouin de Courtenay, J. (1972). Selected Writings of Baudouin de Courtenay. Stankiewicz,


E. (ed). Bloomington, Indiana University Press.

Beale, J. M., & Keil, F. C. (1995). Categorical effects in the perception of faces. Cognition,
57(3), 217–239. [Link]

Beier, U. (1954). The talking drums of the Yoruba. African Music: Journal of the International
Library of African Music, 1(1), 29–31.

293
Bidelman, G. M., Gandour, J. T., & Krishnan, A. (2011). Cross-domain Effects of Music and
Language Experience on the Representation of Pitch in the Human Auditory Brainstem.
Journal of Cognitive Neuroscience, 23(2), 425–434. [Link]

Blaylock, R. (2021). VocalTract ROI Toolbox. Available online at


[Link]
[Link]

Blaylock, R., & Phoolsombat, R. (2019). Beatrhyming probes the nature of the interface
between phonology and beatboxing. The Journal of the Acoustical Society of America,
146(4), 3081–3081. [Link]

Blaylock, R., Patil, N., Greer, T., & Narayanan, S. S. (2017). Sounds of the Human Vocal Tract.
INTERSPEECH, 2287–2291. [Link]

Boersma, P., & Weenink, D. (1992-2022). Praat: Doing phonetics by computer (6.1.13)
[Computer software]. [Link]

Boersma, Paul (2001). Praat, a system for doing phonetics by computer. Glot International
5:9/10, 341-345.

Bořil, T., & Skarnitzl, R. (2016). Tools rPraat and mPraat. In P. Sojka, A. Horák, I. Kopeček, &
K. Pala (Eds.), Text, Speech, and Dialogue (Vol. 9924, pp. 367–374). Springer International
Publishing. [Link]

Bresch, E., Nielsen, J., Nayak, K., & Narayanan, S. (2006). Synchronized and noise-robust
audio recordings during realtime magnetic resonance imaging scans. The Journal of the
Acoustical Society of America, 120(4), 1791–1794. [Link]

Browman, C. P., & Goldstein, L. (1986). Towards an Articulatory Phonology. Phonology


Yearbook, 3, 219–252.

Browman, C. P., & Goldstein, L. (1988). Some notes on syllable structure in articulatory
phonology. Phonetica, 45(2-4), 140-155.

Browman, C. P., & Goldstein, L. (1989). Articulatory gestures as phonological units.


Phonology, 6(2), 201–251. [Link]

Browman, C. P., & Goldstein, L. (1991). Gestural Structures: Distinctiveness, Phonological


Processes, and Historical Change. In I. G. Mattingly & M. Studdert-Kennedy (Eds.),
Modularity and the Motor Theory of Speech Perception: Proceedings of a Conference to
Honor Alvin M. Liberman (pp. 313–338).

Browman, C. P., & Goldstein, L. (1992). Articulatory Phonology: An Overview. Phonetica,


49(3–4), 155–180. [Link]

294
Browman, C. P., & Goldstein, L. (1995). Gestural Syllable Position Effects in American
English. In Bell-Berti, F. & Raphael, L. J. (Eds.), Producing Speech: Contemporary Issues.
For Katherine Safford Harris. AIP Press: New York.

Byrd, D., & Saltzman, E. (1998). Intragestural dynamics of multiple prosodic boundaries.
Journal of Phonetics, 26(2), 173–199. [Link]

Byrd, D., & Saltzman, E. (2003). The elastic phrase: Modeling the dynamics of
boundary-adjacent lengthening. Journal of Phonetics, 31(2), 149–180.
[Link]

Catford, J. C. (1977). Fundamental problems in phonetics. Indiana University Press.

Chomsky, N., & Halle, M. (1968). The Sound Pattern of English.

Clements, G. N. (2003). Feature economy in sound systems. Phonology, 20(3), 287–333.


[Link]

Cohn, A. C. (2007). Phonetics in Phonology and Phonology in Phonetics. Working Papers of


the Cornell Phonetics Laboratory, 16, 1–31.

Collins, J. (2017). Faculties and Modules: Chomsky on Cognitive Architecture. In J.


McGilvray (Ed.), The Cambridge Companion to Chomsky (2nd ed., pp. 217–234).
Cambridge University Press. [Link]

Coltheart, M. (1999). Modularity and cognition. Trends in Cognitive Sciences, 3(3), 115–120.
[Link]

Cooke, J. D. (1980). The Organization of Simple, Skilled Movements. In G. E. Stelmach & J.


Requin (Eds.), Advances in Psychology (Vol. 1, pp. 199–212). North-Holland.
[Link]

Cummins, F., & Port, R. (1998). Rhythmic constraints on stress timing in English. Journal of
Phonetics, 26(2), 145–171. [Link]

Danner, S. G., Krivokapić, J., & Byrd, D. (2019). Co-speech movement behavior in
conversational turn-taking. The Journal of the Acoustical Society of America, 146(4),
3082–3082.

Dehais-Underdown, A., Buchman, L., & Demolin, D. (2019, August). Acoustico-Physiological


coordination in the Human Beatbox: A pilot study on the beatboxed Classic Kick Drum.
19th International Congress of Phonetic Sciences.
[Link]

295
Dehais-Underdown, A., Vignes, P., Buchman, L. C., & Demolin, D. (2020). Human
Beatboxing: A preliminary study on temporal reduction. Proceedings of the 12th
International Seminar on Speech Production (ISSP), 142–145.

Dehais-Underdown, A., Vignes, P., Crevier-Buchman, L., & Demolin, D. (2021). In and out:
Production mechanisms in Human Beatboxing. 060005. [Link]

Deutsch, D., Henthorn, T., & Lapidis, R. (2011). Illusory transformation from speech to song.
The Journal of the Acoustical Society of America, 129(4), 2245–2252.
[Link]

Diehl, R. L. (1991). The Role of Phonetics within the Study of Language. Phonetica, 48(2–4),
120–134. [Link]

Diehl, R. L., & Kluender, K. R. (1989). On the Objects of Speech Perception. Ecological
Psychology, 1(2), 121–144. [Link]

Dresher, B. E. (2011). The Phoneme. In The Blackwell companion to phonology (pp.


241–266).

Drum tablature. (2022). In Wikipedia.


[Link]

DrumTabs—DRUM TABS. (n.d.). Retrieved June 3, 2022, from [Link]

Duckworth, M., Allen, G., Hardcastle, W., & Ball, M. (1990). Extensions to the International
Phonetic Alphabet for the transcription of atypical speech. Clinical Linguistics &
Phonetics, 4(4), 273–280. [Link]

Dunbar, E., & Dupoux, E. (2016). Geometric Constraints on Human Speech Sound
Inventories. Frontiers in Psychology, 7.
[Link]

Eklund, R. (2008). Pulmonic ingressive phonation: Diachronic and synchronic


characteristics, distribution and function in animal and human sound production and in
human speech. Journal of the International Phonetic Association, 38(3), 235–324.
[Link]

Episode 4 | When Art Meets Therapy. (2019, March 23).


[Link]

Evain, S., Contesse, A., Pinchaud, A., Schwab, D., Lecouteux, B., & Henrich Bernardoni, N.
(2019). Beatbox Sounds Recognition Using a Speech-dedicated HMM-GMM Based
System.

296
Farmer, J. D. (1990). A Rosetta stone for connectionism. Physica D: Nonlinear Phenomena,
42(1), 153–187. [Link]

Feld, S., & Fox, A. A. (1994). Music and Language. Annual Review of Anthropology, 23,
25–53.

Flash, T., & Sejnowski, T. J. (2001). Computational approaches to motor control. Current
Opinion in Neurobiology, 11, 655–662.

Fodor, J. A. (1983). The Modularity of Mind. MIT Press.

Fowler, C. A. (1980). Coarticulation and theories of extrinsic timing. Journal of Phonetics,


8(1), 113–133. [Link]

Fowler, C. A., & Rosenblum, L. D. (1990). Duplex perception: A comparison of monosyllables


and slamming doors. Journal of Experimental Psychology: Human Perception and
Performance, 16(4), 742–754. [Link]

Fukuda, M., Kimura, Kosei, Blaylock, Reed, & Lee, Seunghun. (2022). Scope of beatrhyming:
Segments or words. Proceedings of the AJL 6 (Asian Junior Linguists), 59–63.
[Link]

Gafos, A. I. (1996). The articulatory basis of locality in phonology [Ph.D., The Johns Hopkins
University].
[Link]

Gafos, A. I., & Benus, S. (2006). Dynamics of Phonological Cognition. Cognitive Science,
30(5), 905–943. [Link]

Gafos, A., & Goldstein, L. (2011). Articulatory representation and organization. In A. C. Cohn,
C. Fougeron, & M. K. Huffman (Eds.), The Oxford Handbook of Laboratory Phonology
(1st ed.). Oxford University Press.
[Link]

Goldsmith, J. A. (1976). Autosegmental phonology [PhD, Massachusetts Institute of


Technology]. [Link]

Goldstein, L., Byrd, D., & Saltzman, E. (2006). The role of vocal tract gestural action units in
understanding the evolution of phonology. In M. A. Arbib (Ed.), Action to Language via
the Mirror Neuron System (pp. 215–249). Cambridge University Press.
[Link]

Goldstein, L., Nam, H., Saltzman, E., & Chitoran, I. (2009). Coupled Oscillator Planning
Model of Speech Timing and Syllable Structure. In C. G. M. Fant, H. Fujisaki, & J. Shen
(Eds.), Frontiers in phonetics and speech science (p. 239-249). The Commercial Press.
[Link]

297
Greenwald, J. (2002). Hip-Hop Drumming: The Rhyme May Define, but the Groove Makes
You Move. Black Music Research Journal, 22(2), 259–271. [Link]

Guinn, D., & Nazarov, A. (2018, January). Evidence for features and phonotactics in
beatboxing vocal percussion. 15th Old World Conference on Phonology, University
College London, United Kingdom.

Hale, K., & Nash, D. (1997). Damin and Lardil phonotactics [PDF]. Boundary Rider: Essays
in Honor of Geoffrey O’Grady, 247-259 pages. [Link]

Hale, M., & Reiss, C. (2000). Phonology as Cognition. Phonological Knowledge: Conceptual
and Empirical Issues, 161–184.

Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The Faculty of Language: What Is It, Who
Has It, and How Did It Evolve? 298, 1569–1579.

Hayes, B. (1984). The Phonology of Rhythm in English. Linguistic Inquiry, 15(1), 33–74.

Hayes, B., & Kaun, A. (1996). The role of phonological phrasing in sung and chanted verse.
The Linguistic Review, 13(3–4). [Link]

Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic
learning. Linguistic inquiry, 39(3), 379-440.

Hayes, B., Kirchner, R., & Steriade, D. (Eds.). (2004). Phonetically Based Phonology.
Cambridge University Press.

Himonides, E., Moors, T., Maraschin, D., & Radio, M. (2018). Is there potential for using
beatboxing in supporting laryngectomees? Findings from a public engagement project.

Hockett, C. F. (1955). A manual of phonology (Vol. 21). Indiana University Publications in


Anthropology and Linguistics.

Hoyt, D. F., & Taylor, C. R. (1981). Gait and the energetics of locomotion in horses. Nature,
292(5820), 239–240. [Link]

Human Beatbox. (2014, September 16). Unforced. HUMAN BEATBOX.


[Link]

Icht, M. (2018). Introducing the Beatalk technique: Using beatbox sounds and rhythms to
improve speech characteristics of adults with intellectual disability: Using beatbox sounds
and rhythms to improve speech. International Journal of Language & Communication
Disorders, 54. [Link]

298
Icht, M. (2021). Improving speech characteristics of young adults with congenital dysarthria:
An exploratory study comparing articulation training and the Beatalk method. Journal of
Communication Disorders, 93, 106147. [Link]

Icht, M., & Carl, M. (2022). Points of view: Positive effects of the Beatalk technique on speech
characteristics of young adults with intellectual disability. International Journal of
Developmental Disabilities, 1–5. [Link]

Jakobson, R., Fant, C. G., & Halle, M. (1951). Preliminaries to speech analysis: The distinctive
features and their correlates.

Kaun, A. R. (2004). The typology of rounding harmony. In B. Hayes, R. Kirchner, & D.


Steriade (Eds.), Phonetically based phonology (pp. 87–116).

Keating, P. A. (1996). The Phonetics-Phonology Interface. UCLA Working Papers in


Phonetics, 92, 45–60.

Kelso, J. A. S., & Tuller, B. (1984). A Dynamical Basis for Action Systems. In M. S. Gazzaniga
(Ed.), Handbook of Cognitive Neuroscience (pp. 321–356). Springer US.
[Link]

Kelso, J. A. S., Holt, K. G., Rubin, P., & Kugler, P. N. (1981). Patterns of Human Interlimb
Coordination Emerge from the Properties of Non-Linear, Limit Cycle Oscillatory
Processes. Journal of Motor Behavior, 13(4), 226–261.
[Link]

Kelso, J. A., & Tuller, B. (1984). Converging evidence in support of common dynamical
principles for speech and movement coordination. American Journal of
Physiology-Regulatory, Integrative and Comparative Physiology, 246(6), R928–R935.
[Link]

Kelso, J. S., Tuller, B., Vatikiotis-Bateson, E., & Fowler, C. A. (1984). Functionally specific
articulatory cooperation following jaw perturbations during speech: Evidence for
coordinative structures. Journal of Experimental Psychology: Human Perception and
Performance, 10(6), 812–832. [Link]

Kimper, W. A. (2011). Competing Triggers: Transparency and Opacity in Vowel Harmony


[PhD Dissertation]. University of Massachusetts Amherst.

Krivokapić, J. (2014). Gestural coordination at prosodic boundaries and its role for prosodic
structure and speech planning processes. Philosophical Transactions of the Royal Society
B: Biological Sciences, 369(1658), 20130397. [Link]

Kröger, B. J., Schröder, G., & Opgen‐Rhein, C. (1995). A gesture‐based dynamic model
describing articulatory movement data. The Journal of the Acoustical Society of America,
98(4), 1878–1889. [Link]

299
Kugler, P. N., Kelso, J. A. S., & Turvey, M. T. (1980). On the Concept of Coordinative
Structures as Dissipative Structures: I. Theoretical Lines of Convergence. In G. E.
Stelmach & J. Requin (Eds.), Advances in Psychology (Vol. 1, pp. 3–47). North-Holland.
[Link]

Kuhl, P. K., & Miller, J. D. (1978). Speech perception by the chinchilla: Identification
functions for synthetic VOT stimuli. The Journal of the Acoustical Society of America,
63(3), 905–917. [Link]

Ladefoged, P. (1989). Representing Phonetic Structure (No. 73; Working Papers in Phonetics).
Phonetics Laboratory, Department of Linguistics, UCLA.

Lammert, A. C., Melot, J., Sturim, D. E., Hannon, D. J., DeLaura, R., Williamson, J. R.,
Ciccarelli, G., & Quatieri, T. F. (2020). Analysis of Phonetic Balance in Standard English
Passages. Journal of Speech, Language, and Hearing Research, 63(4), 917–930.
[Link]

Lammert, A. C., Proctor, M. I., & Narayanan, S. S. (2010). Data-Driven Analysis of Realtime
Vocal Tract MRI using Correlated Image Regions. Interspeech 2010, 1572–1575.

Lammert, A. C., Ramanarayanan, V., Proctor, M. I., & Narayanan, S. S. (2013). Vocal tract
cross-distance estimation from real-time MRI using region-of-interest analysis.
Interspeech 2013, 959–962.

Large, E. W. (2000). On synchronizing movements to music. Human Movement Science,


19(4), 527–566. [Link]

Large, E. W., & Kolen, J. F. (1994). Resonance and the perception of musical meter.
Connection Science, 6(1), 177–208.

Lartillot, O., Toiviainen, P., & Eerola, T. (2008). A Matlab Toolbox for Music Information
Retrieval. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, & R. Decker (Eds.), Data
Analysis, Machine Learning and Applications (pp. 261–268). Springer.
[Link]

Lartillot, O., Toiviainen, P., Saari, P., & Eerola, T. (n.d.). MIRtoolbox (1.7.2) [Computer
software]. [Link]

Lederer, K. (2005, 2006). The Phonetics of Beatboxing. Introduction (The Phonetics of


Beatboxing).
[Link]

Lederer, K. (2005/2006). The Phonetics of Beatboxing.


[Link]

Lerdahl, F., & Jackendoff, R. (1983/1996). A Generative Theory of Tonal Music. MIT press.

300
Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised.
Cognition, 21(1), 1–36. [Link]

Liberman, A. M., & Mattingly, I. G. (1989). A Specialization for Speech Perception. Science,
243(4890), 489–494.

Liberman, A. M., Isenberg, D., & Rakerd, B. (1981). Duplex perception of cues for stop
consonants: Evidence for a phonetic mode. Perception & Psychophysics, 30(2), 133–143.
[Link]

Liberman, M., & Prince, A. (1977). On Stress and Linguistic Rhythm. Linguistic Inquiry, 8(2),
249–336.

Liljencrants, J., & Lindblom, B. (1972). Numerical Simulation of Vowel Quality Systems: The
Role of Perceptual Contrast. Language, 48(4), 839. [Link]

Lindblom, B. (1983). Economy of Speech Gestures. In P. F. MacNeilage (Ed.), The Production


of Speech (pp. 217–245). Springer New York. [Link]

Lindblom, B. (1986). Phonetic universals in vowel systems. In Experimental phonology (pp.


13–44).

Lindblom, B. (1990). On the notion of “possible speech sound.” Journal of Phonetics, 18(2),
135–152. [Link]

Lindblom, B., & Maddieson, I. (1988). Phonetic universals in consonant systems. In Language,
speech and mind.

Lindblom, B., Lubker, J., & Gay, T. (1979). Formant frequencies of some fixed-mandible
vowels and a model of speech motor programming by predictive simulation. Journal of
Phonetics, 7(2), 147–161. [Link]

Lingala, S. G., Zhu, Y., Kim, Y.-C., Toutios, A., Narayanan, S., & Nayak, K. S. (2017). A fast and
flexible MRI system for the study of dynamic vocal tract shaping. Magnetic Resonance in
Medicine, 77(1), 112–125. [Link]

Llorens, M. (In progress). Dissertation, University of Southern California.

MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production.


Behavioral and Brain Sciences, 21(4), 499–511. [Link]

Maess, B., Koelsch, S., Gunter, T. C., & Friederici, A. D. (2001). Musical syntax is processed in
Broca’s area: An MEG study. Nature Neuroscience, 4(5), 540–545.
[Link]

301
Mann, V. A., & Liberman, A. M. (1983). Some differences between phonetic and auditory
modes of perception. Cognition, 14(2), 211–235.
[Link]

Martin, M., & Mullady, K. (n.d.). Education. Lightship Beatbox. Retrieved June 6, 2022, from
[Link]

McPherson, L. (2018). The Talking Balafon of the Sambla: Grammatical Principles and
Documentary Implications. Anthropological Linguistics, 60(3), 255–294.
[Link]

McPherson, L., & Ryan, K. M. (2018). Tone-tune association in Tommo So (Dogon) folk
songs. Language, 94(1), 119–156. [Link]

Mielke, J. (2011). Distinctive Features. In The Blackwell Companion to Phonology (pp. 1–25).
John Wiley & Sons, Ltd. [Link]

Moors, T., Silva, S., Maraschin, D., Young, D., Quinn V, J., Carpentier, J., Allouche, J., &
Himonides, E. (2020). Using Beatboxing for Creative Rehabilitation After Laryngectomy:
Experiences From a Public Engagement Project. Frontiers in Psychology, 10, 2854.
[Link]

Mullady, K. (January 25, 2017). Beatboxing rapping and singing at the same time [Video].
YouTube. [Link]

Nam, H., & Saltzman, E. (2003). A Competitive, Coupled Oscillator Model of Syllable
Structure. Proceedings of the 15th International Congress of Phonetic Sciences.

Nam, H., Goldstein, L., & Saltzman, E. (2009). Self-organization of Syllable Structure: A
Coupled Oscillator Model. In F. Pellegrino, E. Marsico, I. Chitoran, & C. Coupé (Eds.),
Approaches to Phonological Complexity (pp. 297–328). Walter de Gruyter.
[Link]

Narayanan, S., Nayak, K., Lee, S., Sethy, A., & Byrd, D. (2004). An approach to real-time
magnetic resonance imaging for speech production. The Journal of the Acoustical Society
of America, 115(4), 1771–1776. [Link]

Oh, M. (2021). Articulatory Dynamics and Stability in Multi-Gesture Complexes [Ph.D.,


University of Southern California].
[Link]

Oh, M., & Lee, Y. (2018). ACT: An Automatic Centroid Tracking tool for analyzing vocal tract
actions in real-time magnetic resonance imaging speech production data. The Journal of
the Acoustical Society of America, 144(4), EL290–EL296. [Link]

302
Ohala, J. J. (1980). Moderator’s summary of symposium on “Phonetic universals in
phonological systems and their explanation.” Proceedings of the 9th International
Congress of Phonetic Sciences, 3, 181–194.

Ohala, J. J. (1983). The Origin of Sound Patterns in Vocal Tract Constraints.


[Link]

Ohala, J. J. (1990). There is no interface between phonology and phonetics: A personal view.
Journal of Phonetics, 18(2), 153–171. [Link]

Ohala, J. J. (1994). Towards a universal, phonetically-based, theory of vowel harmony. The


3rd International Conference on Spoken Language Processing, ICSLP, Yokohama, Japan.

Ohala, J. J. (2008). Languages’ Sound Inventories: The Devil in the Details. UC Berkeley
Phonology Lab Annual Reports, 4. [Link]

O’Dell, M. L., & Nieminen, T. (1999). Coupled oscillator model of speech rhythm.
Proceedings of the 14th International Congress of Phonetic Sciences, 2, 1075–1078.

O’Dell, M. L., & Nieminen, T. (2009). Coupled oscillator model for speech timing: Overview
and examples. Prosody: Proceedings of the 10th Conference, 179–190.

Palmer, C., & Kelly, M. H. (1992). Linguistic Prosody and Musical Meter in Song. Journal of
Memory and Language, 31(4), 525–542.

Park, J. (2016, September 12). 80 Fitz | Build your basic sound arsenal | HUMAN BEATBOX.
HUMAN BEATBOX.
[Link]

Park, J. (2017, March 22). Spit Snare—HUMAN BEATBOX. HUMAN BEATBOX.


[Link]

Paroni, A., Henrich Bernardoni, N., Savariaux, C., Lœvenbruck, H., Calabrese, P., Pellegrini, T.,
Mouysset, S., & Gerber, S. (2021). Vocal drum sounds in human beatboxing: An acoustic
and articulatory exploration using electromagnetic articulography. The Journal of the
Acoustical Society of America, 149(1), 191–206. [Link]

Paroni, A., Lœvenbruck, H., Baraduc, P., Savariaux, C., Calabrese, P., & Bernardoni, N. H.
(2021). Humming Beatboxing: The Vocal Orchestra Within. MAVEBA 2021 - 12th
International Workshop Models and Analysis of Vocal Emissions for Biomedical
Applications, Universita Degli Studi Firenze.

Parrell, B., & Narayanan, S. (2018). Explaining Coronal Reduction: Prosodic Structure and
Articulatory Posture. Phonetica, 75(2), 151–181. [Link]

303
Patil, N., Greer, T., Blaylock, R., & Narayanan, S. S. (2017). Comparison of Basic Beatboxing
Articulations Between Expert and Novice Artists Using Real-Time Magnetic Resonance
Imaging. Interspeech 2017, 2277–2281. [Link]

Pike, K. L. (1943). Phonetics: A Critical Analysis of Phonetic Theory and a Technique for the
Practical Description of Sounds. University of Michigan Publications.

Pillot-Loiseau, C., Garrigues, L., Demolin, D., Fux, T., Amelot, A., & Crevier-Buchman, L.
(2020). Le human beatbox entre musique et parole: Quelques indices acoustiques et
physiologiques. Volume !, 16 : 2 / 17 : 1, 125–143. [Link]

Pouplier, M. (2012). The gaits of speech: Re-examining the role of articulatory effort in
spoken language. In M.-J. Solé & D. Recasens (Eds.), Current Issues in Linguistic Theory
(Vol. 323, pp. 147–164). John Benjamins Publishing Company.
[Link]

Prince, A., & Smolensky, P. (1993/2004). Optimality Theory: Constraint Interaction in


Generative Grammar. Manuscript, Rutgers University and University of Colorado
Boulder. Published 2004 by Blackwell Publishing.

Proctor, M., Bresch, E., Byrd, D., Nayak, K., & Narayanan, S. (2013). Paralinguistic
mechanisms of production in human “beatboxing”: A real-time magnetic resonance
imaging study. The Journal of the Acoustical Society of America, 133(2), 1043–1054.
[Link]

Proctor, M., Lammert, A., Katsamanis, A., Goldstein, L., Hagedorn, C., & Narayanan, S. (2011).
Direct Estimation of Articulatory Kinematics from Real-Time Magnetic Resonance Image
Sequences. Interspeech 2011, 284–281.

Ravignani, A., Honing, H., & Kotz, S. A. (2017). Editorial: The Evolution of Rhythm
Cognition: Timing in Music and Speech. Frontiers in Human Neuroscience, 11.
[Link]

Rialland, A. (2005). Phonological and phonetic aspects of whistled languages. Phonology,


22(2), 237–271. [Link]

Roon, K. D., & Gafos, A. I. (2016). Perceiving while producing: Modeling the dynamics of
phonological planning. Journal of Memory and Language, 89, 222–243.
[Link]

Rose, S., & Walker, R. (2011). Harmony Systems. In The Handbook of Phonological Theory
(pp. 240–290). John Wiley & Sons, Ltd. [Link]

Saltzman, E. L., & Munhall, K. G. (1989). A Dynamical Approach to Gestural Patterning in


Speech Production. Ecological Psychology, 1(4), 333–382.
[Link]

304
Saltzman, E. L., & Munhall, K. G. (1992). Skill Acquisition and Development: The Roles of
State-, Parameter-, and Graph-Dynamics. Journal of Motor Behavior, 24(1), 49–57.
[Link]

Saltzman, E., & Kelso, J. A. (1987). Skilled actions: A task-dynamic approach. Psychological
Review, 94(1), 84–106. [Link]

Saltzman, E., Nam, H., Goldstein, L., & Byrd, D. (2006). The Distinctions Between State,
Parameter and Graph Dynamics in Sensorimotor Control and Coordination. In M. L.
Latash & F. Lestienne (Eds.), Motor Control and Learning (pp. 63–73). Kluwer Academic
Publishers. [Link]

Saltzman, E., Nam, H., Krivokapic, J., & Goldstein, L. (2008). A task-dynamic toolkit for
modeling the effects of prosodic structure on articulation. Proceedings of the 4th
International Conference on Speech Prosody (Speech Prosody 2008), 175–184.

Sapir, E. (1925). Sound Patterns in Language. Language, 1(2), 37–51.

Schellenberg, M. H. (2013). The Realization of Tone in Singing in Cantonese and Mandarin.


The University of British Columbia.

Schellenberg, M., & Gick, B. (2020). Microtonal Variation in Sung Cantonese. Phonetica,
77(2), 83–106. [Link]

Schyns, P. G., Goldstone, R. L., & Thibaut, J.-P. (1998). The development of features in object
concepts. Behavioral and Brain Sciences, 21(1), 1–17.
[Link]

Shadmehr, R. (1998). The Equilibrium Point Hypothesis for Control of Movements.


Baltimore, MD: Department of Biomedical Engineering, Johns Hopkins University.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical


Journal, 27(3), 379–423. [Link]

Shaw, P. A. (2008). Scat syllables and markedness theory. Toronto Working Papers in
Linguistics, 27, 145–191.

Shih, S. S., & Inkelas, S. (2014). A Subsegmental Correspondence Approach to Contour Tone
(Dis)Harmony Patterns. Proceedings of the Annual Meetings on Phonology, 1(1), Article
1. [Link]

Shih, S. S., & Inkelas, S. (2018). Autosegmental Aims in Surface-Optimizing Phonology.


Linguistic Inquiry, 50(1), 137–196. [Link]

Shih, S. S., & Zuraw, K. (2017). Phonological conditions on variable adjective and noun word
order in Tagalog. Language, 93(4), e317–e352. [Link]

305
Smith, C. M. (2018). Harmony in Gestural Phonology [Ph.D., University of Southern
California].
[Link]

Smolensky, P., Goldrick, M., & Mathis, D. (2014). Optimization and quantization in gradient
symbol systems: A framework for integrating the continuous and the discrete in
cognition. Cognitive science, 38(6), 1102-1138.

Sorensen, T., & Gafos, A. (2016). The Gesture as an Autonomous Nonlinear Dynamical
System. Ecological Psychology, 28(4), 188–215.
[Link]

Stevens, K. N. (1989). On the quantal nature of speech. Journal of Phonetics, 17(1–2), 3–45.
[Link]

Stevens, K. N., & Keyser, S. J. (2010). Quantal theory, enhancement and overlap. Journal of
Phonetics, 38(1), 10–19. [Link]

Stowell, D. (2003). The Beatbox Alphabet. The Beatbox Alphabet.


[Link]

Stowell, D., & Plumbley, M. D. (2008). Characteristics of the beatboxing vocal style (No.
C4DM-TR-08–01; pp. 1–4). Queen Mary, University of London.

Studdert-Kennedy, M., & Goldstein, L. (2003). Launching Language: The Gestural Origin of
Discrete Infinity. In M. H. Christiansen & S. Kirby (Eds.), Language Evolution (pp.
235–254). Oxford University Press.
[Link]

Tiede, M. (2010). MVIEW: Multi-channel visualization application for displaying dynamic


sensor movements.

Tilsen, S. (2009). Multitimescale Dynamical Interactions Between Speech Rhythm and


Gesture. Cognitive Science, 33(5), 839–879. [Link]

Tilsen, S. (2018, March 28). Three mechanisms for modeling articulation: Selection,
coordination, and intention. Cornell Working Papers in Phonetics and Phonology.

Tilsen, S. (2019). Motoric Mechanisms for the Emergence of Non-local Phonological Patterns.
Frontiers in Psychology, 10. [Link]

Tyte, & SPLINTER. (2014, September 18). Standard Beatbox Notation (SBN). HUMAN
BEATBOX. [Link]

306
Tyte, G. and Splinter, M. (2002/2004). Standard Beatbox Notation (SBN). Retrieved
December 8, 2019 from
[Link]

WIRED. (2020, March 17). 13 Levels of Beatboxing: Easy to Complex | WIRED.


[Link]

Walker, R. (2005). Weak Triggers in Vowel Harmony. Natural Language & Linguistic Theory,
23(4), 917. [Link]

Walker, R., Byrd, D., & Mpiranya, F. (2008). An articulatory view of Kinyarwanda coronal
harmony. Phonology, 25(3), 499–535. [Link]

Werker, J. F., & Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual
reorganization during the first year of life. Infant Behavior and Development, 7(1), 49–63.
[Link]

Westbury, J. R. (1983). Enlargement of the supraglottal cavity and its relation to stop
consonant voicing. The Journal of the Acoustical Society of America, 73(4), 1322–1336.
[Link]

Woods, K. J. (2012). (Post)Human Beatbox Performance and the Vocalisation of Electronic


and Mechanically (Re)Produced Sounds.

Wyttenbach, R. A., May, M. L., & Hoy, R. R. (1996). Categorical Perception of Sound
Frequency by Crickets. Science, 273(5281), 1542–1544.

Ziegler, W. (2003a). Speech motor control is task-specific: Evidence from dysarthria and
apraxia of speech. Aphasiology, 17(1), 3–36. [Link]

Ziegler, W. (2003b). To speak or not to speak: Distinctions between speech and nonspeech
motor control. Aphasiology, 17(2), 99–105. [Link]

Zipf, G. K. (1949). Human Behavior And The Principle Of Least Effort. Addison-Wesley
Press, Inc. [Link]

de Saussure, F. (1916). Cours de linguistique générale (1916). Payot.

de Torcy, T., Clouet, A., Pillot-Loiseau, C., Vaissière, J., Brasnu, D., & Crevier-Buchman, L.
(2014). A video-fiberscopic study of laryngopharyngeal behaviour in the human beatbox.
Logopedics Phoniatrics Vocology, 39(1), 38–48.
[Link]

307
APPENDIX: Harmony beat pattern drum tabs

Beat pattern 1: Clickroll showcase

b |x-----------x---|--x-----------x-|x-----------x---|--x-------------
B |------x---------|------x---x-----|------x---------|------x---x-----
t |----------------|----------------|----------------|------------x---
dc|----x-----x-----|x---x-------x---|----x-----x-----|x---x-----------
^K|--------x-------|--------x-------|--------x-------|--------x-----x-
CR|x~~~--------x~~~|--x~------------|x~~~--------x~~~|--x~------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Beat pattern 2: Clop showcase

C |x---x-x---x-x-x-|x-x---x-x-x-xxx-|x---x-x---x-x-x-|x-x---x-x-x-xxx-
ex|x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Note: exhale may be some kind of voicing, given the larynx activity

Beat pattern 3: Duck Meow SFX showcase

b |------x---------|--x---x---------|x-----x-----x---|--x---x---x-x---
ac |--x-------x---x-|----------------|--x-------x---x-|----x-----------
dc |----x-----------|x---------------|----x-----------|x---------------
tbc|----------------|----x-----------|----------------|----------------
DM |x-----------x---|----------x-----|x-----------x---|------------x---
^K |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

308
Beat pattern 4: Liproll showcase

Bars 1-4

b |x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
ac |----------x-----|----------x-----|----------x-----|--------x-------
dc |----------------|----x-----------|----------------|------------x---
tbc|----------------|----------------|----------------|----x-----------
pf |--------x-------|--------x-------|--------x-------|------x---------
LR |x~~~~~------x~~~|~~----------x~~~|x~~~~~------x~~~|~~--------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Bars 5-8

b |x---x-------x---|x---x-------x---|x---x-------x---|x---x-----------
ac |----------x-----|----------x-----|----------x-----|----------------
dc |----------------|--------------x-|----------------|----------------
tbc|----------------|----------------|----------------|----------------
pf |--------x-------|--------x-------|--------x-------|--------x-------
LR |x~~~x~~~----x~~~|x~~~x~~~--------|x~~~x~~~----x~~~|x~~~x~~~--------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Beat pattern 5: Spit Snare showcase

b |x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
dc |----x-----------|----------------|----x-----------|----------------
tll|----------------|x---------------|----------------|x---------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Beat pattern 6: Water Drop Air showcase

b |x---------------|x---------------|x---------------|x---------------
ac |--x-------x---x-|--x-------x---x-|--x-------x---x-|--x-------x---x-
WDA|----x~~~---x~~--|----x~~~---x~~--|----x~~~---x~~--|----x~~~---x~~--
pf |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

309
Beat pattern 7: Water Drop Tongue showcase

b |x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x---x-
WDT|--x-x---------x-|x---x-------x---|--x-x---------x-|x---x-------x---
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Beat pattern 8: Inward Bass showcase

B |x---------------|----------------|----------------|----------------
b |------------x---|----------------|x-----------x---|----------------
SS |--------x-------|--------x-------|--------x-------|----------------
IB |x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Beat pattern 9: Humming while Beatboxing showcase

b |x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
dc |--x-----------x-|----------------|--x-----------x-|------------x---
tbc|----x-----------|x---x-------x---|----x-----------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
hm |x---x-------x---|x---x-------x---|x---x-------x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

310
Beat pattern 10: Unknown 1

Bars 1-4

B |x-----x-----x---|--x-------------|x-----x-----x---|--x-------------
^LR|x~~~~~------x~~~|~~--------------|x~~~~~------x~~~|~~--------------
^K |----------------|----------------|----------------|----------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|------------x~~~
b |----------------|------x---------|----------------|------x---------
dc |----------------|----------------|----------------|----------------
dac|----------------|----------------|----------------|----------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Bars 5-8

B |x-----x-----x---|--x-------------|x---------------|----------------
^LR|x~~~~~------x~~~|~~--------------|----------------|----------------
^K |----------------|----------------|----------------|------------x---
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|----------------
b |----------------|------x---------|------x-----x---|--x---x---------
dc |----------------|----------------|--x-----------x-|----------------
dac|----------------|----------------|----x-----------|x---------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

311
Beat pattern 11: Unknown 2

Bars 1-4

hm |x---x-------x---|x---x-------x---|x---x-------x---|x---x---x---x---
b |x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---x---x-
B |----------------|----------------|----------------|----------------
dc |--x-----------x-|----------------|--x-------------|----------------
tll|----x-----------|x---------------|----x-----------|----------------
tbc|----------------|----x-----------|----------------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
WDT|----------------|------------x---|----------------|------------x---
PF |----------------|----------------|----------------|----------------
ta |----------------|----------------|----------------|----------------
^K |----------------|----------------|----------------|----------------
^LR|----------------|----------------|----------------|----------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

Bars 5-8

hm |x---x-------x---|x---x-------x---|----------------|----------------
b |x-----x-----x---|--x-----------x-|----------------|----x-x---------
B |----------------|----------------|----------------|----------x-----
dc |--x-------------|----x-------x---|----x-----x---x-|--x-------------
tll|----x-----------|------x---------|----------------|----------------
tbc|----------------|----------------|----------------|----------------
SS |--------x-------|--------x-------|----------------|----------------
WDT|--------------x-|x---------------|----------------|----------------
PF |----------------|----------------|x-----x-----x---|x---------------
ta |----------------|----------------|--x-----x-------|----------------
^K |----------------|----------------|----------------|--------x-------
^LR|----------------|----------------|----------------|----------x~~~~~
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +

312
ProQuest Number: 29323155

INFORMATION TO ALL USERS


The quality and completeness of this reproduction is dependent on the quality
and completeness of the copy made available to ProQuest.

Distributed by ProQuest LLC ( 2022 ).


Copyright of the Dissertation is held by the Author unless otherwise noted.

This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.

This work is protected against unauthorized copying under Title 17,


United States Code and other applicable copyright laws.

Microform Edition where available © ProQuest LLC. No reproduction or digitization


of the Microform Edition is authorized without permission of ProQuest LLC.

ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA

You might also like