Senior Project Draft

Allen 1
David Allen
Mrs. Jenkins
English IV
02/25/20
Neural Networks in Music
A computer is like a soldier. You tell it go and it goes, do and it does. But the machine
can’t improvise on unclear instructions, nor can it look at subtle patterns and form its own
conclusions, unlike the human brain. This is where humans vastly surpass computers in
functioning. If we think about it, humans are constantly taking in information through the eyes
since birth to the point where they can instantly identify dogs and cats and road signs upon
seeing them. Why can’t modern machines do this?
Looking at the human brain primitively as a robot, what it does is take input through the
five senses and deliver an output to the brain as information. Computer scientists are actively
discovering ways to do this with man-made robots called artificial intelligence. I plan to make a
simple version of this intelligence in what is known as a neural network.
Neural networks have incredible potential in themselves. They can paint unique artwork,
identify your mood, and predict stocks and weather. As a guitar player, I’m passionate about the
musical application of artificial intelligence, and I’ve been fortunate to have exposure in
computer programming from beginning in the third grade. Specifically I want to research the
ways computer science can help musicians learn the complexities of music theory and ear
training, and provide a resource for many beginners to reach higher levels of skill. I asked: can
Allen 2
neural networks revolutionize the way we create music and open a pathway for aspiring artists to
learn their craft?
Artificial neural networks have their origin in the study of the human brain. The first
recorded explanation of biological neurons was in 1943 when “neurophysiologist Warren
McCulloch and mathematician Walter Pitts wrote a paper on how neurons might work”
(“History”). The theory hit the computer scene in 1959 with Stanford’s development of
ADALINE and MADALINE, the most primitive forms of artificial neural networks produced on
a machine, which were able to predict the next bit in a streaming phone line (“History”). The
first unsupervised multiple-layer neural network was achieved in 1975 (“History”). The
technology is still shockingly new, but it irrevocably impacts the world. Still, most people are
totally unaware of what a neural network is.
The point of a neural network is to accomplish the difficult tasks a human can perform by
mimicking the architecture of the brain. The principal way neural networks imitate biological
neurons is in the way they receive input and output. A single brain neuron cell has what are
called dendrites that take in electrical or chemical signals from adjacent neurons and send them
to the nucleus of the cell (“Brain Basics”). After being deciphered in the nucleus, the signal is
sent down the cell’s axon into the axon terminals. These emit output signals to other neurons
through a tiny gap in between cells called the synapse (“Brain Basics”). An artificial neuron
emulates this process: it takes in a numerical input, runs it through a function, compresses the
answer, and sends it to other “neurons”. By chaining neurons together the network is able to
make incredibly fast and complicated computations using relatively simple construction.
Allen 3
An example artificial neuron takes in values between zero and one. These values together
are subjected to what are called weights and biases, which alter the value depending on how
influential each input is to the final result (Nielson). For example, when we look at a tree we
don’t expect to see the color blue. Therefore, seeing the color blue should be given a strongly
negative weight to discourage the computer from concluding it sees a tree, and this line of
reasoning applies to all other factors, like size and shape. Once weighted, the function adds a
bias. A bias mimics the brain by preventing the neuron from firing unless a positive enough
value is reached. A relatively high bias causes the neuron to fire in most cases, while a low bias
tends to bar the neuron from firing. Although these numbers may seem arbitrary at first, they are
critical for the network to make calculated judgments and to improve its own thinking. The result
of the process is then condensed back into a value between zero and one, typically using a
sigmoid function, S(x)=1/(1+e⁻ˣ) (Nielson). The numbers can be sent to the next neurons and the
cycle continues.
A neural network is composed of layers of these neurons. The first layer is the input
layer, followed by a discretionary amount of hidden layers, ending in an output layer. This is
parallel to the brain which has sensory neurons, interneurons, and motor neurons (“Brain
Basics”). Any network with more than one hidden layer is known as a deep neural network
(Sturm). A music example may be one where the input is the waveform data in a five second
sound clip and the output is the name of the instrument being played. The hidden layers do all
the heavy lifting, weighing each input and passing informative results on to other neurons. How
do these weights get set? There is no golden formula that dictates what each value should be.
Allen 4
What it takes to fine-tune the results of a neural network is the same for any human: relentless
practice.
While our brains can easily distinguish handwritten characters and song lyrics, computers
have trouble making sense of pixels and frequencies. In order to get them up to speed, thousands
of tests must be administered to train a neural network to see patterns that we can identify in an
instant. When a test is sent through fed forward through the network, it produces an output,
whether correct or completely false. The computer determines how far off it is from its expected
outcome using a loss function. The loss is essential to improve the network so it can “learn”, and
it simply is the difference of the expected and actual products (Loy). All the weights and biases
of the network are arranged into a matrix, and each one is shifted according to the outcome of
the loss function inserted into the gradient descent function (Nielson). Gradient descent is a
function that uses multivariate calculus to find local minima in a multidimensional
curve--visualise a ball rolling down to the lowest point it finds on a hilly terrain--and outputs a
vector containing every slight and grand directional movement the network needs to make to get
closer to the truth (Nielson). This whole process is called backpropagation, because the machine
is moving backwards through the network to update itself (Loy). Testing the network thousands
of times and using backpropagation on a sample of the results allows it to minimize its loss and
produce more accurate outcomes (Rocca). The most remarkable part is that the computer does
this entirely on its own, only depending on simple math.
Unlike images, sounds have multiple attributes that increase the difficulty of the
computer’s task of differentiation. Images are made up of a number of pixels which have
individual, quantifiable RGB values, responsible for all the data in the image. Sounds, on the
Allen 5
other hand, have multiple factors, including pitch, tone, timbre, some of which are not so easy to
record (Nave). While pitch can be measured with frequency, using Hertz, tone is dependent on
the quality of a sound as well as its pitch and volume. Timbre can be called the complexity of a
sound wave, and is the main element in determining the type of instrument being played (Yun,
Bi). Sounds also have what is called attack and decay, which depict the change in a note over
time, and can be useful for humans to distinguish sounds like cymbals crashing versus a trash
can falling over (Nave).There is also the tempo or BPM of a sound to consider. Altogether,
computers need a way to deal with the immense variation in the raw data. Luckily, there are
already a few methods for dealing with this.
One way to convert the complexities of sound into computational information is called a
Fourier transformation. It exists on the principle that “All waveforms, no matter what you
scribble or observe in the universe, are actually just the sum of simple sinusoids of different
frequencies” (Bevelacqua). What that means is that all sound waves that could ever possibly be
made can be turned into a discrete mathematical function, which is incredibly important for a
computer to be able to glean information from. Many researchers interested in music recognition
using data science will use a form of this called Short-Term Fourier Transformations, or STFT,
which essentially map out audio signals over a short period of time (Bevelacqua). While STFT
provides more complex data than simply using the raw audio information, it is often the better,
more professional method of reading music data. Therefore, most projects use mid-level
transformations like the STFT (Nair).
Music is dealt with differently from images in other ways as well. Instead of typical
neural networks, sound clips are often analyzed using Recurrent Neural Networks, RNNs. An
Allen 6
RNN “creates some level of memory in the network” by feeding certain outcomes recursively
again through the algorithm, and for this reason it works exceedingly well with problems that
involve time periods and loops, like music (Hadad). Advantages of this kind of network include
its potential to take any length of input while the architecture remains the same size; however, it
takes significantly longer to compute (“Recurrent Neural Networks Cheatsheet”). Variable size
input is incredibly important in analyzing the timbre of an instrument because of the
aforementioned attack and decay; the beginning of the sound differs from the middle and end of
the note, for example, when the bow hits the violin string there initially is a plucking sound
preceding the tremulous tune. A recurrent network ensures all aspects of the sound are
considered over the entire time domain (Franklin). Alternatively, the sound could be broken up
into beginning, middle, and end, then modeled separately, which would improve accuracy of the
network (Anderson). Notwithstanding, a recurrent network should be used in any scenario
involving audio.
Neural networks have already been used in fascinating ways; to recognize chord
progressions and even distinguish music genres (Ghosal and Kolekar). It seems there are
abundant opportunities already for the world of computer science to collide with those of music
and art. For an algorithm that recognizes the instrument being played in a sound, a network
model needs to be constructed; but no need to reinvent the wheel. Google has developed an
open-source API for those interested in machine-learning called TensorFlow (Hadad). The
product makes it significantly simpler to focus on gathering data and training the network, while
most of the advanced calculus is encapsulated inside prewritten code. Most of the relevant code
is contained in a library of functions called Keras, which significantly eases the process of
Allen 7
constructing deep learning in code (“Recurrent Layers”). Keras can be imported in Python and
easily utilized in a program.
Unfortunately, computer science is woefully under-taught in today’s society. According
to the CSE Coalition, “Only one out of four K-12 schools teach any computer science” in the
United States, meaning that many children never even get the chance to discover this incredibly
important and rapidly-expanding field of study. Experts in the field have repeatedly emphasized
the incoming prevalence of machine learning (Benaich, et. al.).
Yaron Hadad, one such expert, has been working with machine learning for 20 years. He
has a PhD in Mathematics and Physics. His company, Nutrino, takes billions of nutritional data
points to provide personalized meal plans for consumers. When asked about the future of
artificial intelligence, he made it clear that “we will see companies incorporating A.I. in pretty
much every single industry that exists”, like in the medical field, where an algorithm that can
locate tumors in an x-ray was recently developed, as well as the music industry. The reality is
that “a lot of things that we naturally have been doing for decades will be enhanced by A.I.”.
Like a batch of yeast, the tech world is working its way through the modern market. It is not a
question of whether or not computers will take over: technology has “taken over for about 100
years...It's been happening since the industrial revolution”. Many people worry that A.I. will
harm society. In terms of the workforce, it is true that “certain jobs will get displaced, but they
will be replaced by other things”, as has been the case throughout history. However, as the
technology has an unknown and unprecedented potential to interfere in daily life, therefore “AI
should be regulated”; but this only means that machine learning deserves more awareness, not
less.
Allen 8
Ultimately, the question of whether neural networks will be instrumental in the music
industry can be answered with straightforward assertion. Stanford computer scientists Allen
Huang and Raymond Wu have already begun tackling the problem of generating music
completely hands-free. In their published research paper, they recorded that their programs “were
able to learn meaningful musical structure” and that many volunteer respondents in general
couldn’t distinguish the computer generated music with a novice composer’s (Huang and Wu).
This technology continues to develop, and “more intricate music has been learned as the state of
the art in recurrent networks improves” (Sturm, et. al.). Furthermore, scientists from across the
globe have worked together to construct a basic model with the goal of creating a network that
transcribes original music. A developed version of this product could aid musicians by providing
inspiration when composing music (Sturm, et. al.). Another operation of the network could be to
transcribe pieces in order to help musicians learn how to play, the way many jazz musicians
imitate their favorite solos to become proficient themselves (Anderson). The application of
artificial intelligence regarding sound is expansive and opportunities are innumerable.
Kence Anderson, musician and principal program manager at Microsoft AI & Research,
interacts with neural networks every day. He remarks that it is an “awesome time to live”
regarding the modern advancement of A.I. Not only are we using machine learning in an
increasing measure to vitalize every industry, we are also making deep learning more accessible
to those without specialized computer science knowledge. It is evident that the conceptual
background required to fully grasp and implement artificial intelligence is daunting, to say the
least; however, “through machine teaching at Bonsai and Microsoft, we're able to take
[mechanical engineers] and allow them to train these very complex neural networks without
Allen 9
[them] knowing how to architect a neural network”, which in turn allows a higher number of
people to contribute to new projects. According to Kence, the “deep reinforcement learning”
technology that makes this layman accessibility possible has only been around since 2015, when
Google first innovated it. What this means for the future is that almost all autonomous systems
will run on A.I. algorithms. As far as the music industry, Anderson hopes for A.I. that can
collaborate with artists the way musicians play off each other.
In my project I discovered shoreline seashells compared to the deep ocean of machine
learning. As high and wide and deep it is to grasp the complex math and science behind neural
network models, it is worthwhile for the incredible progress that surpasses imagination being
brought to our society. Musicians will absolutely benefit from the expansion of this technology,
as well as every other industry in our economy. Researchers have already begun to assist artists
in both learning and composing music, and I hope to join them as I develop my computer science
knowledge. The world faces a bright yet cloudy future. What would benefit society most is for
upcoming generations to participate in the inevitable technological revolution, whether in
advancing A.I. or setting boundaries for it. Therefore I also plan to be active in teaching
computer principles to younger students in order to expose them to the silently growing field
early on. At the end of my research, I am inspired by the creative possibilities of neural
networks, but also more aware in the way I think, identify patterns and make connections. It
seems as we build these increasingly sophisticated machines, we simultaneously build a deeper
understanding of ourselves. And that is truly deep learning.

Allen 10
Works Cited
Anderson, Kence. Principal Program Manager at Microsoft A.I. & Research. Personal interview.
11 March 2020.
Bevelacqua, Pete. “Fourier Transforms.” Fourier Transform, www.thefouriertransform.com/.
Benaich, Nathan, and Ian Hogarth. “State of AI Report 2019.” State of AI, Air Street Capital, 28
June 2019, www.stateof.ai/.
Franklin, Judy A. “Recurrent Neural Networks for Music Computation.” INFORMS Journal on
Computing, Informs, 1 Aug. 2006,
pubsonline.informs.org/doi/abs/10.1287/ijoc.1050.0131.
Ghosal, Deepanway, and Maheshkumar H Kolekar. “Music Genre Recognition Using Deep
Neural Networks and Transfer Learning.” Interspeech 2018, Indian Institute of
Technology, 6 Sept. 2018,
www.isca-speech.org/archive/Interspeech_2018/pdfs/2045.pdf.
Hadad, Yaron. Chief Scientist and Co-founder of Nutrino. Phone interview. 1 March 2020.
Huang, Allen, and Raymond Wu. “Deep Learning for Music.” Allenh.pdf, Stanford University,
cs224d.stanford.edu/reports/allenh.pdf.
Loy, James. “How to Build Your Own Neural Network.” Towards Data Science, 14 May 2018,
towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-
68998a08e4f6.
Allen 11
Nair, Amal, et al. “Step By Step Guide To Audio Visualization In Python.” Analytics India
Magazine, Pvt Ltd., 16 Dec. 2019,
analyticsindiamag.com/step-by-step-guide-to-audio-visualization-in-python/.
National Institute of Neurological Disorders and Stroke. “Brain Basics: The Life and Death of a
Neuron | National Institute of Neurological Disorders and Stroke.” National Institute of
Health, 16 Dec. 2019,
www.ninds.nih.gov/Disorders/Patient-Caregiver-Education/Life-and-Death-Neuron.
Nave, R. “Timbre.” Sound Quality or Timbre, GSU,
hyperphysics.phy-astr.gsu.edu/hbase/Sound/timbre.html.
“Neural Networks - History.” Stanford CS, Stanford University,
cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history1.ht
ml. Date Accessed: 11 March, 2020.
Nielsen, Michael A. Neural Networks and Deep Learning. Determination Press, 2015,
neuralnetworksanddeeplearning.com/index.html.
“Recurrent Layers.” Keras Documentation, GitHub, 17 Sept. 2019, keras.io/layers/recurrent/.
“Recurrent Neural Networks Cheatsheet” Stanford CS, Stanford University,
stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks.
Rocca, Baptiste. “Handling Imbalanced Datasets in Machine Learning.” Towards Data Science,
Medium, 30 Mar. 2019,
towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f2
8.
Allen 12
Sturm, Bob L, et al. “Music Transcription Modelling and Composition using Deep Learning.”
ArXiv, ArXiv, 29 Apr. 2016, arxiv.org/pdf/1604.08723.pdf.
Yun, Mingqing, and Jing Bi. “Deep learning for musical instrument recognition”. University of
Rochester.
https://pdfs.semanticscholar.org/ad42/01d862fd0952d8028697d505ad7697337292.pdf
Allen 13
Works Consulted
“Beginner's Guide to Audio Data.” Kaggle, Kaggle, 12 Apr. 2018,
www.kaggle.com/fizzbuzz/beginner-s-guide-to-audio-data.
Large, Edward W, et al. “Neural Networks for Beat Perception in Musical Rhythm.” National
Institute of Health, Frontiers Media S.A., 25 Nov. 2015,
www.ncbi.nlm.nih.gov/pmc/articles/PMC4658578/'.
E. J. Humphrey and J. P. Bello, "Rethinking Automatic Chord Recognition with Convolutional
Neural Networks," 2012 11th International Conference on Machine Learning and
Applications, Boca Raton, FL, 2012, pp. 357-362.
Fonseca, et al. “General-Purpose Tagging of Freesound Audio with AudioSet Labels: Task
Description, Dataset, and Baseline.” ArXiv.org, Arxiv, 7 Oct. 2018,
arxiv.org/abs/1807.09902.
Peter Isley. Sound Samples Philharmonia Orchestra, 2008. [Online; accessed 15 March, 2020].

Senior Project Draft

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Senior Project Draft

Uploaded by

Copyright:

Available Formats

Allen 1

Neural Networks in Music

seeing them. Why can’t modern machines do this?

simple version of this intelligence in what is known as a neural network.

learn their craft?

recorded explanation of biological neurons was in 1943 when “neurophysiologist Warren

totally unaware of what a neural network is.

function that uses multivariate calculus to find local minima in a multidimensional

this entirely on its own, only depending on simple math.

already a few methods for dealing with this.

transformations like the STFT (​Nair​).

input is incredibly important in analyzing the timbre of an instrument because of the

network (Anderson). Notwithstanding, a recurrent network should be used in any scenario

easily utilized in a program.

Unfortunately, computer science is woefully under-taught in today’s society. According

the incoming prevalence of machine learning (Benaich, et. al.).

artificial intelligence regarding sound is expansive and opportunities are innumerable.

In my project I discovered shoreline seashells compared to the deep ocean of machine

upcoming generations to participate in the inevitable technological revolution, whether in

seems as we build these increasingly sophisticated machines, we simultaneously build a deeper

understanding of ourselves. And that is truly deep learning.

Bevelacqua, Pete. “Fourier Transforms.” Fourier Transform, www.thefouriertransform.com/.

June 2019, ​www.stateof.ai/​.

Computing,​ Informs, 1 Aug. 2006,

Neural Networks and Transfer Learning.” Interspeech 2018, Indian Institute of

Technology, 6 Sept. 2018,

Magazine, Pvt Ltd., 16 Dec. 2019,

Neuron | National Institute of Neurological Disorders and Stroke.” ​National Institute of

Health,​ 16 Dec. 2019,

Nave, R. “Timbre.” ​Sound Quality or Timbre,​ GSU,

“Neural Networks - History.” ​Stanford CS,​ Stanford University,

ml. Date Accessed: 11 March, 2020.

“Recurrent Layers.” Keras Documentation, GitHub, 17 Sept. 2019, keras.io/layers/recurrent/.

“Recurrent Neural Networks Cheatsheet” Stanford CS, Stanford University,

Medium, 30 Mar. 2019,

ArXiv​, ArXiv, 29 Apr. 2016, arxiv.org/pdf/1604.08723.pdf.

“Beginner's Guide to Audio Data.” ​Kaggle,​ Kaggle, 12 Apr. 2018,

Institute of Health,​ Frontiers Media S.A., 25 Nov. 2015,

E. J. Humphrey and J. P. Bello, "Rethinking Automatic Chord Recognition with Convolutional

Neural Networks," 2012 11th International Conference on Machine Learning and

Applications, Boca Raton, FL, 2012, pp. 357-362.

Description, Dataset, and Baseline.” ArXiv.org, Arxiv, 7 Oct. 2018,

You might also like

transformations like the STFT (Nair).

June 2019, www.stateof.ai/.

Computing, Informs, 1 Aug. 2006,

Neuron | National Institute of Neurological Disorders and Stroke.” National Institute of

Health, 16 Dec. 2019,

Nave, R. “Timbre.” Sound Quality or Timbre, GSU,

“Neural Networks - History.” Stanford CS, Stanford University,

ArXiv, ArXiv, 29 Apr. 2016, arxiv.org/pdf/1604.08723.pdf.

“Beginner's Guide to Audio Data.” Kaggle, Kaggle, 12 Apr. 2018,

Institute of Health, Frontiers Media S.A., 25 Nov. 2015,