You are on page 1of 13

Allen 1

David Allen

Mrs. Jenkins

English IV

02/25/20

Neural Networks in Music

A computer is like a soldier. You tell it go and it goes, do and it does. But the machine

can’t improvise on unclear instructions, nor can it look at subtle patterns and form its own

conclusions, unlike the human brain. This is where humans vastly surpass computers in

functioning. If we think about it, humans are constantly taking in information through the eyes

since birth to the point where they can instantly identify dogs and cats and road signs upon

seeing them. Why can’t modern machines do this?

Looking at the human brain primitively as a robot, what it does is take input through the

five senses and deliver an output to the brain as information. Computer scientists are actively

discovering ways to do this with man-made robots called artificial intelligence. I plan to make a

simple version of this intelligence in what is known as a neural network.

Neural networks have incredible potential in themselves. They can paint unique artwork,

identify your mood, and predict stocks and weather. As a guitar player, I’m passionate about the

musical application of artificial intelligence, and I’ve been fortunate to have exposure in

computer programming from beginning in the third grade. Specifically I want to research the

ways computer science can help musicians learn the complexities of music theory and ear

training, and provide a resource for many beginners to reach higher levels of skill. I asked: can
Allen 2

neural networks revolutionize the way we create music and open a pathway for aspiring artists to

learn their craft?

Artificial neural networks have their origin in the study of the human brain. The first

recorded explanation of biological neurons was in 1943 when “neurophysiologist Warren

McCulloch and mathematician Walter Pitts wrote a paper on how neurons might work”

(“History”). The theory hit the computer scene in 1959 with Stanford’s development of

ADALINE and MADALINE, the most primitive forms of artificial neural networks produced on

a machine, which were able to predict the next bit in a streaming phone line (“History”). The

first unsupervised multiple-layer neural network was achieved in 1975 (“History”). The

technology is still shockingly new, but it irrevocably impacts the world. Still, most people are

totally unaware of what a neural network is.

The point of a neural network is to accomplish the difficult tasks a human can perform by

mimicking the architecture of the brain. The principal way neural networks imitate biological

neurons is in the way they receive input and output. A single brain neuron cell has what are

called dendrites that take in electrical or chemical signals from adjacent neurons and send them

to the nucleus of the cell (“Brain Basics”). After being deciphered in the nucleus, the signal is

sent down the cell’s axon into the axon terminals. These emit output signals to other neurons

through a tiny gap in between cells called the synapse (“Brain Basics”). An artificial neuron

emulates this process: it takes in a numerical input, runs it through a function, compresses the

answer, and sends it to other “neurons”. By chaining neurons together the network is able to

make incredibly fast and complicated computations using relatively simple construction.
Allen 3

An example artificial neuron takes in values between zero and one. These values together

are subjected to what are called weights and biases, which alter the value depending on how

influential each input is to the final result (Nielson). For example, when we look at a tree we

don’t expect to see the color blue. Therefore, seeing the color blue should be given a strongly

negative weight to discourage the computer from concluding it sees a tree, and this line of

reasoning applies to all other factors, like size and shape. Once weighted, the function adds a

bias. A bias mimics the brain by preventing the neuron from firing unless a positive enough

value is reached. A relatively high bias causes the neuron to fire in most cases, while a low bias

tends to bar the neuron from firing. Although these numbers may seem arbitrary at first, they are

critical for the network to make calculated judgments and to improve its own thinking. The result

of the process is then condensed back into a value between zero and one, typically using a

sigmoid function, ​S(x)=1/(1+e⁻ˣ) (Nielson). The numbers can be sent to the next neurons and the

cycle continues.

A neural network is composed of layers of these neurons. The first layer is the input

layer, followed by a discretionary amount of hidden layers, ending in an output layer. This is

parallel to the brain which has sensory neurons, interneurons, and motor neurons (“Brain

Basics”). Any network with more than one hidden layer is known as a deep neural network

(Sturm). A music example may be one where the input is the waveform data in a five second

sound clip and the output is the name of the instrument being played. The hidden layers do all

the heavy lifting, weighing each input and passing informative results on to other neurons. How

do these weights get set? There is no golden formula that dictates what each value should be.
Allen 4

What it takes to fine-tune the results of a neural network is the same for any human: relentless

practice.

While our brains can easily distinguish handwritten characters and song lyrics, computers

have trouble making sense of pixels and frequencies. In order to get them up to speed, thousands

of tests must be administered to train a neural network to see patterns that we can identify in an

instant. When a test is sent through fed forward through the network, it produces an output,

whether correct or completely false. The computer determines how far off it is from its expected

outcome using a loss function. The loss is essential to improve the network so it can “learn”, and

it simply is the difference of the expected and actual products (Loy). All the weights and biases

of the network are arranged into a matrix, and each one is shifted according to the outcome of

the loss function inserted into the gradient descent function (Nielson). Gradient descent is a

function that uses multivariate calculus to find local minima in a multidimensional

curve--visualise a ball rolling down to the lowest point it finds on a hilly terrain--and outputs a

vector containing every slight and grand directional movement the network needs to make to get

closer to the truth (Nielson). This whole process is called backpropagation, because the machine

is moving backwards through the network to update itself (Loy). Testing the network thousands

of times and using backpropagation on a sample of the results allows it to minimize its loss and

produce more accurate outcomes (Rocca). The most remarkable part is that the computer does

this entirely on its own, only depending on simple math.

Unlike images, sounds have multiple attributes that increase the difficulty of the

computer’s task of differentiation. Images are made up of a number of pixels which have

individual, quantifiable RGB values, responsible for all the data in the image. Sounds, on the
Allen 5

other hand, have multiple factors, including pitch, tone, timbre, some of which are not so easy to

record (Nave). While pitch can be measured with frequency, using Hertz, tone is dependent on

the quality of a sound as well as its pitch and volume. Timbre can be called the complexity of a

sound wave, and is the main element in determining the type of instrument being played (Yun,

Bi). Sounds also have what is called attack and decay, which depict the change in a note over

time, and can be useful for humans to distinguish sounds like cymbals crashing versus a trash

can falling over (Nave).There is also the tempo or BPM of a sound to consider. Altogether,

computers need a way to deal with the immense variation in the raw data. Luckily, there are

already a few methods for dealing with this.

One way to convert the complexities of sound into computational information is called a

Fourier transformation. It exists on the principle that “​All waveforms, no matter what you

scribble or observe in the universe, are actually just the sum of simple​ sinusoids​ of different

frequencies” (Bevelacqua). What that means is that all sound waves that could ever possibly be

made can be turned into a discrete mathematical function, which is incredibly important for a

computer to be able to glean information from. Many researchers interested in music recognition

using data science will use a form of this called Short-Term Fourier Transformations, or STFT,

which essentially map out audio signals over a short period of time (Bevelacqua). While STFT

provides more complex data than simply using the raw audio information, it is often the better,

more professional method of reading music data. Therefore, most projects use mid-level

transformations like the STFT (​Nair​).

Music is dealt with differently from images in other ways as well. Instead of typical

neural networks, sound clips are often analyzed using Recurrent Neural Networks, RNNs. An
Allen 6

RNN “creates some level of memory in the network” by feeding certain outcomes recursively

again through the algorithm, and for this reason it works exceedingly well with problems that

involve time periods and loops, like music (Hadad). Advantages of this kind of network include

its potential to take any length of input while the architecture remains the same size; however, it

takes significantly longer to compute (“Recurrent Neural Networks Cheatsheet”). Variable size

input is incredibly important in analyzing the timbre of an instrument because of the

aforementioned attack and decay; the beginning of the sound differs from the middle and end of

the note, for example, when the bow hits the violin string there initially is a plucking sound

preceding the tremulous tune. A recurrent network ensures all aspects of the sound are

considered over the entire time domain (​Franklin)​. Alternatively, the sound could be broken up

into beginning, middle, and end, then modeled separately, which would improve accuracy of the

network (Anderson). Notwithstanding, a recurrent network should be used in any scenario

involving audio.

Neural networks have already been used in fascinating ways; to recognize chord

progressions and even distinguish music genres (Ghosal and Kolekar). It seems there are

abundant opportunities already for the world of computer science to collide with those of music

and art. For an algorithm that recognizes the instrument being played in a sound, a network

model needs to be constructed; but no need to reinvent the wheel. Google has developed an

open-source API for those interested in machine-learning called TensorFlow (Hadad). The

product makes it significantly simpler to focus on gathering data and training the network, while

most of the advanced calculus is encapsulated inside prewritten code. Most of the relevant code

is contained in a library of functions called Keras, which significantly eases the process of
Allen 7

constructing deep learning in code (“Recurrent Layers”). Keras can be imported in Python and

easily utilized in a program.

Unfortunately, computer science is woefully under-taught in today’s society. According

to the CSE Coalition, “​Only one out of four K-12 schools teach any computer science” in the

United States, meaning that many children never even get the chance to discover this incredibly

important and rapidly-expanding field of study. Experts in the field have repeatedly emphasized

the incoming prevalence of machine learning (Benaich, et. al.).

Yaron Hadad, one such expert, has been working with machine learning for 20 years. He

has a PhD in Mathematics and Physics. His company, Nutrino, takes billions of nutritional data

points to provide personalized meal plans for consumers. When asked about the future of

artificial intelligence, he made it clear that “we will see companies incorporating A.I. in pretty

much every single industry that exists”, like in the medical field, where an algorithm that can

locate tumors in an x-ray was recently developed, as well as the music industry. The reality is

that “a lot of things that we naturally have been doing for decades will be enhanced by A.I.”.

Like a batch of yeast, the tech world is working its way through the modern market. It is not a

question of whether or not computers will take over: technology has “taken over for about 100

years...It's been happening since the industrial revolution”. Many people worry that A.I. will

harm society. In terms of the workforce, it is true that “certain jobs will get displaced, but they

will be replaced by other things”, as has been the case throughout history. However, as the

technology has an unknown and unprecedented potential to interfere in daily life, therefore “AI

should be regulated”; but this only means that machine learning deserves more awareness, not

less.
Allen 8

Ultimately, the question of whether neural networks will be instrumental in the music

industry can be answered with straightforward assertion. Stanford computer scientists Allen

Huang and Raymond Wu have already begun tackling the problem of generating music

completely hands-free. In their published research paper, they recorded that their programs “were

able to learn meaningful musical structure” and that many volunteer respondents in general

couldn’t distinguish the computer generated music with a novice composer’s (Huang and Wu).

This technology continues to develop, and ​“more intricate music has been learned as the state of

the art in recurrent networks improves” (Sturm, et. al.). Furthermore, scientists from across the

globe have worked together to construct a basic model with the goal of creating a network that

transcribes original music. A developed version of this product could aid musicians by providing

inspiration when composing music (Sturm, et. al.). Another operation of the network could be to

transcribe pieces in order to help musicians learn how to play, the way many jazz musicians

imitate their favorite solos to become proficient themselves (Anderson). The application of

artificial intelligence regarding sound is expansive and opportunities are innumerable.

Kence Anderson, musician and principal program manager at Microsoft AI & Research,

interacts with neural networks every day. He remarks that it is an “awesome time to live”

regarding the modern advancement of A.I. Not only are we using machine learning in an

increasing measure to vitalize every industry, we are also making deep learning more accessible

to those without specialized computer science knowledge. It is evident that the conceptual

background required to fully grasp and implement artificial intelligence is daunting, to say the

least; however, “through machine teaching at Bonsai and Microsoft, we're able to take

[mechanical engineers] and allow them to train these very complex neural networks without
Allen 9

[them] knowing how to architect a neural network”, which in turn allows a higher number of

people to contribute to new projects. According to Kence, the “deep reinforcement learning”

technology that makes this layman accessibility possible has only been around since 2015, when

Google first innovated it. What this means for the future is that almost all autonomous systems

will run on A.I. algorithms. As far as the music industry, Anderson hopes for A.I. that can

collaborate with artists the way musicians play off each other.

In my project I discovered shoreline seashells compared to the deep ocean of machine

learning. As high and wide and deep it is to grasp the complex math and science behind neural

network models, it is worthwhile for the incredible progress that surpasses imagination being

brought to our society. Musicians will absolutely benefit from the expansion of this technology,

as well as every other industry in our economy. Researchers have already begun to assist artists

in both learning and composing music, and I hope to join them as I develop my computer science

knowledge. The world faces a bright yet cloudy future. What would benefit society most is for

upcoming generations to participate in the inevitable technological revolution, whether in

advancing A.I. or setting boundaries for it. Therefore I also plan to be active in teaching

computer principles to younger students in order to expose them to the silently growing field

early on. At the end of my research, I am inspired by the creative possibilities of neural

networks, but also more aware in the way I think, identify patterns and make connections. It

seems as we build these increasingly sophisticated machines, we simultaneously build a deeper

understanding of ourselves. And that is truly deep learning.


Allen 10

Works Cited

Anderson, Kence. Principal Program Manager at Microsoft A.I. & Research. Personal interview.

11 March 2020.

Bevelacqua, Pete. “Fourier Transforms.” Fourier Transform, www.thefouriertransform.com/.

Benaich, Nathan, and Ian Hogarth. “State of AI Report 2019.” State of AI, Air Street Capital, 28

June 2019, ​www.stateof.ai/​.

Franklin, Judy A. “Recurrent Neural Networks for Music Computation.” ​INFORMS Journal on

Computing,​ Informs, 1 Aug. 2006,

pubsonline.informs.org/doi/abs/10.1287/ijoc.1050.0131.

Ghosal, Deepanway, and Maheshkumar H Kolekar. “Music Genre Recognition Using Deep

Neural Networks and Transfer Learning.” Interspeech 2018, Indian Institute of

Technology, 6 Sept. 2018,

www.isca-speech.org/archive/Interspeech_2018/pdfs/2045.pdf.

Hadad, Yaron. Chief Scientist and Co-founder of Nutrino. Phone interview. 1 March 2020.

Huang, Allen, and Raymond Wu. “Deep Learning for Music.” ​Allenh.pdf​, Stanford University,

cs224d.stanford.edu/reports/allenh.pdf.

Loy, James. “How to Build Your Own Neural Network.” ​Towards Data Science​, 14 May 2018,

towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-

68998a08e4f6.
Allen 11

Nair, Amal, et al. “Step By Step Guide To Audio Visualization In Python.” Analytics India

Magazine, Pvt Ltd., 16 Dec. 2019,

analyticsindiamag.com/step-by-step-guide-to-audio-visualization-in-python/.

National Institute of Neurological Disorders and Stroke. “Brain Basics: The Life and Death of a

Neuron | National Institute of Neurological Disorders and Stroke.” ​National Institute of

Health,​ 16 Dec. 2019,

www.ninds.nih.gov/Disorders/Patient-Caregiver-Education/Life-and-Death-Neuron​.

Nave, R. “Timbre.” ​Sound Quality or Timbre,​ GSU,

hyperphysics.phy-astr.gsu.edu/hbase/Sound/timbre.html.

“Neural Networks - History.” ​Stanford CS,​ Stanford University,

cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history1.ht

ml. Date Accessed: 11 March, 2020.

Nielsen, Michael A. ​Neural Networks and Deep Learning.​ Determination Press, 2015,

neuralnetworksanddeeplearning.com/index.html.

“Recurrent Layers.” Keras Documentation, GitHub, 17 Sept. 2019, keras.io/layers/recurrent/.

“Recurrent Neural Networks Cheatsheet” Stanford CS, Stanford University,

stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks.

Rocca, Baptiste. “Handling Imbalanced Datasets in Machine Learning.” ​Towards Data Science,​

Medium, 30 Mar. 2019,

towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f2

8.
Allen 12

Sturm, Bob L, et al. “Music Transcription Modelling and Composition using Deep Learning.”

ArXiv​, ArXiv, 29 Apr. 2016, arxiv.org/pdf/1604.08723.pdf.

Yun, Mingqing, and Jing Bi. “Deep learning for musical instrument recognition”. University of

Rochester.

https://pdfs.semanticscholar.org/ad42/01d862fd0952d8028697d505ad7697337292.pdf
Allen 13

Works Consulted

“Beginner's Guide to Audio Data.” ​Kaggle,​ Kaggle, 12 Apr. 2018,

www.kaggle.com/fizzbuzz/beginner-s-guide-to-audio-data.

Large, Edward W, et al. “Neural Networks for Beat Perception in Musical Rhythm.” ​National

Institute of Health,​ Frontiers Media S.A., 25 Nov. 2015,

www.ncbi.nlm.nih.gov/pmc/articles/PMC4658578/'.

E. J. Humphrey and J. P. Bello, "Rethinking Automatic Chord Recognition with Convolutional

Neural Networks," 2012 11th International Conference on Machine Learning and

Applications, Boca Raton, FL, 2012, pp. 357-362.

Fonseca, et al. “General-Purpose Tagging of Freesound Audio with AudioSet Labels: Task

Description, Dataset, and Baseline.” ArXiv.org, Arxiv, 7 Oct. 2018,

arxiv.org/abs/1807.09902.

Peter Isley. Sound Samples Philharmonia Orchestra, 2008. [Online; accessed 15 March, 2020].

You might also like