You are on page 1of 71

Harmonic Mixing: Key & Beat

Detection Algorithms
M.Eng Individual Project - Final Report

Christopher Roebuck

Project Supervisor: Iain Phillips


Project Directory: http://www.doc.ic.ac.uk/~cjr03/Project
1. Abstract
Harmonic mixing is the art of mixing together two songs based on their key. In order for a DJ to perform
harmonic mixing of two songs, their key and tempo must be known in advance. The aim of this project is
to automate the process of detecting the key and tempo of a song, so that a DJ can select two songs
which will ‘sound good’ when mixed together.

The result will be a program, which given a song, can detect its key and tempo automatically and enable
the user to mix two songs together based on these features. This document outlines the research into
various key and beat detection algorithms and the design, implementation and evaluation of such a
program.

2
2. Acknowledgements
I would like to thank my supervisor, Iain Phillips, for proposing the project in the first place and taking
the time to meet me regularly throughout the course of the project.

I would also like to thank Christopher Harte for sending me his paper on a Quantised Chromagram, and
Kyogu Lee for responding to my emails about the Harmonic Product Spectrum.

Thanks also to Peter Littlewood, Rachel Lau and Tiana Kordbacheh for creating chord samples for which
to test my algorithm on.

3
3. Contents
1. Abstract .................................................................................................................................................................................... 2
2. Acknowledgements ................................................................................................................................................................ 3
3. Contents ................................................................................................................................................................................... 4
4. Table of Figures...................................................................................................................................................................... 6
5. Introduction ............................................................................................................................................................................ 7
5.1 Motivation for this project ......................................................................................................................................... 7
5.2 Major Objectives .......................................................................................................................................................... 7
5.3 Deeper Objectives ....................................................................................................................................................... 8
5.4 Report Layout ............................................................................................................................................................... 8
6. Background ............................................................................................................................................................................. 9
6.1 History of DJ Mixing .................................................................................................................................................. 9
6.2 Illustration of Beat Mixing ....................................................................................................................................... 10
6.3 Key Detection Algorithms ....................................................................................................................................... 14
6.3.1 Musical key extraction from audio..................................................................................................................... 14
6.3.2 Chord Segmentation and Recognition using EM-Trained Hidden Markov Models ................................ 15
6.3.3 Automatic Chord Recognition from Audio Using Enhanced Pitch Class Profile .................................... 16
6.3.4 A Robust Predominant-F0 Estimation Method for Real-Time Detection of Melody and Bass Lines in
CD Recording........................................................................................................................................................ 18
6.3.5 A computational model of harmonic chord recognition ............................................................................... 20
6.4 Beat Detection Algorithms ...................................................................................................................................... 20
6.4.1 Tempo and Beat Analysis of Acoustic Musical Signals.................................................................................. 20
6.4.2 Analysis of the Meter of Acoustic Musical Signals ......................................................................................... 22
6.4.3 Audio Analysis using the Discrete Wavelet Transform ................................................................................. 23
6.4.4 Statistical streaming beat detection .................................................................................................................... 24
6.5 Similar Projects / Software ...................................................................................................................................... 25
6.5.1 Traktor DJ Studio by Native Instruments........................................................................................................ 25
6.5.2 Rapid Evolution 2 ................................................................................................................................................. 26
6.5.3 Mixed in Key ......................................................................................................................................................... 27
6.5.4 MixMeister ............................................................................................................................................................. 28
7. Design .................................................................................................................................................................................... 29
7.1 System Architecture................................................................................................................................................... 29
7.2 Key Detection Algorithm Design Rationale ......................................................................................................... 29
7.3 Beat Detection Algorithm Design Rationale ........................................................................................................ 30
8. Implementation .................................................................................................................................................................... 32
8.1 System Implementation ............................................................................................................................................ 32
8.2 Detecting the Key ...................................................................................................................................................... 34
8.3 Detecting the Beats.................................................................................................................................................... 37
8.4 Calculating BPM Value ............................................................................................................................................. 38

4
8.5 Automatic Beat Matching ......................................................................................................................................... 39
8.6 Generating and animating the waveforms............................................................................................................. 39
9. Testing .................................................................................................................................................................................... 41
9.1 Parameter Testing – Key Detection Algorithm.................................................................................................... 41
9.1.1 Bass threshold frequency..................................................................................................................................... 41
9.1.2 Choice of FFT window length ........................................................................................................................... 41
9.1.3 Harmonic Product Spectrum .............................................................................................................................. 42
9.1.4 Weighting System.................................................................................................................................................. 44
9.1.5 Time in between overlapping frames ................................................................................................................ 46
9.1.6 Downsampling ...................................................................................................................................................... 47
9.2 Parameter Evaluation – Beat Detection Algorithm ............................................................................................. 48
9.2.1 Size of Instant Energy .......................................................................................................................................... 48
9.2.2 Size of Average Energy ........................................................................................................................................ 48
9.2.3 Beat Interval ........................................................................................................................................................... 49
9.2.4 Low Pass Filtering ................................................................................................................................................ 50
10. Evaluation ........................................................................................................................................................................ 51
10.1 Quantitative Evaluation ............................................................................................................................................ 51
10.1.1 Key Detection Accuracy Test with Dance Music ...................................................................................... 51
10.1.2 Key Detection Accuracy Test with Classical Music .................................................................................. 52
10.1.3 Beat Detection Accuracy Test ....................................................................................................................... 54
10.1.4 Performance Evaluation ................................................................................................................................. 56
10.2 Qualitative Evaluation ............................................................................................................................................... 56
10.2.1 Automatic Beat Matching............................................................................................................................... 56
10.2.2 Graphical User Interface ................................................................................................................................ 56
10.2.3 Pitch Shift and Time Stretching Functions ................................................................................................. 58
10.2.4 Overall Evaluation ........................................................................................................................................... 58
11. Conclusion........................................................................................................................................................................ 59
11.1 Appraisal ...................................................................................................................................................................... 59
11.2 Further Work .............................................................................................................................................................. 60
12. Bibliography ..................................................................................................................................................................... 61
13. Appendix .......................................................................................................................................................................... 63
Appendix A: Introduction to Digital Signal Processing ..................................................................................................... 63
Appendix B: Specification ....................................................................................................................................................... 65
Aims of the project .............................................................................................................................................................. 65
Core Specification ................................................................................................................................................................ 65
Extended Specification........................................................................................................................................................ 66
Appendix C: User Guide ......................................................................................................................................................... 67
Loading a track into a deck ................................................................................................................................................ 67
Detecting the Key of a track .............................................................................................................................................. 69
Mixing two tracks ................................................................................................................................................................. 70

5
4. Table of Figures
Figure 1: Crossfader in the left position ................................................................................................................. 10
Figure 2: Beats, Bars and Loops .............................................................................................................................. 10
Figure 3: Tracks in sync but not in phase ................................................................................................................ 11
Figure 4: Train wreck mix......................................................................................................................................... 11
Figure 5: Tracks in sync and in phase ..................................................................................................................... 11
Figure 6: Crossfader in central position ................................................................................................................. 12
Figure 7: Crossfader in right hand position ........................................................................................................... 12
Figure 8: Circle of Fifths and Camelot Easymix System ..................................................................................... 13
Figure 9: Flow diagram of the algorithm from Sheh et al(9) ................................................................................ 15
Figure 10: PCP vector of a C major triad .............................................................................................................. 16
Figure 11: Pitch Class Profile of A minor triad..................................................................................................... 17
Figure 12: Harmonic Product Spectrum ................................................................................................................ 17
Figure 13: Comparison of PCP and EPCP vectors from Lee(11)........................................................................ 18
Figure 14: Flow diagram of Goto’s algorithm (12).................................................................................................. 19
Figure 15: Overview of Scheirer's Algorithm(14) ................................................................................................... 21
Figure 16: Waveform showing Tatum, Tactus and Measure .............................................................................. 22
Figure 17: Overview of algorithm from Klapuri et al (15) .................................................................................... 22
Figure 18: Block diagram of algorithm from Tzanetakis et al (16) ....................................................................... 23
Figure 19: Traktor DJ Studio ................................................................................................................................... 25
Figure 20: Rapid Evolution 2 ................................................................................................................................... 26
Figure 21: Mixed in Key............................................................................................................................................ 27
Figure 22: MixMeister ............................................................................................................................................... 28
Figure 23: Overview of System Architecture ........................................................................................................ 29
Figure 24: System Overview..................................................................................................................................... 32
Figure 25: Key Detection Algorithm Flow Chart ................................................................................................. 34
Figure 26: Output from the STFT .......................................................................................................................... 35
Figure 27: Chroma Vector of C Major chord and its correlation with key templates .................................... 36
Figure 28: Overlapping of waveform images ........................................................................................................ 40
Figure 29: Illustration of the Harmonic Product Spectrum taken from (30) ..................................................... 43
Figure 30: Chroma Vector showing close correlation between many different key templates ..................... 45
Figure 31: F minor is detected correctly with the weighting system enabled................................................... 46
Figure 32: C minor is detected without the weighting system enabled ............................................................. 46
Figure 33: Too many beats detected with 50ms beat interval ............................................................................ 49
Figure 34: Beats being detected correctly with beat interval of 350ms ............................................................. 49
Figure 35: Sound energy variations detected as beats in silent areas of Quivver – Space Manoeuvres....... 55
Figure 36: The spacing between these detected beats is closer, leading to higher BPM calculation ............ 55
Figure 37: Sampling of a signal for 4-bit PCM...................................................................................................... 63
Figure 38: How FMOD stores audio data ............................................................................................................. 64
Figure 39: The Main Screen ..................................................................................................................................... 67
Figure 41: The Deck Control ................................................................................................................................... 68
Figure 40: Loading Sasha - Magnetic North into Deck A................................................................................... 68
Figure 42: Key Detection progress/results............................................................................................................ 69
Figure 43: Crossfader in left hand position ........................................................................................................... 70
Figure 44: Crossfader in central position ............................................................................................................... 71
Figure 45: Crossfader in right hand position ......................................................................................................... 71

6
5. Introduction
This section sets out the aims and motivation for the project and introduces some of the concepts which
will be discussed in greater detail further in the report.

5.1 Motivation for this project


Beat mixing (or beat-matching) is a process employed by DJs to transition between two songs by
changing the tempo of a new track to match that of the currently playing track, perfectly aligning the
beats of one track with the beats of the other, then mixing or cross-fading between the two so that there
is no pause between songs. This is used to keep the flow of the music constant for the pleasure of the
listener, both through appreciation of the quality of the mix between records and the lack of time
between tracks played back to back providing more variety in the melody and rhythm to dance to.

Today's DJ software has simplified the task of beat mixing greatly; however, very few notable forays have
addressed the idea of harmonic mixing.

Two tracks can be beat-mixed together perfectly and still sound ‘off’. This is likely to be because the two
tracks are out of tune with each other and their harmonic elements are in incompatible keys causing the
melodies to clash. Harmonic mixing sets out to address this problem.

Harmonic mixing is the natural evolution of beat mixing: mixing in compatible keys. It is the idea that the
currently playing song should only be beat-mixed with another song of compatible key which will make
the transition between the two songs sound pleasurable to the listener. This can give the DJ more creative
freedom to perform a mix, as they do not have to rely on large segments of regular beats in order to make
a transition between two songs, they can now start to overlay melody sections which are harmonic with
each other.

People with perfect pitch will find it easy to detect the key to a song (through years and years of practice)
but there seems to be no automatic process in parallel to beat detection algorithms which would save DJs
manually finding the key of every song of their 1000+ collection. Even when that's done, two songs with
compatible keys will still not necessarily match, since changing the tempo of the songs to achieve the
same speed will result in a change of key. For example a 6% increase/decrease in tempo, measured in
beats per minute (BPM), would cause a change of one semitone in key, say C to Db minor. Time-
stretching algorithms are therefore essential to lock the key of the track and allow the BPM of the track to
be altered independently of pitch/key. Pitch-shifting algorithms can change the pitch/key of the track
without affecting the BPM.

5.2 Major Objectives


The primary aim of this project is to design, implement and optimise a key detection algorithm which can
work on polyphonic real-world audio. No key detection algorithm thus far can claim to be 100% accurate
and as such there have been many different attempts at solving the problem with greater accuracy, each
with their own strengths and weaknesses.

As the finished program will be aimed mainly at DJs, the key detection algorithm should be able to
accurately extract the key from various types of dance music. The main problems associated with this
genre of music, is that there is a lot of emphasis on the bass line and bass drum, which may make it
difficult for the key detection to give an accurate result. The key detection algorithm will also be tested on
classical music which should enable the algorithm to give a more accurate result.

7
The other two major problems, the detection of beats and the calculation of an appropriate BPM value
can be considered solved problems. There has been much research into the various ways of detecting the
BPM from a piece of music and the main challenge is to find a suitable algorithm which will be able to
detect beats and calculate a BPM value in the shortest amount of time while maintaining a certain amount
of accuracy.

As this project aims to aid DJs perform harmonic mixing, I will also attempt to implement an automatic
real-time beat mixing algorithm, which will enable the DJ to beat-mix two tracks together based on their
tempo and position of beats. Obviously the success of this feature will rely heavily on the accuracy of the
above stated beat detection algorithm.

5.3 Deeper Objectives


There are deeper objectives to this project than simply providing a DJ with an automatic key and tempo
detection tool.

This project aims to show that academic and state of the art music analysis techniques can be applied to
real world problems in an efficient and reliable manner. Part of this will be to show that disparate areas of
research can be combined together successfully.

Finally the project aims to be more than just a research study of feasibility. The result of successful
completion will be an application of sufficient reliability and quality that it can be released to, and used by,
untrained computer users. This report only lightly touches on this facet of the project, as creating usable
polished applications is a reasonably well solved problem, and the least interesting area of this project.

5.4 Report Layout


• The remainder of the report begins with a brief history of DJ mixing and illustrates the concepts
of beat mixing and harmonic mixing in Chapter 6 (Background). We then discuss the main
literature on beat detection and BPM calculation, along with selected literature on key extraction
from music. Finally we compare the strengths and weaknesses of the state of the art to the aims
of this project.

• Chapter 7 (Design) gives a brief overview of the overall system design and the rationale behind
the design of the algorithms.

• A detailed description of the interesting or problematic areas of project implementation is given


in Chapter 8 (Implementation). Trivial and/or uninteresting areas of the project are not
mentioned and can be considered to have been implemented successfully.

• The tests performed to determine the optimal values for the parameters of the algorithms are
stated in Chapter 9 (Testing).

• A quantitative and qualitative evaluation of the final product is made in Chapter 10


(Evaluation). Analysis of any anomalous results is given.

• The report concludes with Chapter 11 (Conclusion) which covers the strengths and weaknesses
of the project and details possible future work.

8
6. Background
To fully understand this project requires a basic understanding of the process which a DJ will perform
behind the turntables. First of all a brief history of DJ Mixing will explain the evolution of DJ mixing and
the advancements in technology and music culture which brought us to where we are today. Then a more
in-depth look at the ‘science’ of beat and harmonic mixing will follow, which will explain in detail the
concept of beat mixing and the extra constraints that harmonic mixing implies.

A discussion of the most applicable literature for detecting the beats and extracting the key from a song is
given followed by an overview of software projects of a similar nature. An Introduction to Digital Signal
Processing explaining some of the techniques used in the literature is given in the Appendix.

6.1 History of DJ Mixing


The art of DJ mixing has come a long way since its early appearances. In general, its journey can be
plotted to have passed through 4 basic stages. Before there was any mixing or blending together of songs,
there was the Non-Mixing Jukebox DJ(1). Working with just one deck (or turntable) this DJs primary skill
was to entertain an audience whilst playing requested music, usually at a wedding or some other
celebration.

The first dimension of mixing (Basic Fades)(1) occurred as DJs replaced bands as the primary form of club
entertainment. The DJ, now working with two decks and a mixer would fade a new song over the end of
the currently playing song, usually with calamitous results. As neither the beats nor the keys were in sync,
the overlays would sound like train-wrecks. A train-wreck describes when two tracks are playing at the
same time but their beats are not synchronised i.e. when your tracks cross, your train will crash(2). When
the audience can hear this, it will sound like incoherent beats occurring at odd times and not making any
musical sense.

The second dimension of mixing (Cutting and Scratching)(1) coincided with the appearance of rap as a
distinct vocal form. High torque turntables now allowed DJs to ‘cut’ by inserting short musical sections
from a second source and ‘scratch’ by rapidly and rhythmically repeating single beats from a second
source usually by manipulating a vinyl record as it played with their hand.

Technological improvements brought about the 3rd Dimension of Mixing (Beat Mixing)(1). By now
turntables had accurate speed stability thanks to the arrival of the direct-drive turntable motor as opposed
to older belt-driven turntable motors which over time would wear out and cause records to turn in
warped rotation and affect tempo. Technics(3) introduced the SL-1200 turntable in 1972 and by 1984 had
added features suited to the needs of DJs wanting to beat-mix(4). Some of these features included pitch
control, which allowed the DJ to adjust the speed of tracks to match one another. The fact that pressing
the start button immediately started the turntable at the desired speed gave the DJ more confidence in
starting a new track at exactly the right point and at the correct speed. Technics SL-1200 turntable also
allowed the vinyl to be spun backwards for the first time to allow a DJ to carefully and precisely cue the
starting position of the track to fall exactly on the onset of some desired beat.

Separate from advancements in the technology of the equipment which DJs used to play their sets (live
performances) on, was the advancement in technology in dance music production. Most dance music
used electronic drums, which locked in a consistent tempo indefinitely. The speed stability of both the
music and the turntables allowed DJs to overlay long segments of different records, as long as they could
be synchronised, and beat mixing was born.

9
However, the limitation of beat mixing was that if the desired segments of both songs contained
melodies, the result was usually unpleasant because of clashing keys. Thus the DJ either used trial and
error to find songs with compatible keys or had to rely on there being a beat-only intro and outro section
on each song. This is the reason why most dance music has two to three minutes of continual beats at the
beginning and end of the song, to enable the DJ enough time to beat match tracks together.

Harmonic mixing brings the fourth dimension, harmony, to DJ mixing technique, which only permits
different melodies to be played simultaneously if they have compatible keys. The gradual shift away from
analogue vinyl records to digital audio formats such as CD’s and MP3 combined with the development of
time stretching algorithms made harmonic mixing possible. A time stretching algorithm locks in a certain key
whilst allowing tempo to be altered independently. More and more DJs nowadays are letting computers
do the beat mixing for them and focusing their attention on being more artistic and creative with their
mixes.

6.2 Illustration of Beat Mixing

1. The DJ first starts their set with a song, we shall call it Song A. Song A has a tempo of 130BPM.
Whilst Song A is playing, the DJ decides which song to play next, we will call this Song B. Song B
has a lower BPM than Song A of 120BPM. The crossfader (part of the mixer) allows multiple
audio outputs to be blended together into one output. At the start the crossfader will be in the
left position as shown in Figure 1 below so that only the output from Deck A will be heard by
the audience.
Deck A Deck B
Crossfader Song B
Song A loaded. Only
currently DJ can hear
playing Output from Deck A this through
only is audible to headphones
audience
Figure 1: Crossfader in the left position

2. As the DJ listens to Song B through their headphones, they detect that its tempo is slower than
Song A. The DJ increases the tempo of Song B to 130BPM to match that of Song A. Nearly all
modern dance music is written in the 4/4 common time signature i.e. 4 beats to every bar. A
typical dance track contains a series of loops of n bars where n is a power of 2, usually 4,8,16 or
32. Assume that both Song A and Song B contain a series of 4 bar loops with 4 beats to every
bar as illustrated in Figure 2:

Downbeat for this loop Downbeat for next loop

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1

1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1
Time
1 beat

1 bar

1 loop

Figure 2: Beats, Bars and Loops

10
The downbeat is the first beat of a loop and is usually signified by an extra sound or accent, such as
a cymbal crash. The DJ finds a downbeat towards the beginning of Song B (the first beat of Song
B is normally used) and pauses Song B just before the onset of that downbeat. Song B is now
cued and ready.

3. The DJ now waits for a downbeat to occur in Song A after the main melody has played out. Song
B is started at the exact same time as the onset of the downbeat in Song A. Song B is still only
audible through the DJs headphones. The DJ ensures that the two beats are in time and in phase.
To be in time the beats of the two tracks must occur at the same time, to be in phase the
downbeats of each track must occur at the same time. Below is an example where the two tracks
are in time but not in phase:

Downbeat for this loop Downbeat for next loop

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1
Song A

1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1
Downbeat for this loop
Time

6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6
Song B

2 3 4 3 2 3 4 4 2 3 4 1 2 3 4 2 2

Figure 3: Tracks in sync but not in phase

4. If the two tracks have different BPMs they will eventually go out of time and out of phase as the
duration between the beats of each track will drift further and further apart. The following
diagram shows this scenario, with Song B being the slower of the two tracks. This is a train-
wreck mix:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1
Song A

1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1
Time

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1
Song B

1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1

Figure 4: Train wreck mix

When the two tracks are in phase and have the same BPM they should be aligned like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1
Song A

1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1
Time

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1
Song B

1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1

Figure 5: Tracks in sync and in phase

11
5. Once the tracks are in time and in phase, the DJ fades in the output from Deck B by sliding the
crossfader to the middle. Both tracks are now audible and are being mixed.

Deck A Deck B
Crossfader
Song A Song B
currently currently
playing Output from both playing
Decks is equally
audible
Figure 6: Crossfader in central position

6. Finally after an arbitrary amount of time (or when Song A ends) the crossfader is moved all the
way to the right so that only Song B is audible. Song A is taken off the Deck and the DJ chooses
another song, i.e. Song C to replace Song A on Deck A. Song B will then be mixed into Song C.
Thus we are now back at stage 1, and the cycle continues.

Deck A Deck B
Crossfader
Song B
Song C
currently
loaded
Output from Deck B playing
is audible only
Figure 7: Crossfader in right hand position

Harmonic mixing adds constraints to the selection of the next track in stage one of the above cycle. The
next track must be in a compatible key to the currently playing track. The circle of fifths illustrates
relationships between compatible keys and is used by composers for correct sounding chord
progressions(5). Any song is compatible with another song of the same key, its perfect fourth, fifth or
relative major/minor.

Using the circle of fifths, a song in C Major is compatible with another song of C Major, a song in F
Major, a song in G Major or a song in A Minor. To make this easier to use, Camelot Sound came up with
the ‘Easymix’ system where each key is assigned a keycode, 1-12A for Minor keys and 1-12B for Major
keys(6). Using the easymix chart, a song with keycode 1A (A-Flat Minor) can be mixed together with
another song of keycode 1A, 2A (E-Flat Minor), 12A (D-Flat Minor) or 1B (B Major).

However, altering the tempo of a track by +/- 6% will alter its key by a semitone (it will shift its keycode
by 7 steps). A song in E-Flat Minor (keycode 2A) becomes an E Minor (keycode 9A) song with a 6%
increase.

12
Circle of Fifths Camelot Sound ‘Easymix’
System

Figure 8: Circle of Fifths and Camelot Easymix System

Assume Song A in step 1 has a key of C Major (8B) with 130BPM and Song B has a key of F Major (7B)
with 120BPM. The songs are compatible if played at their original tempo, but Song B has to be increased
to 130 BPM to match the tempo of Song A. This is an 8.33% increase in the tempo and so Song B’s key
now changes up a semitone to F-Sharp Major (2B) which is now incompatible with Song A. Time-
stretching algorithms solve this problem, the tempo of Song B can be increased to match Song A and
Song B’s key remains at F Major which is harmonically compatible with Song A.

13
6.3 Key Detection Algorithms

The extraction of key from audio is not new, but not often reported in literature. Many algorithms that
are found in literature only work on symbolic data (e.g. MIDI or notated music) where the notes of the
incoming signal are already known. For this project, the algorithm will need to work directly on incoming
audio data with no prior knowledge of the notes which make up the song. Various different methods are
used, varying from heavy use of spectral analysis, to statistical modelling, to modelling inner-hair cell
dynamics. The algorithms presented below give a flavour of the research currently on-going into this
challenging problem.

6.3.1 Musical key extraction from audio

A key extraction algorithm that works directly on raw audio data is presented by Pauws(7). Its
implementation is based on models of human auditory perception and music cognition. It is relatively
straightforward and has minimal computing requirements.

For each 100 millisecond section of the signal, it first down-samples the audio content to around 10kHz,
which reduces significantly the computing cost and also cuts off any frequencies above 5kHz. It is
assumed that these high frequencies will not contribute to the pitches in the lower frequency ranges. The
‘remaining’ samples in a frame are multiplied by a Hamming window, zero-padded, and the amplitude
spectrum is calculated from a 1024-point FFT.

A 12 dimension chroma vector (chromagram) is then calculated from the frequency spectrum, which
converts the frequencies in the spectrum into the 12 musical notes, e.g. for pitch class C, this comes down
to the six spectral regions centred around the pitch frequencies for C1 (32.7 Hz), C2 (65.4 Hz), C3
(130.8Hz), C4 (261.6 Hz), C5 (523.3 Hz) and C6 (1046.5 Hz). The chroma vector is normalised to show
the relative ratios of each musical note in the frequency spectrum.

Eventually there will be a chroma vector for each 100 millisecond section of the song. These are
correlated with Krumhansl’s key profiles(8) and the key profile that has maximum correlation over all the
computed chroma vector is taken as the most likely key.

An evaluation with 237 CD recordings of classical piano sonatas indicated a classification accuracy of
75.1%. By considering the exact, relative, dominant, sub-dominant and parallel keys as similar keys, the
accuracy is even 94.1%.

The algorithm is quite basic and whilst it has the benefits of being fast, it suffers from using the FFT
which although useful for detecting the frequency spectrum of a stationary signal, such as a chord played
constantly, it is not suitable for extracting the frequencies of a non-stationary signal where major
frequencies will change rapidly such as in any normal song. The method of scoring the most likely key
could also be improved by weighting the maximum key relative to how close it was to the next likely
detected key. This way, if two or more keys correlate highly for a single chromagram, the resulting winner
is penalised by giving it a low weighting as it only just correlated higher than some other key. If one key
dominates the correlation it is rewarded with a larger weighting and therefore is more likely to be the
maximum key overall.

14
6.3.2 Chord Segmentation and Recognition using EM-Trained Hidden
Markov Models

Sheh et al.(9) describe a method of recognising the major chords in a piece of music using pitch class
profiles and Hidden Markov Models (HMMs) trained using the Expectation Maximisation (EM)
algorithm.

The pitch class profile (PCP) was first proposed by Fujishima (10) and is the same idea as the above
algorithms ‘chroma vector’, in which the Fourier transform intensities are mapped to the twelve semitone
pitch classes corresponding to musical notes.

First the input signal is transformed to the frequency domain using the short-time Fourier transform
(STFT). The STFT has the advantage over the FFT of being able to determine frequency changes over
time rather than simply just taking a snapshot of frequencies in a certain time span. Thus the STFT is
more suited to frequency analysis of non stationary signals.

The STFT is mapped to the Pitch Class Profile (PCP) features, which traditionally consist of 12-
dimensional vectors, with each dimension corresponding to the intensity of a semitone class (chroma).
The procedure collapses pure tones of the same pitch class, independent of octave, to the same PCP bin.
The PCP vectors are normalised to show the intensities of each pitch class relative to one another.

Pre-determined PCP vectors are used as features to train a HMM with one state for each chord
distinguished by the system. The EM algorithm calculates the mean and variance vector values and the
transition probabilities for each chord HMM. Now the Viterbi algorithm is used to either forcibly align or
recognise these labels. The PCP vector corresponding to a chord which aligned itself the most with the
PCP vectors computed from the song is chosen as the most likely key.

This algorithm performs well but attempting to code a hidden Markov model and the algorithms required
in training it would be too time consuming for this project. Comparable results can be established using
much simpler template matching techniques. The algorithm is also computationally expensive and as such
only a short segment of a song is used to detect the key on. One major advantage of this project is the use
of the STFT to analyse the frequencies and map them to the PCP / chroma vector. This is much more
accurate than the FFT and this part of the algorithm can be used as part of a different key detection
algorithm.

Figure 9: Flow diagram of the algorithm from Sheh et al(9)

15
6.3.3 Automatic Chord Recognition from Audio Using Enhanced Pitch
Class Profile

This algorithm (11) sets out to improve on other key detection algorithms which use a chromagram/PCP
as the feature vector to identify chords. Some use a template matching algorithm to correlate the PCP
with pre-determined PCP vectors for the 24 chords; others use a probabilistic model such as HMMs.

The problem with the PCP in the template matching algorithm is that the templates which the PCP is
matched against are binary i.e. since a C major triad comprises three notes at C (root),E (third), and G
(fifth), the template for a C major triad is [1,0,0,0,1,0,0,1,0,0,0,0] where chord labelling is
[C,C#,D,D#,E,F,F#,G,G#,A,A#,B]. However, the PCP from real world recordings will never be exactly
binary because acoustic instruments produce overtones as well as fundamental tones. The PCP / chroma
vector of a C major triad played on a piano is shown in Figure 10.

Figure 10: PCP vector of a C major triad

In Figure 10, even though the strongest peaks are found at C, E, and G, we can see that the chroma
vector has nonzero intensities at all 12 pitch classes due to the overtones generated by the chord tones.
This noisy chroma vector may cause confusion to the recognition systems with binary type templates,
especially if two chords share one or more notes such as a major triad and its relative minor e.g. a C major
triad and a C minor triad share two notes, C and G, and a C major triad and an A minor triad have notes
C and E in common.

Figure 11 shows an A minor triad and its correlation with the 24 keys. The A minor triad correlates
highest with a C major chord and is identified incorrectly as C major. This is due to the fact that the
intensity of the G in the A minor triad, which is not a chord tone, is greater than that of the A, which is a
chord tone.

16
Figure 11: Pitch Class Profile of A minor triad

To overcome this problem, Lee has suggested taking the harmonic product spectrum (HPS) of the
frequency spectrum, before computing the Enhanced Pitch Class Profile (EPCP) from the HPS. The
algorithm for computing the HPS is very simple and is based on the harmonicity of the signal. Since most
acoustic instruments and human voice produce a sound that has harmonics at the integer multiples of its
fundamental frequency, decimating the original magnitude spectrum by an integer number will also yield a
peak at its fundamental frequency. This should in theory eliminate the overtones which are produced and
amplify the pure tones, leading to a more binary type EPCP. The following figure demonstrates the HPS
and how it amplifies the main peak from the FFT whilst reducing the number of overtone frequencies.

Figure 12: Harmonic Product Spectrum

17
In Figure 13 below the EPCP vector from the above example (A minor), and its correlation with the 24
major/minor triad templates are shown. Overlaid are the conventional PCP vector and its correlation in
dotted lines for comparison. We can clearly see from the figure that non-chord tones are suppressed
enough to emphasize the chord tones only, which are A, C, and E in this example. This removes the
ambiguity between its relative major triad, and the resulting correlation identifies the chord correctly as A
minor.

Figure 13: Comparison of PCP and EPCP vectors from Lee(11)

This technique seems useful enough and could be used to optimise any key detection algorithm which
uses PCP/chroma vectors. I am a little concerned that if used with dance music, which has a lot of
intense low frequencies, such as the bass line and bass drum which may not be in key, that these
frequencies will be amplified instead of the melodic frequencies which I intend to amplify, skewing the
results in favour of the key of the bass line rather than the key of the melody.

6.3.4 A Robust Predominant-F0 Estimation Method for Real-Time


Detection of Melody and Bass Lines in CD Recording

Goto(12) describes a method, called PreFEst (Predominant-F0 Estimation Method), which can detect the
melody and bass lines in complex real-world audio signals. F0 is shorthand notation for the fundamental
frequency of the piece of music, or it’s key.

The PreFEst obtains traces of the fundamental melody and bass lines under the following assumptions:

• The melody and bass sounds have the harmonic structure. We do not care about the existence of the
F0’s frequency component.

18
• The melody line has the most predominant harmonic structure in middle and high frequency regions
and the bass line has the most predominant harmonic structure in a low frequency region.

• The melody and bass lines tend to have temporally continuous traces.

The diagram below shows an overview of the PreFEst. It first calculates instantaneous frequencies by
using multi-rate signal processing techniques and extracts candidate frequency components on the basis
of an instantaneous-frequency-related measure.

The PreFEst basically estimates the F0 which is supported by predominant harmonic frequency
components within an intentionally limited frequency range; by using two band-pass filters it limits the
frequency range to middle and high regions for the melody line and low region for the bass line.

It then forms a probability density function (PDF) of the F0, which represents the relative dominance of
every possible harmonic structure. To form this F0’s PDF, it regards each set of the filtered frequency
components as a weighted mixture of all possible harmonic-structure tone models and estimates their
weights that can be interpreted as the F0’s PDF: the maximum-weight model corresponds to the most
predominant harmonic structure. This estimation is carried out by using the Expectation Maximisation
algorithm, which is an iterative technique for computing maximum likelihood estimates from incomplete
data.

Finally, multiple agents track the temporal trajectories of salient promising peaks in the F0’s PDF and the
output F0 is determined on the basis of the most dominant and stable trajectory.

Figure 14: Flow diagram of Goto’s algorithm(12)

19
6.3.5 A computational model of harmonic chord recognition

Walsh et al.(13) investigate the perception of harmonic chords by peripheral auditory processes and
auditory grouping. The frequency selectivity of the auditory system is modelled using a bank of
overlapping band-pass filters and a model of inner hair cell dynamics. By computing intervals between
different classes of pitch, the model achieves considerable success in recognizing major, minor, dominant
seventh, diminished and augmented chords.

Part of the algorithm relies on an existing computational model of mechanical to neural transduction
based on the hair cell-auditory-nerve fibre synapse. The output excitation function in response to an
acoustic stimulus is a stream of spike events precisely located in time. The model describes the
production, movement and dissipation of a transmitter substance in the region of the hair cell-auditory-
nerve fibre synapse.

It is probably not feasible to implement the algorithm described in this paper; the aim here is just to
demonstrate the wide variety of methods and theories that researchers are trying to apply to the problem
of extracting the key from polyphonic audio signals.

6.4 Beat Detection Algorithms

There are many different approaches to detecting the beats in a song. An overview of each algorithm is
given below along with a brief discussion of its accuracy and applicability. The reader is encouraged to
read each of the papers in full for more detail, the aim here is to give a brief introduction to the many
ways in which beat detection can be performed. All diagrams from within this section come from their
respective papers.

6.4.1 Tempo and Beat Analysis of Acoustic Musical Signals

Scheirer’s paper(14) is one of the most frequently referenced papers on beat detection. The paper details
the implementation of a fast, close to real time, beat detection system for music of many genres.

The algorithm works by first dividing the music into six different frequency bands using a filterbank. This
filterbank can be constructed by combining a low-pass and high-pass filter with many band-pass filters in
between.

The envelope of each frequency band is then calculated. The envelope is a highly smoothed
representation of the positive values in a waveform.

The differentials of each of the six envelopes are calculated, they are highest where the slopes in the
envelope are steepest. The peaks of the differentials would give a good estimate of the beats in the music,
but the algorithm in the paper uses a different method.

Each differential is passed to a bank of comb filter resonators. In each bank of resonators, one of the
comb filters will phase lock with the signal, where the resonant frequency of the filter matches the
periodic modulation of the differential.

The outputs of all of the comb filters are examined to see which ones have phase locked, and this
information is tabulated for each frequency band.

20
Summing this data across the frequency bands gives a tempo (BPM) estimate for the music. Referring
back to the peak points in the comb filters allows the exact occurrence of each beat to be determined.

The beat detection strategy used in this paper has demonstrated high accuracy and has been implemented
many times by different parties. It can cope with a wide variety of music genres and fits the requirements
of this project. The speed of the algorithm may also be beneficial to this project.

The algorithm is very complex and may be time consuming to implement. By working with music in a
stream, it fails to take advantage of the ability to analyse all of the music as one element. This means that
while the accuracy may be good enough to tap along with users in real time, it may not be able to
determine the BPM to a sufficient accuracy for this project.

Figure 15: Overview of Scheirer's Algorithm(14)

21
6.4.2 Analysis of the Meter of Acoustic Musical Signals

Klapuri et al(15) describe a method which analyses the meter of acoustic musical signals at the tactus,
tatum, and measure pulse levels illustrated in Figure 16. The target signals are not limited to any particular
music type but all the main Western genres, including classical music, are represented in the validation
database.

Figure 16: Waveform showing Tatum, Tactus and Measure

An overview of the method is shown below. For the time-frequency analysis part, a technique is proposed
which aims at measuring the degree of accentuation in a music signal.

Figure 17: Overview of algorithm from Klapuri et al (15)

Feature extraction for estimating the pulse periods and phases is performed using comb filter resonators
very similar to those used by Scheirer in the above paper. This is followed by a probabilistic model where
the period-lengths of the tactus, tatum, and measure pulses are jointly estimated and temporal continuity
of the estimates is modelled. At each time instant, the periods of the pulses are estimated first and act as
inputs to the phase model. The probabilistic models encode prior musical knowledge and lead to a more
reliable and temporally stable meter tracking.

An important aspect of this algorithm lies in the feature list creation block: the differentials of the
loudness in 36 frequency sub-bands are combined into 4 ‘accent bands’, measuring the ‘degree of musical
accentuation as a function of time’.

The goal in this procedure is to account for subtle energy changes that might occur in narrow frequency
sub-bands (e.g. harmonic or melodic changes) as well as wide-band energy changes (e.g. drum
occurrences).

The algorithm presented in this paper seems to output some good results across a wide variety of musical
genres. However, due to the complexity of the many different parts which make up the algorithm it is a
bit beyond the scope of the simple beat detection which this project aims to achieve. With the
assumption that the system is designed only to work with music containing a prominent, distinguishable
beat, implementing this algorithm would be like over-engineering the project and would use up valuable
time ensuring that everything was working properly.

22
6.4.3 Audio Analysis using the Discrete Wavelet Transform

Tzanetakis et al (16) describe an algorithm based on the DWT that is capable of automatically extracting
beat information from real world musical signals with arbitrary timbral and polyphonic complexity.

The beat detection algorithm is based on detecting the most salient periodicities of the signal. The signal
is first decomposed into a number of octave frequency bands using the DWT. After that the time domain
amplitude envelope of each band is extracted separately. This is achieved by low pass filtering each band,
applying full wave rectification and down-sampling. The envelopes of each band are then summed
together and an autocorrelation function is computed. The peaks of the autocorrelation function
correspond to the various periodicities of the signal’s envelope.

The first five peaks of the autocorrelation function are detected and their corresponding periodicities in
BPM are calculated and added in a histogram. This process is repeated by iterating over the signal. The
periodicity corresponding to the most prominent peak of the final histogram is the estimated tempo in
BPM of the audio file. A block diagram of the beat detection algorithm is shown below.

Figure 18: Block diagram of algorithm from Tzanetakis et al (16)


Key: WT: Wavelet Transform, LPF: Low Pass Filter, FWR: Full wave rectification, ↓: Downsampling, Norm:
Normalisation, ACR: Autocorrelation, PKP: Peak Picking, Hist: Histogram

To evaluate the algorithm’s performance it was compared to the BPM detected manually by tapping the
mouse with the music. The average time difference between the taps was used as the manual beat
estimate. Twenty files containing a variety of music styles were used to evaluate the algorithm (5 Hip-
Hop, 3 Rock, 6 Jazz, 1 Blues, 3 Classical, 2 Ethnic). For most of the files the prominent beat was detected
clearly (13/20) (i.e. the beat corresponded to the highest peak of the histogram). For 5/20 files the beat
was detected as a histogram peak but it was not the highest, and for 2/20 no peak corresponding to the
beat was found. In the pieces that the beat was not detected there was no dominant periodicity (these
pieces were either classical music or jazz). In such cases humans rely on higher level information like
grouping, melody and harmonic progression to perceive the primary beat from the interplay of multiple
periodicities.

This algorithm is different to the others in that it uses a more specialised version of the FFT algorithm to
decompose the incoming signal into separate frequency bands. Whether this improves the beat detection
is debatable, and it seems that the DWT is still relatively new technology. The test results show that the

23
algorithm performs well on music containing a constant beat, which is fine for this project, however the
algorithm may also be too time consuming to implement.

6.4.4 Statistical streaming beat detection

The human listening system determines the rhythm of music by detecting a pseudo – periodical
succession of beats. The signal which is intercepted by the ear contains certain energy, this energy is
converted into an electrical signal which the brain interprets. Obviously, the more energy the sound
transports, the louder the sound will seem. But a sound will be heard as a beat only if this energy is largely
superior to the sound's energy history. Therefore if the ear intercepts a monotonous sound with
sometimes big energy peaks it will detect beats, however, if you play a continuous loud sound you will not
perceive any beats. This algorithm assumes that beats are big variations of sound energy.

Patin(17) presents a model whereby beats are detected by computing the average sound energy of the
signal and comparing it to the instant sound energy. The instant energy will be the energy contained in
1024 samples, 1024 samples represent about 5 hundredths of a second which is pretty much 'instant'. The
average energy should not be computed on the entire song, as some songs have both intense passages and
more calm parts. The instant energy must be compared to the nearby average energy, for example if a
song has an intense ending, the energy contained in this ending shouldn't influence the beat detection at
the beginning.

We detect a beat only when the energy is superior to a local energy average. Thus we will compute the
average energy on say : 44032 samples which is about 1 second, that is to say we will assume that the
hearing system only remembers of 1 second of the song to detect a beat. This 1 second time (44032
samples) is what we could call the human ear energy persistence model; it is a compromise between being
too big and taking into account energies too far away, and being too small and becoming too close to the
instant energy to make a valuable comparison.

24
6.5 Similar Projects / Software
6.5.1 Traktor DJ Studio by Native Instruments

Traktor DJ Studio (18) is state of the art proprietary software enabling DJs to mix together up to four
different tracks at the same time. Traktor’s beat detection system enables two tracks to be automatically
beat-synchronised and manages to detect the beats well in most tracks with a prominent, regular beat.
However, it does not produce good results when used with music of other genres such as classical and
rock.

Traktor offers a visualisation of the playing track and highlights the detected beats with visual beat
markers. It has support for time-stretching of tracks and also basic tempo adjustment. Extra features of
Traktor include a whole host of real time effects, such as reverb, delay, flange which can be applied, plus a
selection of low-, mid- and high-pass filters. A file browser displays information about files for easy
dragging and dropping of them onto the decks, and the program allows you to record and save your own
mix as it happens, capturing any effects applied.

Traktor is commercial software and is aimed at the professional DJ, however Traktor is missing a couple
of features which this project aims to include. Traktor does not have any key detection algorithm capable
of extracting the key from a digital audio file. Pitch shifting; enabling the pitch of the track to be adjusted
without altering the tempo of the track is an aim of the project however is also not present in Traktor.

Figure 19: Traktor DJ Studio

25
6.5.2 Rapid Evolution 2

Rapid Evolution 2 (19) is free software which allows the user to import their music files and have them
analysed in order to detect the BPM and the Key of the track. Based on the BPM and key extracted from
an audio file, the system indicates which other songs would go well with the analysed song to produce a
good harmonic mix. A unique element of rapid evolution is the availability of a virtual piano which can
play the chord of the key detected in a song. This can be used to determine qualitatively how accurate the
key detection of an audio file was, and would be a valuable feature in any program aimed at harmonic
mixing. The program allows simultaneous playback of two files and has time-stretching functionality.

Although this product strives to generate and display a lot of useful information to the harmonic mixing
DJ, the graphical user interface is not the most intuitive. For example it is not obvious what the difference
is between some of the buttons such as the ‘import’ and ‘add song(s)’ buttons, and a lot of the same
controls and information is displayed in more than one area, making inefficient use of real-estate and
confusing the user. The program does not have an automatic beat matching algorithm although this is
planned for future release.

Figure 20: Rapid Evolution 2

26
6.5.3 Mixed in Key

Mixed in Key (20) is a small commercial application whose sole purpose is to analyse files and extract the
key and BPM from them and store the information in the files metadata. Mixed in Key uses Camelot’s
easymix system to display the key as well as the formal musical notation. The software licenses a key-
detection algorithm named tONaRT from zplane development(21) to detect the key from the audio file.
The application is geared towards batch processing of several files at once. The software does not provide
any way of playing the song and as such it does not support features such as pitch-shifting and time-
stretching.

Figure 21: Mixed in Key

27
6.5.4 MixMeister

MixMeister (22) DJ mixing software is commercial software which allows users to ‘design’ a mix rather
than create one in real time. With its unique timeline function it allows users to visualise the overlapping
of two (or more) songs which they want to mix, enabling them to refine the mix so that for example the
beats are perfectly aligned. It is much easier to create a perfect mix this way, as you have full control over
the tempos of the tracks, and when they should both start and finish. The downside of this is that you
would not be able to use MixMeister in a live situation, as it takes trial and error to align the songs
perfectly. MixMeister is therefore aimed at people who want to create mixes for later use, such as creating
their own mix CD.

MixMeister has seemingly accurate BPM and Key Detection, making use of the Camelot notation to
display the detected keys as Camelot keycode's. On the whole it is a solid application which creates a
unique technique for DJ mixing which would not be possible without the advancement of computers in
music analysis.

Figure 22: MixMeister

28
7. Design
This section of the report gives a very brief description of the design and architecture of the project,
before the reasons behind the design of the algorithms.

7.1 System Architecture


The system was designed with the user in mind. As such the system was based around the need for a
responsive, intuitive user interface. This meant keeping the graphical user interface (GUI) separate from
the sound processing and from the main algorithms. The result is that the system has a modular
architecture which can be broken down into three main areas: GUI, Core and Algorithms.

The GUI comprises those classes which the user interacts with, and which the system uses to feedback
information to the user about the state of the system.

The core contains the functions which process the audio files when called upon by the user interacting
with the interface.

The algorithms are separated from the core logic as they apply specific routines on an audio file. When
running, these routines should not hamper the smooth running of the program. They should work in the
background independently of the core logic.

For a more in-depth discussion on these areas see the implementation section.

Algorithms
GUI Core Logic
Beat Detection and Key Detection

Figure 23: Overview of System Architecture

The algorithms are separated into the key detection and beat detection algorithm. The rationale behind
the design decisions for each algorithm is explained below.

7.2 Key Detection Algorithm Design Rationale


Any key detection algorithm inevitably involves conversion of the signal from the time domain into the
frequency domain, using either the Fourier, Constant Q or Wavelet transforms.

Initially, I planned to write the entire algorithm in C# using the FFTW(23) (Fastest Fourier Transform in
the West) library, which as the name suggests claims to perform the FFT transformation in the shortest
amount of time. However, I was getting peculiar frequency spectrums which showed high intensities at
very high frequencies (i.e. greater than 20kHz). Additionally, I learned that the FFT was not the best
transform to use for non-stationary signals and so started looking for an efficient way of performing one
of the other transforms more suited to non-stationary signals.

29
The transform I decided to use to convert the signal from the time to the frequency domain was the short
time Fourier transform (STFT) which is essentially the FFT applied to small sections, or windows, of the
signal at a time.

Eventually I chose to use Matlab to develop the majority of the key detection algorithm. Matlab is a
matrix based programming language and has excellent support for digital signal processing. Matlab uses
the FFTW library to perform Fourier transformations. Recent versions of Matlab include the ‘Builder for
.NET’ tool which conveniently converts Matlab code into a C# compatible dynamic link library which
can be interfaced from the rest of my project in the C# language.

To determine the key, the output from the STFT is mapped to a chroma vector. There are then two main
techniques of matching the chroma vector to a key. Pattern matching techniques correlate the chroma
vector against a series of pre-programmed key templates and record the highest correlating key.
Probabilistic models involve developing and training a hidden Markov model, and recording the template
which best aligns itself with the chroma vector. Pattern matching techniques were chosen for the design
of the algorithm because they have shown to give similar results to probabilistic methods without the
extra development time needed to program and train a HMM.

The speed constraints of the key detection algorithm are not as tight as for the beat detection algorithm,
as once a song has its key detected, that information will be stored in the songs ID3 tag and in future can
be read by the program. Even so, it is still desirable for the process to take the shortest amount of time
possible.

7.3 Beat Detection Algorithm Design Rationale

The beat detection algorithm is based on the method set out by Patin(17) in ‘Statistical streaming beat
detection’. This algorithm iteratively compares the instant energy of a piece of music with the average
energy calculated over the past second. A beat is detected if this instant energy is significantly greater than
the average energy. The concept is similar to the human hearing system in that when we listen to music,
we only remember the past second or so of music.

We are designing the algorithm primarily to be used with dance music. It is assumed that this type of
music will have a consistent tempo throughout. This assumption means that the algorithm is unlikely to
give good results when applied to music without a consistent tempo.

It is also assumed that beats in this type of music are produced by a bass instrument such as a bass drum
with low frequency. Because the algorithm does not convert the signal to the frequency domain, and
works entirely in the time domain, the energies are based on the amplitude over the whole frequency
spectrum. This means that a significant sound variation in the high frequencies could be detected as a
beat just as much as one in the low frequencies. Applying a low pass filter to the signal should reduce the
impact that high frequencies have on the detection of beats.

The required accuracy for the beat detection for this project is to be within +/- 1.5% of the actual BPM.
Bearing in mind that the majority of time was devoted to developing the key detection algorithm to a high
standard, the method chosen was dictated by the time constraints of the project. Nevertheless, the
algorithm claims to give good results with songs containing a dominant, consistent beat, so it is perfect
for this project, which is intended for use with dance music.

Patin’s method does not explicitly suggest a method for calculating a BPM value from the beats detected.
Finding the BPM is not as simple as counting the number of beats detected in a minute. A comb filter
could be used. This is a special type of filter that resonates at a certain frequency when a signal is passed

30
through it, that frequency is then used to calculate the BPM. Due to time constraints a more basic
method was used to calculate the BPM; the average interval between similarly spaced beats is found and
converted to a BPM value.

The algorithm for detecting beats had to be accurate and fast at the same time, because each time a file is
loaded into a deck, the program will detect its beats. Increasing the speed of the beat detection algorithm
usually implies a trade off in the accuracy of the algorithm, so it was important to strike the right balance
between speed and accuracy.

31
8. Implementation
This section describes in detail the actual implementation of the system and algorithms plus any other
interesting implementation areas of the project. This is not a complete account of all areas of the project,
many small details are omitted and can be assumed to have been successfully implemented. This approach
was taken in order to increase the readability of this report.

8.1 System Implementation


The system was implemented in the C# language with the FMOD sound processing library(24) in mind.
FMOD is an advanced platform independent front end to Microsoft’s DirectShow and Direct X API’s. It
makes it much easier to develop a multimedia based program than using the API’s directly. It is aimed at
the games industry and is used by many high profile game developers. FMOD is free for non-commercial
use.

Figure 24 illustrates a simplified overview of the system showing the main classes and their relationships.

Figure 24: System Overview

32
FMOD defines three main types which are used throughout the program:

• The System object initialises the FMOD engine, handles the creation and playing of Sound
objects and is used to set global parameters for the FMOD engine, such as changing the size and
type of buffers used by FMOD. There should only be one System object initialised throughout
the whole program, for efficiency reasons, and I decided to keep this object in the core class and
let other objects access it if and when they need it. This is the intuition behind the centralised
design.
• The Sound object holds information on the type of audio file loaded, i.e. its length in samples,
bytes and milliseconds, the number of channels (mono or stereo),and the bit-rate. It also reads
the audio data in the sound file into a byte buffer, enabling custom operations and analysis to be
performed on the raw audio data.
• The Channel object handles the parameters in which the sound is played, such as its volume,
playback rate (tempo), pitch and current position.

For each deck, the Core class contains a corresponding Sound and Channel object. The GUI classes fire
off events when certain actions are performed on them, these events are handled by the Core class which
calls the appropriate FMOD function on the Sound and Channel objects corresponding to that deck. For
example, when the Play button is pressed on deck A, an event is fired and sent to the Core class, the Core
can tell from the message passed that deck A fired the event, so the core class knows to call FMOD’s play
function on the sound object corresponding to deck A..

The GUI is made up of the following classes:

• Deck – encapsulates the behaviour of a turntable i.e. loading, playing and pausing of sounds as
well as controlling pitch and tempo. Each deck has a unique id which corresponds to the id of
the relevant FMOD Sound and Channel objects.
• WaveForm – the Deck class contains a WaveForm class, which presents a zoomed-in
animated visualisation of the currently loaded track. This visualisation contains beat markers
which mark the precise location of where the beat detection algorithm detected a beat. The
visualisation can be dragged forwards and backwards, mimicking the bi-directional rotation of a
vinyl on a turntable. The waveform also displays a representation of the whole track, enabling
the user to quickly skip to a certain position in the track.
• Mixer – blends the output from the two currently playing decks using the crossfader. Also adds
functionality to filter out high or low frequencies for each track.
• MusicBrowser – displays the music files supported by the program on the user’s computer,
and their corresponding metadata information, such as the BPM and Key that was detected by
the program.

Both the algorithms run asynchronously in separate threads to the GUI and Core classes. This means that
they run in the background and do not block the GUI thread. This enables the user, for example, to be
playing a track in one deck, while at the same time loading a track in the other deck, whilst detecting the
key of another track. Obviously, the more activities the user decides to perform simultaneously, the
slower the performance of the system as the different threads all compete for CPU time.

The structure of both algorithms is the same. The ‘worker’ class sets off the main routine asynchronously,
and receives progress updates from the main routine which allow it to update the relevant progress bars.
The worker is notified when the main routine has completed, causing the ‘results’ class to return the
relevant results from the algorithm. For beat detection, this is more than just the estimated BPM result.
Since the waveform generation happens in the same loop as the beat detection, the results class returns

33
arrays containing the values to be drawn onto the waveform. It also returns an array of the beat positions
so that beat markers can be placed in the waveforms at the appropriate times.

8.2 Detecting the Key


The audio file is broken down into non-overlapping sections of approx 5.5 seconds and the flow diagram
shows the process which is applied to each section of the song, before a key for the whole song is chosen.

Figure 25: Key Detection Algorithm Flow Chart

In order to save computation time, my approach starts by converting the audio section to mono and
downsampling to 11025Hz. Converting to mono involves taking the average of every two consecutive
samples in the signal, reducing the number of samples by a factor of two. Downsampling further reduces
the number of samples in the audio stream whilst still conveying enough information to perform accurate
key detection. A side-effect of downsampling the audio file is that frequency content above 5512.5Hz is
not considered, due to Nyquist’s theory. However, frequencies above this limit do not contribute much to
the harmonic content of the song; the note with highest frequency detectable by the human ear is D# in
octave 8, with a frequency of 4978.03Hz.

After the pre-processing stage, the signal is passed to Matlab which performs an STFT of the signal using
a hamming window of length 8192 samples. This is approximately 0.74s which is a relatively long analysis
window in terms of musical harmony. Thus, to improve time resolution, frames are overlapped by an
1/8th of a window length giving a time resolution of 0.093s per frame. The STFT returns a spectrogram
which shows a time-frequency plot, enabling you to see the intensities of frequencies at different time
slices throughout the section of the song. Figure 26 shows a spectrogram of a C major chord played on
the piano. You can see the most intense frequencies at around the 250 – 1500Hz range, and how the
intensities gradually decay as time increases.

34
Figure 26: Output from the STFT

The next stage is to scan through the output from the STFT and map the frequencies in Hz to pitch
classes or musical notes. The result will be a chroma vector, also called a Pitch Class Profile (PCP) or
chromagram, which traditionally consist of 12-dimensional vectors, with each dimension corresponding
to the intensity of a semitone class (chroma). The procedure collapses pure tones of the same pitch class,
independent of octave, to the same chroma vector bin; for complex tones, the harmonics also fall into
particular, related bins. Frequency to pitch mapping is achieved using the logarithmic characteristic of the
equal temperament scale. STFT bins  are mapped to chroma vector bins  according to:

  12
log ⁄
 ⁄   12 Equation 1

Where  is the reference frequency corresponding to the first index in the chroma vector (0 ). I
chose  = 440Hz which is the frequency of pitch class A.  is the sampling rate (11025Hz),  is the
size of the FFT in samples (8192).

For each time slice, we calculate the value of each chroma vector element by summing the magnitude of
all frequency bins that correspond to a particular pitch class i.e. for   0, 1," " " , 23,

  ∑':)'*) |& | Equation 2

Once we have our normalised chroma vector, we need to match it against pre-defined templates
representing the 24 possible keys (12 major, 12 minor). These templates are also 12 dimensional where
each bin represents a pitch class. They are binary type, i.e. each bin is either 1 or 0. A C major chord
consists of the notes C (root), E (third) and G (fifth), therefore, the template for the key of C Major
would be [0,0,0,1,0,0,0,1,0,0,1,0] where the labelling of the template is
[A,A#,B,C,C#,D,D#,E,F,F#,G,G#]. A G Major chord consists of the notes G, B and D, and so its
template would be [0,0,1,0,0,1,0,0,0,0,1,0]. As can be seen from these examples, every template for the
major triad is just a shifted version of the other. The minor key templates are the same as the major keys

35
but with the third shifted by one to the left. The template for a C minor chord (C,D#,G) is therefore
[0,0,0,1,0,0,1,0,0,0,1,0] and the other minor keys are just a shifted version of this template. Templates for
augmented, diminished, or 7th chords can be defined in a similar way. We will just deal with detecting of
major and minor keys here, as the Camelot easymix system does not recognise other modes than these.

Figure 27 shows the chroma vector of a C major chord played on the piano and its correlation with the
24 key templates.

Figure 27: Chroma Vector of C Major chord and its correlation with key templates

We now perform correlation of the computed chroma vector with each of the 24 key templates and get a
correlation coefficient for each of the 24 keys. The correlation coefficient is calculated using:

∑- ∑.,-. / ,01-. / 12


+ Equation 3
3∑- ∑.,-. / ,04  -- ∑- ∑.1-. / 124 

Where A and B are matrices of size m x n, in our case these will simply be vectors of size 12.

We assign a weighting to the key that has the highest correlation which corresponds to the difference
between its coefficient and the second highest correlation coefficient. For the weighting to be fair we
need to normalise the correlation coefficients so that the highest value becomes 1. The weighting
penalises the highest correlated key when the chroma vector correlates closely with other keys and the
difference between them is minute, meaning that the key could possibly have been one of the other highly
correlating keys. It rewards the highest key when the correlation coefficient is by far the highest value.

36
When we have reached the end of the song we will have several weightings, one for each 5.5 second
segment of the song. To find the most likely key, we simply sum the weightings for each key and the key
with the highest value at the end is selected as the most likely key.

The detected key is then stored in the ID3 tag of the song so that in the future, this can just be read
straight away without having to go through the whole process described above again. The ID3 Tag is
written using the library Ultra ID3 Lib(25).

8.3 Detecting the Beats


The basic intuition behind the beat detection algorithm is to find sections of the music where the instant
energy in the signal is greater than some scaling of the average energy of the signal over the previous
approximate second of music. The assumption made is that the instant energy in a signal will be much
greater on the beat than between beats. This assumption is reasonable for songs with heavy down beats
and little mid and high frequency “noise”.

The audio file is first split into manageable sections. The reason behind splitting the file up is simply
because reading the whole of the file in as one big chunk of data requires a lot of memory to cope with
the large buffer containing the samples. It also causes a bottleneck on the entire system as the reading of
the entire song takes up the majority of the CPU usage at that particular moment.

The audio data is first converted to mono as in the key detection algorithm but is not downsampled. We
then iteratively apply the following process to the signal. First we calculate the instant energy, 6, which is
the energy contained in 512 samples. 512 samples are chosen for this length as it corresponds to one
thousandth of a second which is pretty much instant. The instant energy is calculated using the following
formula where X is the signal.
9:

6  7 &8 

;*<

Equation 4

We then need to calculate the average energy. This is not calculated on the entire song, since a song may
have an intense passage and also a calm part. The average energy is calculated on the last 44032 samples
which is just short of a second. 44032 samples are chosen instead of 44100 because it is then more
convenient to calculate the average energy by simply summing the past 86 instant energy readings (as 86 x
512 = 44032) and taking the average of them. We illustrate the calculation of the average energy, =2 , in
Equation 5, where E is a history buffer of length 86 containing the past 86 instant energy readings.
AB
1
=> 
7=8 
86
;*<

Equation 5

Next we compare the current instant energy to the average energy over the past second multiplied by
some constant C. To get the value of C we first compute the variance of the past 86 instant energies:
AB
1

7=8 C =2 
86
;*<

Equation 6

37
The C constant is then computed using a linear regression of C and V with values:

  C0.0025714
 H 1.5142857 Equation 7

A beat is detected only if the instant energy is greater than the average energy multiplied by C. Also, the
chosen beat interval time must have elapsed between the last detected beat. The beat interval is a
minimum time that separates adjacent beats, if a beat is detected but the beat interval has not elapsed
since the last beat was detected, then the beat is rejected and not recorded.

We continue this cycle by shifting the history buffer, E, one index to the right, making room for a new
instant energy value whilst flushing the oldest. The new instant energy reading is placed in the first index
of H. =2 is recalculated, as well as C, and we compare the new instant energy to =2 multiplied by C again
and so on until we reach the end of the section of the song. We repeat the whole process for each section
of the song until the end is reached.

Each time a beat is detected its position is stored down so that a visual beat marker can be added at the
correct location in the waveform. FMOD Sound objects have the ability to store markers called
syncpoints within them, which, when the sound is played back and a syncpoint is read, a call-back is
generated. So, each time a beat is detected, a syncpoint is automatically added to the sound object at the
correct position in the song. This will come in useful later on for the automatic beat matching function.

8.4 Calculating BPM Value


The BPM Value was originally calculated by keeping track of the number of beats detected in a 15 second
section of the song and multiplying this number by 4, to get the number of beats in a minute, however
this was very inaccurate and was very sensitive to the area of the song chosen to count the number of
beats in the 15 second section. Some 15 second sections of a song may have many more beats than other
15 second sections of the song, so there needed to be a more accurate way.

I noticed that the best way to get an accurate BPM estimate would be to make the estimate when the
beats are consistently detected as being one after the other, in a dance song this is usually at the beginning
and end of the track. To get a more accurate BPM this way, first we keep track of the time span in
between adjacent detected beats in milliseconds, which I will refer to as the gap value from now on.

The gap value is compared to the previous gap value. If they are equal, meaning that the beats are exactly
spaced one after another, a similarity counter is incremented. If not, the similarity counter is reset.
Depending on how high the counter is incremented, the actual gap values are added to a certain array
corresponding to ranges of the similarity counter. The higher the similarity count gets, the more accurate
the BPM estimate should be.

At the end of the song, we take the average of the array that corresponds to the highest range of the
similarity counter, if this array is empty, we take the average of the next array and so on until we find a
non-empty array. Eventually we will have a value for the average gap value found throughout the song,
but only those gap values which are similar are taken into account. To convert the gap value into a BPM
estimate, we use the following formula:

1000
IJK 
60
LM NMOP6
Equation 8

38
8.5 Automatic Beat Matching
The beat synchronisation consists of two elements, first getting the two tracks to the same tempo,
second, starting the incoming track at the correct downbeat and at the same time as a beat occurs in the
outgoing track.

When both decks are loaded and the sync button is activated on one of the decks, the difference between
its BPM and the other tracks BPM is found and converted into the appropriate tempo change which is
applied to the track loaded in the deck. Now, both tracks are at the same tempo according to the BPM
estimate given by the beat detection algorithm.

To get the incoming track to start at the same time as when a beat happens in the outgoing track, the
incoming track is cued up just before the first beat. With the sync button still enabled, when the play
button is pressed, the track will start playing from that beat only when the track in the other deck
encounters its next beat. We can tell when the next beat is read thanks to the syncpoints that have
previously been added, and the call-backs that are generated when the current position of the playing
track is equal to the position of a syncpoint.

If the two tracks start to come out of sync due to slight differences in their ‘real’ tempo, there is a
function which snaps both tracks to their next beat so that the two tracks will be back in sync. This is
implemented by finding the nearest syncpoint in each track to the current play position, then finding the
next syncpoint on from the current position in each track, and setting both tracks to the position of the
next syncpoint. Since the position of a syncpoint corresponds to the position of a beat, both tracks
should now skip to the position of their next beat and be in sync again.

8.6 Generating and animating the waveforms


Although this area of the project is not to do with the actual beat and key detection algorithms, the
waveforms were an interesting and challenging programming problem. I will document the most
interesting aspects here.

The generation of the waveforms is done at the same time as the beat detection to prevent having to scan
through the song twice. For the zoomed in waveform I chose to display 6 seconds of the song at any
instant. The current position of the song is illustrated by a vertical bar in the centre of the waveform
display, this means that half of the waveform will show the past 3 seconds of the song that have just
played and the other half will show 3 seconds which are about to play.

Each pixel in the waveform corresponds to a number of samples in the song. The peak amplitude and
peak negative amplitude found within these samples are plotted on the waveform. This process is carried
out for each pixel until we have built up an image of the entire song.

This image is placed into an image holder using the built-in .Net PictureBox control. Animating the
waveforms involves using a timer to repeatedly find the current position of the track every 100
milliseconds, then translating the image. The amount to translate the image by is calculated by dividing
the current position of the track by the width of the image to get a value in pixels. The translation is
performed using matrix transformations.

The problem with using the PictureBox control is that it has a limit on the size of images it can handle.
This meant that loading a song longer than eight and a half minutes would cause an exception in the

39
PictureBox control, because a song this long generates a waveform image which is too wide for the
PictureBox control to handle.

To solve the problem, the images are broken down into smaller segments and placed in the PictureBox
one after the other. But as only one image at a time can be translated, once the end of an image is reached
there will be a gap before the next image is displayed by the PictureBox.

To solve this further problem, the images are actually overlapped a certain amount, so that when one
image is about to end, the next image is swapped in, eliminating the gap. The waveform appears to be one
continuous image to the user, but it is actually several images all overlapped at certain positions.

Figure 28: Overlapping of waveform images

40
9. Testing
This section covers the tests performed to determine the optimum value for the parameters used in the
key and beat detection algorithms.

9.1 Parameter Testing – Key Detection Algorithm


This section explains the tests performed to determine the optimum values for parameters in the key
detection algorithm.

9.1.1 Bass threshold frequency

When we scan through the STFT mapping the frequency components to pitch classes, we start from
some lower bound, so that all frequencies below this lower bound are not taken into account. In effect,
we are trying to filter out the bass line and bass drum from having an effect on the key detection result.
The keys are displayed below with their corresponding Camelot keycode and formal notation where C =
C Major, C m = C Minor, # = Sharp, b = Flat.

Table 1: Effect of frequency cut-off on key detection

Lucid - I Pulp Victim - William


Low Can't Sasha & Quivver - Skip Raiders - The World 99 Warrior - Orbit -
Pass Help Emerson - She Does Another Day [Lange Don't you Ravels
Cut-off Myself Scorchio [Quivver [Perfecto Remix] want me Pavane [FC
(4A) (4A) Mix] (4A) Remix] (3A) (12A/9A) (11A) Remix] (11A)
None
4A / F m 7B / F 4A / F m 5B / Eb 8A / Am 11A / F#m 1A / Abm
(1Hz)
32Hz
4A / F m 7B / F 4A / F m 7B / F 8A / Am 11A / F#m 1A / Abm
64Hz
4A / F m 3A / Bbm 4A / F m 3A / Bbm 3B / Db 3B / Db 3A / Bbm
96Hz
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 11A / F#m
98Hz
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 11A / F#m
100Hz
4A / F m 3B / Db 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 3B / Db
128Hz
4A / F m 4A / F m 4A / F m 3A / Bbm 9A / Em 11A / F#m 11A / F#m
250Hz
4A / F m 4B / Ab 4A / F m 3A / Bbm 1A / Abm 11A / F#m 3A / Bbm

We can see that the cut-off point has no effect on some tracks, whereas on others it alters the key
detection result quite considerably. The most successful cut-off point seems to be around the 100Hz
mark. It is interesting to see that such a small change in the value of this cut-off point can swing the key
detection result one way or another, as going from 98Hz to 100Hz changes the key detection of the Sasha
and William Orbit tracks. 98Hz is used in the final implementation as it was suggested in some literature
as being quite reliable and it gives the correct reading for all the tracks tested.

9.1.2 Choice of FFT window length

Choosing the correct size (N) for the FFT can affect the algorithm in the following ways. First, in order

41
to take advantage of the computational efficiency of the FFT algorithm, we want N to be a power of 2.
Secondly, we want to choose a value of N that will not misrepresent the data. The larger we make N, the
more data is being analysed in one FFT. We do not want to make the N too big that it takes in too much
of the signal at a time, as this defeats the point of using the STFT. If N is too small, the STFT may not
capture enough harmonic data and may lead to misinterpretation of the data. It should be noted that the
actual window function is a Hamming window and this is fixed by the Matlab implementation of the
STFT.

Table 2: Effect of FFT Length on key detection

William
Lucid - I Pulp Victim - Orbit -
FFT Sasha & Quivver - Skip Raiders - Warrior -
Can't The World 99 Ravels
Length in Emerson - She Does Another Day Don't you
Help [Lange Pavane
samples Scorchio [Quivver [Perfecto want me
Myself Remix] [FC
(Time) (4A) Mix] (4A) Remix] (3A) (11A)
(4A) (12A/9A) Remix]
(11A)
1024
4A / F m 8A / A m 4B / Ab 4B / Ab 12A / Dbm 11B / A 2A / Ebm
(0.023s)
2048
4A / F m 4B / Ab 4A / F m 3A / Bbm 12A / Dbm 12A / Dbm 3A / Bbm
(0.046s)
4096
4A / F m 4B / Ab 4A / F m 3A / Bbm 12B / E 11A / F#m 2B / F#
(0.093s)
8192
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 11A / F#m
(0.186s)
16384
4A / F m 3A / Bbm 4A / F m 7B / F 12A / Dbm 11A / F#m 3B / Db
(0.372s)

Once again, the Lucid track is not affected by the length of the FFT. All of the other tracks are affected
though. In most cases, the key is mistaken for its relative minor or major. An FFT length of 1024
samples, corresponding to 0.023s of audio data is too short for the analysis to capture enough of the
harmonic data. 16384 samples, corresponding to 0.372s of audio data is too long, as the signal is unlikely
to remain stationary during this time. The FFT size of 8192 gives the best results, probably because it
strikes the right balance between capturing enough harmonic data to get a good analysis, without being
long enough to allow the signal to change a lot. Therefore, the FFT in the final implementation is 8192
samples long.

9.1.3 Harmonic Product Spectrum

Using the harmonic product spectrum to enhance the chroma vector is a technique suggested by Lee in
‘Automatic Chord Recognition from Audio Using Enhanced Pitch Class Profile’ detailed in the background section.

It aims to remove all non-harmonic overtones which are inevitably produced by certain instruments and
sounds, whilst amplifying the harmonic overtones of the signal.

Most acoustic instruments and human voice produce a sound that has harmonics at the integer multiples
of its fundamental frequency, so decimating the original frequency spectrum by an integer number will
also yield a peak at its fundamental frequency. The HPS is calculated from the signal (X) using the
following formula:
U

QJRS  T|&S|
V*:

Equation 9

42
Figure 29: Illustration of the Harmonic Product Spectrum taken from (30)

Figure 29 illustrates what is happening in the Equation 9, for M = 5.

In the case of chord recognition application, however, decimating the original spectrum by the powers of
2 turned out to work better than decimating by integer numbers, according to Lee. This is because
harmonics not at the power of 2 or at the octave equivalents of the fundamental frequency may
contribute to generating some energy at other pitch classes than those who comprise chord tones, thus
preventing enhancing the spectrum. For example, the fifth harmonic of A in octave 3 is C# in octave 6,
which is not a chord tone in an A minor triad. Therefore, the equation is modified as follows to reflect
this:
U

QJRS  T|&2V S|


V*:

Equation 10

I tested the Harmonic Product Spectrum with M = 3 for integer decimation and power-of-two
decimation. The results are shown below:

43
Table 3: Effect of Harmonic Product Spectrum on key detection

William
Lucid - I Pulp Victim -
Sasha & Quivver - Skip Raiders - Warrior - Orbit -
Can't The World 99
Emerson - She Does Another Day Don't you Ravels
HPS Help [Lange
Scorchio [Quivver [Perfecto want me Pavane [FC
Myself Remix]
(4A) Mix] (4A) Remix] (3A) (11A) Remix]
(4A) (12A/9A)
(11A)
None
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 11A / F#m
Integer
3B / Db 11A / F#m 3A / Bbm 2A / Ebm 11B / A 10B / D 1A / Abm
Spacing
Power of
two 4A / F m 4A / F m 4A / F m 7B / F 12A / Dbm 4B / Ab 3B / Db
spacing

The HPS with integer spacing does not work well with the songs in the test set. The HPS with power of
two spacing recognises the correct key in four out of the six tracks, but is still not perfect. It was decided,
after further experimentation, that the HPS should not be used to detect the key of dance music tracks in
the final implementation of the algorithm.

The reason the HPS is not so successful with the test set maybe because there are few acoustic
instruments used in the tracks, as dance music tends to use electronically generated sounds which may or
may not produce harmonics in the same way as acoustic instruments. The HPS may be better suited to
detecting the key of classical music or a genre which uses more acoustic instruments than dance music.

9.1.4 Weighting System

The weighting system described in the implementation section assigns a weight to the highest correlating
key for every section of the song. This weighting is the difference between the highest correlation
coefficient and the second highest. Without the weighting system, the highest correlating key is always
given a weighting of 1, no matter how close or remote the other correlating keys are.

The results show the effect of the weighting system on the keys detected:

Table 4: Effect of the weighting system on key detection

William
Lucid - I Pulp Victim - Orbit -
Sasha & Quivver - Skip Raiders Warrior -
Can't The World 99 Ravels
Weighting Emerson - She Does - Another Don't you
Help [Lange Pavane
System Scorchio [Quivver Day [Perfecto want me
Myself Remix] [FC
(4A) Mix] (4A) Remix] (3A) (11A)
(4A) (12A/9A) Remix]
(11A)
On
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 11A / F#m
Off
5A / C m 4B / Ab 4A / F m 5A / C m 12A / Dbm 11A / F#m 11A / F#m

The weighting system does make a difference to the key detected. Without the weighting system, the
algorithm only detects 4 out of the 6 tracks correctly. Therefore the weighting system is used in the final
implementation.

44
The reason the weighting system works well is because the chroma vector sometimes correlates highly
with more than one key template. In the diagram below, the chroma vector has a strong reading in the
A# pitch class, and smaller peaks at the pitch classes of C, F and G. The resulting correlation with the 24
key templates results in the highest correlation with A# minor, but also 5 other key templates correlate
highly with the chroma vector. The weighting system recognises that the A# minor key has correlated the
most but because it is so close to being another key chord, the weighting assigned is small, i.e. in this case,
the difference between the correlation coefficient of A#m and Gm.

Figure 30: Chroma Vector showing close correlation between many different key templates

For the Lustral song in the test section, there is a competition between C minor and F minor for the
highest detected key. Without the weighting system, C minor wins the competition and is chosen as the
most likely key. With the weighting system on, F minor wins convincingly. This shows that although C
minor may correlate highest the most amount of times, it must do so when there are other alternatives
that correlate almost as high, and so it receives a smaller weighting overall. On the other hand, when F
minor correlates, it must do so when there is less doubt about which key template the chroma vector
matches, and so is assigned a higher weight. The bar charts below show the results of the key detection
for the Lustral track with and without the weighting system.

45
Figure 31: F minor is detected correctly with the weighting system enabled

Figure 32: C minor is detected without the weighting system enabled

9.1.5 Time in between overlapping frames

The time in between overlapping FFT frames of the STFT, also called the hopsize or stride, is another
important parameter which can have an effect on the overall key result. A small hopsize, say 50ms
equates to taking 20 FFT’s per second, whereas a hopsize of 1000ms means taking one overlapping FFT
every second. Generally, the smaller the hopsize, the more accurate the FFT will be, however performing
more FFT’s per second will affect performance.

46
Table 5: Effect of hopsize on key detection

William
Lucid - I Pulp Victim -
Sasha & Quivver - Skip Raiders - Warrior - Orbit -
Can't The World 99
Emerson - She Does Another Day Don't you Ravels
Hopsize Help [Lange
Scorchio [Quivver [Perfecto want me Pavane [FC
Myself Remix]
(4A) Mix] (4A) Remix] (3A) (11A) Remix]
(4A) (12A/9A)
(11A)
50ms
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 11A / F#m
75ms
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 11A / F#m
100ms
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 11A / F#m
200ms
4A / F m 4A / F m 4A / F m 7B / F 12A / Dbm 11A / F#m 11A / F#m
500ms
4A / F m 4A / F m 4A / F m 7B / F 12A / Dbm 11A / F#m 11A / F#m
1000ms
4A / F m 4A / F m 4A / F m 7B / F 12A / Dbm 11A / F#m 11A / F#m

The hopsize does not affect the key detection result in any of the tracks except for the Skip Raiders track,
at sizes greater than 100ms. 100ms is the final value chosen, which means taking an FFT frame every 10th
of a second. This gives good time resolution to the STFT without a big performance hit.

9.1.6 Downsampling

The downsampling of the song is done to speed the computation of the STFT up. I wanted to see the
effect it would have, if any, on the actual key detected.

Table 6: Effect of downsampling on key detection

William
Lucid - I Skip Raiders Pulp Victim - Orbit -
Sasha & Quivver - Warrior -
Can't - Another The World 99 Ravels
Downsample Emerson - She Does Don't you
Help Day [Lange Pavane
Rate Scorchio [Quivver want me
Myself [Perfecto Remix] [FC
(4A) Mix] (4A) (11A)
(4A) Remix] (3A) (12A/9A) Remix]
(11A)
4x - 11025Hz
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 11A / F#m
2x - 22050Hz
4A / F m 4B / Ab 4A / F m 3A / Bbm 12B / E 11A / F#m 3B / Db
1x - 44100Hz
4A / F m 4A / F m 4A / F m 3A / Bbm 12A / Dbm 11A / F#m 3B / Db

For the Sasha and Pulp Victim tracks, downsampling the signal to 22050Hz results in the relative major
of the actual key to be detected. Whereas for the William Orbit track, a Db major is detected instead of
F# minor.

Leaving the signal unaltered, at 44100Hz, the only key that was detected differently was the William Orbit
track, again detecting a Db major.

Although the downsampling is used to speed up the computation, I did not notice much difference in the
time taken to perform the analysis at varying levels of sample rate. Even so, 11025Hz is used in the final
implementation because this seems to be the standard for key detection algorithms.

47
9.2 Parameter Evaluation – Beat Detection
Algorithm

9.2.1 Size of Instant Energy

Varying the size of the instant energy buffer will affect the BPM result. A range of instant energy sizes
were tested with six songs.

Table 7: Effect of instant energy size on beat detection

Agnelli and Chakra - William Orbit


Size of Sasha & Xstasia -
Nelson - Home Faithless - - Ravels
Instant Emerson - Sweetness
Everyday [Above & Why Go Pavane [FC
Energy Scorchio [Michael Woods
[Lange Mix] Beyond Mix] (134.73) Remix]
(samples) (134.91) Remix] (136)
(138.15) (138) (137.85)
128
137.84 137.82 163.09 145.48 155.92 138.32
256
137.82 137.82 140.81 133.72 141.71 136.01
512
138.51 134.34 136.73 134.73 138.85 136.01
1024
136.45 136.45 137.45 136.01 139.68 136.01
2048
143.56 127.22 118.34 107.65 113.43 67.96
4096
75.08 46.11 71.74 118.94 46.11 129.21

An instant energy size of 512 samples consistently gives the closest BPM result, this size is used in the
final implementation. However, for the Chakra song, an instant energy of 256 samples gave a closer result
to the actual BPM value.

9.2.2 Size of Average Energy

The size of the average energy buffer corresponds to how much of the song we compare the instant
energy to.

Table 8: Effect of average energy size on beat detection

Agnelli and Chakra - William Orbit


Size of Sasha & Xstasia -
Nelson - Home Faithless - - Ravels
Average Emerson - Sweetness
Everyday [Above & Why Go Pavane [FC
Energy Scorchio [Michael Woods
[Lange Mix] Beyond Mix] (134.73) Remix]
(Samples) (134.91) Remix] (136)
(138.15) (138) (137.85)
11008
139.68 134.68 128.48 135.84 127.56 136.01
22016
138.53 136.3 139.68 135.06 138.43 136.01
44032
138.51 134.34 136.73 134.73 138.85 136.01
88064
138.74 166.72 141.42 135.18 138.93 136.01
176128
142.57 166.72 153.82 135.52 140.31 136.01

48
44032 is chosen as the value for this parameter in the final implementation because it performed well in
day to day use of the program.. This corresponds to just less than one second of audio data. The size of
22016 samples does identify the BPM of the Chakra, Sasha and William Orbit songs more closely, but not
by much.

9.2.3 Beat Interval

The beat interval


al is the minimum time that must elapse between adjacent beats. Without this interval,
beats are detected far too frequently. This is because a beat may consist of many peaks in the waveform,
all of which are counted as separate beats if we do not separate them up using a minimum gap value.

The following diagrams illustrate the problem; they are both waveform images of the start of Xstasia –
Sweetness. The vertical green lines show where the beats are detected. The first is with the beat interval
set at 50ms,
s, you can see that beats are being detected too close to each other, especially at the beginning.
In the second waveform, with the beat interval set at 350ms, the beats are being detected at the correct
onset of the beat and at regular intervals.

Figure 33:: Too many beats detected with 50ms beat interval

Figure 34: Beats being detected correctly with beat interval of 350ms

Table 9:: Effect of beat interval size on beat detection

Agnelli and Chakra - Sasha & William Orbit - Xstasia -


Faithless -
Beat Nelson - Home [Above Emerson - Ravels Pavane Sweetness
Why Go
Interval Everyday [Lange & Beyond Scorchio [FC Remix] [Michael Woods
(134.73)
Mix] (138.15) Mix] (138) (134.91) (137.85) Remix] (136)
50ms
1,033.61 1,033.61 1,033.61 923.76 492.2 685.57
100ms
177.34 574.24 574.24 228.18 369.16 136.01
200ms
138.51 283.19 184.99 179.22 250.2 136.01
300ms
138.51 195.77 136.73 137.25 136.01 136.01
350ms
138.51 134.34 136.73 134.73 138.85 136.01
400ms
139.26 131.95 136.01 134.88 137.32 136.01
500ms
68.87 68.87 85.96 77.71 100.72 93.46

350ms is chosen as the most suitable beat interval for separating out adjacent beats. Using a value too
small a leads to a very high BPM, whereas using a value that is too long begins to cancel out actual beats
rather than the extra peaks surrounding the onset of a beat.

49
9.2.4 Low Pass Filtering

The idea behind low pass filtering the signal is to remove high frequencies and focus on the low
frequencies which we assume dictate the beat of a typical dance music track. This also should help to
smooth the waveform so that extra peaks surrounding a beat are not detected as separate beats, as
illustrated above.

Table 10: Effect of applying a low pass filter on beat detection

Agnelli and
Nelson - Chakra - Sasha & William Orbit - Xstasia -
Low Pass Faithless -
Everyday Home [Above Emerson - Ravels Pavane Sweetness
Filter Cut-off Why Go
[Lange & Beyond Scorchio [FC Remix] [Michael Woods
Frequency (134.73)
Mix] Mix] (138) (134.91) (137.85) Remix] (136)
(138.15)
No Low Pass
Filter 138.51 134.34 136.73 134.73 138.85 136.01

10000Hz
138.51 134.94 137.21 134.65 139.21 136.01
5000Hz
138.19 142.76 137.21 134.85 139.21 136.01
2000Hz
137.12 146.26 136.52 134.88 139.21 136.01
1000Hz
137.34 146.26 136.52 134.76 140.31 136.01

Applying the low pass filter did not affect the results that much. If anything it reduced the quality of the
BPM estimate. In the final implementation I decided not to use a low pass filter.

50
10. Evaluation
This section gives a critical evaluation of the finished project. It is divided into two parts, the first gives a
quantitative evaluation of the key and beat detection algorithms, the second part gives a qualitative
evaluation of all sections of the project.

10.1 Quantitative Evaluation


10.1.1 Key Detection Accuracy Test with Dance Music

The accuracy of the algorithm was tested using a test set of 30 dance music tracks. The key was found for
each of the tracks using three separate programs, MixMeister, Mixed in Key and Rapid Evolution 2. The
keys found using these programs are compared with the key detected using the key detection algorithm
developed in this project. The algorithm is judged to have detected the correct key if it is compatible with
one of the keys detected by the other three programs, according to the Camelot sound easymix system.
The keys are displayed below with their corresponding Camelot keycode and formal notation where C =
C Major, C m = C Minor, # = Sharp, b = Flat.

Table 11

Rapid
Key Correct
Artist / Title MixMeister Mixed in Key Evolution
Detected Detection
2
Chris Lake & Rowan Blades - Filth 12A / C#m 12A / C#m 12A / C#m 5A / C m No*
BT - Mercury and solace [Quivver Mix] 3A / A#m 3A / A#m 3A / A#m 3A / A#m Yes
Lostep - Burma [Sasha Remix] 9A / E m 10A / Bm 9A / E m 9A / Em Yes
Energy 52 - Cafe Del Mar [Three N One
8A / A m 3A / A#m 8A / A m 7B / F No
Mix]
Dirty Bass - Emotional Soundscape 10A / B m 11A / F#m 10A / B m 11A / F#m Yes
Dogzilla - Dogzilla 5A / C m 6A / G m 5A / C m 5A / C m Yes
Agnelli and Nelson - Everyday [Lange
8A / A m 7A / D m 8A / A m 11B / A No**
Mix]
Evoke - Arms of Loren 2001 [Ferry
11A / F#m 11A / F#m 11A / F#m 11A / F#m Yes
Corsten Remix]
Eye To Eye - Just Can't Get Enough
4A / F m 4A / F m 4A / F m 4A / F m Yes
[Lange Mix]
Floyd - Come Together [Vocal Club Mix] 8A / A m 7A / D m 8A / A m 8B / C Yes
Lucid - I Can't Help Myself 4A / F m 4A / F m 4A / F m 4A / F m Yes
Chicane - Saltwater [Original mix] 4A / F m 4A / F m 4A / F m 5A / C m Yes
Sasha & Emerson - Scorchio 4A / Fm 4A / F m 4A / F m 4A / F m Yes
Sasha - Arkham Asylum 10A / Bm 10A / Bm 10A / B m 10A / B m Yes
Sasha - Magnetic North [Subsky's In your
5A / Cm 3A / A#m 5A / C m 8B / C No**
face remix]
Quivver - She Does [Quivver Mix] 4A / F m 4A / F m 4A / F m 4A / F m Yes
Signum - First Strike 1A / Abm 2A / Ebm 1A / Abm 4B / Ab No**
Skip Raiders - Another Day [Perfecto
2A / D#m 3A / A#m 2A / D#m 3A / A#m Yes
Remix]
Freefall - Skydive [John Johnson Vocal
8A / A m 8A / A m 8A / A m 8A / A m Yes
Mix]
Solid Globe - North Pole 5A / C m 5A / C m 5A / C m 5A / C m Yes
Space Manoeuvres - Stage One
5A / C m 8A / A m 5A / C m 4A / F m Yes†
[Seperation Mix]
Quivver - Space Manoeuvres part 3 9A / E m 10A / Bm 9A / E m 11B / A No

51
Starparty - I'm in Love [Fc & Rs Remix] 7A / D m 7A / D m 7A / Dm 5A / C m No
Pulp Victim - The World 99 [Lange
9A / E m 12B / E 9A / Em 12A / Dbm Yes†
Remix]
Travel - Bulgarian 8A / A m 8A / A m 8A / Am 8A / A m Yes
Warrior - If you want me [vocal club mix] 11A / F#m 11A / F#m 11A / F#m 11A / F#m Yes
Faithless - Why Go 10A / B m 10A / Bm 10A / Bm 2B / F# No
William Orbit - Ravels Pavane [FC
11A / F#m 2B / F# 11A / F#m 11A / F#m Yes
Remix]
X-Cabs - Neuro 99 [X-Cabs mix] 11A / F#m 11A / F#m 11A / F#m 11A / F#m Yes
Xstasia - Sweetness [Michael Woods
10A / B m 10A / Bm 8A / Am 10A / B m Yes
Remix]

The key detection correctly identified a compatible key to one of the other programs in 22 out of the 30
tracks, corresponding to a 73.3% success rate. Out of these 22 tracks, 20 were identified as being the
exact same key as at least one of the other programs. The 2 others identified by † (Space Manoeuvres and
Pulp Victim) were one step away from the key identified by the other programs using the Camelot
easymix system.

Of the 8 tracks which the key was not judged to have been detected correctly, one of the tracks identified
by * (Chris Lake) was detected as a C minor instead of a C# minor, this is a semitone difference apart.
The key was detected as being of the correct pitch class but the wrong mode (i.e. C major instead of C
minor) in 3 tracks identified by ** (Agnelli and Nelson, Sasha, Signum).

These can be considered near misses, because the key detected shares very similar chord notes to the key
detected by the other programs. Also, the other programs are biased towards minor keys, as most dance
music is written in minor keys. If we include these ‘near misses’ as being correct, the success rate would
increase to 26/30 = 87%.

10.1.2 Key Detection Accuracy Test with Classical Music

The key detection algorithm was also tested with 20 pieces of classical music, where the key of the piece is
known in advance. The results are shown in the table.

Table 12

Composer / Title Actual Key Key Detected Correct

Wolfgang Amadeus Mozart - Requiem (K. 626) - Lacrimosa D Minor D Minor Yes

Ludwig van Beethoven - Piano Concerto No. 5 (Op. 73) -


B Major B Major Yes
Adagio Un Poco Mosso

Wolfgang Amadeus Mozart - Klarinettenkonzert (K. 622) –


D Major D Major Yes
Adagio

Antonio Lucio Vivaldi - Le Quattro Stagioni (Op. 8, RV 269)


E Major C# Minor(Db) Yes†
- La Primavera

Johann Pachelbel - Kanon In D D Major G Major Yes†

52
Antonin Dvorak - New World Symphony (Op. 95) - Largo Db Major C# Major(Db) Yes

Sergej Vassiljevitsj Rachmaninoff - Piano Concerto No. 2 Begins E Major


E Major Yes
(Op. 18) - Adagio Sostenuto Ends C Major

Tomaso Giovanni Albinoni - Adagio In Sol Minore G Minor G Minor Yes

Ludwig van Beethoven - Symphony No. 7 (Op. 92) A Major C# Minor(Db) No

Edvard Hagerup Grieg - Peer Gynt Suite No. 1 (Op. 46) –


E Major E Major Yes
Morgenstemning
Gustav Mahler - Symphony No. 5 F Major F Major Yes
Samuel Osborne Barber - Adagio For Strings F Major F Major Yes
Johann Sebastian Bach - Orchestersuite Nr. 3 (BWV 1068) –
D Major D# Major No*
Air

Johann Sebastian Bach - Toccata E Fuga (BWV 565) D Minor D Minor Yes

Georg Friederich Händel - Wassermusik (HWV 348) F Major F Major Yes

Wolfgang Amadeus Mozart - Requiem (K. 626) - Introitus D Minor D Minor Yes

Ludwig van Beethoven - Symphony No. 6 (Op. 68) F Major F Major Yes

Johann Sebastian Bach - Orchestersuite Nr. 2 (BWV 1067) –


D Major B Minor Yes†
Badinerie

The key detection algorithm detected a compatible key in 18 out of the 20 tracks tested, giving a success
rate of 90%. In the tracks marked †, the key was detected as being either the relative minor or the major
fifth of the actual key. The Camelot keycode represents this as being the same number but different letter,
for Le Quattro Stagioni the actual key is 12B whereas the key detected was 12A, for Orchestersuite Nr. 2
the actual key is 10B whereas the key detected was 10A. For Kanon in D, the key is detected as G Major
(9B) which is the major fifth of D Major (10B).

For Orchestersuite Nr. 3 by Bach, the key detected was a semitone different from the actual key.

In the case where a piece of music changes key often, as in Piano Concerto No. 2, which begins in E
Major and ends in C Major, the key detection will pick up on the key which the piece remains in for the
longest duration.

The key detection performs considerably well on classical music, this is because acoustic instruments used
in them give off more harmonic tones, making it more obvious which chords are being played. Also,
unlike in the dance music tracks, there is no dominant bass line or bass drum which will affect the
algorithms accuracy.

53
10.1.3 Beat Detection Accuracy Test

The accuracy of the beat detection algorithm was tested by finding the BPM using four other programs
(MixMeister, Mixed in Key, Rapid Evolution 2 and Traktor DJ Studio). The average of these four BPM
estimates is compared against the BPM detected by the algorithm.

Table 13

Rapid
Mix- Mixed in Traktor BPM
Artist / Title Evolution Average Difference
Meister Key DJ Studio Detected
2
Chris Lake & Rowan
133.60 133.62 133.80 133.68 133.67 132.52 -1.16 (0.87%)
Blades - Filth
BT - Mercury and solace
131.90 132.02 131.80 131.95 131.92 129.21 -2.71 (2.06%)
[Quivver Mix]
Lostep - Burma [Sasha
130.00 129.94 130.00 130.10 130.01 166.72 36.71 (28.23%)
Remix]
Energy 52 - Cafe Del Mar
133.00 133.00 133.20 132.96 133.04 132.52 -0.52 (0.39%)
[Three N One Mix]
Dirty Bass - Emotional
141.10 140.90 141.00 141.16 141.04 139.68 -1.36 (0.96%)
Soundscape
Dogzilla - Dogzilla 135.00 135.20 135.20 134.83 135.06 135.92 0.87 (0.64%)
Agnelli and Nelson -
138.10 138.66 137.90 137.95 138.15 138.51 0.36 (0.26%)
Everyday [Lange Mix]
Evoke - Arms of Loren
2001 [Ferry Corsten 137.80 137.91 137.90 137.67 137.82 137.21 -0.61 (0.44%)
Remix]
Eye To Eye - Just Can't
137.80 137.80 137.60 137.81 137.75 139.68 1.93 (1.40%)
Get Enough [Lange Mix]
Floyd - Come Together
140.00 140.03 140.10 139.97 140.02 139.68 -0.34 (0.25%)
[Vocal Club Mix]
Free Radical - Surreal [En
138.80 139.00 138.50 138.62 138.73 140.56 1.83 (1.32%)
Motion Remix]
Chakra - Home [Above &
138.00 137.82 138.30 137.88 138.00 136.30 -1.70 (1.23%)
Beyond Mix]
Lucid - I Can't Help
131.00 130.94 131.10 130.85 130.97 138.02 7.05 (5.38%)
Myself
Chicane - Saltwater
131.00 130.84 131.20 131.15 131.05 134.37 3.33 (2.54%)
[Original mix]
Sasha & Emerson -
135.00 135.02 134.80 134.80 134.91 134.73 -0.17 (0.13%)
Scorchio
Sasha - Arkham Asylum 126.00 125.80 126.00 126.07 125.97 126.05 0.09 (0.07%)
Sasha - Magnetic North
[Subsky's In your face 130.00 129.99 130.20 129.99 130.04 129.31 -0.73 (0.56%)
remix]
Quivver - She Does
140.80 70.38* 140.70 140.52 140.67 139.35 -1.32 (0.94%)
[Quivver Mix]
Signum - First Strike 139.70 139.60 139.80 139.43 139.63 139.68 0.05 (0.03%)
Skip Raiders - Another
138.00 137.96 137.90 138.04 137.97 139.58 1.60 (1.16%)
Day [Perfecto Remix]
Freefall - Skydive [John
135.00 135.02 134.90 135.07 135.00 139.20 4.20 (3.11%)
Johnson Vocal Mix]
Solid Globe - North Pole 134.80 134.84 135.00 134.65 134.82 135.74 0.92 (0.68%)
Space Manoeuvres - Stage
131.90 131.89 131.80 132.12 131.93 132.52 0.59 (0.45%)
One [Seperation Mix]
Quivver - Space
128.00 127.97 127.80 127.98 127.94 135.43 7.49 (5.86%)
Manoeuvres part 3
Starparty - I'm in Love
136.00 135.86 135.90 136.08 135.96 136.01 0.05 (0.03%)
[Fc & Rs Remix]
Pulp Victim - The World
136.00 135.70 136.00 136.15 135.96 136.01 0.04 (0.03%)
99 [Lange Remix]
Travel - Bulgarian 138.20 138.21 138.10 138.32 138.21 139.68 1.47 (1.07%)
Warrior - If you want me
134.80 134.79 134.80 134.98 134.84 136.73 1.88 (1.40%)
[vocal club mix]

54
Faithless - Why Go 135.00 134.80 134.80 134.34 134.73 136.73 1.99 (1.48%)
William Orbit - Ravels
137.30 137.44 137.30 137.75 137.45 138.85 1.40 (1.02%)
Pavane [FC Remix]
X-Cabs - Neuro 99 [X-
139.90 139.85 140.10 139.87 139.93 139.68 -0.25 (0.18%)
Cabs mix]
Xstasia - Sweetness
136.30 136.14 136.00 135.78 136.05 136.01 -0.05 (0.04%)
[Michael Woods Remix]
*Value not used in calculation of average

The beat detection algorithm has an 80% success rate of detecting a BPM that is +/-- 1.5 % from the
average BPM of the other four programs.

The highest difference between the detected BPM and actual BPM was +36.7 +36.7 for Lostep – Burma. The
reason why the BPM is not accurate is because this track does not have any sections of consistent beats.
The track is actually a break-beat
beat track, which means the main beat falls at irregular positions, unlike the
consistent 4 beats to every bar tempo of most of the other tracks in the test set. However, the other
programs all seem to agree closely on the BPM for this track, suggesting that the beat detection algorithm
could be improved to detect beats in tracks without a consistent
consis tempo.

The other big anomaly is for Quivver – Space Manoeuvres part 3, where the BPM seems to have been
calculated as 7.49 BPM too high. Closer analysis shows that the beats are being detected correctly,
however, beats are also being detected in silent
silen areas, this can be seen towards the left hand side of Figure
35 below.

Figure 35: Sound energy variations detected as beats in silent areas of Quivver – Space Manoeuvres

Figure 36 shows another part of the same track, here beats are being detected again during a relatively
silent period of the track. The interval in between the beats being detected here is much shorter than it is
in the diagram above, this could be why the BPM is mistaken for being higher than it actually is.

Figure 36: The spacing between these detected beats is closer, leading to higher BPM calculation

The problem is that the BPM calculation relies on consistent beats. If there are many
any silent areas in a
track where the algorithm detects beats, and these detected beats are equally spaced, then this will skew
the BPM in favour of a different tempo.

It is interesting that for Quivver – She Does, the BPM given by Mixed in Key is 70.38, which is
approximately half the rate given by every other program, including my algorithm. Detecting the BPM as
half the rate or twice the rate is a common problem, one that plagues humans as much as computers. This
also shows that the algorithm can outperform
outperform some of the other programs with certain tracks.

Overall, the beat detection algorithm is very good at detecting and tracking the actual beats from the
song. The
he calculation of the actual BPM estimate from these beats is not as accurate compared to the
other programs tested. Sometimes the algorithm detects very slight changes in sound energy as a beat
when really it isn’t a beat. This can lead to calculation of an inaccurate BPM, usually one that is too high.

55
10.1.4 Performance Evaluation

The performance of the algorithms were evaluated with 3 tracks of short, medium and long duration. The
test computer specification is a mobile Intel Pentium 4 Processor running at 3.06GHz with 2 GB RAM.

Table 14

Artist / Title Length BPM Time (s) Key Time (s)


Sasha - Arkham Asylum 13:20 24.527 69.489
Travel - Bulgarian 06:32 10.324 26.788
Nikkfurie - The a la menthe 02:24 3.949 9.233

The beat detection is quite fast even for the Sasha track which is more than 13 minutes long. The key
detection algorithm is slower than the beat detection, this is inevitable as the analysis is much more in-
depth. Also, calling MATLAB from C# reduces performance slightly for the key detection algorithm.

10.2 Qualitative Evaluation


10.2.1 Automatic Beat Matching

The automatic beat matching function does work as expected. However, it relies heavily on an accurate
BPM estimate of both tracks that are loaded in the decks. Synchronising a track to the tempo of another
adjusts its tempo by a suitable amount, so that it is equal to the detected BPM of the other track.
Unfortunately, not all tracks are guaranteed to have an accurate enough BPM estimate for the beat
matching function to work as intended.

If the BPM estimate for either of the tracks is slightly inaccurate, then it will not take long for the two
tracks to become out of sync with each other, and a train-wreck mix to take over. If the two tracks do
come out of sync, it is then up to the user to try and alter the speed of one of the tracks to match the
other. I attempted to use accurate timing functions to detect the timings of the beats of each playing
track, and then to calculate which track needed speeding up or slowing down, so that the speed could be
adjusted automatically. However, with the interval between beats being sensitive to even a slight
difference in milliseconds, the timing was never accurate enough; any increase in CPU usage would have
an effect on the interval being reported, so instead of timing the interval between beats I was also timing
the speed at which the code was being processed.

Once the tempo of the tracks are manually adjusted the snap to beat functionality comes in use and will
then reposition both tracks at their next beat so that it is easy to tell whether or not the two tracks are at
the same tempo or not. The ability to skip through the track in terms of its beats allows the user to re-
position a track to its next or previous beat in the case when it is a beat behind or in-front of the other
track.

10.2.2 Graphical User Interface

The user interface of the program is designed for users who are experienced with either physical DJ
equipment or other existing software. These users should find the design and layout easy to use and much
less cluttered than existing DJ software. Anyone not used to this environment may find the various
buttons, sliders and visual displays overwhelming at first sight, even so, universally recognisable functions
such as play, pause, eject, mute and volume controls should be fairly intuitive.

56
The user interface was evaluated with users split into two categories, those that are familiar with DJ
concepts and have had previous experience of DJ software or equipment, and those users who are
familiar with computer software but unfamiliar with similar DJ software or equipment.

Both sets of users could easily carry out simple tasks such as loading and playing tracks in the decks. The
advanced users instantly recognised the layout of the software, the decks and the crossfader are fairly
standard in any DJ software. The users unfamiliar with the DJ software did not understand the concept of
beat mixing and therefore did not understand the process to go through in mixing two tracks together,
even so, the controls for adjusting tempo and pitch were intuitive and they instantly understood how to
use them. Both sets of users understood and could interact with the visual displays and both users could
figure out how to detect the key of a track.

Nielsen developed 10 heuristics by which a user interface can be evaluated. These are listed below with a
brief description of how well the heuristic is fulfilled in the system.

• Visibility of system status

The system always keeps users informed about what is going on, through the use of progress
indicators i.e. during beat detection/waveform generation/key detection. A separate progress
bar also shows the current CPU usage to the user.

• Match between system and the real world

The system does speak the user’s language; terms such as ‘Deck’, ‘Crossfader’, and ‘Mixer’
should be familiar to any budding DJ, whom this program is primarily aimed at.
The layout of the system matches that of a real-life DJ setup, and the ability to drag the
waveform forwards and backwards mimics the ability to push a vinyl record forwards and
backwards on a real turntable.

• User control and freedom

The system does not have support for undo or redo functions. Allowing the user to be able
to cancel loading a song, or cancel detecting the key of a song which they clicked on by
accident would be a welcome addition.

• Consistency and standards

The system uses a consistent design and layout throughout. The user should be familiar with
the words and terms used to describe functions or to display information.

• Error prevention

The system provides helpful error messages upon error-prone conditions, such as pressing
the play button with no track loaded, or trying to load a track into a deck which is currently
loading another track. The system exits gracefully if it cannot process a certain file.

• Recognition rather than recall

Status bars display the track which is currently being loaded into the deck, or which is
currently being key detected, so the user does not forget which track they selected.

57
• Flexibility and efficiency of use

There are a few built-in accelerators which are a side effect of using Windows Forms to
produce the GUI. These allow the user to control the sliders such as tempo/pitch/volume
with the cursor keys rather than the mouse. Apart from this, the support for shortcuts to
speed up common functionality is limited.

• Aesthetic and minimalist design

The system has a simplistic yet informative design. Only relevant information is ever
displayed, and it is displayed in a clear concise manner.

• Help users recognize, diagnose, and recover from errors

The error messages are expressed in a plain language for the most part, and will explain what
the problem is to the user and how to prevent the error happening again. The only error
messages that may confuse the user are those which come from the FMOD system, but
these should only occur when a corrupt file is loaded.

• Help and documentation

The system can be used without documentation for the most part, as most of the functions it provides
are straight forward. However, when it comes to mixing two tracks together, some help explaining the
concepts to the novice user will be useful. The user guide is a brief summary of how to use the system
and its various functions, and should provide enough help to enable even novice users to use the system
effectively.

10.2.3 Pitch Shift and Time Stretching Functions

The pitch shift and time stretching functions works as they should do. Adjusting the pitch slider does
adjust the pitch of the track in the way expected, without altering the tempo of the song. Adjusting the
tempo of the track does alter the tempo of the song, and with the key lock function disabled, it alters the
pitch at the same time too. The change in the key of the track is displayed to the user as the tempo
increases or decreases in steps of +/- 6%, because this change in tempo corresponds to a semitone
change in the key. With the key lock enabled, adjusting the tempo of the track keeps the pitch constant,
and so the key stays constant.

10.2.4 Overall Evaluation

This project was implemented primarily in C# using the FMOD Library. The motivation behind using
C# was based on familiarity with the language and the speed and ease at which development of a user
interface is made possible with Windows Forms. Windows Forms is a fourth generation programming
language in which the user can drag and drop various built-in controls such as buttons and sliders onto
forms in a WYSIWYG format. This aids rapid application development.

The disadvantage of using C# for such a project is the fact that compared to C++ or C code it is not as
fast. This is because C# is an interpreted programming language, sometimes referred to as ‘managed’
code. The main libraries used in this project i.e. MATLAB and FMOD are developed in C++/C or
‘unmanaged code’, faster code that runs natively on the processor. The C# interface to these libraries

58
therefore suffers a slight loss of performance in the conversion from managed to unmanaged code; if the
application had been developed in C++ / C then this would not have been a problem. Even so, the
application still has good performance.

FMOD was chosen because of its ease of use and the wide functionality it offered. There is no other
library out there that can be used free of charge that offers the freedom, performance and breadth of
functions which FMOD offers. Without FMOD, the project would not have been possible.

MATLAB was used in the key detection algorithm primarily because of its support for the short time
Fourier transform. It offers support for many other digital signal processing techniques and if the project
was to be started again, MATLAB would probably be used to develop the beat detection algorithm as
well.

11. Conclusion
This chapter gives an appraisal of the system and discusses further work which can be undertaken to
extend the project.

11.1 Appraisal
The project’s primary aim was to create a tool which would aid DJs to perform harmonic mixing. This
aim has been fulfilled to some extent; the project provides accurate key detection and reasonably accurate
beat detection which help the DJ select suitable tracks, and provides functionality to enable the DJ to mix
two selected tracks together. No other DJ software on the market today combines key detection, beat
detection and the ability to mix tracks in real-time into one package.

The key detection algorithm is the main algorithm which brings together many different ideas from
current music analysis research. The short time Fourier transform used to transform the signal to the
frequency domain has proven to be a very worthy alternative to other transforms used in key detection
algorithms; the key detection has a 73% success rate on dance music, which could potentially reach 87%
with minor improvements, and a 90% success rate used with classical music. The method of using a
chroma vector with pattern matching techniques to select a key is the basis of many key detection
algorithms described in the background section, and one which performs well in this algorithm. The
weighting system used to reward the most suitable key has been shown to improve the accuracy of the
key detection result on certain tracks where there are many key candidates. Finally, I have experimented
with using the harmonic product spectrum to try to remove non-harmonic overtones and improve the
accuracy of the computed chroma vector. Unfortunately, by removing some relevant harmonic
frequencies, this optimisation has shown to be too aggressive and does not improve the results on dance
music.

The beat detection algorithm shows an 80% success rate at detecting the tempo to within +/- 1.5% of
the actual BPM. This is good enough to enable a DJ to select tracks knowing that they are within a certain
tempo range of one another, however, it is not accurate enough to perform automatic beat matching
successfully.

It gives very accurate results for songs which have areas of consistent beats, however it does not perform
well on tracks which lack these features. The Parameter Evaluation showed that the algorithm is sensitive

59
to small changes in its parameters. The parameters chosen in the final implementation do not give
optimum results for every track tested, so tweaking the parameters more to suit a wider variety of tracks
would increase the accuracy.

The automatic beat matching function works well when the two tracks that are used have an accurate
BPM estimate, the function sets the tempo of the tracks equal to one another and starts the incoming
tracks at the same time as a downbeat occurs in the currently playing track. The only problem is that most
of the time, the BPM estimates, which tempo changes are based on, are slightly inaccurate which causes
the mix to go out of sync.

The usability of the tool ranked highly with users who were familiar with the concept of DJ mixing and
also users who were not so familiar. The interface fulfils most of the ten usability heuristics set out by
Nielsen as a way of analysing user interfaces.

11.2 Further Work


There are various extensions to the work outlined in the Extended Specification of the Appendix. Some
of these extensions have already been carried out in the final implementation. The following discussion
covers opportunities for further work that have arisen as a result of implementing the system.

The key detection algorithm can be improved in a number of ways. Different methods of transforming
the signal into the frequency domain such as using the discrete wavelet transform, or the Constant Q(26)
transform instead of the STFT could be investigated further. These may lead to more accurate chroma
vectors being produced.

Using a finely grained 36-bin chroma vector and applying a tuning algorithm as described in Harte(26),
could lead to extra improvements in the quality of the algorithm, as could extending the weighting
function; possibly to reward key candidates that came a close second or third in the correlation stage of
the algorithm.

Using a statistical approach to matching the chroma vector with key templates, such as training a Hidden
Markov Model to detect the key could also increase the predictive accuracy of the algorithm.

The actual function of the key detection algorithm could be extended beyond just identifying the overall
key of the track. By splitting the song up into smaller sections, the key of each section of the song could
be found and stored down. This could then be used to transcribe or ‘reverse engineer’ a song into basic
manuscripts or even a symbolic audio format such as MIDI files. Chroma vectors are currently being
used in music analysis research to identify repetitive sections of a song. For example, the chorus and verse
sections of a song could be extracted based on how the key changes throughout the song.

The beat detection algorithm could be improved by using an algorithm which converts the signal from
the time domain to the frequency domain. The song could be split up into certain frequency bands
ranging from low to high frequencies. The detection of beats could then be more selective, for example, if
we assumed that beats only occur in low frequency bands, we can filter out beats detected in the higher
frequency bands.

A highly accurate beat detection algorithm capable of detecting the tempo to within +/- 0.1% of the
actual BPM would improve the automatic beat detection function, meaning that the beats of each track
would be guaranteed to stay in sync.

60
12. Bibliography
1. Camelot Sound. Harmonic-mixing.com - The History of DJ Mixing. [Online] www.harmonic-
mixing.com.

2. T. Beamish. A Taxonomy of DJs - Beatmatching. [Online] August 2001.


http://www.cs.ubc.ca/~tbeamish/djtaxonomy/beatmatching.html.

3. Technics. Technics Europe. [Online] 2007. http://www.panasonic-europe.com/technics/.

4. A. Cosper. Art and History of DJ Mixing. [Online] 2007.


http://www.tangentsunset.com/djmixing.htm.

5. Number A Productions. Scales and Key Signatures - The Method behind the Music. [Online] 2007.
http://numbera.com/musictheory/theory/scalesandkeys.aspx.

6. Camelot Sound. Harmonic-Mixing.com - The Camelot Sound Easymix System. [Online] 2007.
http://www.harmonic-mixing.com/overview/easymix.mv.

7. S. Pauws. Musical key extraction from audio. Proceedings of the 5th ISMIR. 2004, pp. 96-99.

8. C. Krumhansl. Cognitive Foundations of Musical Pitch. 1990.

9. A. Sheh and D.P.W. Ellis. Chord Segmentation and Recognition using EM-Trained Hidden Markov
Models. Proceedings of the 4th ISMIR. 2003, pp. 183-189.

10. T. Fujishima. Realtime chord recognition of musical sound: A system using Common Lisp Music.
1999.

11. K. Lee. Automatic Chord Recognition from Audio Using Enhanced Pitch Class Profile. International
Computer Music Conference. 2006.

12. M. Goto. A Robust Predominant-F0 Estimation Method for Real-Time Detection Of Melody and
Bass Lines in CD Recordings. June 2000, pp. 757-760.

13. R. Walsh, D. O’Maidin. A computational model of harmonic chord recognition.

14. E.D. Scheirer. Tempo and beat analysis of acoustic musical signals. January 1998, Vol. 103, pp. 588-
601.

15. A.P. Klapuri, A.J. Eronen and J.T. Astola. Analysis of the meter of acoustic musical signals.
January 2006, Vol. 14, pp. 342- 355.

16. G. Tzanetakis, G. Essl and P. Cook. Audio Analysis using the Discrete Wavelet Transform.
September 2001.

17. F. Patin. Beat Detection Algorithms. [Online] 2007.


http://www.gamedev.net/reference/programming/features/beatdetection/.

18. Native Instruments. Traktor DJ Studio. [Online] www.native-instruments.com.

19. Mix Share. Rapid Evolution. [Online] www.mixshare.com.

20. Y. Vorobyev. Mixed in Key. [Online] www.mixedinkey.com.

61
21. Z. Plane Development. tONaRT Key detection algorithm. [Online] www.zplane.de.

22. MixMeister Technology, LLC. MixMeister DJ Mixing Software.

23. M. Frigo, S.G. Johnson,. The Design and Implementation of FFTW3. Proceedings of the IEEE 93.
2005, Vol. 2, pp. 216-231.

24. Firelight Technologies. FMOD SoundSystem. [Online] www.fmod.org.

25. Hundred Miles Software. UltraID3Lib. [Online] 2007.


http://home.fuse.net/honnert/hundred/?UltraID3Lib.

26. C.A. Harte and M.B. Sandler. Automatic Chord Identification Using a Quantised Chromagram.
Proceedings of the 118th Convention of the Audio Engineering Society. 2005.

27. C. Bores. Introduction to DSP. [Online] www.bores.com.

28. S.M. Bernsee. DSP Dimension. [Online] www.DSPDimension.com.

29. Camelot Sound. Harmonic-Mixing. [Online] www.harmonic-mixing.com.

30. Mazurka Project. Harmonic Spectrum. [Online] 2007.


http://sv.mazurka.org.uk/MzHarmonicSpectrum/.

31. B. Hollis. The Method behind the Music. [Online] www.numbera.com/musictheory/theory/.

62
13. Appendix
Appendix A: Introduction to Digital Signal
Processing

Digital Signal Processing (DSP) is the process of manipulating


manipulating a signal digitally either to analyse it or to
create various possible effects at the output.

Music can be thought of as a real world analogue signal. For music to be processed digitally by a
computer it must be converted from a continuous analogue signal
signal to a digital signal made up of a
sequence of discrete samples. This conversion is called sampling.

The analogue signal that we want to convert to a digital representation is sampled many times per second.
Each sample is simply a binary encoding of the amplitude
amplitude at that sampling instant. The sampling
frequency is the number of samples obtained per second, measured in hertz (Hz). If we sample at too low
a rate, we could miss changes in amplitude that occur between the taking of samples, and we could also
mistake
istake higher frequency signals for lower ones. This is called aliasing.

To prevent aliasing from happening and to perfectly reconstruct the original signal completely and
exactly, we must sample at a rate which is more than double the Nyquist Frequency. This is the highest
frequency in the original signal that we want to sample. It is called the Nyquist frequency after it was
proven in the Nyquist-Shannon
Shannon sampling theorem. Most digital audio is recorded at 44,100 Hz
(44.1KHz), so the Nyquist frequency for this sample rate would be 22,050 Hz, which is approximately the
highest frequency perceivable by the human ear.

PCM or Pulse code modulation is the most common method of encoding analogue audio signals in digital
form. The diagram below shows the sampling of a signal for 4-bit PCM.

Figure 37: Sampling of a signal for 4-bit PCM

The following diagram displays how the FMOD library (24) stores raw PCM audio data in memory buffers
and differentiates between samples, bytes and milliseconds. In this format it can be seen that a left-right
left
pair is called a sample.

63
Figure 38: How FMOD stores audio data

Once we have access to the audio data in digital format, we can apply many functions to manipulate it.
Filtering is a common process used to either select or suppress certain frequency ranges of the signal. A
low-pass filter removes all high frequency components from a signal above a certain bound, allowing the
low frequencies to pass through it normally. In music, high frequencies equate to the treble and low
frequencies equate to the bass. Therefore a low pass filter would return only the bass elements of a song.

A high pass filter returns a signal containing only treble frequencies above a certain bound. A band-pass
filter returns a signal containing frequencies between a specified lower and upper bound.

Filtering is made possible through a technique called convolution. Convolution takes the original signal,
and with a shifted reversed version of the same signal, it finds the amount of overlap between the two
signals. It is a very general moving average of the two signals. This process is very computationally
expensive as every point in the original signal has to be multiplied by the corresponding point in the
transformed signal.

Correlation is another function which shows how similar two signals are, and for how long they remain
similar when one is shifted with respect to the other. The idea is the same as convolution, however the
second signal is not reversed, just shifted by a certain factor. Correlating a signal with itself is called auto-
correlation and can be used to extract a signal from noise.

Full wave rectification takes the absolute values of each sample, so that a signal can be treated on a purely
positive amplitude scale. Half wave rectification is an example of a clipper, in which negative valued
samples of a signal are blocked, whilst positive valued samples are untouched.

One of the most important techniques used in DSP (and in almost every algorithm described below) is
called the Fourier Transform. Jean Baptiste Joseph Fourier showed that any signal could be reconstructed
by summing together many different sine waves with different frequencies, amplitudes and phase. The
discrete Fourier transform (DFT) is a specific type of Fourier transform used in signal processing. A
computationally efficient algorithm to calculate the DFT is called the fast Fourier transform (FFT). There
are many different algorithms to calculate the FFT, the most popular being the Cooley-Tukey algorithm.

The FFT basically takes a signal in the time domain as input and returns a spectrum of the frequency
components which make up every sample in the signal. Many techniques in DSP operate in the frequency
domain so the FFT is a good way of converting a signal to the frequency domain from the time domain
to enable certain operations to be carried out. An inverse FFT is used to convert the data back into the
time domain.

The short term Fourier transform (STFT) is a specialised version of the FFT more applicable to
polyphonic, non-stationary signals such as music. Essentially, it applies the FFT to small sections of the
signal at a time, in a process called windowing.

64
Wavelets and the Discrete Wavelet Transformation (DWT) offer an even better alternative to the STFT.
More can be read about the DWT in the paper described below; Audio Analysis using the Discrete
Wavelet Transform.

More on DSP techniques in general can be found at the online introduction to DSP by Bores(27).

Appendix B: Specification
Aims of the project

The aim of this project is to analyse pieces of music to detect their tempo (beat detection) and their key
(harmonic detection). The system will work by accepting up to two music files at a time from the user,
analyse them, and return a visualisation of the audio content with beat markers, and an indication of the
key of the song. There should be intuitive controls to enable the user to play, pause, stop, alter the
volume, pitch-shift and time-stretch the song. There should also be a function which can automatically
mix together two songs based on the detected beats and keys of the songs.

The big idea of the project is to aid a DJ to perform a perfect beat and harmonic mix; to devise a program
that will enable a DJ to mix together two songs which have a similar tempo, and a similar key, so that they
‘sound good’ together.

Core Specification

The system must be able to analyse music in the following common file formats: .wav (Microsoft Wave
files), .mp3 (MPEG I/II Layer 3), .wma (Windows Media Audio format), .ogg (Ogg Vorbis format).

The system is only required to deal with music containing a prominent distinguishable beat. It is only
expected to deal with music of a similar style to that which a DJ would be playing at a night club.

The system must be able to detect the beats from the music files accurately, and provide visualisations of
those beats to the user as the music file is played. The system must be able to detect the key of the track
accurately, and display the detected key to the user.

The system must be able to play two tracks at the same time, enabling the user to alter certain properties
of each track independent of the other. The system should treat each track as a separate unit, rather like a
physical deck or turntable in a DJ set-up. There should also be a mixer unit, which stands between the
two tracks, and enables the user to mute or un-mute each track or to cross-fade one track into the other,
so that they are both audible at the same time.

The properties of each track that the user should be able to adjust are: its volume, its tempo (independent
of pitch; time-stretching), its pitch (independent of tempo: pitch-shifting) and both its tempo and pitch
together. By varying the tempo, the user is also varying the tracks key, for example a ± 6% adjustment in
the BPM rate would cause a change of one semitone in key. This adjustment in key should also be taken
into account and updated based on the changes in tempo.

The system should be able to return some sort of visualisation of the currently playing tracks to the user.
A plot of the waveform of the currently playing track along with visual beat-markers that mark out the
start of a new beat would be very useful to a user looking to beat-match together two songs.

An oscilloscope or VU meter outlining the amplitude or volume of each track would aid the user to
detect which track needs to be turned up or down, so that when the tracks are mixed together, both
tracks can be heard, and one does not drown out the other.

65
The BPM and key detected by the program should be stored in a tag within the music file, such as ID3v2
format (for MP3 files). This then makes it possible for the program to indicate which other tracks would
be suitable to be mixed with a selected track. The system can indicate tracks with a compatible key to one
that is selected based on Camelot’s easymix system as described in the background section. Additionally
the system should be able to read and display common tags or metadata from files loaded such as the
artist and title of a song.

The system should be able to automatically beat-match together two tracks. First it detects the beats and
key of each track. Then it time stretches or adjusts the speeds of each track to the same speed and a
compatible key. Finally by overlaying one track on top of another at the beginning of a beat, the two
tracks should be beat-matched (the beats of each track should fall at precisely the same time), making a
seamless harmonic mix.

The project is to be written in C#.net, making use of the power of windows forms for the graphical user
interface. The FMOD sound system (24) will be the main library used. This library contains various
functions that will enable the raw sound data to be extracted from the audio files and used in the
algorithms. It includes a pitch-shifting algorithm which can also be used to do time stretching. This
algorithm is based on code developed by Bernsee (28).

The graphical user interface should be clear and intuitive. Using sliders the user should be easily able to
adjust each property of the track. Buttons with intuitive icons should make it clear to the user what their
function is.

Extended Specification

If time permits, there are various extensions to the project that could be implemented to improve the
overall functionality of the system.

Improvements to the beat detection algorithm could be investigated in order to enable the system to
detect beats in a wider range of musical styles, such as rock. The addition of cue- and loop-points gives
the user more control over the mix. Cue-points enable the user to start playback of a file from a defined
point, such as the time of the first beat. Loop-points enable the user to repeat certain regions of the song,
possibly to extend the length of a mix.

Various effects could be added to the mixer unit, such as reverb, echo and flange. Low and high-pass
filters enable the user to filter out the sounds of one track which may not fit well with that of another, e.g.
a low pass filter could be used to eliminate an irregular high-pitched hi-hat sound.

The addition of real-time recording of mixes, including applied effects, would enable the user to look
back over the mix to decide if two songs sound good together and to make pre-recorded mixing a
possibility. Finally, extending the application with an integrated file browser displaying information about
the users song library, would make it easier for the user to load new songs and generate playlists.

66
Appendix C: User Guide

Upon loading the program, you are presented with the main screen. This consists of two decks, Deck A
and Deck B, a Mixer containing volume controls and crossfader, and the music browser, which will
display information on the tracks in your music library.

Deck A Mixer Deck B

Music Browser

Figure 39: The Main Screen

Loading a track into a deck

A track can be loaded in one of three different ways. You can choose to load a track into a deck by
dragging and dropping an appropriate file from windows explorer onto one of the decks. You can also
click the eject button on the appropriate deck, this will load a file browser dialog where you can navigate
to a music folder and load one of your songs.

The alternative method is to use the built in music browser to navigate your computer for music folders.
Once a folder is selected, the music browser will display information about the songs in the folder such as
the artist/title/duration and BPM and Key if they have previously been detected. You can sort the tracks
in ascending or descending order of any of these categories by clicking the column header of the
appropriate category.

67
Figure 40: Loading Sasha - Magnetic North into Deck A

Select a track from the music browser and click the ‘Load in Deck A’ button to load the track into Deck
A, the same applies for Deck B. You can also right click on the track and select the same option from the
drop down menu.

Once the song is loaded the deck will display the waveforms of the track; one which covers the whole
track, and the zoomed-in view which displays 6 seconds of the audio at a time. The deck displays
information on the track such as its BPM and Key. The deck now allows you to play and pause the track
and to adjust its pitch and tempo.

Figure 41: The Deck Control

68
Detecting the Key of a track

Follow the same process as described above for loading a track, however instead of selecting ‘Load into
Deck A/B’, choose the ‘Detect Key’ option.

The program will then begin to detect the key of the selected track. Once again, a progress bar indicates
the progress of the key detection process.

Once the key detection process has finished, a table showing the results will pop up, and the status bar
will display the key and keycode which will be written to the ID3 tag of the audio file.

In Figure 42 the key detected is clearly a C major.

Figure 42: Key Detection progress/results

69
Mixing two tracks

First load tracks into Deck A and Deck B. These tracks should be selected based on their BPM and Key
detected. A track with keycode 4A can be harmonically mixed with any track with keycode 3A, 4A, 5A or
4B. The BPM’s of the two tracks should not differ by much (+/- 10 BPM max); trying to mix tracks that
differ largely in their BPMs will not usually sound good. In Figure 43, the two tracks are X-Cabs – Neuro
99 in deck A, with a keycode of 10A and BPM of 139.68, which is going to be mixed into Xstasia –
Sweetness which has a compatible keycode of 11A and BPM of 136.01.

1. Make sure the crossfader is all the way to the left hand side and start the track in deck A.
2. Enable the sync button on the track in deck B. The sync button will sync the non playing track to
the tempo of the playing track. In figure 41, you can see the tempo of Xstasia in deck B has been
automatically increased by 3.67BPM (2.7%) to match the BPM of 139.68 of X-Cabs in deck A.
3. Now cue up the track in deck B to its first downbeat, as described in the ‘Illustration of beat
mixing’ section of the background. This can be achieved by dragging the waveform until the first beat
marker is found, and aligning the beat marker with the centre of the waveform.

Figure 43: Crossfader in left hand position

4. When the playing track reaches a downbeat press play on deck B. The track will only start playing
when the next beat marker of song A passes the centre of the waveform.
5. Now both tracks are playing but because the crossfader is in the left hand position, only deck A’s
output is audible. Drag the crossfader to the central position so that both tracks can now be heard, as in
Figure 44. If the beats are in sync the mixed output will sound like one song still. If the beats are not
matched, the beats of the two songs will clash at irregular intervals and the overall output will not make
any musical sense!

70
Crossfader in central position,
both songs playing at same time

Figure 44: Crossfader in central position

6. To correct an out of sync mix, click the ‘Sync to Beat’ button on either of the decks. This will
shift the tracks to their next beat.
7. If the tracks still become out of sync, this means they are not at the correct tempo. To correct
this make minor adjustments to the tempo of the song in deck B by clicking the + / - buttons on the
tempo control depending on whether it is slower or faster than the other playing song.
8. Move the crossfader all the way over to the right once you are finished mixing the two songs, as
in Figure 45. Now the song in deck B is playing and you can decide to load another song into deck A.

Figure 45: Crossfader in right hand position

71