You are on page 1of 10

Deep Layered Learning in MIR

Anders Elowsson
KTH Royal Institute of Technology

ABSTRACT relevant intermediate representations are not perceptually
well-defined, and to design these, human intuition only
Deep learning has boosted the performance of many music
goes so far – decades of traditional research in MIR gave
information retrieval (MIR) systems in recent years. Yet,
steady but rather slow incremental improvements. Deep
the complex hierarchical arrangement of music makes
learning has become an essential tool for bridging the un-
end-to-end learning hard for some MIR tasks – a very deep
known territory between input data and targets [26, 48],
and structurally flexible processing chain is necessary to
well suited for MIR [38]. At the same time, some (often
extract high-level features from a spectrogram representa-
intermediate) representations of music involving pitch,
tion. Mid-level representations such as tones, pitched on-
harmony, and rhythm, are well-established and fundamen-
sets, chords, and beats are fundamental building blocks of
tal to musical comprehension. Music intelligence systems
music. This paper discusses how these can be used as in-
may rely on such representations to partition complex MIR
termediate representations in MIR to facilitate deep pro-
tasks into several supervised learning problems, thus
cessing that generalizes well: each music concept is pre-
breaking down complexity. The learning steps necessary
dicted individually in learning modules that are connected
for untangling the musical structure can be specified in a
through latent representations in a directed acyclic graph.
directed acyclic graph (DAG), using, for example, deep
It is suggested that this strategy for inference, defined as
learning with musically appropriate shared weights at each
deep layered learning (DLL), can help generalization by
individual step. This strategy for inference, defined here as
(1) – enforcing the validity of intermediate representations
deep layered learning (DLL), is starting to become preva-
during processing, and by (2) – letting the inferred repre-
lent in MIR systems with state-of-the-art performance.
sentations establish disentangled structures that support
Figure 1 provides a very high-level illustration of the
high-level invariant processing. A background to DLL and
rich machine learning repertoire that can be used as build-
modular music processing is provided, and relevant con-
ing blocks in DLL for MIR. An overview of these learning
cepts such as pruning, skip connections, and layered per-
strategies, withDeep Learning
examples, is provided in Sections 2.1-3,
formance supervision are reviewed.
while 2.4-5 reviews modular processing and layered learn-
ing in MIR. Section 3 explores relevant concepts of DLL
in MIR, and Section 4 offers closing remarks.
Many of the skills necessary for successfully navigating
modern human life are interrelated, but still so divergent 2.3 Layered Classifiers 2.1 Representation Learning
that it is practically unfeasible to learn them all from one Deep Learning
Cascading y
overall objective function. Although it may be possible to
Classifiers 3. Deep Layered
formulate an overarching goal for learning, progress gen- y
erally stems from achieving smaller intermediate goals, y Decision y
gradually expanding a toolbox of concepts and procedures y1
in a supervised and structured manner. y y Trees
Music is a fitting environment for studying learning, an y
2.2 Transfer
intricate arrangement of overlapping sounds across many y1 y2 y1 z1
dimensions. Complex sinusoids in the time domain with Classifier
Chains y3 y2 z2 y z
frequencies at integer multiples give rise to the pitch per-
cept. Pitches are layered across frequency to form har-
mony, and the pitched sounds are combined with percus- Figure 1. Simplified flowgraphs for learning methodolo-
sion to form complicated rhythmical patterns. How can we gies which are building blocks of DLL (Sections 2.1-3).
teach machines to excel at music? Consider for example The input is marked with a black line, processes with ar-
the complex task of writing and performing a song with rows (red = processes learned and later applied), represen-
certain perceptual qualities with guidance from only a few tations with circles (red = learned for other processes), and
music audio examples. Such a task requires the machine to annotated targets marked with y or z (red = used as output;
be able to represent and understand music at a high level – subscripts = different aspects of the same problem). The
something that necessitates many intermediate layers of depth of each learning step may vary (not specified).
non-linear transformations and representations. Many such

Disentangled Invariance

2. BACKGROUND input space. The structure of such architectures can be de-
termined with genetic algorithms [2, 34]. Although these
2.1 Representation Learning
systems have been referred to as modular NNs, they do not
A common challenge in machine learning is that available use modularity in the same way as DLL for MIR, described
data is not organized in a way that facilitates inference. in Section 3.1. With DLL, a task is solved by subjecting all
Classification will become much easier after the data is examples to the same (deep) learning modules, but the
transformed into a new representation. For MIR, some modules can also be used for other tasks.
suitable representations can be defined beforehand from A way to increase computation speed for classification
knowledge about music and auditory perception. Resear- is to use cascading classifiers [1]. With this methodology,
chers may compute a time-frequency representation (TFR) weak classifiers are applied in a linear sequence to succes-
of the audio signal, and, as will be expanded upon in this sively reject a larger and larger portion of the input space
paper, extract tonal or temporal information as intermedi- until (ideally) only true examples of the input space re-
ate representations. Commonly however, it is not known main. A cascade of classifiers was used for face recogni-
how to optimally represent the data at intermediate levels, tion back in 2001 [82], training small decision tree classi-
so suitable representations (and transformations) must be fiers with AdaBoost. In visual media, they can be used for
discovered automatically with representation learning [3]. pedestrian detection [13, 83]. Another example is Google's
Oftentimes, the best such latent representations (i.e., in- system for address number transcription [27], where one
ferred and not observed) are derived after several layers of classifier locates addresses and another transcribes them
transformations. Depth generally leads to richer represen- (as clarified in [26]). Cascades have also been applied in
tations, yielding better results for MIR [38]. As suggested MIR, to utilize the sparse distribution of pitched onsets for
by Bengio et al. [3] (p. 1800), the “hierarchical organiza- polyphonic transcription [80].
tion of explanatory factors” exploited in deep learning pro- A related methodology, classifier chains [68], represent
motes “progressively more abstract features.” an automated method for transforming a multi-label clas-
Learning can be performed in an unsupervised way, by sification task into a chain of binary ones. Labels are clas-
identifying compact representations from which the train- sified in consecutive stages, with the classification result
ing set can be recreated [37, 81]. In MIR, this has been at one stage supplied together with the original features for
done for genre recognition [33] and piano transcription the next stage (with a new set of labels). The benefits can
[54] with a deep belief network (DBN). Supervised deep be understood from the effect of depth on the structure of
learning systems can also be conceived of as performing the learning. The labels predicted at intermediate steps can
representation learning during processing in hidden layers be conceived of as hidden layers in a multilayer perceptron,
[26]. The ability to automatically form such intermediate thereby functioning as a universal approximator [66]. Alt-
representations is one of the powers of deep learning [48]. hough they emulate these end-to-end networks, intermedi-
In MIR, some of the earliest deep learning systems recog- ate layers are trained with supervised learning. As will be
nized genres [49] instruments [40], and chords [39]. discussed, DLL can be understood as a way to reframe
complex tasks as multi-label classification problems, where
2.2 Transfer Learning
labels in early layers preimpose the structural representa-
To successfully tackle new problems, previous knowledge tion of the data. Classifier chains have been used in several
from similar problems can be used [62]. In machine learn- systems for multi-label classification in MIR [31, 67, 68].
ing, this induction from knowledge is called transfer learn- Image segmentation and object classification has re-
ing [57], which can be used to manage a task when anno- cently seen many models with intricate learning architec-
tated data is scarce. A typical example is to recognize face tures [90], including: the part-based R-CNN [89], convo-
emotions [55] from latent representations extracted with a lutional neural networks (CNNs) combining bottom-up
deep learning system trained on a big set of images [46]. and top-down [87] or locally shifting (through a recursive
In MIR, several successful systems have first been neural network, RNN) [72] attention mechanisms, and the
trained on a large dataset with annotated tags, such as the CNN tree [84] using several CNNs in a classifier chain.
million songs dataset, and then applied to smaller datasets Notably, end-to-end DAG networks can use intermediate
for, e.g., genre classification [11, 32, 33, 79]. Transfer targets and average their gradients with those of the overall
learning has also been used for predicting musical prefer- target during backpropagation, e.g., to guide the structural
ences [50] or playlist generation [10]. segmentation of tissue images during cancer classification
[21]. In robotics, neuro-controllers have been trained for
2.3 Architectures with Multiple Layered Classifiers several subtasks to perform a target task better; described,
Many machine learning architectures have been proposed among other things, as modular decomposition [12, 14, 77].
over the years involving multiple, layered classifiers, i.e., 2.4 Modular Music Processing
ensemble learning with a dynamic structure [26]. Early ex-
amples focused on dividing the input into nested sub-re- Mid-level representations are very useful in MIR [5].
gions. Such algorithms, based on decision trees, include There is an abundance of such relevant representations
CART [9], ID3 [65], and MARS [24]. The strategy was covering different dimensions of music. Across the tem-
extended with neural networks (NNs) to perform the clas- poral dimension, researchers may predict or utilize, e.g.,
sification split [30], and separate gating networks for hier- meter, tempo, beats, and downbeats; across frequency,
archical mixtures of experts [41] to smoothly divide the e.g., fundamental frequencies (f0s), etc.
Many MIR systems have been designed in a modular For polyphonic transcription, onset and offset annotations
fashion, utilizing musical representations to express struc- can be used to generate annotations for framewise predic-
tural properties of the data at intermediate levels. Frame- tions. For beat tracking, the annotated beat positions can
wise f0 activations are commonly used for pitched onset be used for predicting a beat activation, the tempo, and to
detection, but they can also be used for beat tracking [17] find the most likely sequence of beats. As indicated in the
and for computing a refined chromagram for chord detec- Figure, polyphonic transcription (1-3) can be used as a pre-
tion [52]. Various perceptual features of music such as the processing to beat tracking (4-6). For example, detected
“speed,” rhythmic clarity, and harmonic complexity, can notes can be supplied to a beat activation network.
be predictive of music mood [23]. The computed speed can 2.5.1 Polyphonic Transcription
be used together with periodicity estimates and tempo es-
timates for beat tracking [17]. Beat estimates can be useful In polyphonic transcription, framewise f0 estimation (or
when trying to detect chords and chord changes [52] (to- partial estimation) and note transcription have been sepa-
gether with a prior key estimation [88]), or downbeat po- rated into more than one learning algorithm in many sys-
sitions [15, 45]. The opposite is also true; chord change tems. Marolt [51] used networks of oscillators for partial
information can be used when computing beat and down- tracking, and separate networks for detecting note pitches
beat positions [28, 58-60, 86]. Many systems have used and repeated notes. Another early implementation [64]
harmonic-percussive source separation (HPSS) as an early used a support vector machine (SVM) for the frame level
processing step [20, 25, 56, 69, 70]. This interdependence classification, and a hidden Markov model with state pri-
of musical concepts is probably why several MIR ors determined from the training set, to identify onsets and
toolboxes [16, 47, 53] have a modular design. At a higher offsets across time. A similar methodology was applied by
level of processing, Schenkerian analysis [22], extended Nam et al. [54], using a DBN to extract features for classi-
with probabilistic computational models [43], can be used fication. A musical model was then applied on top of the
to understand the hierarchical structure of compositions. framewise predictions in a separate study [7], using an
Modular music processing can be found when studying RNN combined with a restricted Boltzmann machine.
music impairments in brain-damaged patients [61], lead- A recent system tracks pitch onsets and uses these for
ing researchers to suggest various modules for the tonal and framewise estimates with CNNs [35]. Another recent sys-
temporal organization. A biologically-inspired hierarchical tem uses six interconnected networks, each trained sepa-
unsupervised learning model has also been proposed [63]. rately with supervised learning [18]. Cascading classifiers
are used for framewise f0 estimation. Regions of connected
2.5 MIR implementations using layered learning f0s (corresponding to tones) are identified with tone-shift-
Modular music processing facilitated by suitable mid-level invariant onset and offset detection. Finally, tentative
representations, as described in Section 2.4, can easily be notes are extracted and classified as true or false with the
extended to modular learning systems by introducing sev- last network.
eral supervised learning steps. Often in MIR, one type of In some implementations, the first step has been per-
annotations can be restructured to facilitate learning of formed with unsupervised learning of latent variable mod-
several perceptually relevant aspects for a task. Two such els (e.g., non-negative matrix factorization). Supervised
tasks are shown in Figure 2 and reviewed in this Section. classifiers have then been applied to refine the frame-level
classification [71] or for the note detection step [78, 85].
1. (4) Beat activation
2. (5) Tempo 2.5.2 Beat Tracking

Another MIR task where separate supervised learning
steps can be useful is rhythm tracking (e.g., tempo estima-
3. (6) Beat sequence modeling
tion, beat tracking, and downbeat tracking). Some systems
use one network for computing a time-varying activation
curve, and then, to find the best sequence of beats, tune a
few parameters for Viterbi decoding [44] or for a dynamic

2. Onset 3. Offset
Bayesian network [6]. For downbeat tracking, the learning
1. Frames step for detecting downbeats has used beat synchronous
input features, with the beats derived from a previous RNN
learning layer [45]. In [20], linear regression is used to es-
timate the speed of the music and logistic regression is
used to pick the tempo between several tentative tempi.
Figure 2. Tasks that can be learned in sequence from one Some of these methods are combined in [17], a system
set of annotations. For polyphonic transcription (blue), described as using “layered learning.” The most salient pe-
systems can first learn to perform framewise predictions, riodicity is estimated by the first network and then used to
then onset detection and offset detection. For beat track- subsample periodicity invariant input features for a second
ing (green), systems can compute activations at beats, use network computing a beat activation. The speed is esti-
these activations to infer tempo, and finally model the se- mated with a third network (regression) from global fea-
quence of beat from the previous two representations. tures. Finally, the tempo is computed in a fourth network,
using previously computed representations as input.
3. DEEP LAYERED LEARNING IN MIR data at lower modules for MIR can be a TFR, whereas the
subsequent modules use representations from lower-level
As presented in Section 2.5, many recent systems for MIR
learning modules. The predictive power of deep learning
use multiple learning steps, a few with a considerable
is best utilized if modules perform well-defined tasks that
depth in at least one learning step. The architectures and
require several layers of transformation within each mod-
flow of representations can be described by a DAG, with
ule. Learning should only be broken down into modules if
learning steps trained in a sequence, starting with the most
the module meaningfully untangles the musical structure.
low-level. These designs are not generally motivated by
Figure 3 outlines important representations for DLL in
the same circumstances as when multiple steps are used
MIR, drawing on the examples in Sections 2.4-5. The ar-
for transfer learning. The purpose is not to transfer proce-
rows of the model indicate how learning modules can be
dures acquired from one dataset to another, but to use sev-
connected, starting from a TFR of the music audio. All rep-
eral subtasks to distill knowledge of structural properties
resentations have a rather clear perceptual and music-the-
within intermediate representations. This is a method for
oretical interpretation (the importance of which is dis-
approaching an overarching task with multiple annota-
cussed in Section 3.2) and have been computed with ma-
tions in a way that promotes generalization (see Sections
chine learning in the past. Skip connections can (and
3.2 and 4) and, also, modularity in terms of the overall ar-
should) be used, but these are not drawn. The model fo-
chitecture (Section 3.1). As stated in the introduction, the
cuses on source perception at lower layers, extracting har-
approach will be defined as deep layered learning (DLL),
monic and percussive audio, as well as framewise f0s (the
referencing a desirable depth of each learning step (i.e.,
tonal “source” of harmonics across the spectrum). Onsets
deep learning), but also a depth of the layering of the
and pitched notes can be estimated from these sources, or
learning steps.1 The term was originally used in a headline
directly from the TFR. Estimated pitches facilitate key and
for a shorter abstract on polyphonic transcription [19],
chord detection, rhythmical processing (e.g. beats and
and, as mentioned in Section 2.5, a system for beat track-
tempo), and can be used as a precursor to pitched instru-
ing has been described as using “layered learning” [17].
ment recognition. Percussive audio and onsets can instead
There are several concepts, evident in [17, 18] and ap-
be used to guide the recognition of percussive instruments.
plicable to MIR, that are important to understand to build
As indicated, there is a strong interaction between repre-
successful DLL systems (here highlighted in bold with the
sentations covering the rhythmical organization of music.
corresponding subsection in parenthesis). The modularity
Higher-level music analysis can use a transcription of the
of music (see Section 2.4) enables the architecture to be
music, which could also be used together with perceptual
divided into learning modules (3.1), which are the build-
features for estimating mood. Genre is depicted as a rather
ing blocks of DLL. Each module performs a mapping be-
high-level concept, contrary to the shallow architectures of
tween latent representations (3.4) (together with the out-
many successful genre recognition systems [4, 76]. This is
put) of previous learning modules and an annotated target.
inspired by research suggesting that such systems rely on
The validity (3.2) of the target (with respect to the over-
“irrelevant confounding factors” for classification [73-75].
arching task) for these intermediate learning module is im-
portant; learning modules should only be employed when Music Percussive
the module will transform the input in a direction that fa- Framewise f0s Harmonic TFR audio
cilitates prediction of the overall target. If this is not the Pitched notes audio
case, a (deeper) learning architecture must instead be de- Onsets
vised within-module, reaching an even higher-level target.
It may often be unfeasible to devise intermediate targets Key Beats
comprising all relevant information for the final predic- det. Percept.
Tempo Downbeats features
tion. Skip connections (3.5) from earlier points can
therefore be employed. Such an overall strategy is espe- Meter Periodicities
det. Mood
cially useful if the purpose of the earlier learning module
was to infer “disentangled structures (3.3)” – structural Music Analysis
Transcription Instr. Rec.
components of the music facilitating invariant processing.
Such invariances are hard to encode for with end-to-end (perc.)
Instr. rec. (pitched) Genre
learning. Computation time can be reduced by pruning
(3.6) the search space before the next module. The system Figure 3. Important representations of music audio, and
can also benefit from applying layered performance su- their hierarchical organization. There is a strong interac-
pervision (3.7), identifying weak links. tion between all rhythm representations (green rounded
3.1 Learning Modules rectangle, RR), pitch-related representations (blue RR),
and the vertical arrangement of pitch (purple RR). Esti-
The cornerstone of DLL is the learning module, in which mates that generally are global are marked with yellow,
the relationship between input data and a target is inferred. and the output representations of HPSS are marked with
The modules are arranged in a DAG, producing higher- red. Structural analysis performed on musical scores is
level representations at each hierarchical layer. The input marked in black (a deep multi-layered task).

I.e. both “deep-layered learning” and “deep, layered learning”.
3.1.1 Parallels to Modular Programming for invariant properties within musical structures when not
relying on an end-to-end learning architecture. The previ-
In computer programming, code modularity is an im-
ous learning module can infer structural components of the
portant design principle, allowing well-defined tasks to be
music as shown in Figure 4, which then are incorporated
abstracted into higher-level modules. This gives develop- into the processing chain of subsequent modules.
ers benefits with clear parallels to those offered by DLL
For example, listeners can track tones across both pitch
and learning modules.
and time during vibrato, so that fluctuating f0s are associ-
1) Modular code is reusable since many modules can be ated with the same tone stream. A tone and its features can
used for other projects. In DLL, each learning module can
therefore be traced across both pitch and time [18]. Two
be used by all succeeding tasks in the chain. tones with the same envelope (blue in Figure 4) can
2) When well-defined subtasks are divided into mod-
thereby be represented by the same envelope values in the
ules, it becomes easier to perform bug fixing, performance
next step of processing, regardless of their variation in
improvements, and other under-the-hood changes for each pitch across time (red in Figure 4). By collecting infor-
module separately. With DLL, it becomes possible to iden-
mation from the note attack (onsets), the sustained part
tify modules that are underperforming, through layered
(ridge across the length of the note), and decay (offsets), it
performance supervision (Section 3.7). Performance can also becomes possible to represent the note as a vector of
be maximized for each module, for instance, by experi-
fixed size, independent of note length. Another example is
menting with different weight-sharing mechanisms.
downbeat tracking (bottom pane in Figure 4), where initial
3) Modular code can be developed by large teams, as periodicity trackers on a finer (faster) level can make the
team members can focus on separate code parts that inter-
downbeat tracker's input features invariant with regards to
act only through pre-defined interfaces. Modular DLL sys-
phase, tempo, and tempo variations in [15, 45].
tems can be developed by several research groups, as mod-
ules can interact through pre-defined representations. f0-ridge to note envelope
3.2 Enforced Validity
Deep layered learning can be used to force the system into
using certain intermediate representations of the data that
the researcher knows are important – to enforce validity.
This can be useful for tasks composed of rather independ- Time
ent subtasks [29], and also prevent overfitting, as the sys-
tem cannot to the same extent make predictions by infer-
ring complex and irrelevant relationships between the in-
put data and the target. Irrelevant in this context refers to
those relationships that will not generalize outside of the
Beats to downbeats
training set, as discussed in [73, 75]. For example, down-
beats may be inferred from pitched onsets, beats [45], and Figure 4. Disentangled structures for pitched note tracking
chord changes [58-60, 86]. Such representations were pre- and downbeat estimation. The extracted representation for
sented in Figure 3. If these representations are computed future processing is invariant with regards to variations
with supervised learning, and then supplied to the down- that are of less perceptual relevance.
beat tracker in a relevant form, it may generalize better
than if it computes downbeats directly from a TFR with 3.4 Latent Representations
end-to-end deep learning. The system will not to the same
The latent representations in the last hidden layer of each
extent rely on assumptions about how, e.g., the rhythmical
structure covaries with vocal characteristics, instrumenta- learning module consist of various features useful for pre-
tion, or timbre. Such relationships may exist only in the dicting the target of the module. When computing a beat
training set and are therefore less “valid” for the task. For activation (see Figure 2) with a feedforward network for a
tasks that to a larger extent are defined from perception, single time frame using input features from surrounding
DLL could be contrived as enforcing perceptual validity. time frames, one neuron in the hidden layer may activate
There is a close connection between using layered if a kick drum is present close to the processed frame, an-
learning modules with high validity and using weight shar- other neuron for the snare drum, and some neurons may
ing with convolutional neural networks (CNNs). In MIR, activate if periodical musical accents intersect the present
weight sharing can generally be used to encode for invari- frame. Many neurons will also activate to attributes for
ant properties of music. They thereby enforce restrictions which no clear musical terminology exist. It is reasonable
on the computations to improve generalization and allow to assume that these hidden layer activations can be use-
for greater depth. Deep layered learning systems can use ful, in addition to the actual output prediction, for predict-
weight sharing within modules, but also use intermediate ing the target of the next learning module – after all, a ma-
targets to enforce restrictions on the representations and jor point of DLL is to create high-level representations for
structures that are used. prediction. Therefore, for most steps of a DLL system, it
3.3 Disentangled Structures should be beneficial to use representations from the last
hidden layer of the earlier learning module when predict-
It is important to account for invariances in music when ing a new higher-level target. In this respect, DLL is re-
developing a MIR system. Sometimes it is easier to encode lated to the layerwise training of DBNs.
3.5 Skip Connections 3.7 Layered Performance Supervision
Skip connections were introduced more than 20 years ago It becomes easier to identify strengths and weaknesses in
for feedforward networks [42]. They allow lower-level a system when various subtasks can be evaluated sepa-
representation to skip layers of processing in the network. rately. A system with three supervised learning steps may,
Recently, skip layers or “residual layers” have also been for example, first compute framewise f0s, use these to com-
proposed for deep learning (in DAGs) with CNNs [36], pute pitched notes, and finally perform chord tracking
with a gated variant used in deep “highway networks.” from note features. As annotations have already been cre-
In DLL systems, learning modules can receive input ated for each supervised learning step, the performance of
from previous modules or from the TFR via “skip connec- each subtask can be evaluated individually, assessing
tions” to cover for aspects of the data not predicted for by whether the performance is satisfactory, given the com-
later modules. As each module is trained separately, there plexity of the subtask. Such layered performance supervi-
will be no gradient flowing through such connections. The sion makes it possible to identify parts of the overall sys-
reason for using skip connections from earlier learning tem design that need to be improved. As outlined in Sec-
modules or raw input is rather obvious. As discussed in tion 2.6, when applying cascaded classifiers, performance
Section 3.2, the layered learning can be used to enforce the can be tuned to keep recall high during initial steps. The
validity of mid-level representations. Depending on the same is generally true for detection tasks in DLL. For the
mid-level target, these representations may not have fully above example of chord recognition, recall of the first step
captured all aspects relevant to the overall target. There- could be defined as the proportion of annotated onsets that
fore, it can be useful to provide representations from ear- are close to any activations of framewise f0s, and in the
lier layers or the original input when making predictions at second step depend on whether the played chord tones are
the next layer. This structure can be especially motivated detected during an annotated chord. Relevant evaluation
if the mid-level processing is not merely aimed at extract- measures may take a bit of engineering to get right.
ing relevant representations, but also to extract disentan-
gled structures as outli[35]ned in Section 3.3. 4. CLOSING REMARKS

3.6 Pruning In MIR, there are many tasks for which it is very hard to
produce a detailed formula for the solution, while it still is
Deep layered learning can be used to prune the search possible to surmise predictive mid-level features. Deep
space during run-time with a functionality inspired by cas- layered learning has been discussed as a design principle
cading classifiers (Section 2.3). Many tasks consist of de- in MIR for balancing the power of deep learning to infer
tecting local elements from a large input space. For MIR, complex relationships with the foresight of a researcher
researchers may be interested in finding music concepts acting like a teacher. In this regard, there is a parallel to the
such as tones, beats, instrumental parts, or onsets, that are role of a researcher for using prior knowledge when de-
sparsely distributed across the whole input space. In the signing the weight sharing used in CNNs.
case of polyphonic transcription, pruning can occur in sev- When the system is trained with input from the earlier
eral stages [18, 80]. For example, if the spectrogram mag- learning module and skip connections from the input spec-
nitude is low at harmonically related frequency bins and trogram, it may become over-reliant on deeper representa-
the spectrum lacks the ridge typically associated with par- tions, as these have already been fitted to the training set.
tials, it is unlikely that the bin corresponds to an f0. With a The system can then be regulated by distorting (e.g., equal-
very basic processing scheme, it is therefore possible to ization, compression, added noise) the musical excerpts
prune the search space significantly. More advanced (and randomly before training each learning module. However,
computationally expensive) models can then be applied to it should be noted that some over-reliance on deeper rep-
the bins where there is a higher probability that the bin fre- resentations may be desirable, under the assumption that
quency corresponds to an f0. Given high-resolution the enforced validity of these representations promotes
processing of 5 cents/pitch-bin across a range of 80 generalization to test sets with different characteristics.
pitches, and with 10 evaluated f0s per frame, the search As described in Section 2.3, end-to-end DAG networks
space can be reduced to around 6 ‰ of its original size, can use intermediate targets, guiding the learning (see, e.g.,
10 × 0.05⁄80 = 0.00625. After pruning across pitch, an [21, 35]). If it is not necessary to process disentangled mu-
onset activation can be computed at activated f0s across sical structures such as notes and beats (see Section 3.3),
time. Finally, only the most likely onset positions may be intermediate targets in an end-to-end DAG network can in-
processed further to evaluate if they are a correct onset of stead be used to enforce validity. In this context, it could
a note or not. Given a hop size of 0.01 seconds (s) and 20 be interesting to explore weighting schemes when combin-
evaluated notes/s, the final search space for classification ing gradients at hidden layers during backpropagation
can be reduced to almost 1/10000th of its initial size, (other than taking the mean). With high weights for the
0.05×20×0.01/80 = 0.000125. gradients of intermediate targets, the DAG network will
The described method for pruning differs from that of function more like the DLL systems described in this paper
cascading classifiers in that the input is processed to prune (or as the DBN used for chord recognition in [8]), with low
across several dimensions (and modulations of annotations weights it will perform more conventional end-to-end
as shown in Figure 2) iteratively. The resolution can be up- learning. For example, the desired contribution from chord
sampled at each pruning stage with parabolic interpol- changes to downbeat tracking could be tuned by the re-
ation; e.g., for f0 estimation [18] or for subsampling of in- searcher, discovered by parameter sweeps, or even learned
put vectors in beat tracking [17]. under gradient descent in more advanced training setups.
5. ACKNOWLEDGMENT [15] S. Durand, J. P. Bello, B. David, and G. Richard,
"Downbeat tracking with multiple features and deep
Thanks to Pawel Herman for many relevant comments. neural networks," in IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), 2015, pp.
6. REFERENCES 409-413: IEEE.
[1] E. Alpaydin and C. Kaynak, "Cascading classifiers," [16] T. Eerola and P. Toiviainen, "MIR In Matlab: The
Kybernetika, vol. 34, no. 4, pp. 369-374, 1998. MIDI Toolbox," in ISMIR, 2004.
[2] F. Azam, "Biologically inspired modular neural [17] A. Elowsson, "Beat tracking with a cepstroid
networks," Ph.D dissertation, Virginia Tech, 2000. invariant neural network," in ISMIR, 2016, pp. 351-
[3] Y. Bengio, A. Courville, and P. Vincent, 357.
"Representation learning: A review and new [18] A. Elowsson, "Polyphonic Pitch Tracking with
perspectives," IEEE transactions on pattern Deep Layered Learning," arXiv preprint
analysis and machine intelligence, vol. 35, no. 8, pp. arXiv:1804.02918, 2018.
1798-1828, 2013.
[19] A. Elowsson and A. Friberg, "Polyphonic
[4] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and transcription with deep layered learning," presented
B. Kégl, "Aggregate features and adaboost for at the MIREX Multiple Fundamental Frequency
music classification," Machine learning, vol. 65, no. Estimation & Tracking, 2014.
2-3, pp. 473-484, 2006.
[20] A. Elowsson and A. Friberg, "Modeling the
[5] S. Böck, "Event Detection in Musical Audio," Ph.D perception of tempo," The Journal of the Acoustical
dissertation, Johannes Kepler University Linz, Linz Society of America, vol. 137, no. 6, pp. 3163-3177,
Austria, 2016. 2015.
[6] S. Böck, F. Krebs, and G. Widmer, "Joint Beat and [21] G. Flood, "Deep learning with a dag structure for
Downbeat Tracking with Recurrent Neural segmentation and classification of prostate cancer,"
Networks," in ISMIR, 2016, pp. 255-261. M. theses, Lund University, 2016.
[7] N. Boulanger-Lewandowski, Y. Bengio, and P. [22] A. Forte, "Schenker's conception of musical
Vincent, "Modeling temporal dependencies in high- structure," Journal of Music Theory, vol. 3, no. 1,
dimensional sequences: application to polyphonic pp. 1-30, 1959.
music generation and transcription," in Proc. of the
29th Int. Conf. on Machine Learning, 2012, pp. [23] A. Friberg, E. Schoonderwaldt, A. Hedblad, M.
1881-1888: Omnipress. Fabiani, and A. Elowsson, "Using listener-based
perceptual features as intermediate representations
[8] N. Boulanger-Lewandowski, Y. Bengio, and P. in music information retrieval," Journal of the Ac.
Vincent, "Audio Chord Recognition with Recurrent Soc. of Am. (JASA), vol. 136, no. 4, pp. 1951-1963,
Neural Networks," in ISMIR, 2013, pp. 335-340. 2014.
[9] L. Breiman, "Classification and regression trees," [24] J. H. Friedman, "Multivariate adaptive regression
1984. splines," The annals of statistics, pp. 1-67, 1991.
[10] K. Choi, G. Fazekas, and M. Sandler, "Towards [25] A. Gkiokas, V. Katsouros, G. Carayannis, and T.
playlist generation algorithms using rnns trained on Stajylakis, "Music tempo estimation and beat
within-track transitions," arXiv preprint tracking by applying source separation and metrical
arXiv:1606.02096, 2016. relations," in IEEE Int. Conf. on Ac., Speech and
[11] K. Choi, G. Fazekas, M. Sandler, and K. Cho, Signal Proc. (ICASSP), 2012, pp. 421-424: IEEE.
"Transfer learning for music classification and [26] I. Goodfellow, Y. Bengio, A. Courville, and Y.
regression tasks," presented at the ISMIR, 2017. Bengio, Deep learning. MIT press Cambridge,
[12] S. Doncieux and J.-B. Mouret, "Beyond black-box 2016.
optimization: a review of selective pressures for [27] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud,
evolutionary robotics," Evolutionary Intelligence, and V. Shet, "Multi-digit number recognition from
vol. 7, no. 2, pp. 71-93, 2014. street view imagery using deep convolutional neural
[13] J. Dong, J. Ge, and Y. Luo, "Nighttime pedestrian networks," presented at the Int. Conf. on Learning
detection with near infrared using cascaded Representations, 2014.
classifiers," in IEEE Int. Conf. on Image Processing, [28] M. Goto and Y. Muraoka, "Real-time rhythm
2007, vol. 6, pp. VI-185-VI-188: IEEE. tracking for drumless audio signals–chord change
[14] M. Duarte, S. Oliveira, and A. L. Christensen, detection for musical decisions," in Working Notes
"Hierarchical evolution of robotic controllers for of the IJCAI-97 Workshop on Computational
complex tasks," in IEEE Int. Conf. on Development Auditory Scene Analysis, 1997, pp. 135-144.
and Learning and Epigenetic Robotics (ICDL) [29] Ç. Gülçehre and Y. Bengio, "Knowledge matters:
2012, pp. 1-6: IEEE. Importance of prior information for optimization,"
The Journal of Machine Learning Research, vol. 17, [44] F. Korzeniowski, S. Böck, and G. Widmer,
no. 1, pp. 226-257, 2016. "Probabilistic Extraction of Beat Positions from a
Beat Activation Function," in ISMIR, 2014, pp. 513-
[30] H. Guo and S. B. Gelfand, "Classification trees with
neural network feature extraction," IEEE
Transactions on Neural Networks, vol. 3, no. 6, pp. [45] F. Krebs, S. Böck, M. Dorfer, and G. Widmer,
923-933, 1992. "Downbeat Tracking Using Beat Synchronous
Features with Recurrent Neural Networks," in
[31] M. Haggblade, Y. Hong, and K. Kao, "Music genre
ISMIR, 2016, pp. 129-135.
[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton,
[32] P. Hamel, M. Davies, K. Yoshii, and M. Goto,
"Imagenet classification with deep convolutional
"Transfer learning in MIR: Sharing learned latent
neural networks," in Advances in neural information
representations for music audio classification and
processing systems, 2012, pp. 1097-1105.
similarity," presented at the ISMIR, 2013.
[47] O. Lartillot, P. Toiviainen, and T. Eerola, "A matlab
[33] P. Hamel and D. Eck, "Learning Features from
toolbox for music information retrieval," in Data
Music Audio with Deep Belief Networks," in
analysis, machine learning and applications:
ISMIR, 2010, vol. 10, pp. 339-344: Utrecht, The
Springer, 2008, pp. 261-268.
[48] Y. LeCun, Y. Bengio, and G. Hinton, "Deep
[34] B. L. Happel and J. M. Murre, "Design and
learning," nature, vol. 521, no. 7553, p. 436, 2015.
evolution of modular neural network architectures,"
Neural networks, vol. 7, no. 6-7, pp. 985-1004, [49] T. L. Li, A. B. Chan, and A. Chun, "Automatic
1994. musical pattern feature extraction using
convolutional neural network," in Proc. Int. Conf.
[35] C. Hawthorne et al., "Onsets and Frames: Dual-
Data Mining and Applications, 2010.
Objective Piano Transcription," arXiv preprint
arXiv:1710.11153, 2017. [50] D. Liang, M. Zhan, and D. P. Ellis, "Content-Aware
Collaborative Music Recommendation Using Pre-
[36] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual
trained Neural Networks," in ISMIR, 2015, pp. 295-
learning for image recognition," in Proceedings of
the IEEE conference on computer vision and pattern
recognition, 2016, pp. 770-778. [51] M. Marolt, "A connectionist approach to automatic
transcription of polyphonic piano music," IEEE
[37] G. E. Hinton, S. Osindero, and Y.-W. Teh, "A fast
Transactions on Multimedia, vol. 6, no. 3, pp. 439-
learning algorithm for deep belief nets," Neural
449, 2004.
computation, vol. 18, no. 7, pp. 1527-1554, 2006.
[52] M. Mauch, "Automatic Chord Transcription from
[38] E. J. Humphrey, J. P. Bello, and Y. LeCun, "Moving
Audio Using Computational Models of Musical
Beyond Feature Design: Deep Architectures and
Context," Ph.D dissertation, 2010.
Automatic Feature Learning in Music Informatics,"
in ISMIR, 2012, pp. 403-408. [53] B. McFee et al., "librosa: Audio and music signal
analysis in python," in Proceedings of the 14th
[39] E. J. Humphrey, T. Cho, and J. P. Bello, "Learning
python in science conference, 2015, pp. 18-25.
a robust tonnetz-space transform for automatic
chord recognition," in Acoustics, Speech and Signal [54] J. Nam, J. Ngiam, H. Lee, and M. Slaney, "A
Processing (ICASSP), 2012 IEEE International Classification-Based Polyphonic Piano
Conference on, 2012, pp. 453-456: IEEE. Transcription Approach Using Learned Feature
Representations," in ISMIR, 2011, pp. 175-180.
[40] E. J. Humphrey, A. P. Glennon, and J. P. Bello,
"Non-linear semantic embedding for organizing [55] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S.
large instrument sample libraries," in Int. Conf. on Winkler, "Deep learning for emotion recognition on
Machine Learning and Applications (ICMLA), small datasets using transfer learning," in Proc. of
2011, vol. 2, pp. 142-147: IEEE. the 2015 ACM int. conf. on multimodal interaction,
2015, pp. 443-449.
[41] M. I. Jordan and R. A. Jacobs, "Hierarchical
mixtures of experts and the EM algorithm," Neural [56] N. Ono et al., "Harmonic and percussive sound
computation, vol. 6, no. 2, pp. 181-214, 1994. separation and its application to MIR-related tasks,"
in Advances in music information retrieval:
[42] B. L. Kalman and S. C. Kwasny, "High performance
Springer, 2010, pp. 213-236.
training of feedforward and simple recurrent
networks," Neurocomputing, vol. 14, no. 1, pp. 63- [57] S. J. Pan and Q. Yang, "A survey on transfer
83, 1997. learning," IEEE Transactions on knowledge and
data engineering, vol. 22, no. 10, pp. 1345-1359,
[43] P. B. Kirlin, "A probabilistic model of hierarchical
music analysis," Ph.D dissertation, University of
Massachusetts Amherst, 2014. [58] H. Papadopoulos and G. Peeters, "Simultaneous
estimation of chord progression and downbeats
from an audio file," in IEEE Int. Conf. on Acoustics, [72] P. Sermanet, A. Frome, and E. Real, "Attention for
Speech and Signal Processing, (ICASSP), 2008, pp. fine-grained categorization," arXiv preprint
121-124: IEEE. arXiv:1412.7054, 2014.
[59] H. Papadopoulos and G. Peeters, "Joint estimation [73] B. L. Sturm, "Classification accuracy is not
of chords and downbeats from an audio signal," enough," Journal of Intelligent Information
IEEE Transactions on Audio, Speech, and Systems, vol. 41, no. 3, pp. 371-406, 2013.
Language Processing, vol. 19, no. 1, pp. 138-152,
[74] B. L. Sturm, "The GTZAN dataset: Its contents, its
faults, their effects on evaluation, and its future use,"
[60] G. Peeters and H. Papadopoulos, "Simultaneous arXiv preprint arXiv:1306.1461, 2013.
beat and downbeat-tracking using a probabilistic
[75] B. L. Sturm, "A simple method to determine if a
framework: Theory and large-scale evaluation,"
music information retrieval system is a “horse”,"
IEEE Transactions on Audio, Speech, and
IEEE Transactions on Multimedia, vol. 16, no. 6,
Language Processing, vol. 19, no. 6, pp. 1754-1769,
pp. 1636-1644, 2014.
[76] G. Tzanetakis and P. Cook, "Musical genre
[61] I. Peretz and M. Coltheart, "Modularity of music
classification of audio signals," IEEE Transactions
processing," Nature neuroscience, vol. 6, no. 7, p.
on speech and audio processing, vol. 10, no. 5, pp.
688, 2003.
293-302, 2002.
[62] D. N. Perkins and G. Salomon, "Transfer of
[77] J. Urzelai, D. Floreano, M. Dorigo, and M.
learning," International encyclopedia of education,
Colombetti, "Incremental robot shaping,"
vol. 2, pp. 6452-6457, 1992.
Connection Science, vol. 10, no. 3-4, pp. 341-360,
[63] M. Pesek, A. Leonardis, and M. Marolt, "A 1998.
Compositional Hierarchical Model for Music
[78] J. J. Valero-Mas, E. Benetos, and J. M. Inesta,
Information Retrieval," in ISMIR, 2014, pp. 131-
"Classification-based Note Tracking for Automatic
Music Transcription," 2016.
[64] G. E. Poliner and D. P. Ellis, "A discriminative
[79] A. Van Den Oord, S. Dieleman, and B. Schrauwen,
model for polyphonic piano transcription,"
"Transfer learning by supervised pre-training for
EURASIP Journal on Advances in Signal
audio-based music classification," in ISMIR, 2014.
Processing, vol. 2007, no. 1, 2006.
[80] C. G. vd Boogaart and R. Lienhart, "Note onset
[65] J. R. Quinlan, "Induction of decision trees,"
detection for the transcription of polyphonic piano
Machine learning, vol. 1, no. 1, pp. 81-106, 1986.
music," in Int. Conf. on Multimedia and Expo
[66] J. Read and J. Hollmén, "A deep interpretation of (ICME), 2009, pp. 446-449: IEEE.
classifier chains," in Int. Symp. on Intelligent Data
[81] P. Vincent, H. Larochelle, Y. Bengio, and P.-A.
Analysis, 2014, pp. 251-262: Springer.
Manzagol, "Extracting and composing robust
[67] J. Read, L. Martino, and D. Luengo, "Efficient features with denoising autoencoders," in
Monte Carlo optimization for multi-label classifier Proceedings of the 25th international conference on
chains," in IEEE Int. Conf. on Acoustics, Speech and Machine learning, 2008, pp. 1096-1103.
Signal Processing (ICASSP), 2013, pp. 3457-3461:
[82] P. Viola and M. Jones, "Rapid object detection using
a boosted cascade of simple features," in Computer
[68] J. Read, B. Pfahringer, G. Holmes, and E. Frank, Vision and Pattern Recognition (CVPR), 2001, vol.
"Classifier chains for multi-label classification," in 1, pp. I-I: IEEE.
Joint European Conference on Machine Learning
[83] P. Viola, M. J. Jones, and D. Snow, "Detecting
and Knowledge Discovery in Databases, 2009, pp.
pedestrians using patterns of motion and
254-269: Springer.
appearance," in Computer vision, 2003, p. 734:
[69] H. Rump, S. Miyabe, E. Tsunoo, N. Ono, and S. IEEE.
Sagayama, "Autoregressive MFCC Models for
[84] Z. Wang, X. Wang, and G. Wang, "Learning fine-
Genre Classification Improved by Harmonic-
grained features via a CNN tree for large-scale
percussion Separation," in ISMIR, 2010, pp. 87-92.
classification," Neurocomputing, vol. 275, pp.
[70] E. M. Schmidt and Y. Kim, "Learning Rhythm And 1231-1240, 2018.
Melody Features With Deep Belief Networks," in
[85] F. Weninger, C. Kirst, B. Schuller, and H.-J.
ISMIR, 2013, pp. 21-26.
Bungartz, "A discriminative approach to polyphonic
[71] R. Schramm and E. Benetos, "Automatic piano note transcription using supervised non-
transcription of a cappella recordings from multiple negative matrix factorization," in IEEE Int. Conf. on
singers," in AES International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Semantic Audio, 2017. 2013, pp. 6-10.
[86] C. White and C. Pandiscio, "Do Chord Changes
Make Us Hear Downbeats?," 2015.
[87] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z.
Zhang, "The application of two-level attention
models in deep convolutional neural network for
fine-grained image classification," in IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR),
2015, pp. 842-850.
[88] V. Zenz and A. Rauber, "Automatic chord detection
incorporating beat and key detection," in IEEE Int.
Conf. on Signal Processing and Communications
(ICSPC), 2007, pp. 1175-1178.
[89] N. Zhang, J. Donahue, R. Girshick, and T. Darrell,
"Part-based R-CNNs for fine-grained category
detection," in European Conf. on computer vision,
2014, pp. 834-849: Springer.
[90] B. Zhao, J. Feng, X. Wu, and S. Yan, "A survey on
deep learning-based fine-grained object
classification and semantic segmentation," Int.
Journal of Automation and Computing, vol. 14, no.
2, pp. 119-135, 2017.