Professional Documents
Culture Documents
Ziyi Zhao1 , Hanwei Liu1 , Song Li1 , Junwei Pang1 , Maoqing Zhang1 , Yi Qin2 , Lei
arXiv:2211.09124v1 [cs.SD] 16 Nov 2022
Abstract
Intelligent music generation, one of the most popular subfields of computer creativity, can lower the
creative threshold for non-specialists and increase the efficiency of music creation. In the last five
years, the quality of algorithm-based automatic music generation has increased significantly, moti-
vated by the use of modern generative algorithms to learn the patterns implicit within a piece of
music based on rule constraints or a musical corpus, thus generating music samples in various styles.
Some of the available literature reviews lack a systematic benchmark of generative models and are
traditional and conservative in their perspective, resulting in a vision of the future development
of the field that is not deeply integrated with the current rapid scientific progress. In this paper,
we conduct a comprehensive survey and analysis of recent intelligent music generation techniques,
provide a critical discussion, explicitly identify their respective characteristics, and present them in
a general table. We first introduce how music as a stream of information is encoded and the rel-
evant datasets, then compare different types of generation algorithms, summarize their strengths
and weaknesses, and discuss existing methods for evaluation. Finally, the development of artificial
intelligence in composition is studied, especially by comparing the different characteristics of music
generation techniques in the East and West and analyzing the development prospects in this field.
Keywords: music generation, music coding methods, automatic composition, evaluation of generation effects
1
&RQVWUDLQW (YROXWLRQDU\$OJRULWKPV
0HORG\ 6XEMHFWLYH
6\PERO JHQHUDWLRQ HYDOXDWLRQ
6HTXHQFH
+00 /670 JHQHUDWLRQ 2EMHFWLYH
HYDOXDWLRQ
$XGLR
JHQHUDWLRQ
6W\OH6KLIW (YDOXDWLRQ
$XGLR
,QSXW *$1 9$(
2XWSXW
5HSUHVHQWDWLRQ *HQHUDWLQJPRGHOV
naturally coherent music with long-term depen- less information on the general method of music
dencies, including different levels of repetition generation.
and variation[44]; interpretive ability translates Considering the shortcomings of the previ-
complex computational models into controllable ous literature, this paper provides a comprehen-
interfaces for interactive performance[46]. Finally, sive analysis of the latest articles on the various
this field can serve a variety of purposes, such as research directions of music generation, pointing
composers using the technology to develop exist- out their respective strengths and weaknesses, and
ing material, to inspire and explore new ways of summarizing them in a very clear classification
making and working with music, and listeners can based on generative methods. In order to better
customise or personalise music. understand the digital character of the music, we
The artificial intelligence composition system also present the two main representations of music
is usually called the Music Generation System as an information stream and list the commonly
(MGS). Currently, a few review papers on music used datasets For the evaluation of generative
generation systems can currently be found. Her- effects, the most representative evaluation meth-
remans et al.[32] introduced the classification ods based on both subjective and objective aspects
method of music generation systems and pointed are selected for discussion. Our work concludes
out problems in music evaluation. Briot et al.[4] with a comparative analysis of the similarities and
introduced algorithms based on deep learning differences between Eastern and Western research
based music generation system, and put forward in the field of AI composition based on different
corresponding countermeasures for some limita- cultural characteristics and a look at the further
tions in this field. Kaliakatsos et al.[47] analyzed development of the field of music generation as it
and predicted the research fields that promote the matures.
development of intelligent music generation in the
field of artificial intelligence; Loughran et al.[56] 2 Overview
aimed the music generation system based on evo-
lutionary calculation has analyzed the categories The most basic elements of music include the
and potential problems; Plug et al.[71] introduced pitch of the sound, the length of the sound, the
the latest research technology of music genera- strength of the sound and the timbre. These
tion systems in the game field, and analyzed the basic elements combine with each other to form
prospects and challenges of the field. However, in the melody, harmony, rhythm and tone of the
these previous papers, some do not provide suffi- music. In the field of artificial intelligence compo-
cient analysis of algorithms and models[32, 47, 56]. sition, melody constitutes the primary purpose of
There is a great deal of redundancy in the text an automatic music generation system. In most
in[4]. Others have a more traditional and con- melody generation algorithms, the goal is to cre-
servative outlook on the future[32, 47], and[71] ate a melody similar to the selected style, such as
is more focused on the application level, giving generating western folk music[63] or free jazz[49].
3
In addition to melody, harmony is another pop- some representation of music, which has an impor-
ular aspect of automatic music generation. When tant influence on the number of input and output
generating a harmony sequence, the quality of nodes (variables) to the system, the correctness
its output mainly depends on the similarity with of training, and the quality of the generated
the target style. For example, in chorus harmony, content[4].
this similarity is achieved by observing Sound
guidance rules[28] are clearly defined. In popu- 3.1 Specific formats
lar music, the chord progression is mainly used
as an accompaniment to the melody[91]. In addi- The representation of music is mainly divided into
tion, audio generation and style conversion are two categories: audio and symbolic. The main
also popular directions in the field of intelligent difference between the two is that audio is a
composition. The overall framework of the specific continuous signal, while the symbol is a discrete
music generation system is shown in Figure 1. It is signal.
worth noting that many historical music genera-
tion systems do not include deep neural networks 3.1.1 Audio representation
or evolutionary algorithms, but generate outputs The main ways in which music can be represented
based solely on musical rule constraints. In the in audio are:
a. Waveform[68]
b. Spectrogram[58]
6HJPHQW$
Among them, the spectrogram[58] can be
3KUDVH 3KUDVH 3KUDVH 3KUDVH obtained by the short-time Fourier transform of
the original audio signal. The advantage of using
0RWLYH 0RWLYH 0RWLYH ĂĂ 0RWLYHQ
audio signal representation is that the original
Fig. 2: The Hierarchy of Music audio waveform retains the original properties of
music and can be used to produce expressive
music[62]. However, this form of representation
process of generating music, the musical content has certain limitations, because music fragments
needs to be encoded in digital form as input to the in the original audio domain are usually repre-
algorithm, after which the intelligent algorithm sented as continuous waveform x ∈ [−1, 1]T , where
is learnt and trained to eventually output differ- the number of samples T is the audio duration
ent types of musical segments, such as melodies, and the sampling frequency (usually 16 kHz To
chords, audio, style transitions, etc... Music has a 48 kHz), short-term music fragments will generate
clear hierarchical structure in time, from motives a lot of data. Therefore, modelling the raw audio
to phrases and then to segments. As shown in can lead to a somewhat range-dependent system,
Figure 2, where the overall structure of the music making it challenging for intelligent algorithms to
together with local repetitions and variants form learn the high-level semantics of music, such as the
the theme, rhythm and emotion of the piece; sec- Jukebox[14] automated record jukebox that takes
ondly, music usually consists of multiple voices approximately nine hours to render one minute of
(may be the same or different instruments), each music.
individual voice part has a complete structure, and
together they must be in harmony with each other. 3.1.2 Symbolic representation
Combining these limitations, music generation is The symbolic representation of music fragments is
still a very challenging problem. mainly as follows:
a. MIDI events
3 Representation MIDI (Musical Instrument Digital Interface)
is a digital interface of musical instruments to
Music can be seen as a stream of information that ensure the interoperability between various elec-
conveys emotions at different levels of abstrac- tronic musical instruments, software and equip-
tion. The input to a generative system is usually ment. Figure 3a shows the type of MIDI event.
This method uses two symbols to indicate the
appearance of notes: Note on and Note off, indi- file before converting it to image form. As shown
cating respectively the beginning and end of the in Figure 3b, the x-axis represents the continuous
played notes. In addition, the note number is used time step, the y-axis represents the pitch, and the
to indicate the pitch of the note, which is specified z-axis represents the audio track, where n tracks
ㅖਧਾѪ 㖁㔌Ⲵ䗃⭘ˈޕԕ⭏ᡀ≁؇丣Ҁ
by an integer between 0 and 127. This data struc- have n one-hot matrices.
ture contains an incremental time value to specify The Piano-roll format quantizes the notes into
㊫රⲴᴢᆀDŽ਼ṧⲴˈ
the relative interval time between notes. For exam-
䟷⭘ ㅖਧ
a matrix form according to the beats. This rep-
䘋㹼㺘⽪⭘Ҿ⭏ᡀᖻDŽնᱟਚ㜭Ѫঅ丣ᖻ䘋㹼
ple, Music transformer[38], MusicVAE[74], etc. resentation form is popular in music generation
convert melody, drums, piano, etc. into MIDI systems because it is easy to observe the structure
བྷ㕆⸱Ⲵተ䲀ᙗ䲀ࡦҶ䈕㺘⽪ᖒᔿⲴᓄ⭘DŽ
ㅖਧਾѪ
events and use them 㖁㔌Ⲵ䗃⭘ˈޕԕ⭏ᡀ≁؇丣Ҁ
as input to the training net- of the melody and does not require serializa-
work to generate music fragments. However, this tion of polyphonic music. Piano-roll is one of the
Ⲵ丣Ҁབྷ㊫රⲴᴢᆀDŽ਼ṧⲴˈ
ㅖਧਾѪ cannot
representation 㖁㔌Ⲵ䗃⭘ˈޕԕ⭏ᡀ≁؇丣Ҁ 䟷⭘
effectively retain the con-
ㅖਧ
most commonly used representations for mono-
PLGL1RWH2Q(YHQWWLFN
cept of multiple tracks FKDQQHO
䘋㹼㺘⽪⭘Ҿ⭏ᡀᖻDŽնᱟਚ㜭Ѫঅ丣ᖻ䘋㹼
㊫රⲴᴢᆀDŽ਼ṧⲴˈ playing multiple 䟷⭘ GDWD
notes ㅖਧ >@
at tone music (polyphonic music) melody, such as:
PLGL1RWH2Q(YHQWWLFN
once[35]. FKDQQHO GDWD >@Midi-VAE[5], MuseGAN[18], etc. First convert
Ѫ 䘋㹼㺘⽪⭘Ҿ⭏ᡀᖻDŽնᱟਚ㜭Ѫঅ丣ᖻ䘋㹼
㕆⸱Ⲵተ䲀ᙗ䲀ࡦҶ䈕㺘⽪ᖒᔿⲴᓄ⭘DŽ
PLGL1RWH2II(YHQWWLFN FKDQQHO GDWD >@polyphonic music to Piano-roll representation,
ᔿѫ㾱Ѫ PLGL1RWH2II(YHQWWLFN FKDQQHO GDWD >@ and then proceed to follow-up processing. However
㕆⸱Ⲵተ䲀ᙗ䲀ࡦҶ䈕㺘⽪ᖒᔿⲴᓄ⭘DŽ
PLGL1RWH2Q(YHQWWLFN
PLGL1RWH2Q(YHQWWLFN FKDQQHOFKDQQHO GDWD
GDWD >@
>@
the biggest limitation of using Piano-roll is that it
PLGL1RWH2Q(YHQWWLFN
PLGL1RWH2Q(YHQWWLFN FKDQQHO
FKDQQHO
PLGL1RWH2Q(YHQWWLFN FKDQQHO GDWD
GDWD
GDWD >@
>@
>@
PLGL1RWH2Q(YHQWWLFN FKDQQHO
PLGL1RWH2II(YHQWWLFN FKDQQHO GDWD >@
GDWD >@ cannot capture the vocal and other sound details
PLGL1RWH2II(YHQWWLFN FKDQQHO GDWD >@
а ᱟа PLGL1RWH2II(YHQWWLFN FKDQQHO GDWD >@ such as timbre, velocity, and expressiveness[4],
PLGL1RWH2II(YHQWWLFN FKDQQHO GDWD >@
PLGL1RWH2Q(YHQWWLFN
PLGL1RWH2Q(YHQWWLFN FKDQQHO GDWD
FKDQQHO GDWD >@
>@
PLGL1RWH2Q(YHQWWLFN
and it is difficult to distinguish long notes from
઼ PLGL1RWH2Q(YHQWWLFN FKDQQHO GDWD
FKDQQHO GDWD >@
>@
ǃ䖟Ԧ઼ repeated short notes (e.g. whole notes and mul-
һ һ (a) MIDI
MIDI events(Extract
events(Extract from
tiple eighth notes) and has multiple zero integers
(a) fromTwinkle
TwinkleTwinkle
TwinkleLittle
LittleStar)
Star)
resulting in low training efficiency.
Ⲵࠪ⧠˖
⧠˖ c.ABC notation
Ⲵᔰ઼ ABC notation1 encodes melodies into a tex-
઼
ˈ⭡ ࡠ tual representation, as shown in Figure 3c, initially
ࡠ
њ໎䟿ᰦ designed specifically for folk and traditional music
宫
宦
宫
宦
宷 originated in Western Europe, but later widely
ྲᰦ 宷
宬
宓 宫
宬
宓
宦
宷
宬
used for other types of music. A typical example
宓
ǃ䫒⩤ㅹ of this encoding method, which has the advantage
ˈӾ㘼ӗ 宷宬宰宨 of being concise and easy to read, is Folk-rnn[77],
ㅹ which takes the ABC notation of a particular folk
⮉؍ཊњ 宷宬宰宨
宷宬宰宨
ӗ music transcription and uses it as input to an
宛孽季孴 LSTM network to generate tunes in the folk music
њ (b) Piano-roll
宗孽季宖宦宲宷宯室宱宧季宗宫宨季宅宵室容宨 genre. Similarly, GGA-MG[23] uses ABC symbols
宝孽季宊宲宱宨宄宺室宼
宕孽季宰室宵宦宫 for representation used to generate melodies. How-
ݸ䴰㾱ሩ 宛孽季孴
宐孽季孷孲孷
宛孽季孴 宏孽季孴孲孻
宗孽季宖宦宲宷宯室宱宧季宗宫宨季宅宵室容宨 ever, the limitation of only being able to encode
䖜ᦒѪ 宎孽季宇宰室宭
宗孽季宖宦宲宷宯室宱宧季宗宫宨季宅宵室容宨
宝孽季宊宲宱宨宄宺室宼
宄孯宇宿宇孵季宇孶孲孵守孲孵宿安宇季安宄宿宧孵季宧孶孲孵宦孲孵宿 monophonic melodies limits the application of this
宕孽季宰室宵宦宫
↕䮯ˈ 宝孽季宊宲宱宨宄宺室宼 representation.
ሩ 宐孽季孷孲孷
宕孽季宰室宵宦宫
宏孽季孴孲孻
ࡉᴹ
Ѫ њ 宐孽季孷孲孷
宎孽季宇宰室宭
ഴ 丩҆㺞⽰ᖘᕅ⽰ׁ
宏孽季孴孲孻
宄孯宇宿宇孵季宇孶孲孵守孲孵宿安宇季安宄宿宧孵季宧孶孲孵宦孲孵宿
3.2 Datasets
ˈ 宎孽季宇宰室宭
ᮦᦤ䳼 For the currently very prevalent deep neural
Ѫ⸙䱥ᖒ 宄孯宇宿宇孵季宇孶孲孵守孲孵宿安宇季安宄宿宧孵季宧孶孲孵宦孲孵宿
њ ൘丣Ҁ⭏ᡀ㌫㔏ѝˈ⺞ᇊྭ丣ҀⲴ㺘⽪ᖒᔿཆˈ network-based generative systems, in addition to
н䴰㾱ᒿ ഴ 丩҆㺞⽰ᖘᕅ⽰ׁ identifying a proper musical representation, a
(c) 䘈䴰䘹ਆоѻ३䝽Ⲵ丣Ҁᮠᦞ䳶Ѫ⭏ᡀ㇇⌅Ⲵ
ABC(c)notation(Extract
ABC notation(Extract from Scotland
from Scotland The
The Blave) Blave)
࠶⍱㹼DŽ matching musical dataset needs to be selected as
ᮦᦤ䳼 Fig. 3: Representation of Music
䗃ޕDŽՐ㔏ޡޜᮠᦞ䳶Ⲵ丣Ҁ㊫රवਜ਼ཊ丣䖘丣Ҁǃ the input to the system. Popular public datasets
ᖒ
ᖻ䘋㹼㺘 ഴ 丩҆㺞⽰ᖘᕅ⽰ׁ contain music types such as multi-track, choral,
ਸୡ丣Ҁǃ䫒⩤丣Ҁǃ≁䰤丣Ҁㅹˈ㺘⽪ᯩᔿवਜ਼
൘丣Ҁ⭏ᡀ㌫㔏ѝˈ⺞ᇊྭ丣ҀⲴ㺘⽪ᖒᔿཆˈ
ᒿ ǃ piano and folk music, and representations such
ᮦᦤ䳼丣仁ǃ һԦǃ
b.Piano-roll
㺘䘈䴰䘹ਆоѻ३䝽Ⲵ丣Ҁᮠᦞ䳶Ѫ⭏ᡀ㇇⌅Ⲵ
ㅖਧㅹDŽ൘丣Ҁ⭏ᡀ䗷〻
DŽ The conversion of a music clip to Piano-roll for-
ѝ䴰㾱⭏ᦞᡀ丣Ҁ㊫ර䘹ਆਸ䘲Ⲵᮠᦞ䳶ˈ㺘
൘丣Ҁ⭏ᡀ㌫㔏ѝˈ
㺘䗃ޕDŽ matՐ㔏ޡޜᮠᦞ䳶Ⲵ丣Ҁ㊫රवਜ਼ཊ丣䖘丣Ҁǃ
begins with a one-hot
ࡇѮҶࠐњިරᐢޜᔰⲴ丣Ҁᮠᦞ䳶Ⲵ㊫රǃ㕆⸱
⺞ᇊྭ丣ҀⲴ㺘⽪ᖒᔿཆˈ
encoding of the MIDI 1
http://abcnotation.com/wiki/abc:standard
㺘
৺ަԆ丣ਸୡ丣Ҁǃ䫒⩤丣Ҁǃ≁䰤丣Ҁㅹˈ㺘⽪ᯩᔿवਜ਼
ǃ 䘈䴰䘹ਆоѻ३䝽Ⲵ丣Ҁᮠᦞ䳶Ѫ⭏ᡀ㇇⌅Ⲵ
ᖒᔿㅹ䘋㹼㔏䇑DŽ
ˈфᖸ䳮
丣仁ǃ һԦǃ ㅖਧㅹDŽ൘丣Ҁ⭏ᡀ䗷〻
㺞 丩҆ᮦᦤ䳼⽰ׁ
㺘䗃ޕDŽ
ཊњ࠶ޛ Ր㔏ޡޜᮠᦞ䳶Ⲵ丣Ҁ㊫රवਜ਼ཊ丣䖘丣Ҁǃ
ѝ䴰㾱⭏ᦞᡀ丣Ҁ㊫ර䘹ਆਸ䘲Ⲵᮠᦞ䳶ˈ㺘
䰞仈DŽ
㺘 ਸୡ丣Ҁǃ䫒⩤丣Ҁǃ≁䰤丣Ҁㅹˈ㺘⽪ᯩᔿवਜ਼
5
as audio, MIDI events and ABC symbols. Table1 written by Jay Glanville and some perl scripts,
lists the types and encoding forms of several typi- much of this database has been converted to ABC
cal publicly available music datasets for statistical notation.
purposes. The above dataset consists mainly of West-
ern music, with less attention paid to Eastern
musical works. Actually, there are a small number
Table 1: Examples of music datasets of datasets that have been collected for oriental
Dataset Type Format music, but most of them are not yet publicly avail-
able on the Internet. For example, MG-VAE[59]
Nsynth2 Multitrack WAV digitised the folk songs from different regions of
Lakh MIDI Dataset3 Multitrack MIDI
JSB-Chorales4 Harmonized chorale MIDI China recorded in the Chinese Folk Music Collec-
Groove MIDI5 Drum MIDI tion into 2,000 tracks with different regional styles
Nottingham6 Folk tunes ABC in MIDI format. XiaoIce Band[91] collected Chi-
nese popular music for generating multi-tracked
pop music. Jiang et al.[43] collected over 1,000
Nsynth contains 305,979 notes, each with a
Chinese karaoke songs. In addition, CH818[34] col-
unique pitch, timbre and envelope. For the 1,006
lected 818 Chinese pop songs, extracting the most
instruments from the commercial sample library,
emotive 30-second segments from each song and
the four-second, single-single instrument is gener-
adding emotional tags to them.
ated in the range of each pitch (21-108) and five
different speeds (25, 50, 75, 100, 127) of a standard
MIDI piano. 4 Music Generation Systems
Lakh MIDI is a collection of 176,581 unique
MIDI files, 45,129 of which have been matched The quest for computers to create music has
and aligned with entries in the One Million Songs been pursued by researchers since the dawn of
Dataset. It aims to facilitate large-scale retrieval computing. In 1957, Lejaren Hiller produced the
of music information, both symbolic (using MIDI IlliacSuite [94], the first musical composition in
files only) and audio content-based (using infor- human history composed entirely by a computer.
mation extracted from MIDI files as annotations David Cope(1991)[95] introduced Experiments in
for matching audio files). Musical Intelligence (EMI), which used the most
JSB-Chorales offers three temporal resolu- basic process of pattern recognition and recombi-
tions: quarter, eighth and sixteenth. These ’quan- nancy to create music that successfully imitated
tizations’ are created by retaining the tones heard the styles of hundreds of composers.
on a specified time grid. Boulanger-Lewandowski In the new, century, from early rule-based[78]
(2012) uses a temporal resolution of quarter notes. to the currently widely used deep learning-
This dataset does not currently encode fermatas based[4] music generation systems, the generated
and is unable to distinguish between held and music has gradually evolved from simple, short-
repeated notes. duration monotonic music fragments to poly-
Groove MIDI Dataset (GMD) consists of 13.6 phonic music with chords and long duration. This
hours of unified MIDI and (synthetic) audio of paper surveys the recent state-of-the-art in music
human-performed, rhythmically expressive drum generation and classifies them based on algo-
sounds. The dataset contains 1150 MIDI files and rithm, genre, dataset, and representation, etc.
over 22,000 drum sounds. The detailed classification results are presented in
Nottingham Music Dataset maintained by Eric Table 2, which can be simply categorized into the
Foxley contains over 1000 folk tunes stored in a following categories according to the purpose of
special text format. Using NMD2ABC, a program generation:
• Melody generation
• Arrangement generation
2
3
https://magenta.tensorflow.org/datasets/nsynth • Audio generation
https://colinraffel.com/projects/lmd/
4
http://www.jsbchorales.net/index.shtml
• Style transfer
5
https://magenta.tensorflow.org/datasets/groove
6
http://abc.sourceforge.net/NMD/
Table 2: List of Music Generation Systems
Name Year Algorithm Application
of notes or chords, and the conditional probability 4.3 Deep Learning Methods for
of generating chords on a given note. Music Generation
Content generation is an extended area of deep
(optional) learning. With the achievements of Google’s
Onsets and
.wav
Frames
MIDI Melody
Magenta and CTRL (Creator Technology
Research Lab) in music generation, as a music
Performance Performance
(MIDI) Encoder
[...|...|...|...|...] [embedding] Decoder generation method, deep learning is receiving
(optional)
more and more attention. Unlike grammar-based
Melody
(MIDI)
Melody
Encoder
[...|...|...|...] or rule-based music generation systems, deep
learning based music generation systems can learn
Fig. 5: Programmable Markov Music Generation the distribution and relevance of samples from an
System[72] arbitrary corpus of music and generate music that
represents the musical style of the corpus through
prediction (e.g. predicting the pitch of the next
Due to the time sequence characteristics of note of a melody) or classification (e.g. identifying
music fragments, Markov model is well suited the chords corresponding to the melody).
for the generation of musical elements such as
melodies (note sequences)[9]. Chi-Fang Huang et 4.3.1 Convolutional Neural Networks
al.[36] established a Markov chain switching table
of the Chinese pentatonic scale by analyzing the
modes of Chinese folk music. They generate the
Residual
next note of the current input melody accord- +
that the music generation system could adapt to Fig. 6: WaveNet: Overview of the residual block
different moods of the input. The model of the and the entire architecture
procedural music generation algorithm used in
the literature is shown in Figure5. D Williams et
al.[86] also generated melodies based on emotions,
with the difference that the approach used a Hid- Traditional Convolutional Neural Networks
den Markov Model (HMM) for modelling, showing mainly process images with non-time sequence
that emotionally relevant music can be composed characteristics. It is difficult to process music with
using a Hidden Markov Model. time sequence. Therefore, at this stage, there are
Compared with the current deep learning fewer music generation models based on convo-
architecture, the advantage of the Markov model lutional neural networks. But by improving the
is that it is easier to control, allowing constraints neural network, music can be generated.
to be attached to the internal structure of the The audio generation model WaveNet[68],
algorithm[4]. However, from a compositional per- introduced by Google Deepmind, including its
spective, Markov models may yield more novel variants TimbreTron[61] (timbre conversion),
combinatorial forms for shorter sequences, but Manzelli et al.[21] (audio generation) provide
may reuse much of the corpus non-innovatively for effective improvements to convolutional neural
sequences with longer structures, resulting in an networks to enable them to generate music clips,
excessive number of repetitive fragments. see the structure of WaveNet in Figure 6. Wavenet
generates new music clips with high fidelity by
introducing extended causal convolution (Dilated
Casual Convolutions), which predicts the prob-
ability distribution of the current occurrence of
different audio data based on the already gen-
erated audio sequences. By introducing extended
convolution, the deep model has a very large per-
ceptual field, thus dealing with the long-span time-
dependent problem of audio generation.However,
there is no significant repetitive structure in the
generated music fragments and the generated Fig. 7: A Note RNN is trained on MIDI files
audio is less musical. Manzell et al. improved this and supplies the initial weights for the Q-network
method by using an LSTM (Long Short-Term and Target-Q-network, and final weights for the
Memory) network combined with the Wavenet Reward RNN. [84]
architecture to generate cello-style audio clips
from a given small melody. Where the LSTM net-
due to the gradient problem that occurs when pro-
work is used to learn the melodic structure of
cessing long time sequence data, which is prone
different musical styles, the output of this model
to gradient explosion (disappearance). LSTM, as
is used as input to the Wavenet audio generator
a variant of RNN, effectively alleviates the long
to provide the corresponding melodic information
time dependency problem of RNN by introducing
for the generation of the audio clip.
cell states and using three types of gates, namely
The Wavenet model can also be used for other
input gates, forgetting gates and output gates,
types of music generation tasks, with Engel et al.
to hold and control information. LSTM networks
constructing a Wavenet-like encoder to infer hid-
are mainly used in music generation systems to
den temporal distribution information and feeding
generate music clips instead of RNN networks.
it into a Wavenet decoder, which effectively recon-
The combination of LSTM and other algo-
structs the original audio and enables conversion
rithms can effectively generate melodies, poly-
between the timbres of different instruments. Sim-
phonic music, etc. The MelodyRNN (Figure7)
ilarly, TimbreTron is a musical timbre conversion
proposed by the Google Brain team[84] built a
system that applies ’image’ domain style trans-
lookback RNN and an attention RNN in order
fer to the time-frequency representation of audio
to improve the ability of RNNs to learn long
signals. The timbre transfer is performed in the
sequence structures, where the attention RNN
log-CQT domain using CycleGAN (Cycle Gener-
introduced an attention mechanism (attention)
ative Adversarial Networks) by first calculating
to model higher-level structures in music to gen-
the Constant Q Transform of the audio signal and
erate melodies. Keerti et al.[49] also introduced
using its logarithmic amplitude as an image, and
an attention mechanism and used dropout to
finally using the Wavenet synthesiser to generate
reduce overfitting to generate jazz music. Rl
high-quality waveforms, enabling the transforma-
Tuner[40] used an RNN network to provide par-
tion of instrument timbres such as cello, violin
tial reward values for a Reinforcement Learning
and electric guitar. In addition, Coconet used a
(RL) model, with the reward values consisting of
convolutional alignment network to automate the
two components: (1) the similarity of the next
filling of incomplete scores by modelling the tenor,
note generated to the note predicted by the reward
baritone, bass and soprano in a Bach chorus as
RNN; (2) the degree of satisfaction with the user-
an eight-channel feature map, using the feature
defined constraints. However, this method is based
map of randomly erased notes as input to the
on overly simple melodic composition rules and
convolutional neural network.
can only generate simple monophonic melodies.
Other music generation systems for melody gen-
4.3.2 Recursive Neural Networks eration include Hadjeres et al.[28], who propose
Recursive Neural Networks are a class of neural an Anticipation-RNN neural network structure for
networks for processing time series, which are very generating melodies for interactive Bach-style cho-
suitable for processing music clips. However, RNN ruses; StructureNet[63], which generates simple
cannot solve the long time dependency problem monophonic accompanying music based on LSTM
11
networks; and Makris et al.[60], who build a sep- network with a symbolic representation of the
arate LSTM network for each type of drum An music (Piano-roll) as input and an audio repre-
LSTM network was constructed with a feed for- sentation (sound spectrum map) as output. The
ward (FF) network as the condition layer, and model consists of two sub-networks: a convolu-
its input was a One-hot matrix of 26 features tional encoder/decoder structure is used to learn
including rhythm and beat number, which could the correspondence between Piano-roll and the
generate sequences of drums with untrained beat acoustic spectrogram; a multi-segment residual
numbers. network is further used to optimise the results by
For arrangement generation, Folk-rnn first adding spectral textures for overtones and tim-
used LSTM networks to model musical sequences bres, ultimately generating music clips for violin,
transcribed into ABC notation for generating folk cello and flute.
music. DeepBach[29] used a bi-directional LSTM
(Bi-LSTM) to generate Bach-style choral music 4.3.3 Generative Adversarial Nets
considering two directions in time: one aggre-
gating past information and the other aggregat-
ing future information, demonstrating that the
sequences generated by the bi-directional LSTM
were more musical. XiaoIce Band proposes an
end-to-end Chinese multi-track pop music gen-
eration framework that uses a GRU network
to handle the low-dimensional representation of
chords and obtains hidden states through encoders
Fig. 8: System diagram of the MidiNet model for
and decoders, ultimately generating music for
symbolic-domain music generation.
multi-instrument tracks. Amadeus[50] uses an
explicit duration encoding approach: multiple
audio streams represent note durations and uses
RL’s reward mechanism to improve the structure Generative Adversarial Nets (GAN)[26] is
of the generated music. Jambot[6] uses a chord a state-of-the-art method for generating high-
LSTM network to predict chord sequences, and a quality images. The idea behind the method is
polyphonic LSTM network to generate polyphonic to train two networks simultaneously: a generator
music based on the predicted chord sequences. and a discriminator, with the two lattices playing
Song From PI[12] uses a layered RNN network against each other, with the discriminator eventu-
model, where the two bottom LSTM networks ally being unable to distinguish between the truth
generate the melody and the two top LSTM and falsity of the real data and the generated data
networks generate the drums and chords. as the optimal result. Researchers have been work-
In addition, audio generation models such ing on applying this to more sequential data, such
as Performance RNN[69] and PerformanceNet[85] as audio and music. For music generation, the
are considered to be the possible future develop- problems are that:
ment directions of music generation systems[20]. (1) music is of a certain time series
Music performance is a necessary condition for (2) music is multi-track/multi-instrumental
giving meaning and feeling to music. In addition, (3) monophonic music generation cannot be used
direct synthesis of audio with a library of sound directly for polyphonic music
samples often leads to mechanical and dull results.
Performance RNN converts a MIDI file of a live Numerous researchers have improved GAN
piano piece into a musical representation of mul- networks to generate melodies, arrangements,
tiple one-hot vectors with 413 dimensions, and style transformations, etc.
the method specifies a time ”step” to a fixed size For melody generation, C-RNN-GAN[64] uses
(10ms) rather than a note time value, thus cap- LSTM networks as a generator and discrimina-
turing more expressiveness in the note timing. tor model, but the method lacks a conditional
PerformanceNet is a first attempt at score-to- generation mechanism to generate music based
audio conversion using a fully convolutional neural on a given antecedent. MidiNet[88] improves the
method by inserting the melody and chord infor- lyrics, and recursive layers to improve the accu-
mation generated in the previous stage as a condi- racy of pitch to achieve male and female voice
tional mechanism in the middle convolution layer conversion. In addition, Jin et al.[46] used an
of the generator to restrict the generation of note LSTM network as a generator, adding music rules
types, thus improving the innovation of generat- as a reward function in reinforcement learning to
ing single jazz pieces, the diagram of this model a control network that takes advantage of the AC
is shown in Figure8. JazzGAN[80] developed a (Actor Critic) reinforcement learning algorithm to
GAN-based model for single jazz generation, using select for diversity of elements and avoid the draw-
LSTM networks to improvise monophonic jazz back of GAN networks generating samples that
music over chord progressions, and proposed sev- are too similar, thus improving the creativity of
eral metrics for evaluating the musical features stylistic music generation. The same goes for Play
associated with jazz music. SSMGAN[42] used as You Like[58] which implements the conversion
the GAN model to generate a self-similar matrix of classical, jazz and pop music. For audio gen-
(SSM) to represent musical self-repetition, and eration, WaveGAN[16] achieved the first attempt
subsequently fed the SSM matrix into an LSTM to apply GAN to the synthesis of raw waveform
network to generate melodies. Yu et al.[89] first audio signals, and GANSynth[22] improved the
proposed a conditional LSTM-GAN for melody method by generating the entire sequence in par-
generation from lyrics, containing an LSTM gen- allel rather than sequentially, which appended a
erator and LSTM discriminator, both with lyrics one-hot vector representing pitch and achieved
as conditional inputs. independent control of pitch and timbre based on
For arrangement generation, MuseGAN pro- a PGGAN networ[30], with the final results 50,000
posed a GAN model with three types of generators times faster than the standard Wavenet[68], and
to construct correlations between multiple tracks, the generated music clips are both better than
containing independent generation within tracks, Wavenet and WaveGAN[16].
global generation between tracks, and composite Despite the outstanding performance in music
generation, but the method produced many over- generation, GAN still have shortcomings:
fragmented notes. BinaryMuseGAN[19] improved
(1) They are difficult to train
the method by introducing binary neurons (
(2) They are poorly interpretable, and there is
Binary Neuron, BN) as input to the genera-
still a lack of high-quality research results on
tor, a representation that makes it easier for
how GANs can model text-like data or musical
the discriminator to learn decision boundaries,
scores.
while introducing chromatic features to detect
chordal features to reduce over-fragmented notes.
LeadSheetGAN[54] uses a functional score (Lead
4.3.4 Variational Auto-Encoder
Sheet) as a conditional input to generate a Piano-
roll, and this approach produces results that con-
Input
tain more information at the musical level due
to the clearer melodic direction of the functional Encoder
Variational Auto-Encoder (VEA) is essen- simulate polyphonic music of any length. MIDI-
tially a compression algorithm for encoders and Sandwich2 [52] proposed a RNN-based hierar-
decoders, which has been able to analyse and chical multi-modal fusion to generate a VAE
generate information such as pitch dynamics and (MFG-VAE) network. First, multiple independent
instrumentation in polyphonic music. The use of and non-weighted Binary VAEs (BVAE) are used
VAE aims to solve two problems[73]: music recom- to extract the information of instruments in dif-
bination and music prediction, but when the data ferent tracks. Feature information, MGF-VEA is
is multimodal, VAE does not provide a clear used as the top-level model to fuse the features
mechanism for generating music, usually using a of different modalities, and finally multi-track
hybrid coding model based on VAE, such as a music is generated through the BVAE decoder.
combination of VAE networks and RNN networks. This method improves the refinement method
Some typical examples are MIDI-VAE[5], of BinaryMuseGAN, transforms the binary input
Figure9 and MusicVAE[74]. Among them, MIDI- problem into a multi-label classification problem,
VAE constructs three pairs of coders/decoders and solves the problem of the difficulty of distin-
sharing a potential space to automatically recon- guishing the refining steps of the original scheme
struct the pitch, intensity and instrumentation of and the difficulty of descending the gradient.
a musical composition for musical style conversion; MuseAE[81] applied Adversarial Autoencoders to
this work represents the first successful attempt music generation for the first time. The advan-
to apply style conversion to a complete musical tage of this method over VAE: Using adversarial
composition, but the small amount of data used regularization instead of KL divergence, in prin-
for style feature training results in short lengths ciple, it can be forced to follow any probability
of the generated music. MusicVAE uses hierarchi- distribution, thus provides greater flexibility in
cal decoder to improve the modelling of sequences choosing the prior distribution of latent variables.
with long-term structure, using a bidirectional The Jukebox built a system that can generate a
RNN as an encoder, allowing the model to interpo- variety of high-fidelity music in the original audio
late and reconstruct well. Then, GrooVAE[25] as a domain, but this method still requires 9 hours to
variant of MusicVAE, generates drum melodies by render one minute of music.
training on a corpus of electronic drums recorded
from live performances. 4.3.5 Transformer
For the generation of Eastern popular and folk
A piece of music often has multiple themes,
music, MG-VAE[59] was the first study to apply
phrases or repetitions on different time scales.
deep generative models and adversarial training to
Therefore, generating long-term music with com-
oriental music generation, which was able to con-
plex structures has always been a huge challenge.
vert music into folk music with a specific regional
The music generation system is usually based
style. Jiang et al.[43] developed a music genera-
on a sequential RNN (or LSTM, GRU, etc.)
tion model with style control, which was based on
network, but this mechanism brings about two
MusicVAE and used global and Three music gen-
problems: Calculating from left to right or calcu-
eration models were constructed using global and
lating from right to left limits the parallel ability
local encoders in an attempt to find a suitable way
of the model; Part of the information will be
to represent musical structure at the bar, phrase,
lost during the sequence calculation. This led
and song levels, which could eventually generate
Google to propose the classical network archi-
melodies with style control.
tecture Transformer[82], which uses an Attention
In addition, MahlerNet[57] constructed a con-
mechanism instead of the traditional Encoder-
ditional VAE to model the distribution of latent
Decoder framework that must combine the inher-
states. Two bidirectional RNN networks (one con-
ent patterns of CNNs or RNNs. The core idea of
sidering music reconstruction and one considering
this architecture is to use the Attention mecha-
context) constitute an encoder, and the decoder
nism to indicate correlations between inputs, with
outputs duration, pitch, and instrument, real-
the following advantages:
ized the use of any number of instruments to
(1) Data parallel computing
(2) Visualization of self-reference
(3) Solve the problem of long-term dependence (optional)
better than RNN, and has achieved great suc- .wav
Onsets and
MIDI Melody
Frames
cess in the task of maintaining long continuity
in timing. Performance Performance
[...|...|...|...|...] [embedding] Decoder
(MIDI) Encoder
However, this model is not very implementable
(optional)
in music generation tasks because of the complex- Melody Melody
[...|...|...|...]
(MIDI) Encoder
ity of the space complexity of the intermediate
vectors characterizing the relative positions in the
Fig. 10: The flow chart of Transformer self-
algorithm (the square of the sequence). Music
encode: First transcribes the .wav data file into
Transformer successfully applied this structure
a MIDI file using the Onsets and Frames frame-
to the generation of piano pieces by reducing
work, which is then encoded into a performance
the space complexity of the intermediate vec-
representation to be used as input. The output of
tors characterizing the relative positions to the
the performance encoder is then aggregated across
sequence length order of magnitude, which greatly
time and (selectively) combined with melodic
reduced the space complexity , but with poor
embedding to produce a representation of the
musicality and confusing structure while generat-
entire performance, which is then used by the
ing musical pieces of longer duration. LakhNES
Transformer decoder when reasoning.[70]
[15] used migration learning to generate chip
music by pre-training in a large-scale dataset
using Transformer’s latest extension Transformer-
addition, Transformer VAE combines the Trans-
XL, and then fine-tuning in a small-scale chip
former with VAE, effectively improving on the
music dataset. MuseNet[70] used the same net-
shortcomings of VAE in that it does not handle
work as GPT-2 (a large Transformer-based lan-
time series structure well and the potential states
guage model[8]), which is trained using the Trans-
of the Transformer are not easily interpreted. Choi
former’s recomputed and optimised kernel to
et al.[11] similarly used the Transformer coder
generate a four-minute musical composition con-
(decoder) to obtain a global representation of the
sisting of ten different instruments, but the model
music, thus better control over aspects such as
does not necessarily arrange the music according
performance style and melody.
to the input instruments, generating each note by
calculating the probability of all possible notes
4.3.6 Evolutionary Algorithms
and instruments. Figure10 shows the flow chart of
this Transformer self-encoder. The field of evolutionary computing encompasses
In addition, some scholars have used Trans- a variety of techniques and methods inspired by
former in combination with other models to natural evolution. At its heart are Darwinian
improve problems such as the poor musicality search algorithms based on highly abstract bio-
of long sequences of music generated by this logical models. The advantage of applying them
model. Guan et al.[27] proposed a combination of to music generation systems is that they can be
Transformer and GAN, first using a self-attentive optimized in the direction of high relative qual-
mechanism based on GAN to extract spatial and ity or adaptability, thus compressing or extending
temporal features of music to generate consistent the data and processes required for the search.
and natural monophonic music, and then using It provides an innovative and natural means of
a generative adversarial structure with multiple generating musical ideas from a set of specifi-
branches to generate multi-instrumental music able original components and processes.[96] EA
with harmonic structure between all tracks. Zhang requires three basic criteria to be considered in
et al.[90] also used the Transformer in combina- advance: 1. Problem domain: defining the type of
tion with GAN to make the generation of long music to be created; 2. Individual Representation:
sequences guided by the adversarial goal, which the representation of the music, melody, harmony,
provides powerful regularisation conditions that pitch, duration, etc.; 3.Fitness Measure. Adapta-
can force the Transformer to focus on learning tion measure is a very important part of genetic
the global and local structure of the music. In algorithms to evaluate the goodness of a solution,
15
With the development of artificial intelligence, there are obvious differences between Eastern tra-
there are also certain difficulties in the rapid devel- ditional music and Western classical music (espe-
opment of artificial intelligence composition. The cially symphony)[33]. Specifically, Eastern music
current music generation systems can generate has a different understanding of structural power,
polyphonic music with a higher quality and a which usually constructs single-line monophonic
shorter period of time containing different styles music. In the development process, the music will
and instruments in a shorter period of time[46]. be lined, mainly in pursuit of the beauty of music,
However, the construction of a long time series of and Eastern classical music (especially Chinese
music fragments with a complete structure[44] and classical music) ) Mainly with five-tone tones as
the evaluation of generated music[1] have become the main body. On the contrary, western music is
a common problem for artificial intelligence com- mostly multi-voice music, which mainly focuses on
position researchers. Although the latest music major and minor tones, pays attention to the har-
generation system can generate longer-term music mony logic of music, and pays great attention to
fragments, such as MuseNet, which can use 10 the form of music. In the melody, techniques such
different instruments to generate 4-minute music as repeated derivation, contrast, and expansion
works. However, the latest music generation algo- are used to render the atmosphere. In addition,
rithms are not yet able to deal with the time series the East and the West also have their own distinc-
problem well. With the increase in the duration tive features in the form of expression of music art.
of the generated music fragments, the music of Western music performance highlights the oppo-
the generated music fragments is getting worse sition between subjective and objective, and is
and worse, the structure is becoming more and known for its profound and seriousness. Differ-
more chaotic or higher repetitiveness. Researchers ent from the strict pursuit of music scores by
are currently working on optimizing the genera- Western performers[83], traditional Eastern music
tion algorithm or constraining the generated music emphasizes the unification of subjectivity and
fragments based on music theory knowledge to objectiveness, and the rhythm of the music is
alleviate the problem of time dependence of the more free and varied. With performance needs,
music generation system. performers are more subjective and have higher
performance freedom, that is, music. The second
6.2 Different characteristics of creation has a larger composition. Viewing music
Eastern and Western music as an information stream increases the difficulty of
labeling and analyzing traditional Eastern music
generation
data.
From the perspective of the technical characteris- Considering the differences in the structure,
tics of music generation itself, unlike the mature content and performance methods of Eastern and
and active artificial intelligence music generation Western music, a lot of adjustments and optimiza-
in the West, Eastern researchers are still in their tions of the music generation system are needed
infancy in the field of intelligent composition. to apply to one of the two, thereby reducing the
Although there are also individual companies and versatility between Eastern and Western music
research units that are constantly trying, there generation systems.
are still relatively few studies on the generation On the other hand, there is little research on
of oriental local music. The main limitations are the generation of oriental music presently, which
the differences in structure and musical think- leads to the lack of music corpora of different
ing development between Eastern and Western types of oriental music, and the lack of very clas-
music[45], the lack of corpora for various types sic and high-quality corpora. MG-VAE[59] collects
of Eastern music, the lack of effective evaluation folk music of different regional styles in China,
methods and systems, and the lack of professional XiaoIce Band[91], CH818[34] Chinese pop music
talents who integrate basic music knowledge and and so on. Due to the lack of a high-quality ori-
engineering technology. Wait. ental music corpus, the music generation system
One the one hand, due to the different cul- cannot effectively learn the potential characteris-
tural backgrounds between the East and the West, tics of the sample music types and cannot generate
high-quality music fragments. In addition, the col- 6.3.1 Music Generation Technology
lection of different types of music is also a tedious Moves towards Maturity
and difficult task. At the same time, the copyright
The current music generation system mainly
issue of songs further restricts the development
focuses on the generation of a single aspect of
of Eastern scientific research in the field of music
words or music. For the cross-modal generation
production.
of music, such as the simultaneous generation of
In addition, the lack of an effective method
music and lyrics, and the simultaneous genera-
for evaluating the effectiveness of music genera-
tion of music and video, it will be an important
tion systems and the absence of a unified objective
direction for the future development of artifi-
evaluation method means that the different music
cial intelligence in music. This direction needs to
clips generated automatically usually require man-
extract and merge the features between different
ual selection of music clips that are better struc-
modalities, which has certain technical difficulties.
tured and sound bette. In contrast to the more
In the future development of the music generation
extensive research on Western music genres and
system, the generation system can be standardized
structures, there is less structural analysis of dif-
and improved through better music data struc-
ferent music genres in the East, and thus there is
tures including Piano-roll, MIDI events, scores
a lack of objective information that can be applied
and audio, as well as more standardized test
to the evaluation of the effectiveness of Eastern
data sets and evaluation indicators. In addition,
music generation, which limits the development of
since deep learning architecture is the mainstream
Eastern music generation techniques.
generation method of music generation systems,
The majority of researchers in the East focus
enhancing the interpretability and artificial con-
on a single area of research, such as music or
trollability of deep networks has become one of the
engineering technology, and there is a terrible
future development directions of music generation
lack of cross-disciplinary expertise that combines
technology. At the same time, in order to solve
basic knowledge of music with engineering tech-
the problem of time dependence of music genera-
nology, which severely limits the development of
tion, music generation at the segment level rather
the field of intelligent music engineering in the
than at the note level is also the research trend of
East. Researchers need to relearn music theory,
future music generation systems.
but because music is so rich in knowledge, it is
extremely difficult to move from an introductory
level to a mastery of music. The lack of a foun- 6.3.2 Emotional expression of music
dation in music theory has led most researchers generation and its coordinated
to focus mainly on optimising and improving gen- development with robot
erative algorithms, thus neglecting the effective performance
role of music theory rules in improving generative Music is one of the important forms of human
systems. emotion expression. Musical emotions are con-
ceptually regarded as an expression of human
6.3 Trends in Artificial Intelligence emotion that is difficult to quantify, and it under-
Music goes rich changes with the progress of music. The
current degree of music generation intelligence is
Both the limitations of oriental music and the
at a low level, based more on the perspective
commonalities between eastern and western music
of signal analysis, without introducing a human
generation can be used as directions for the sub-
recognition system for musical emotions, without
sequent development and improvement of music
anthropomorphic musical composition thinking,
generation systems to produce beautiful songs
and with human-computer interaction systems
that meet individual needs and have a complete
limited to shallow information exchange. There-
structure. By combing through the development of
fore, the integration of multi-channel information
AI composition and the current state of research,
such as computer vision and computer hearing to
the following trends in music generation systems
recognise the human spectrum of musical emo-
emerge.
tions and audio expression systems, and then the
19
optimisation of the emotional dimension of music • Availability of data and materials Not applica-
generation using techniques such as deep learn- ble
ing is the main goal of the construction of future • Code availability Not applicable
interactive intelligent composition systems. • Authors’ contributions Methodology: Lei Wang;
In addition, studies such as music visualisation Formal analysis and investigation: Ziyi Zhao;
based on robotic systems[53] and music therapy Writing - original draft preparation: Ziyi Zhao,
robots with emotional interaction capabilities[79] Hanwei Liu; Data curation: Junwei Pang; Writ-
also use robots as the main body to integrate ing - review and editing: Yi Qin, Song Li, Ziyi
robotics and musical expression in a benign way. Zhao; Supervision: Qidi Wu, Lei Wang Funding
The future realisation of intelligent composition acquisition: Lei Wang
and collaborative performance of music robots
under emotional computing is a highlight of the
development of the integration of intelligent music References
generation and performance.
1. Agres, K., Forth, J., and Wiggins, G. A.
(2016). Evaluation of musical creativity and
7 Conclusion musical metacreation systems. Computers in
Entertainment (CIE), 14(3):1–33. Publisher:
The involvement of artificial intelligence in artistic
ACM New York, NY, USA.
practice has become a common phenomenon, and
the use of AI algorithms to generate music is a 2. Avdeeff, M. (2019). Artificial intelligence
very active area of research. & popular music: SKYGGE, flow machines,
This paper systematically introduces the and the audio uncanny valley. In Arts, vol-
recent state-of-the-art in music generation and ume 8, page 130. Multidisciplinary Digital
gives a brief overview of the forms of music cod- Publishing Institute. Issue: 4.
ing and evaluation criteria. In addition, it analyses
the current state of artificially intelligent composi- 3. Berthelot, D., Schumm, T., and Metz, L.
tion worldwide (especially in the traditional music (2017). Began: Boundary equilibrium gen-
of the East and West) and compares the different erative adversarial networks. arXiv preprint
development characteristics of the two. In partic- arXiv:1703.10717.
ular, the opportunities and challenges of music
generation systems in the future are analysed. It is 4. Briot, J.-P., Hadjeres, G., and Pachet, F.-
hoped that this survey will contribute to a better D. (2020). Deep Learning Techniques for
understanding of the current state of research and Music Generation. Computational Synthesis
development trends in AI-based music generation and Creative Systems. Springer International
systems. Publishing, Cham.
International Conference on Machine Learn- 34. Hu, X. and Yang, Y.-H. (2017). The mood
ing (pp. 2269-2279). PMLR. of Chinese Pop music: Representation and
recognition. Journal of the Association for
26. Goodfellow, I., Pouget-Abadie, J., Mirza, Information Science & Technology.
M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. (2014). Gen- 35. Huang, A. and Wu, R. (2016). Deep learning
erative adversarial nets. Advances in neural for music. arXiv preprint arXiv:1606.04930.
information processing systems, 27.
36. Huang, C.-F., Lian, Y.-S., Nien, W.-P., and
27. Guan, F., Yu, C., and Yang, S. (2019). Chieng, W.-H. (2016). Analyzing the percep-
A GAN Model With Self-attention Mecha- tion of Chinese melodic imagery and its appli-
nism To Generate Multi-instruments Sym- cation to automated composition. Multimedia
bolic Music. In 2019 International Joint Tools and Applications, 75(13):7631–7654.
Conference on Neural Networks (IJCNN).
37. Huang, C.-Z. A., Cooijmans, T., Roberts, A.,
28. Hadjeres, G. and Nielsen, F. (2017). Interac- Courville, A., and Eck, D. (2019). Coun-
tive Music Generation with Positional Con- terpoint by convolution. arXiv preprint
straints using Anticipation-RNNs. arXiv arXiv:1903.07227.
preprint arXiv:1709.06404.
38. Huang, C.-Z. A., Vaswani, A., Uszkoreit, J.,
29. Hadjeres, G., Pachet, F., and Nielsen, F. Shazeer, N., Simon, I., Hawthorne, C., Dai,
(2017). DeepBach: a Steerable Model for A. M., Hoffman, M. D., Dinculescu, M., and
Bach Chorales Generation. In International Eck, D. (2018). Music transformer. arXiv
Conference on Machine Learning, pages 119– preprint arXiv:1809.04281.
127.PMLR.
39. Huang, S., Li, Q., Anil, C., Oore, S., and
30. Han, C., Murao, K., Noguchi, T., Kawata, Y., Grosse, R. B. (2019). Timbretron: A
Uchiyama, F., Rundo, L., Nakayama, H., and wavenet (cyclegan (cqt (audio))) pipeline
Satoh, S. (2019). Learning more with less: for musical timbre transfer. arXiv preprint
Conditional PGGAN-based data augmenta- arXiv:1811.09620.
tion for brain metastases detection using
highly-rough annotation on MR images. In 40. Jaques, N., Gu, S., Turner, R. E., and Eck,
Proceedings of the 28th ACM International D. (2017). Tuning recurrent neural networks
Conference on Information and Knowledge with reinforcement learning.
Management, pages 119–127.
41. Jeong, J., Kim, Y., and Ahn, C. W. (2017).
31. Herremans, D. and Chew, E. (2019). A multi-objective evolutionary approach to
MorpheuS: Generating Structured Music automatic melody generation. Expert Sys-
with Constrained Patterns and Tension. tems with Applications, 90:50–61. Publisher:
IEEE Transactions on Affective Computing, Elsevier.
10(4):510–523.
42. Jhamtani, H. and Berg-Kirkpatrick, T.
32. Herremans, D., Chuan, C.-H., and Chew, E. (2019). Modeling Self-Repetition in Music
(2017). A Functional Taxonomy of Music Generation using Generative Adversarial
Generation Systems. ACM Computing Sur- Networks. In Machine Learning for Music
veys, 50(5):1–30. Discovery Workshop, ICML.
33. Hu, X., & Lee, J. H. (2012, October). A 43. Jiang, J. (2019). Stylistic Melody Generation
Cross-cultural Study of Music Mood Percep- with Conditional Variational Auto-Encoder.
tion between American and Chinese Listen-
ers. In ISMIR (pp. 535-540). 44. Jiang, J., Xia, G. G., Carlton, D. B.,
Anderson, C. N., and Miyakawa, R. H.
(2020). Transformer VAE: A Hierarchical 53. Lin, P.-C., Mettrick, D., Hung, P. C., and
Model for Structure-Aware and Interpretable Iqbal, F. (2018). Towards a music visual-
Music Representation Learning. In ICASSP ization on robot (MVR) prototype. In 2018
2020 - 2020 IEEE International Conference IEEE International Conference on Artifi-
on Acoustics, Speech and Signal Processing cial Intelligence and Virtual Reality (AIVR),
(ICASSP), pages 516–520. ISSN: 2379-190X. pages 256–257. IEEE.
45. Jie, C. H. E. N. (2015). Comparative study 54. Liu, H.-M. and Yang, Y.-H. (2018). Lead
between Chinese and western music aesthet- Sheet Generation and Arrangement by
ics and culture. Conditional Generative Adversarial Net-
work. arXiv:1807.11161 [cs, eess]. arXiv:
46. Jin, C., Tie, Y., Bai, Y., Lv, X., and Liu, S. 1807.11161.
(2020). A Style-Specific Music Composition
Neural Network. Neural Processing Letters, 55. Lopes, H. B., Martins, F. V. C., Cardoso,
52(3):1893–1912. R. T., and dos Santos, V. F. (2017). Combin-
ing rules and proportions: A multiobjective
47. Kaliakatsos-Papakostas, M., Floros, A., and approach to algorithmic composition. In 2017
Vrahatis, M. N. (2020). Artificial intelli- IEEE Congress on Evolutionary Computa-
gence methods for music generation: a review tion (CEC), pages 2282–2289. IEEE.
and future perspectives. Nature-Inspired
Computation and Swarm Intelligence, pages 56. Loughran, R. and O’Neill, M. (2020).
217–245. Publisher: Elsevier. Evolutionary music: applying evolution-
ary computation to the art of creating
48. Kaliakatsos-Papakostas, M. A., Floros, A., music. Genetic Programming and Evolvable
and Vrahatis, M. N. (2016). Interactive Machines, 21(1):55–85. Publisher: Springer.
music composition driven by feature evolu-
tion. SpringerPlus, 5(1):1–38. Publisher: 57. Lousseief, E., Sturm, B. L. T., and Sturm,
Springer. B. L. MahlerNet: Unbounded Orchestral
Music with Neural Networks. In the Nordic
49. Keerti, G., Vaishnavi, A., Mukherjee, P., Sound and Music Computing Conference
Vidya, A. S., Sreenithya, G. S., and Nayab, 2019 and the Interactive Sonification Work-
D. (2020). Attentional networks for music shop 2019 (pp. 57-63).
generation. arXiv preprint arXiv:2002.03854.
58. Lu, C.-Y., Xue, M.-X., Chang, C.-C., Lee,
50. Kumar, H. and Ravindran, B. (2019). Poly- C.-R., and Su, L. (2019). Play as You Like:
phonic Music Composition with LSTM Neu- Timbre-Enhanced Multi-Modal Music Style
ral Networks and Reinforcement Learning. Transfer. Proceedings of the AAAI Confer-
arXiv preprint arXiv:1902.01973. ence on Artificial Intelligence, 33:1061–1068.
51. Leach, J. and Fitch, J. (1995). Nature, 59. Luo, J., Yang, X., Ji, S., and Li, J.
music, and algorithmic composition. Com- (2019). MG-VAE: Deep Chinese Folk
puter Music Journal, 19(2):23–33. Publisher: Songs Generation with Specific Regional
JSTOR. Style. arXiv:1909.13287 [cs, eess]. arXiv:
1909.13287.
52. Liang, X., Wu, J., and Cao, J. (2019).
MIDI-Sandwich2: RNN-based Hierarchical 60. Makris, D., Kaliakatsos-Papakostas, M.,
Multi-modal Fusion Generation VAE net- Karydis, I., and Kermanidis, K. L. (2019).
works for multi-track symbolic music gener- Conditional neural sequence learners for gen-
ation. arXiv:1909.03522 [cs, eess]. arXiv: erating drums’ rhythms. Neural Computing
1909.03522. and Applications, 31(6):1793–1804.
23
61. Manzelli, R., Thakkar, V., Siahkamari, A., 69. Oore, S., Simon, I., Dieleman, S., Eck, D.,
and Kulis, B. (2018). CONDITIONING and Simonyan, K. (2020). This time with
DEEP GENERATIVE RAW AUDIO MOD- feeling: learning expressive musical perfor-
ELS FOR STRUCTURED AUTOMATIC mance. Neural Computing and Applications,
MUSIC. arXiv preprint arXiv:1806.09905. 32(4):955–967.
62. Manzelli, R., Thakkar, V., Siahkamari, A., 70. Payne C. (2019) MuseNet.OpenAI
and Kulis, B. (2018). An end to end model for Blog. https://openai.com/blog/musenet/.
automatic music generation: Combining deep Accessed 11 January 2022
raw and symbolic audio networks. In Proceed-
ings of the Musical Metacreation Workshop 71. Plut, C. and Pasquier, P. (2020). Genera-
at 9th International Conference on Computa- tive music in video games: State of the art,
tional Creativity, Salamanca, Spain. challenges, and prospects. Entertainment
Computing, 33:100337. Publisher: Elsevier.
63. Medeot, G., Cherla, S., Kosta, K., McVicar,
M., Abdallah, S., Selvi, M., Newton-Rex, E., 72. Ramanto, A. S., No, J. G., and Maulidevi,
and Webster, K. (2018). StructureNet: Induc- D. N. U. Markov Chain Based Procedu-
ing Structure in Generated Melodies. In ral Music Generator with User Chosen Mood
ISMIR, pages 725–731. Compatibility. IN International Journal of
Asia Digital Art and Design Association,
64. Mogren, O. (2016). C-RNN-GAN: Continu- 21(1),19-24.
ous recurrent neural networks with adversar-
ial training. arXiv:1611.09904 [cs]. arXiv: 73. Rivero, D., Ramı́rez-Morales, I., Fernandez-
1611.09904. Blanco, E., Ezquerra, N., and Pazos, A.
(2020). Classical Music Prediction and Com-
65. Mura, D., Barbarossa, M., Dinuzzi, G., Gri- position by Means of Variational Autoen-
oli, G., Caiti, A., and Catalano, M. G. (2018). coders. Applied Sciences, 10(9):3053.
A Soft Modular End Effector for Underwater
Manipulation.: A Gentle, Adaptable Grasp 74. Roberts, A., Engel, J., Raffel, C., Hawthorne,
for the Ocean Depths. IEEE Robotics & C., and Eck, D. (2018) A Hierarchical Latent
Automation Magazine, PP(4):1–1. Vector Model for Learning Long-Term Struc-
ture in Music. In nternational conference on
66. Muñoz, E., Cadenas, J. M., Ong, Y. S., and machine learning (pp. 4364-4373). PMLR.
Acampora, G. (2014). Memetic music com-
position. IEEE Transactions on Evolutionary 75. Scirea, M., Togelius, J., Eklund, P., and Risi,
Computation, 20(1):1–15. Publisher: IEEE. S. (2016). Metacompose: A compositional
evolutionary music composer. In Interna-
67. Olseng, O. and Gambäck, B. (2018). Co- tional Conference on Computational Intelli-
evolving melodies and harmonization in evo- gence in Music, Sound, Art and Design, pages
lutionary music composition. In International 202–217. Springer.
Conference on Computational Intelligence in
Music, Sound, Art and Design, pages 239– 76. Sturm, B. L., Ben-Tal, O., Monaghan, ı̂,
255. Springer. Collins, N., Herremans, D., Chew, E., Had-
jeres, G., Deruty, E., and Pachet, F. (2019).
68. Oord, A. v. d., Dieleman, S., Zen, H., Machine learning research that matters for
Simonyan, K., Vinyals, O., Graves, A., Kalch- music creation: A case study. Journal of
brenner, N., Senior, A., and Kavukcuoglu, K. New Music Research, 48(1):36–55. Publisher:
(2016). WaveNet: A Generative Model for Taylor & Francis.
Raw Audio. arXiv:1609.03499 [cs]. arXiv:
1609.03499. 77. Sturm, B. L., Santos, J. F., Ben-Tal, O.,
and Korshunova, I. (2016). Music transcrip-
tion modelling and composition using deep
learning. arXiv preprint arXiv:1604.08723. Transfer Using Cycle-Consistent Boundary
Equilibrium Generative Adversarial Net-
78. Supper, M. (2001). A Few Remarks on works. arXiv:1807.02254 [cs, eess]. arXiv:
Algorithmic Composition. Computer Music 1807.02254.
Journal, 25(1):48–53.
88. Yang, L.-C., Chou, S.-Y., and Yang, Y.-
79. Tapus, A. (2009). The role of the physi- H. (2017). Midinet: A convolutional gen-
cal embodiment of a music therapist robot erative adversarial network for symbolic-
for individuals with cognitive impairments: domain music generation. arXiv preprint
longitudinal study. In 2009 Virtual Rehabil- arXiv:1703.10847.
itation International Conference, pages 203–
203. IEEE. 89. Yu, Y., Srivastava, A., and Canales, S. (2021).
Conditional LSTM-GAN for Melody Genera-
80. Trieu, N. and Keller, R. M. (2018). Jaz- tion from Lyrics. ACM Transactions on Mul-
zGAN: Improvising with Generative Adver- timedia Computing, Communications, and
sarial Networks. In MUME 2018: 6th Inter- Applications, 17(1):1–20. arXiv: 1908.05551.
national Workshop on Musical Metacreation.
90. Zhang, N. (2020). Learning adversarial trans-
81. Valenti, A., Carta, A., and Bacciu, D. (2020). former for symbolic music generation. IEEE
Learning Style-Aware Symbolic Music Transactions on Neural Networks and Learn-
Representations by Adversarial Autoen- ing Systems. Publisher: IEEE.
coders. arXiv:2001.05494 [cs, stat]. arXiv:
2001.05494. 91. Zhu, H., Liu, Q., Yuan, N. J., Qin, C., Li, J.,
Zhang, K., Zhou, G., Wei, F., Xu, Y., and
82. Vaswani, A., Shazeer, N., Parmar, N., Uszko- Chen, E. (2018). XiaoIce Band: A Melody
reit, J., Jones, L., Gomez, A. N., Kaiser, \., and Arrangement Generation Framework for
and Polosukhin, I. (2017). Attention is all Pop Music. In Proceedings of the 24th
you need. In Advances in neural information ACM SIGKDD International Conference on
processing systems, pages 5998–6008. Knowledge Discovery & Data Mining, pages
2837–2846, London United Kingdom. ACM.
83. Veblen, K., & Olsson, B. (2002). Community
music: Toward an international overview. The 92. Zhu, J.-Y., Park, T., Isola, P., and Efros,
new handbook of research on music teaching A. A. (2017). Unpaired image-to-image trans-
and learning, 730-753. lation using cycle-consistent adversarial net-
works. In Proceedings of the IEEE interna-
84. Waite, E. and others (2016). Generating long- tional conference on computer vision, pages
term structure in songs and stories. Web blog 2223–2232.
post. Magenta, 15(4).
93. Zipf, G. K. (2016). Human behavior and the
85. Wang, B. and Yang, Y.-H. (2019). Per- principle of least effort: An introduction to
formanceNet: Score-to-Audio Music Genera- human ecology. Ravenio Books.
tion with Multi-Band Convolutional Residual
Network. Proceedings of the AAAI Confer- 94. Westergaard, P. (1959). Experimental music.
ence on Artificial Intelligence, 33:1174–1181. composition with an electronic computer.
86. Williams, D., Hodge, V. J., Gega, L., Mur- 95. Cope, D. (1991). Recombinant music: using
phy, D., Cowling, P. I., and Drachen, A. the computer to explore musical style. Com-
(2019). AI and Automatic Music Generation puter, 24(7):22–28.
for Mindfulness. page 11.
96. Miranda, E. R. and Al Biles, J. (2007).
87. Wu, C.-W., Liu, J.-Y., Yang, Y.-H., and Evolutionary computer music. Springer.
Jang, J.-S. R. (2018). Singing Style