You are on page 1of 24

A Review of Intelligent Music Generation Systems

Ziyi Zhao1 , Hanwei Liu1 , Song Li1 , Junwei Pang1 , Maoqing Zhang1 , Yi Qin2 , Lei
arXiv:2211.09124v1 [cs.SD] 16 Nov 2022

Wang1* and Qidi Wu1


1* College
of Electric and Information Engineering, Tongji University, Cao’an Street,
Shanghai, 201804, China.
2 Department of Music Engineering, Shanghai Conservatory of Music, Fenyang street,

Shanghai, 200031, China.

*Corresponding author(s). E-mail(s): wanglei tj@126.com;;


Contributing authors: Zhaozi1@tongji.edu.cn; liuhw1@tongji.edu.cn;
tobeace@tongji.edu.cn; pangjw@tongji.edu.cn; maoqing zhang@163.com;
qinyi@shcmusic.edu.cn;

Abstract
Intelligent music generation, one of the most popular subfields of computer creativity, can lower the
creative threshold for non-specialists and increase the efficiency of music creation. In the last five
years, the quality of algorithm-based automatic music generation has increased significantly, moti-
vated by the use of modern generative algorithms to learn the patterns implicit within a piece of
music based on rule constraints or a musical corpus, thus generating music samples in various styles.
Some of the available literature reviews lack a systematic benchmark of generative models and are
traditional and conservative in their perspective, resulting in a vision of the future development
of the field that is not deeply integrated with the current rapid scientific progress. In this paper,
we conduct a comprehensive survey and analysis of recent intelligent music generation techniques,
provide a critical discussion, explicitly identify their respective characteristics, and present them in
a general table. We first introduce how music as a stream of information is encoded and the rel-
evant datasets, then compare different types of generation algorithms, summarize their strengths
and weaknesses, and discuss existing methods for evaluation. Finally, the development of artificial
intelligence in composition is studied, especially by comparing the different characteristics of music
generation techniques in the East and West and analyzing the development prospects in this field.

Keywords: music generation, music coding methods, automatic composition, evaluation of generation effects

1 Introduction creation. Intelligent music generation is dedicated


to solving the problem of automatic music compo-
Computer creativity as a field exploring auto- sition and is currently one of the most productive
matic creation has led to high-quality research in sub-fields in the field of computational creativity.
the direction of intelligent innovation, including Two essential properties of automatic music com-
storytelling[2] and art[71]. With the development position are structural awareness and interpre-
of artificial intelligence, music has proven to have tive ability. Structure-aware models can produce
a very high capacity and potential for automatic

1
&RQVWUDLQW (YROXWLRQDU\$OJRULWKPV

0HORG\ 6XEMHFWLYH
6\PERO JHQHUDWLRQ HYDOXDWLRQ
6HTXHQFH
+00 /670 JHQHUDWLRQ 2EMHFWLYH
HYDOXDWLRQ
$XGLR
JHQHUDWLRQ
6W\OH6KLIW (YDOXDWLRQ

$XGLR
,QSXW *$1 9$(
2XWSXW
5HSUHVHQWDWLRQ *HQHUDWLQJPRGHOV

Fig. 1: Overview of music generation system

naturally coherent music with long-term depen- less information on the general method of music
dencies, including different levels of repetition generation.
and variation[44]; interpretive ability translates Considering the shortcomings of the previ-
complex computational models into controllable ous literature, this paper provides a comprehen-
interfaces for interactive performance[46]. Finally, sive analysis of the latest articles on the various
this field can serve a variety of purposes, such as research directions of music generation, pointing
composers using the technology to develop exist- out their respective strengths and weaknesses, and
ing material, to inspire and explore new ways of summarizing them in a very clear classification
making and working with music, and listeners can based on generative methods. In order to better
customise or personalise music. understand the digital character of the music, we
The artificial intelligence composition system also present the two main representations of music
is usually called the Music Generation System as an information stream and list the commonly
(MGS). Currently, a few review papers on music used datasets For the evaluation of generative
generation systems can currently be found. Her- effects, the most representative evaluation meth-
remans et al.[32] introduced the classification ods based on both subjective and objective aspects
method of music generation systems and pointed are selected for discussion. Our work concludes
out problems in music evaluation. Briot et al.[4] with a comparative analysis of the similarities and
introduced algorithms based on deep learning differences between Eastern and Western research
based music generation system, and put forward in the field of AI composition based on different
corresponding countermeasures for some limita- cultural characteristics and a look at the further
tions in this field. Kaliakatsos et al.[47] analyzed development of the field of music generation as it
and predicted the research fields that promote the matures.
development of intelligent music generation in the
field of artificial intelligence; Loughran et al.[56] 2 Overview
aimed the music generation system based on evo-
lutionary calculation has analyzed the categories The most basic elements of music include the
and potential problems; Plug et al.[71] introduced pitch of the sound, the length of the sound, the
the latest research technology of music genera- strength of the sound and the timbre. These
tion systems in the game field, and analyzed the basic elements combine with each other to form
prospects and challenges of the field. However, in the melody, harmony, rhythm and tone of the
these previous papers, some do not provide suffi- music. In the field of artificial intelligence compo-
cient analysis of algorithms and models[32, 47, 56]. sition, melody constitutes the primary purpose of
There is a great deal of redundancy in the text an automatic music generation system. In most
in[4]. Others have a more traditional and con- melody generation algorithms, the goal is to cre-
servative outlook on the future[32, 47], and[71] ate a melody similar to the selected style, such as
is more focused on the application level, giving generating western folk music[63] or free jazz[49].
3

In addition to melody, harmony is another pop- some representation of music, which has an impor-
ular aspect of automatic music generation. When tant influence on the number of input and output
generating a harmony sequence, the quality of nodes (variables) to the system, the correctness
its output mainly depends on the similarity with of training, and the quality of the generated
the target style. For example, in chorus harmony, content[4].
this similarity is achieved by observing Sound
guidance rules[28] are clearly defined. In popu- 3.1 Specific formats
lar music, the chord progression is mainly used
as an accompaniment to the melody[91]. In addi- The representation of music is mainly divided into
tion, audio generation and style conversion are two categories: audio and symbolic. The main
also popular directions in the field of intelligent difference between the two is that audio is a
composition. The overall framework of the specific continuous signal, while the symbol is a discrete
music generation system is shown in Figure 1. It is signal.
worth noting that many historical music genera-
tion systems do not include deep neural networks 3.1.1 Audio representation
or evolutionary algorithms, but generate outputs The main ways in which music can be represented
based solely on musical rule constraints. In the in audio are:
a. Waveform[68]
b. Spectrogram[58]
6HJPHQW$
Among them, the spectrogram[58] can be
3KUDVH 3KUDVH 3KUDVH 3KUDVH obtained by the short-time Fourier transform of
the original audio signal. The advantage of using
0RWLYH 0RWLYH 0RWLYH ĂĂ 0RWLYHQ
audio signal representation is that the original
Fig. 2: The Hierarchy of Music audio waveform retains the original properties of
music and can be used to produce expressive
music[62]. However, this form of representation
process of generating music, the musical content has certain limitations, because music fragments
needs to be encoded in digital form as input to the in the original audio domain are usually repre-
algorithm, after which the intelligent algorithm sented as continuous waveform x ∈ [−1, 1]T , where
is learnt and trained to eventually output differ- the number of samples T is the audio duration
ent types of musical segments, such as melodies, and the sampling frequency (usually 16 kHz To
chords, audio, style transitions, etc... Music has a 48 kHz), short-term music fragments will generate
clear hierarchical structure in time, from motives a lot of data. Therefore, modelling the raw audio
to phrases and then to segments. As shown in can lead to a somewhat range-dependent system,
Figure 2, where the overall structure of the music making it challenging for intelligent algorithms to
together with local repetitions and variants form learn the high-level semantics of music, such as the
the theme, rhythm and emotion of the piece; sec- Jukebox[14] automated record jukebox that takes
ondly, music usually consists of multiple voices approximately nine hours to render one minute of
(may be the same or different instruments), each music.
individual voice part has a complete structure, and
together they must be in harmony with each other. 3.1.2 Symbolic representation
Combining these limitations, music generation is The symbolic representation of music fragments is
still a very challenging problem. mainly as follows:
a. MIDI events
3 Representation MIDI (Musical Instrument Digital Interface)
is a digital interface of musical instruments to
Music can be seen as a stream of information that ensure the interoperability between various elec-
conveys emotions at different levels of abstrac- tronic musical instruments, software and equip-
tion. The input to a generative system is usually ment. Figure 3a shows the type of MIDI event.
This method uses two symbols to indicate the
appearance of notes: Note on and Note off, indi- file before converting it to image form. As shown
cating respectively the beginning and end of the in Figure 3b, the x-axis represents the continuous
played notes. In addition, the note number is used time step, the y-axis represents the pitch, and the
to indicate the pitch of the note, which is specified z-axis represents the audio track, where n tracks
ㅖਧਾ֌Ѫ 㖁㔌Ⲵ䗃‫⭘ˈޕ‬ԕ⭏ᡀ≁؇丣Ҁ
by an integer between 0 and 127. This data struc- have n one-hot matrices.
ture contains an incremental time value to specify The Piano-roll format quantizes the notes into
㊫රⲴᴢᆀDŽ਼ṧⲴˈ
the relative interval time between notes. For exam-
䟷⭘ ㅖਧ
a matrix form according to the beats. This rep-
䘋㹼㺘⽪⭘Ҿ⭏ᡀ᯻ᖻDŽնᱟਚ㜭Ѫঅ丣᯻ᖻ䘋㹼
ple, Music transformer[38], MusicVAE[74], etc. resentation form is popular in music generation
convert melody, drums, piano, etc. into MIDI systems because it is easy to observe the structure
བྷ㕆⸱Ⲵተ䲀ᙗ䲀ࡦҶ䈕㺘⽪ᖒᔿⲴᓄ⭘DŽ
ㅖਧਾ֌Ѫ
events and use them 㖁㔌Ⲵ䗃‫⭘ˈޕ‬ԕ⭏ᡀ≁؇丣Ҁ
as input to the training net- of the melody and does not require serializa-
work to generate music fragments. However, this tion of polyphonic music. Piano-roll is one of the
Ⲵ丣Ҁབྷ㊫රⲴᴢᆀDŽ਼ṧⲴˈ
ㅖਧਾ֌Ѫ cannot
representation 㖁㔌Ⲵ䗃‫⭘ˈޕ‬ԕ⭏ᡀ≁؇丣Ҁ 䟷⭘
effectively retain the con-
ㅖਧ
most commonly used representations for mono-
PLGL1RWH2Q(YHQW WLFN
cept of multiple tracks FKDQQHO
䘋㹼㺘⽪⭘Ҿ⭏ᡀ᯻ᖻDŽնᱟਚ㜭Ѫঅ丣᯻ᖻ䘋㹼
㊫රⲴᴢᆀDŽ਼ṧⲴˈ playing multiple 䟷⭘ GDWD
notes ㅖਧ >@
at tone music (polyphonic music) melody, such as:
PLGL1RWH2Q(YHQW WLFN
once[35]. FKDQQHO GDWD >@ Midi-VAE[5], MuseGAN[18], etc. First convert
Ѫ 䘋㹼㺘⽪⭘Ҿ⭏ᡀ᯻ᖻDŽնᱟਚ㜭Ѫঅ丣᯻ᖻ䘋㹼
㕆⸱Ⲵተ䲀ᙗ䲀ࡦҶ䈕㺘⽪ᖒᔿⲴᓄ⭘DŽ
PLGL1RWH2II(YHQW WLFN FKDQQHO GDWD >@ polyphonic music to Piano-roll representation,
ᔿѫ㾱Ѫ PLGL1RWH2II(YHQW WLFN FKDQQHO GDWD >@ and then proceed to follow-up processing. However
㕆⸱Ⲵተ䲀ᙗ䲀ࡦҶ䈕㺘⽪ᖒᔿⲴᓄ⭘DŽ
PLGL1RWH2Q(YHQW WLFN
PLGL1RWH2Q(YHQW WLFN FKDQQHOFKDQQHO GDWD
GDWD >@
>@
the biggest limitation of using Piano-roll is that it
PLGL1RWH2Q(YHQW WLFN
PLGL1RWH2Q(YHQW WLFN FKDQQHO
FKDQQHO
PLGL1RWH2Q(YHQW WLFN FKDQQHO GDWD
GDWD
GDWD >@
>@
>@
PLGL1RWH2Q(YHQW WLFN FKDQQHO
PLGL1RWH2II(YHQW WLFN FKDQQHO GDWD >@
GDWD >@ cannot capture the vocal and other sound details
PLGL1RWH2II(YHQW WLFN FKDQQHO GDWD >@
а ᱟа PLGL1RWH2II(YHQW WLFN FKDQQHO GDWD >@ such as timbre, velocity, and expressiveness[4],
PLGL1RWH2II(YHQW WLFN FKDQQHO GDWD >@
PLGL1RWH2Q(YHQW WLFN
PLGL1RWH2Q(YHQW WLFN FKDQQHO GDWD
FKDQQHO GDWD >@
>@
PLGL1RWH2Q(YHQW WLFN
and it is difficult to distinguish long notes from
઼ PLGL1RWH2Q(YHQW WLFN FKDQQHO GDWD
FKDQQHO GDWD >@
>@
ǃ䖟Ԧ઼ repeated short notes (e.g. whole notes and mul-
һ һ (a) MIDI
MIDI events(Extract
events(Extract from
tiple eighth notes) and has multiple zero integers
(a) fromTwinkle
TwinkleTwinkle
TwinkleLittle
LittleStar)
Star)
resulting in low training efficiency.
Ⲵࠪ⧠˖
⧠˖ c.ABC notation
Ⲵᔰ࿻઼ ABC notation1 encodes melodies into a tex-

ˈ⭡ ࡠ tual representation, as shown in Figure 3c, initially

њ໎䟿ᰦ designed specifically for folk and traditional music




宷 originated in Western Europe, but later widely
ྲᰦ 宷

宓 宫





used for other types of music. A typical example

ǃ䫒⩤ㅹ of this encoding method, which has the advantage
ˈӾ㘼ӗ 宷宬宰宨 of being concise and easy to read, is Folk-rnn[77],
ㅹ which takes the ABC notation of a particular folk
‫⮉؍‬ཊњ 宷宬宰宨
宷宬宰宨
ӗ music transcription and uses it as input to an
宛孽季孴 LSTM network to generate tunes in the folk music
њ (b) Piano-roll
宗孽季宖宦宲宷宯室宱宧季宗宫宨季宅宵室容宨 genre. Similarly, GGA-MG[23] uses ABC symbols
宝孽季宊宲宱宨宄宺室宼
宕孽季宰室宵宦宫 for representation used to generate melodies. How-
‫ݸ‬䴰㾱ሩ 宛孽季孴
宐孽季孷孲孷
宛孽季孴 宏孽季孴孲孻
宗孽季宖宦宲宷宯室宱宧季宗宫宨季宅宵室容宨 ever, the limitation of only being able to encode
޽䖜ᦒѪ 宎孽季宇宰室宭
宗孽季宖宦宲宷宯室宱宧季宗宫宨季宅宵室容宨
宝孽季宊宲宱宨宄宺室宼
宄孯宇宿宇孵季宇孶孲孵守孲孵宿安宇季安宄宿宧孵季宧孶孲孵宦孲孵宿 monophonic melodies limits the application of this
宕孽季宰室宵宦宫
↕䮯ˈ 宝孽季宊宲宱宨宄宺室宼 representation.
ሩ 宐孽季孷孲孷
宕孽季宰室宵宦宫
宏孽季孴孲孻
ࡉᴹ
Ѫ њ 宐孽季孷孲孷
宎孽季宇宰室宭
ഴ 丩҆㺞⽰ᖘᕅ⽰ׁ
宏孽季孴孲孻
宄孯宇宿宇孵季宇孶孲孵守孲孵宿安宇季安宄宿宧孵季宧孶孲孵宦孲孵宿
3.2 Datasets
ˈ 宎孽季宇宰室宭
ᮦᦤ䳼 For the currently very prevalent deep neural
Ѫ⸙䱥ᖒ 宄孯宇宿宇孵季宇孶孲孵守孲孵宿安宇季安宄宿宧孵季宧孶孲孵宦孲孵宿
њ ൘丣Ҁ⭏ᡀ㌫㔏ѝˈ⺞ᇊྭ丣ҀⲴ㺘⽪ᖒᔿཆˈ network-based generative systems, in addition to
н䴰㾱ᒿ ഴ 丩҆㺞⽰ᖘᕅ⽰ׁ identifying a proper musical representation, a
(c) 䘈䴰䘹ਆоѻ⴨३䝽Ⲵ丣Ҁᮠᦞ䳶֌Ѫ⭏ᡀ㇇⌅Ⲵ
ABC(c)notation(Extract
ABC notation(Extract from Scotland
from Scotland The
The Blave) Blave)
࠶⍱㹼DŽ matching musical dataset needs to be selected as
ᮦᦤ䳼 Fig. 3: Representation of Music
䗃‫ޕ‬DŽՐ㔏‫ޡޜ‬ᮠᦞ䳶Ⲵ丣Ҁ㊫රवਜ਼ཊ丣䖘丣Ҁǃ the input to the system. Popular public datasets

ᖻ䘋㹼㺘 ഴ 丩҆㺞⽰ᖘᕅ⽰ׁ contain music types such as multi-track, choral,
ਸୡ丣Ҁǃ䫒⩤丣Ҁǃ≁䰤丣Ҁㅹˈ㺘⽪ᯩᔿवਜ਼
൘丣Ҁ⭏ᡀ㌫㔏ѝˈ⺞ᇊྭ丣ҀⲴ㺘⽪ᖒᔿཆˈ
ᒿ ǃ piano and folk music, and representations such
ᮦᦤ䳼丣仁ǃ һԦǃ
b.Piano-roll
㺘䘈䴰䘹ਆоѻ⴨३䝽Ⲵ丣Ҁᮠᦞ䳶֌Ѫ⭏ᡀ㇇⌅Ⲵ
ㅖਧㅹDŽ൘丣Ҁ⭏ᡀ䗷〻
DŽ The conversion of a music clip to Piano-roll for-
ѝ䴰㾱‫⭏ᦞ׍‬ᡀ丣Ҁ㊫ර䘹ਆਸ䘲Ⲵᮠᦞ䳶ˈ㺘
൘丣Ҁ⭏ᡀ㌫㔏ѝˈ
㺘䗃‫ޕ‬DŽ matՐ㔏‫ޡޜ‬ᮠᦞ䳶Ⲵ丣Ҁ㊫රवਜ਼ཊ丣䖘丣Ҁǃ
begins with a one-hot
ࡇѮҶࠐњިරᐢ‫ޜ‬ᔰⲴ丣Ҁᮠᦞ䳶Ⲵ㊫රǃ㕆⸱
⺞ᇊྭ丣ҀⲴ㺘⽪ᖒᔿཆˈ
encoding of the MIDI 1
http://abcnotation.com/wiki/abc:standard

৺ަԆ丣ਸୡ丣Ҁǃ䫒⩤丣Ҁǃ≁䰤丣Ҁㅹˈ㺘⽪ᯩᔿवਜ਼
ǃ 䘈䴰䘹ਆоѻ⴨३䝽Ⲵ丣Ҁᮠᦞ䳶֌Ѫ⭏ᡀ㇇⌅Ⲵ
ᖒᔿㅹ䘋㹼㔏䇑DŽ
ˈфᖸ䳮
丣仁ǃ һԦǃ ㅖਧㅹDŽ൘丣Ҁ⭏ᡀ䗷〻
㺞  丩҆ᮦᦤ䳼⽰ׁ
㺘䗃‫ޕ‬DŽ
ཊњ‫࠶ޛ‬ Ր㔏‫ޡޜ‬ᮠᦞ䳶Ⲵ丣Ҁ㊫රवਜ਼ཊ丣䖘丣Ҁǃ
ѝ䴰㾱‫⭏ᦞ׍‬ᡀ丣Ҁ㊫ර䘹ਆਸ䘲Ⲵᮠᦞ䳶ˈ㺘
䰞仈DŽ
㺘 ਸୡ丣Ҁǃ䫒⩤丣Ҁǃ≁䰤丣Ҁㅹˈ㺘⽪ᯩᔿवਜ਼
5

as audio, MIDI events and ABC symbols. Table1 written by Jay Glanville and some perl scripts,
lists the types and encoding forms of several typi- much of this database has been converted to ABC
cal publicly available music datasets for statistical notation.
purposes. The above dataset consists mainly of West-
ern music, with less attention paid to Eastern
musical works. Actually, there are a small number
Table 1: Examples of music datasets of datasets that have been collected for oriental
Dataset Type Format music, but most of them are not yet publicly avail-
able on the Internet. For example, MG-VAE[59]
Nsynth2 Multitrack WAV digitised the folk songs from different regions of
Lakh MIDI Dataset3 Multitrack MIDI
JSB-Chorales4 Harmonized chorale MIDI China recorded in the Chinese Folk Music Collec-
Groove MIDI5 Drum MIDI tion into 2,000 tracks with different regional styles
Nottingham6 Folk tunes ABC in MIDI format. XiaoIce Band[91] collected Chi-
nese popular music for generating multi-tracked
pop music. Jiang et al.[43] collected over 1,000
Nsynth contains 305,979 notes, each with a
Chinese karaoke songs. In addition, CH818[34] col-
unique pitch, timbre and envelope. For the 1,006
lected 818 Chinese pop songs, extracting the most
instruments from the commercial sample library,
emotive 30-second segments from each song and
the four-second, single-single instrument is gener-
adding emotional tags to them.
ated in the range of each pitch (21-108) and five
different speeds (25, 50, 75, 100, 127) of a standard
MIDI piano. 4 Music Generation Systems
Lakh MIDI is a collection of 176,581 unique
MIDI files, 45,129 of which have been matched The quest for computers to create music has
and aligned with entries in the One Million Songs been pursued by researchers since the dawn of
Dataset. It aims to facilitate large-scale retrieval computing. In 1957, Lejaren Hiller produced the
of music information, both symbolic (using MIDI IlliacSuite [94], the first musical composition in
files only) and audio content-based (using infor- human history composed entirely by a computer.
mation extracted from MIDI files as annotations David Cope(1991)[95] introduced Experiments in
for matching audio files). Musical Intelligence (EMI), which used the most
JSB-Chorales offers three temporal resolu- basic process of pattern recognition and recombi-
tions: quarter, eighth and sixteenth. These ’quan- nancy to create music that successfully imitated
tizations’ are created by retaining the tones heard the styles of hundreds of composers.
on a specified time grid. Boulanger-Lewandowski In the new, century, from early rule-based[78]
(2012) uses a temporal resolution of quarter notes. to the currently widely used deep learning-
This dataset does not currently encode fermatas based[4] music generation systems, the generated
and is unable to distinguish between held and music has gradually evolved from simple, short-
repeated notes. duration monotonic music fragments to poly-
Groove MIDI Dataset (GMD) consists of 13.6 phonic music with chords and long duration. This
hours of unified MIDI and (synthetic) audio of paper surveys the recent state-of-the-art in music
human-performed, rhythmically expressive drum generation and classifies them based on algo-
sounds. The dataset contains 1150 MIDI files and rithm, genre, dataset, and representation, etc.
over 22,000 drum sounds. The detailed classification results are presented in
Nottingham Music Dataset maintained by Eric Table 2, which can be simply categorized into the
Foxley contains over 1000 folk tunes stored in a following categories according to the purpose of
special text format. Using NMD2ABC, a program generation:
• Melody generation
• Arrangement generation
2
3
https://magenta.tensorflow.org/datasets/nsynth • Audio generation
https://colinraffel.com/projects/lmd/
4
http://www.jsbchorales.net/index.shtml
• Style transfer
5
https://magenta.tensorflow.org/datasets/groove
6
http://abc.sourceforge.net/NMD/
Table 2: List of Music Generation Systems
Name Year Algorithm Application

Rule T. Tanaka et al. 2015 Constraint based System melody generation


MorpheuS 2017 Constraints(tonal and pattern) Arrangement generation

Chih-Fang Huang et al. 2016 Markov chain melody generation


Markov Model AS Ramanto et al. 2017 Markov chain melody generation
D Williams et al. 2019 Hidden Markov Model(HMM) melody generation

Wavenet[13] 2016 Dilated Casual Convolutions Audio generation


Engel et al. 2017 Wavenet/ AE timbre style transfer
Manzelli et al. 2018 WaveNet/biaxial LSTM Audio generation
CNN
6

TimbreTron 2019 WaveNet/CycleGAN composition style transfer


Coconet 2019 Counterpoint by Convolution Reconstruct partial scores
PerformanceNet 2019 ContourNet/TextureNet Audio generation

MelodyRNN 2016 lookback RNN/attention RNN melody generation


Hadjeres et al. 2017 Constraint-RNN/Token-RNN(LSTM) melody generation
RL Tuner 2017 LSTM/RL melody generation
Article Title

StructureNet 2018 LSTM melody generation


Makris et al. 2019 LSTM/FF melody generation
Keerti et al. 2020 bi-LSTM/attention melody generation
Folk-rnn 2016 LSTM Arrangement generation
RNN
Song From PI 2016 LSTM Arrangement generation
JamBot 2017 LSTM Arrangement generation
DeepBach 2017 bi-LSTM/pseudo-Gibbs sampling Arrangement generation
XiaoIce Band 2018 GRU/MLP/multi-task learning Arrangement generation
Amadeus 2019 LSTM/RL Arrangement generation
Performance RNN 2018 LSTM Audio generation
Jin et al. 2020 LSTM/AC RL Composition style transfer

C-RNN-GAN 2016 GAN/LSTM melody generation


MidiNet 2017 GAN/CNN melody generation
JazzGAN 2018 GAN/RA melody generation
SSMGAN 2019 GAN/ SSM melody generation
Yu et al. 2019 GAN/conditional LSTM melody generation
MuseGAN 2018 GAN Arrangement generation
GAN

BinaryMuseGAN 2018 GAN/ BN Arrangement generation

LeadSheetGAN 2018 GAN/CNN Arrangement generation


CycleBEGAN 2018 CycleBEGAN-CNN/skip/recurrent Timbre style transfer
CycleGAN 2018 Cycle-GAN Composition style transfer
GAN
Play as You Like 2019 RaGAN/MUNIT composition style transfer
WaveGAN 2019 DCGAN audio generation
GANSynth 2019 PGGAN audio generation

MusicVAE 2019 VAE/RNN melody generation


GrooVAE 2019 VAE/LSTM melody generation
Jiang et al. 2019 attention-VAE melody generation
MuseAE 2020 AAE/bi-LSTM melody generation
MIDI-Sandwich2 2019 BVAE/MFG-VAE Arrangement generation
VAE
Rivero et al. 2020 VAE Arrangement generation
Jukebox 2020 VQ-VAE Arrangement generation
MIDI-VAE 2018 parallel VAE/GRU Composition style transfer
MahlerNet 2019 conditional VAE/bi-RNN Composition style transfer
MG-VAE 2019 VAE/bi-GRU Composition style transfer

Music Transformer 2019 transformer Arrangement generation(piano)


LakhNES 2019 transformer XL Arrangement generation
MuseNet 2019 Transformer Arrangement generation
Transformer Guan et al. 2019 self-attention/GAN Arrangement generation
Zhang et al. 2020 transformer/GAN Arrangement generation
Transformer VAE 2020 transformer/VAE melody generation
Choi et al. 2020 Transformer/VAE melody generation

Muńoz et al. 2016 MA/fitness:Harmonic rules melody generation


Jeong et al. 2017 EA/fitting function: Multi-objective (stability, tension) melody generation
GGA-MG 2020 GA/fitting function: LSTM network melody generation
EA
S Hickinbotham et al. 2016 GA/fitness function: Interactive Live-coding
Kaliakatsos et al. 2016 EA/PSO Arrangement generation(interactive)
Olseng O et al. 2018 EA/ fitness function:multi-objective Arrangement generation
EA/ fitness function:multi-objective Arrangement generation
EvoComposer 2019
Harmony rules, corpus statistical analysis Arrangement generation
Melody generation generates melodies for based on the theoretical knowledge of music.
drums, bass, etc.; arrangement generation gen- Including rule based on grammar[78] and its
erates harmonies, dominant music, polyphony, variant L-system, rule-based constraints: such
etc.; audio generation models music directly at as similarity[51], rhythm rules [31], etc. These
the audio level, generating sound fragments with constraints help guide the process of intelligent
different instruments, and styles, which is a chal- composition The direction of music generation.
lenging approach: including computational com- The performance of a rule based generation sys-
plexity and the difficulty of capturing semantic tem largely depends on the creativity of the
structure from the original audio; style trans- developer, the musical concept of the inventor,
fer encompasses timbre conversion, arrangement and how this abstraction is expressed in terms
style conversion, etc., using automatic orchestra- of the corresponding variables[47].T.Tanaka et
tion techniques attempt to apply a certain style al.[78] proposed a formula for generating music
(a particular composer or musical genre, etc.) rhythm based on grammatical rules, focusing on
or a certain instrument (drums, bass, etc.) to the structure and redundancy of music, and con-
the original musical composition. Detailed clas- trolling some global structure of the note rhythm
sification information for each music generation sequence, but the project is still a theoretical
system can be found in Tab.2. Although many expression, has not yet been practiced. Miguel
music generation algorithms feature combinations et.al.[17] introduced the user’s emotional input
and nesting, such as the use of Variational Auto- based on the use of rules derived from music the-
Encoder (VAE) combined with Recursive Neural ory knowledge and proposed a bottom-up approxi-
Networks (RNN)[74], music generation algorithms mation approach to design a two-level multi-agent
can be generally categorised as follows: structure for music generation in a modular way.
• Although the problem of musical expressiveness
Rule Based Systems
• is well solved, the system is still at a rather
Markov Model
• early stage of development, handling user input
Deep Learning: LSTM, VAE, GAN et al.
• very rigidly and performing poorly when deal-
Evolutionary Computation, EC
ing with fuzzy concepts. Besides, as figure 4,
MorpheuS[31] uses Variable Neighborhood Search
4.1 Rule Based Music Generation (VNS) to assign pitches by detecting repeated
rhythms in music to generate polyphonic music
with specific tension and repeating rhythms.
Mod Input Rule based generation systems can effectively
limit the area explored, but music is organised
Markove Chain Batasan awal
at many levels of abstraction and even in Bach
Node
Value Key chorales[37], which have highly organised and
strict rules, complex rules may not be sufficient to
Chord capture deeper structure. Rule based music gen-
Pitch
Range
eration systems somewhat limit the diversity of
Octave output, while the ambiguity of the system is higher
when there are more analyses and rules.
Note Tempo

4.2 Markov Model


Chord Melody Markov model[36, 72, 86] is a probability-based
generation model for processing time series data,
i.e. data where there is a time series relationship
Fig. 4: Overview of MorhpeuS’ architecture between samples. The probability of occurrence of
specific elements in the data set and the probabil-
ity of well-defined features can be captured, such
The rule based music generation system is as the occurrence of notes or chords, the transition
a non-adaptive model, which generates music
9

of notes or chords, and the conditional probability 4.3 Deep Learning Methods for
of generating chords on a given note. Music Generation
Content generation is an extended area of deep
(optional) learning. With the achievements of Google’s
Onsets and
.wav
Frames
MIDI Melody
Magenta and CTRL (Creator Technology
Research Lab) in music generation, as a music
Performance Performance
(MIDI) Encoder
[...|...|...|...|...] [embedding] Decoder generation method, deep learning is receiving
(optional)
more and more attention. Unlike grammar-based
Melody
(MIDI)
Melody
Encoder
[...|...|...|...] or rule-based music generation systems, deep
learning based music generation systems can learn
Fig. 5: Programmable Markov Music Generation the distribution and relevance of samples from an
System[72] arbitrary corpus of music and generate music that
represents the musical style of the corpus through
prediction (e.g. predicting the pitch of the next
Due to the time sequence characteristics of note of a melody) or classification (e.g. identifying
music fragments, Markov model is well suited the chords corresponding to the melody).
for the generation of musical elements such as
melodies (note sequences)[9]. Chi-Fang Huang et 4.3.1 Convolutional Neural Networks
al.[36] established a Markov chain switching table
of the Chinese pentatonic scale by analyzing the
modes of Chinese folk music. They generate the
Residual
next note of the current input melody accord- +

ing to the Markov chain switching table and the 1×1


+ ReLU 1×1 ReLU 1×1 Softmax Output
×
Skip-connections
rhythm complexity algorithm, and re-present a tanh σ

specific style of music. A S Ramanto et al.[72] used Dilated


Conv

four Markov chains to model the components of k Layers

a musical fragment including tempo, pitch, note Causal

value and chord type, and assigned parameter val- Conv

ues for different moods to each part, thus ensuring Input

that the music generation system could adapt to Fig. 6: WaveNet: Overview of the residual block
different moods of the input. The model of the and the entire architecture
procedural music generation algorithm used in
the literature is shown in Figure5. D Williams et
al.[86] also generated melodies based on emotions,
with the difference that the approach used a Hid- Traditional Convolutional Neural Networks
den Markov Model (HMM) for modelling, showing mainly process images with non-time sequence
that emotionally relevant music can be composed characteristics. It is difficult to process music with
using a Hidden Markov Model. time sequence. Therefore, at this stage, there are
Compared with the current deep learning fewer music generation models based on convo-
architecture, the advantage of the Markov model lutional neural networks. But by improving the
is that it is easier to control, allowing constraints neural network, music can be generated.
to be attached to the internal structure of the The audio generation model WaveNet[68],
algorithm[4]. However, from a compositional per- introduced by Google Deepmind, including its
spective, Markov models may yield more novel variants TimbreTron[61] (timbre conversion),
combinatorial forms for shorter sequences, but Manzelli et al.[21] (audio generation) provide
may reuse much of the corpus non-innovatively for effective improvements to convolutional neural
sequences with longer structures, resulting in an networks to enable them to generate music clips,
excessive number of repetitive fragments. see the structure of WaveNet in Figure 6. Wavenet
generates new music clips with high fidelity by
introducing extended causal convolution (Dilated
Casual Convolutions), which predicts the prob-
ability distribution of the current occurrence of
different audio data based on the already gen-
erated audio sequences. By introducing extended
convolution, the deep model has a very large per-
ceptual field, thus dealing with the long-span time-
dependent problem of audio generation.However,
there is no significant repetitive structure in the
generated music fragments and the generated Fig. 7: A Note RNN is trained on MIDI files
audio is less musical. Manzell et al. improved this and supplies the initial weights for the Q-network
method by using an LSTM (Long Short-Term and Target-Q-network, and final weights for the
Memory) network combined with the Wavenet Reward RNN. [84]
architecture to generate cello-style audio clips
from a given small melody. Where the LSTM net-
due to the gradient problem that occurs when pro-
work is used to learn the melodic structure of
cessing long time sequence data, which is prone
different musical styles, the output of this model
to gradient explosion (disappearance). LSTM, as
is used as input to the Wavenet audio generator
a variant of RNN, effectively alleviates the long
to provide the corresponding melodic information
time dependency problem of RNN by introducing
for the generation of the audio clip.
cell states and using three types of gates, namely
The Wavenet model can also be used for other
input gates, forgetting gates and output gates,
types of music generation tasks, with Engel et al.
to hold and control information. LSTM networks
constructing a Wavenet-like encoder to infer hid-
are mainly used in music generation systems to
den temporal distribution information and feeding
generate music clips instead of RNN networks.
it into a Wavenet decoder, which effectively recon-
The combination of LSTM and other algo-
structs the original audio and enables conversion
rithms can effectively generate melodies, poly-
between the timbres of different instruments. Sim-
phonic music, etc. The MelodyRNN (Figure7)
ilarly, TimbreTron is a musical timbre conversion
proposed by the Google Brain team[84] built a
system that applies ’image’ domain style trans-
lookback RNN and an attention RNN in order
fer to the time-frequency representation of audio
to improve the ability of RNNs to learn long
signals. The timbre transfer is performed in the
sequence structures, where the attention RNN
log-CQT domain using CycleGAN (Cycle Gener-
introduced an attention mechanism (attention)
ative Adversarial Networks) by first calculating
to model higher-level structures in music to gen-
the Constant Q Transform of the audio signal and
erate melodies. Keerti et al.[49] also introduced
using its logarithmic amplitude as an image, and
an attention mechanism and used dropout to
finally using the Wavenet synthesiser to generate
reduce overfitting to generate jazz music. Rl
high-quality waveforms, enabling the transforma-
Tuner[40] used an RNN network to provide par-
tion of instrument timbres such as cello, violin
tial reward values for a Reinforcement Learning
and electric guitar. In addition, Coconet used a
(RL) model, with the reward values consisting of
convolutional alignment network to automate the
two components: (1) the similarity of the next
filling of incomplete scores by modelling the tenor,
note generated to the note predicted by the reward
baritone, bass and soprano in a Bach chorus as
RNN; (2) the degree of satisfaction with the user-
an eight-channel feature map, using the feature
defined constraints. However, this method is based
map of randomly erased notes as input to the
on overly simple melodic composition rules and
convolutional neural network.
can only generate simple monophonic melodies.
Other music generation systems for melody gen-
4.3.2 Recursive Neural Networks eration include Hadjeres et al.[28], who propose
Recursive Neural Networks are a class of neural an Anticipation-RNN neural network structure for
networks for processing time series, which are very generating melodies for interactive Bach-style cho-
suitable for processing music clips. However, RNN ruses; StructureNet[63], which generates simple
cannot solve the long time dependency problem monophonic accompanying music based on LSTM
11

networks; and Makris et al.[60], who build a sep- network with a symbolic representation of the
arate LSTM network for each type of drum An music (Piano-roll) as input and an audio repre-
LSTM network was constructed with a feed for- sentation (sound spectrum map) as output. The
ward (FF) network as the condition layer, and model consists of two sub-networks: a convolu-
its input was a One-hot matrix of 26 features tional encoder/decoder structure is used to learn
including rhythm and beat number, which could the correspondence between Piano-roll and the
generate sequences of drums with untrained beat acoustic spectrogram; a multi-segment residual
numbers. network is further used to optimise the results by
For arrangement generation, Folk-rnn first adding spectral textures for overtones and tim-
used LSTM networks to model musical sequences bres, ultimately generating music clips for violin,
transcribed into ABC notation for generating folk cello and flute.
music. DeepBach[29] used a bi-directional LSTM
(Bi-LSTM) to generate Bach-style choral music 4.3.3 Generative Adversarial Nets
considering two directions in time: one aggre-
gating past information and the other aggregat-
ing future information, demonstrating that the
sequences generated by the bi-directional LSTM
were more musical. XiaoIce Band proposes an
end-to-end Chinese multi-track pop music gen-
eration framework that uses a GRU network
to handle the low-dimensional representation of
chords and obtains hidden states through encoders
Fig. 8: System diagram of the MidiNet model for
and decoders, ultimately generating music for
symbolic-domain music generation.
multi-instrument tracks. Amadeus[50] uses an
explicit duration encoding approach: multiple
audio streams represent note durations and uses
RL’s reward mechanism to improve the structure Generative Adversarial Nets (GAN)[26] is
of the generated music. Jambot[6] uses a chord a state-of-the-art method for generating high-
LSTM network to predict chord sequences, and a quality images. The idea behind the method is
polyphonic LSTM network to generate polyphonic to train two networks simultaneously: a generator
music based on the predicted chord sequences. and a discriminator, with the two lattices playing
Song From PI[12] uses a layered RNN network against each other, with the discriminator eventu-
model, where the two bottom LSTM networks ally being unable to distinguish between the truth
generate the melody and the two top LSTM and falsity of the real data and the generated data
networks generate the drums and chords. as the optimal result. Researchers have been work-
In addition, audio generation models such ing on applying this to more sequential data, such
as Performance RNN[69] and PerformanceNet[85] as audio and music. For music generation, the
are considered to be the possible future develop- problems are that:
ment directions of music generation systems[20]. (1) music is of a certain time series
Music performance is a necessary condition for (2) music is multi-track/multi-instrumental
giving meaning and feeling to music. In addition, (3) monophonic music generation cannot be used
direct synthesis of audio with a library of sound directly for polyphonic music
samples often leads to mechanical and dull results.
Performance RNN converts a MIDI file of a live Numerous researchers have improved GAN
piano piece into a musical representation of mul- networks to generate melodies, arrangements,
tiple one-hot vectors with 413 dimensions, and style transformations, etc.
the method specifies a time ”step” to a fixed size For melody generation, C-RNN-GAN[64] uses
(10ms) rather than a note time value, thus cap- LSTM networks as a generator and discrimina-
turing more expressiveness in the note timing. tor model, but the method lacks a conditional
PerformanceNet is a first attempt at score-to- generation mechanism to generate music based
audio conversion using a fully convolutional neural on a given antecedent. MidiNet[88] improves the
method by inserting the melody and chord infor- lyrics, and recursive layers to improve the accu-
mation generated in the previous stage as a condi- racy of pitch to achieve male and female voice
tional mechanism in the middle convolution layer conversion. In addition, Jin et al.[46] used an
of the generator to restrict the generation of note LSTM network as a generator, adding music rules
types, thus improving the innovation of generat- as a reward function in reinforcement learning to
ing single jazz pieces, the diagram of this model a control network that takes advantage of the AC
is shown in Figure8. JazzGAN[80] developed a (Actor Critic) reinforcement learning algorithm to
GAN-based model for single jazz generation, using select for diversity of elements and avoid the draw-
LSTM networks to improvise monophonic jazz back of GAN networks generating samples that
music over chord progressions, and proposed sev- are too similar, thus improving the creativity of
eral metrics for evaluating the musical features stylistic music generation. The same goes for Play
associated with jazz music. SSMGAN[42] used as You Like[58] which implements the conversion
the GAN model to generate a self-similar matrix of classical, jazz and pop music. For audio gen-
(SSM) to represent musical self-repetition, and eration, WaveGAN[16] achieved the first attempt
subsequently fed the SSM matrix into an LSTM to apply GAN to the synthesis of raw waveform
network to generate melodies. Yu et al.[89] first audio signals, and GANSynth[22] improved the
proposed a conditional LSTM-GAN for melody method by generating the entire sequence in par-
generation from lyrics, containing an LSTM gen- allel rather than sequentially, which appended a
erator and LSTM discriminator, both with lyrics one-hot vector representing pitch and achieved
as conditional inputs. independent control of pitch and timbre based on
For arrangement generation, MuseGAN pro- a PGGAN networ[30], with the final results 50,000
posed a GAN model with three types of generators times faster than the standard Wavenet[68], and
to construct correlations between multiple tracks, the generated music clips are both better than
containing independent generation within tracks, Wavenet and WaveGAN[16].
global generation between tracks, and composite Despite the outstanding performance in music
generation, but the method produced many over- generation, GAN still have shortcomings:
fragmented notes. BinaryMuseGAN[19] improved
(1) They are difficult to train
the method by introducing binary neurons (
(2) They are poorly interpretable, and there is
Binary Neuron, BN) as input to the genera-
still a lack of high-quality research results on
tor, a representation that makes it easier for
how GANs can model text-like data or musical
the discriminator to learn decision boundaries,
scores.
while introducing chromatic features to detect
chordal features to reduce over-fragmented notes.
LeadSheetGAN[54] uses a functional score (Lead
4.3.4 Variational Auto-Encoder
Sheet) as a conditional input to generate a Piano-
roll, and this approach produces results that con-
Input
tain more information at the musical level due
to the clearer melodic direction of the functional Encoder

score compared to MIDI data. Z Latent


Code

In terms of style transfer and audio genera-


Conductor
tion, CycleGAN[7] introduces a style loss function
for style conversion and a content loss function
Decoder
to maintain content consistency based on Cycle-
GAN[92] in order to achieve conversion of jazz, ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
16 16 16

classical and pop music, while using two discrimi- Output

nators to determine whether the generated music


belongs to the merged set of two music domains. Fig. 9: Schematic of the hierarchical recurrent
CycleBEGAN[87] uses BEGAN network[3] to sta- Variational Autoencoder model, MusicVAE.
bilize the training process, while introducing jump
connections to improve the clarity of melody and
13

Variational Auto-Encoder (VEA) is essen- simulate polyphonic music of any length. MIDI-
tially a compression algorithm for encoders and Sandwich2 [52] proposed a RNN-based hierar-
decoders, which has been able to analyse and chical multi-modal fusion to generate a VAE
generate information such as pitch dynamics and (MFG-VAE) network. First, multiple independent
instrumentation in polyphonic music. The use of and non-weighted Binary VAEs (BVAE) are used
VAE aims to solve two problems[73]: music recom- to extract the information of instruments in dif-
bination and music prediction, but when the data ferent tracks. Feature information, MGF-VEA is
is multimodal, VAE does not provide a clear used as the top-level model to fuse the features
mechanism for generating music, usually using a of different modalities, and finally multi-track
hybrid coding model based on VAE, such as a music is generated through the BVAE decoder.
combination of VAE networks and RNN networks. This method improves the refinement method
Some typical examples are MIDI-VAE[5], of BinaryMuseGAN, transforms the binary input
Figure9 and MusicVAE[74]. Among them, MIDI- problem into a multi-label classification problem,
VAE constructs three pairs of coders/decoders and solves the problem of the difficulty of distin-
sharing a potential space to automatically recon- guishing the refining steps of the original scheme
struct the pitch, intensity and instrumentation of and the difficulty of descending the gradient.
a musical composition for musical style conversion; MuseAE[81] applied Adversarial Autoencoders to
this work represents the first successful attempt music generation for the first time. The advan-
to apply style conversion to a complete musical tage of this method over VAE: Using adversarial
composition, but the small amount of data used regularization instead of KL divergence, in prin-
for style feature training results in short lengths ciple, it can be forced to follow any probability
of the generated music. MusicVAE uses hierarchi- distribution, thus provides greater flexibility in
cal decoder to improve the modelling of sequences choosing the prior distribution of latent variables.
with long-term structure, using a bidirectional The Jukebox built a system that can generate a
RNN as an encoder, allowing the model to interpo- variety of high-fidelity music in the original audio
late and reconstruct well. Then, GrooVAE[25] as a domain, but this method still requires 9 hours to
variant of MusicVAE, generates drum melodies by render one minute of music.
training on a corpus of electronic drums recorded
from live performances. 4.3.5 Transformer
For the generation of Eastern popular and folk
A piece of music often has multiple themes,
music, MG-VAE[59] was the first study to apply
phrases or repetitions on different time scales.
deep generative models and adversarial training to
Therefore, generating long-term music with com-
oriental music generation, which was able to con-
plex structures has always been a huge challenge.
vert music into folk music with a specific regional
The music generation system is usually based
style. Jiang et al.[43] developed a music genera-
on a sequential RNN (or LSTM, GRU, etc.)
tion model with style control, which was based on
network, but this mechanism brings about two
MusicVAE and used global and Three music gen-
problems: Calculating from left to right or calcu-
eration models were constructed using global and
lating from right to left limits the parallel ability
local encoders in an attempt to find a suitable way
of the model; Part of the information will be
to represent musical structure at the bar, phrase,
lost during the sequence calculation. This led
and song levels, which could eventually generate
Google to propose the classical network archi-
melodies with style control.
tecture Transformer[82], which uses an Attention
In addition, MahlerNet[57] constructed a con-
mechanism instead of the traditional Encoder-
ditional VAE to model the distribution of latent
Decoder framework that must combine the inher-
states. Two bidirectional RNN networks (one con-
ent patterns of CNNs or RNNs. The core idea of
sidering music reconstruction and one considering
this architecture is to use the Attention mecha-
context) constitute an encoder, and the decoder
nism to indicate correlations between inputs, with
outputs duration, pitch, and instrument, real-
the following advantages:
ized the use of any number of instruments to
(1) Data parallel computing
(2) Visualization of self-reference
(3) Solve the problem of long-term dependence (optional)
better than RNN, and has achieved great suc- .wav
Onsets and
MIDI Melody
Frames
cess in the task of maintaining long continuity
in timing. Performance Performance
[...|...|...|...|...] [embedding] Decoder
(MIDI) Encoder
However, this model is not very implementable
(optional)
in music generation tasks because of the complex- Melody Melody
[...|...|...|...]
(MIDI) Encoder
ity of the space complexity of the intermediate
vectors characterizing the relative positions in the
Fig. 10: The flow chart of Transformer self-
algorithm (the square of the sequence). Music
encode: First transcribes the .wav data file into
Transformer successfully applied this structure
a MIDI file using the Onsets and Frames frame-
to the generation of piano pieces by reducing
work, which is then encoded into a performance
the space complexity of the intermediate vec-
representation to be used as input. The output of
tors characterizing the relative positions to the
the performance encoder is then aggregated across
sequence length order of magnitude, which greatly
time and (selectively) combined with melodic
reduced the space complexity , but with poor
embedding to produce a representation of the
musicality and confusing structure while generat-
entire performance, which is then used by the
ing musical pieces of longer duration. LakhNES
Transformer decoder when reasoning.[70]
[15] used migration learning to generate chip
music by pre-training in a large-scale dataset
using Transformer’s latest extension Transformer-
addition, Transformer VAE combines the Trans-
XL, and then fine-tuning in a small-scale chip
former with VAE, effectively improving on the
music dataset. MuseNet[70] used the same net-
shortcomings of VAE in that it does not handle
work as GPT-2 (a large Transformer-based lan-
time series structure well and the potential states
guage model[8]), which is trained using the Trans-
of the Transformer are not easily interpreted. Choi
former’s recomputed and optimised kernel to
et al.[11] similarly used the Transformer coder
generate a four-minute musical composition con-
(decoder) to obtain a global representation of the
sisting of ten different instruments, but the model
music, thus better control over aspects such as
does not necessarily arrange the music according
performance style and melody.
to the input instruments, generating each note by
calculating the probability of all possible notes
4.3.6 Evolutionary Algorithms
and instruments. Figure10 shows the flow chart of
this Transformer self-encoder. The field of evolutionary computing encompasses
In addition, some scholars have used Trans- a variety of techniques and methods inspired by
former in combination with other models to natural evolution. At its heart are Darwinian
improve problems such as the poor musicality search algorithms based on highly abstract bio-
of long sequences of music generated by this logical models. The advantage of applying them
model. Guan et al.[27] proposed a combination of to music generation systems is that they can be
Transformer and GAN, first using a self-attentive optimized in the direction of high relative qual-
mechanism based on GAN to extract spatial and ity or adaptability, thus compressing or extending
temporal features of music to generate consistent the data and processes required for the search.
and natural monophonic music, and then using It provides an innovative and natural means of
a generative adversarial structure with multiple generating musical ideas from a set of specifi-
branches to generate multi-instrumental music able original components and processes.[96] EA
with harmonic structure between all tracks. Zhang requires three basic criteria to be considered in
et al.[90] also used the Transformer in combina- advance: 1. Problem domain: defining the type of
tion with GAN to make the generation of long music to be created; 2. Individual Representation:
sequences guided by the adversarial goal, which the representation of the music, melody, harmony,
provides powerful regularisation conditions that pitch, duration, etc.; 3.Fitness Measure. Adapta-
can force the Transformer to focus on learning tion measure is a very important part of genetic
the global and local structure of the music. In algorithms to evaluate the goodness of a solution,
15

generator, melody generator, and accompaniment


generator, but the method lacked the genera-
tion of harmonies. EvoComposer[13] improved the
method by generating harmonies for four voices
(male high, male low, female high, female low) by
giving one of the four voices, while an adaptation
function is defined: on the one hand following clas-
sical music rules and on the other hand extracting
statistical information from a corpus of existing
Fig. 11: Steps for generating a composition using music in order to assimilate the style of a par-
evolutionary algorithms. ticular composer. Olseng[67] used four adaptation
measures: local, global melodic features and local,
global harmonic features with the aim of making
where the adaptation statistics of different music the melodies generated by the algorithm sound
generation systems are shown in Tab 2. In music harmonious and pleasant. Kaliakatsos[48] uses
generation systems, adaptation measure includes Particle swarm optimization (PSO) to converge
interactive adaptation, similarity to a music cor- on user-preferred rhythmic and pitch features, and
pus, rule-based metrics, etc. Early evolutionary then uses a genetic algorithm to compose music
algorithms commonly used interactive adaptation based on these features in a ’top-level’ evolution.
measure, which requires significant human inter-
vention and can suffer from adaptation bottle-
necks; poor robustness and wider applicability in
5 Evaluation
music corpus or rule-based adaptation measures, In the field of music generation, evaluating the
as well as limitations on the creative potential of effectiveness of generation is a major challenge.
the system. Existing evaluation methods mainly combine sub-
For melody generation, Munoz et al.[66] used jective and objective evaluation, where subjective
a non-digital bass (Unfigured Bass) arrangement evaluation includes Turing tests, questionnaires,
technique to create bass melodies, harmonic rules etc., and objective evaluation includes quantita-
as an adaptive metric, the bass as input to tive indicators based on musical rules, similarity
find subspaces of suitable melodies in a parallel to corpus or style, etc. Agres et al.[62] provide an
manner, and used local composer intelligence to overview of current evaluation methods for music
adaptively output high-quality four-voice musical generation systems, providing provides motiva-
compositions. Jeong et al.[41] proposed a multi- tion and tools. These metrics are interrelated and
objective evolutionary algorithm for automatic together influence the construction of the music
melodic synthesis that produces multiple melodies generation system and the quality of the resulting
at once, which measures the psychoacoustic effects pieces.
of the resulting music based on the stability and
tension of the sound as an adaptive metric. Lopes 5.1 Subjective evaluation
et al. [55] proposed an automatic melodic synthe-
sis method based on two adaptive functions, one Subjective evaluation of music generation systems
defined using Zipf’s law[93] and the other using is usually done by performing human listening
Fux’s law[24]. GGA-MG uses LSTM as an adap- tests to evaluate the performance of the system
tive function, first providing a sufficient dataset in terms of melodic output. This method can be
of random and artificial samples using an initial used in a similar way to the Turing test or ques-
Generative Genetic Algorithm (GGA), and then tionnaires. For example, in MG-VAE, listeners are
using a Bi-LSTM neural network for training as an used to rate the musicality (i.e. whether the song
adaptive function for the main GGA to generate has a clear musical style or structure) and stylis-
melodies similar to those of human compositions. tic significance (i.e. whether the style of the song
In arranging music, MetaCompose[75], 11 was matches a given regional label) of the generated
used to generate real-time music, using a multi- music, combined with several quantitative metrics
objective adaptive function to construct a chord based on musical rules to evaluate the merits of
the music generation system. Chen et al.[10] define (a measure of generated music fragment diver-
a manual evaluation metric that includes Interac- sity), initial scores (to evaluate the GAN network),
tivity i.e. whether the interaction between chords pitch accuracy, pitch entropy, etc. JazzGAN uses
and melody is good, complexity i.e. whether the three metrics, including a pattern collapse metric:
piece is complex enough to express the theme, observing the interval and duration of repeated
and structure i.e. whether there is some repeti- and discontinuous notes; a creativity metric: mea-
tion, forward movement, variation in the sample. suring the number of sequences copied from the
Performance RNN collects people’s ratings of the corpus by the generated fragments and the num-
generated music; Wen et al.[40, 54, 85] combine ber of variants Chord harmony metrics: evalu-
human ratings to judge the performance of the ating the generated sequences in relation to a
music generation system. given chord sequence rates the generated system.
While human feedback may be the most rea- MuseGAN was designed to evaluate the differ-
sonable method for generating music ratings, it ences between the generated musical compositions
has limitations: human users are unable to per- and the real compositions in two aspects within a
form a large number of listens to rate music; single track and between different tracks, includ-
user fatigue can occur within the first few min- ing within-track: the proportion of bars that are
utes of rating or selecting music and increased empty, the number of pitches per bar, the pro-
uncertainty in music ratings or selections due to portion of passing notes and the ratio of notes in
differences in listeners’ personal tastes, leading 8 or 16 beats, and between-track: the the pitch
to inconsistent ratings. Although Sturm et al.[76] distance between tracks.
detail the beneficial aspects of music evaluation A major challenge in the objective evalua-
in a concert setting or by extending a concert to tion of music production systems is that there is
a music competition, i.e. that a concert setting is no single objective evaluation criterion for judg-
one of the most natural ways to experience music ing the ’goodness’ of a system. These criteria are
and to some extent allows for less fatigue for the limited in that they are only applicable to the cur-
evaluator than a laboratory setting, the method rent type of music generation task and cannot be
inevitably suffers from differences due to different applied to all music generation systems, and their
performers, venues and environments. reasonableness has yet to be considered.

5.2 Objective evaluation 6 Analysis & Perspectives


Objective evaluation of music generation systems
includes similarity to a corpus or style, quan-
6.1 Current state of intelligent
titative indicators based on musical rules, etc. music generation
Similarity is central to the success metrics in The amount of content that music generation sys-
music generation systems, so an important chal- tems can provide far exceeds the content that
lenge becomes finding the right balance between the composer can provide, and its music cost
similarity and novelty or creativity. Evaluating per minute is much lower than the cost of the
the creativity (sometimes equated with novelty) composer. Additionally, it can customize and per-
of generated music is a complex topic and is sonalize the music set by the user, inspire com-
discussed at length by Agres et al.[1] posers and provide help and music education, so
Different objective evaluation criteria can be this field has very broad development prospects
chosen depending on the type of generated music. and commercial value. The company cases that
Zhang et al.[90] devised seven objective metrics to have already landed include Melodrive’s focus
compare generated musical compositions with real on instant music generation in game scenarios,
compositions, these include: number of different Amper Music to generate personalized customized
note repetitions, count of different pitches, number original music in a few seconds, and Ambient
of triplets, etc. GANSynth compared generated Generative Music to automatically generate per-
music with real compositions by using manual sonalized music in different environments.
evaluation, counting the number of difference Bins
17

With the development of artificial intelligence, there are obvious differences between Eastern tra-
there are also certain difficulties in the rapid devel- ditional music and Western classical music (espe-
opment of artificial intelligence composition. The cially symphony)[33]. Specifically, Eastern music
current music generation systems can generate has a different understanding of structural power,
polyphonic music with a higher quality and a which usually constructs single-line monophonic
shorter period of time containing different styles music. In the development process, the music will
and instruments in a shorter period of time[46]. be lined, mainly in pursuit of the beauty of music,
However, the construction of a long time series of and Eastern classical music (especially Chinese
music fragments with a complete structure[44] and classical music) ) Mainly with five-tone tones as
the evaluation of generated music[1] have become the main body. On the contrary, western music is
a common problem for artificial intelligence com- mostly multi-voice music, which mainly focuses on
position researchers. Although the latest music major and minor tones, pays attention to the har-
generation system can generate longer-term music mony logic of music, and pays great attention to
fragments, such as MuseNet, which can use 10 the form of music. In the melody, techniques such
different instruments to generate 4-minute music as repeated derivation, contrast, and expansion
works. However, the latest music generation algo- are used to render the atmosphere. In addition,
rithms are not yet able to deal with the time series the East and the West also have their own distinc-
problem well. With the increase in the duration tive features in the form of expression of music art.
of the generated music fragments, the music of Western music performance highlights the oppo-
the generated music fragments is getting worse sition between subjective and objective, and is
and worse, the structure is becoming more and known for its profound and seriousness. Differ-
more chaotic or higher repetitiveness. Researchers ent from the strict pursuit of music scores by
are currently working on optimizing the genera- Western performers[83], traditional Eastern music
tion algorithm or constraining the generated music emphasizes the unification of subjectivity and
fragments based on music theory knowledge to objectiveness, and the rhythm of the music is
alleviate the problem of time dependence of the more free and varied. With performance needs,
music generation system. performers are more subjective and have higher
performance freedom, that is, music. The second
6.2 Different characteristics of creation has a larger composition. Viewing music
Eastern and Western music as an information stream increases the difficulty of
labeling and analyzing traditional Eastern music
generation
data.
From the perspective of the technical characteris- Considering the differences in the structure,
tics of music generation itself, unlike the mature content and performance methods of Eastern and
and active artificial intelligence music generation Western music, a lot of adjustments and optimiza-
in the West, Eastern researchers are still in their tions of the music generation system are needed
infancy in the field of intelligent composition. to apply to one of the two, thereby reducing the
Although there are also individual companies and versatility between Eastern and Western music
research units that are constantly trying, there generation systems.
are still relatively few studies on the generation On the other hand, there is little research on
of oriental local music. The main limitations are the generation of oriental music presently, which
the differences in structure and musical think- leads to the lack of music corpora of different
ing development between Eastern and Western types of oriental music, and the lack of very clas-
music[45], the lack of corpora for various types sic and high-quality corpora. MG-VAE[59] collects
of Eastern music, the lack of effective evaluation folk music of different regional styles in China,
methods and systems, and the lack of professional XiaoIce Band[91], CH818[34] Chinese pop music
talents who integrate basic music knowledge and and so on. Due to the lack of a high-quality ori-
engineering technology. Wait. ental music corpus, the music generation system
One the one hand, due to the different cul- cannot effectively learn the potential characteris-
tural backgrounds between the East and the West, tics of the sample music types and cannot generate
high-quality music fragments. In addition, the col- 6.3.1 Music Generation Technology
lection of different types of music is also a tedious Moves towards Maturity
and difficult task. At the same time, the copyright
The current music generation system mainly
issue of songs further restricts the development
focuses on the generation of a single aspect of
of Eastern scientific research in the field of music
words or music. For the cross-modal generation
production.
of music, such as the simultaneous generation of
In addition, the lack of an effective method
music and lyrics, and the simultaneous genera-
for evaluating the effectiveness of music genera-
tion of music and video, it will be an important
tion systems and the absence of a unified objective
direction for the future development of artifi-
evaluation method means that the different music
cial intelligence in music. This direction needs to
clips generated automatically usually require man-
extract and merge the features between different
ual selection of music clips that are better struc-
modalities, which has certain technical difficulties.
tured and sound bette. In contrast to the more
In the future development of the music generation
extensive research on Western music genres and
system, the generation system can be standardized
structures, there is less structural analysis of dif-
and improved through better music data struc-
ferent music genres in the East, and thus there is
tures including Piano-roll, MIDI events, scores
a lack of objective information that can be applied
and audio, as well as more standardized test
to the evaluation of the effectiveness of Eastern
data sets and evaluation indicators. In addition,
music generation, which limits the development of
since deep learning architecture is the mainstream
Eastern music generation techniques.
generation method of music generation systems,
The majority of researchers in the East focus
enhancing the interpretability and artificial con-
on a single area of research, such as music or
trollability of deep networks has become one of the
engineering technology, and there is a terrible
future development directions of music generation
lack of cross-disciplinary expertise that combines
technology. At the same time, in order to solve
basic knowledge of music with engineering tech-
the problem of time dependence of music genera-
nology, which severely limits the development of
tion, music generation at the segment level rather
the field of intelligent music engineering in the
than at the note level is also the research trend of
East. Researchers need to relearn music theory,
future music generation systems.
but because music is so rich in knowledge, it is
extremely difficult to move from an introductory
level to a mastery of music. The lack of a foun- 6.3.2 Emotional expression of music
dation in music theory has led most researchers generation and its coordinated
to focus mainly on optimising and improving gen- development with robot
erative algorithms, thus neglecting the effective performance
role of music theory rules in improving generative Music is one of the important forms of human
systems. emotion expression. Musical emotions are con-
ceptually regarded as an expression of human
6.3 Trends in Artificial Intelligence emotion that is difficult to quantify, and it under-
Music goes rich changes with the progress of music. The
current degree of music generation intelligence is
Both the limitations of oriental music and the
at a low level, based more on the perspective
commonalities between eastern and western music
of signal analysis, without introducing a human
generation can be used as directions for the sub-
recognition system for musical emotions, without
sequent development and improvement of music
anthropomorphic musical composition thinking,
generation systems to produce beautiful songs
and with human-computer interaction systems
that meet individual needs and have a complete
limited to shallow information exchange. There-
structure. By combing through the development of
fore, the integration of multi-channel information
AI composition and the current state of research,
such as computer vision and computer hearing to
the following trends in music generation systems
recognise the human spectrum of musical emo-
emerge.
tions and audio expression systems, and then the
19

optimisation of the emotional dimension of music • Availability of data and materials Not applica-
generation using techniques such as deep learn- ble
ing is the main goal of the construction of future • Code availability Not applicable
interactive intelligent composition systems. • Authors’ contributions Methodology: Lei Wang;
In addition, studies such as music visualisation Formal analysis and investigation: Ziyi Zhao;
based on robotic systems[53] and music therapy Writing - original draft preparation: Ziyi Zhao,
robots with emotional interaction capabilities[79] Hanwei Liu; Data curation: Junwei Pang; Writ-
also use robots as the main body to integrate ing - review and editing: Yi Qin, Song Li, Ziyi
robotics and musical expression in a benign way. Zhao; Supervision: Qidi Wu, Lei Wang Funding
The future realisation of intelligent composition acquisition: Lei Wang
and collaborative performance of music robots
under emotional computing is a highlight of the
development of the integration of intelligent music References
generation and performance.
1. Agres, K., Forth, J., and Wiggins, G. A.
(2016). Evaluation of musical creativity and
7 Conclusion musical metacreation systems. Computers in
Entertainment (CIE), 14(3):1–33. Publisher:
The involvement of artificial intelligence in artistic
ACM New York, NY, USA.
practice has become a common phenomenon, and
the use of AI algorithms to generate music is a 2. Avdeeff, M. (2019). Artificial intelligence
very active area of research. & popular music: SKYGGE, flow machines,
This paper systematically introduces the and the audio uncanny valley. In Arts, vol-
recent state-of-the-art in music generation and ume 8, page 130. Multidisciplinary Digital
gives a brief overview of the forms of music cod- Publishing Institute. Issue: 4.
ing and evaluation criteria. In addition, it analyses
the current state of artificially intelligent composi- 3. Berthelot, D., Schumm, T., and Metz, L.
tion worldwide (especially in the traditional music (2017). Began: Boundary equilibrium gen-
of the East and West) and compares the different erative adversarial networks. arXiv preprint
development characteristics of the two. In partic- arXiv:1703.10717.
ular, the opportunities and challenges of music
generation systems in the future are analysed. It is 4. Briot, J.-P., Hadjeres, G., and Pachet, F.-
hoped that this survey will contribute to a better D. (2020). Deep Learning Techniques for
understanding of the current state of research and Music Generation. Computational Synthesis
development trends in AI-based music generation and Creative Systems. Springer International
systems. Publishing, Cham.

5. Brunner, G., Konrad, A., Wang, Y., and Wat-


Declarations tenhofer, R. (2018). MIDI-VAE: Modeling
• Funding This work was supported by dynamics and instrumentation of music with
the National Natural Science Founda- applications to style transfer. arXiv preprint
tion of China (No.71771176, 61503287 arXiv:1809.07600.
and 51775385), Natural Science Founda-
6. Brunner, G., Wang, Y., Wattenhofer, R., and
tion of Shanghai(No.19ZR1479000) and
Wiesendanger, J. (2017). JamBot: Music
Science and Technology Winter Olympic
Theory Aware Chord Based Generation of
Project(No.2018YFF0300505).
• Competing Interests The authors declare that Polyphonic Music with LSTMs. In 2017
IEEE 29th International Conference on Tools
they have no conflict of interest.
• Ethics approval This article does not contain with Artificial Intelligence (ICTAI), pages
519–526, Boston, MA. IEEE.
any studies with human participants or animals
performed by any of the authors.
7. Brunner, G., Wang, Y., Wattenhofer, R., 16. Donahue, C., McAuley, J., and Puckette,
and Zhao, S. (2018). Symbolic Music Genre M. (2019b). Adversarial Audio Synthesis.
Transfer with CycleGAN. arXiv:1809.07575 arXiv:1802.04208 [cs]. arXiv: 1802.04208.
[cs, eess, stat]. arXiv: 1809.07575.
17. Delgado, M., Fajardo, W., and Molina-
8. Budzianowski, P. and Vuli, I. (2019). Hello, Solana, M. (2009). Inmamusys: Intelligent
It’s GPT-2 – How Can I Help You? Towards multiagent music system. Expert Systems
the Use of Pretrained Language Models for with Applications, 36(3):4574–4580.
Task-Oriented Dialogue Systems.
18. Dong, H.-W., Hsiao, W.-Y., Yang, L.-C., and
9. Carnovalini, F., & Rodà, A. (2020). Com- Yang, Y.-H. (2018). Musegan: Multi-track
putational creativity and music generation sequential generative adversarial networks for
systems: An introduction to the state of the symbolic music generation and accompani-
art. Frontiers in Artificial Intelligence, 3, 14. ment. In Thirty-Second AAAI Conference on
Artificial Intelligence.
10. Chen, K., Zhang, W., Dubnov, S., Xia, G.,
and Li, W. (2019). The effect of explicit 19. Dong, H.-W. and Yang, Y.-H. (2018). Con-
structure encoding of deep neural networks volutional Generative Adversarial Networks
for symbolic music generation. In 2019 Inter- with Binary Neurons for Polyphonic Music
national Workshop on Multilayer Music Rep- Generation. arXiv:1804.09399 [cs, eess, stat].
resentation and Processing (MMRP), pages arXiv: 1804.09399.
77–84. IEEE.
20. Dong, H.-W. and Yang, Y.-H.
11. Choi, K., Hawthorne, C., Simon, I., Din- (2019) Generating Music with GANs
culescu, M., and Engel, J. (2020). Encod- https://salu133445.github.io/ismir2019tutor
ing musical style with transformer autoen- ial/pdf/ismir2019-tutorial-slides.pdf
coders. In International Conference on Accessed 11 January 2022
Machine Learning, pages 1899–1908. PMLR.
21. Engel, J., Resnick, C., Roberts, A., Diele-
12. Chu, H., Urtasun, R., and Fidler, S. man, S., Norouzi, M., Eck, D., and Simonyan,
(2016). Song From PI: A Musically Plau- K. (20170). Neural Audio Synthesis
sible Network for Pop Music Generation. of Musical Notes with WaveNet Autoen-
arXiv:1611.03477 [cs]. arXiv: 1611.03477. coders. INInternational Conference on
Machine Learning (pp. 1068-1077). PMLR.
13. De Prisco, R., Zaccagnino, G., and
Zaccagnino, R. (2020). EvoComposer: an 22. Engel, J., Agrawal, K. K., Chen, S., Gulra-
evolutionary algorithm for 4-voice music jani, I., Donahue, C., & Roberts, A. (2019).
compositions. Evolutionary computation, Gansynth: Adversarial neural audio synthe-
28(3):489–530. Publisher: MIT Press One sis. arXiv preprint arXiv:1902.08710.
Rogers Street, Cambridge, MA 02142-1209,
USA journals-info . . . . 23. Farzaneh, M. and Toroghi, R. M. GGA-MG:
Generative Genetic Algorithm for Music Gen-
14. Dhariwal, P., Jun, H., Payne, C., Kim, J. W., eration. arXiv preprint arXiv:2004.04687.
Radford, A., and Sutskever, I. Jukebox: A
Generative Model for Music. arXiv preprint 24. Fux, J. J. and Edmunds, J. (1965). The study
arXiv:2005.00341. of counterpoint from Johann Joseph Fux’s
Gradus ad parnassum. Number 277. WW
15. Donahue, C., Mao, H. H., Li, Y. E., Cottrell, Norton & Company.
G. W., and McAuley, J. (2019a). LakhNES:
Improving multi-instrumental music genera- 25. Gillick, J., Roberts, A., Engel, J., Eck, D.,
tion with cross-domain pre-training. arXiv and Bamman, D. (2019). Learning to Groove
preprint arXiv:1907.04868. with Inverse Sequence Transformations. In
21

International Conference on Machine Learn- 34. Hu, X. and Yang, Y.-H. (2017). The mood
ing (pp. 2269-2279). PMLR. of Chinese Pop music: Representation and
recognition. Journal of the Association for
26. Goodfellow, I., Pouget-Abadie, J., Mirza, Information Science & Technology.
M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. (2014). Gen- 35. Huang, A. and Wu, R. (2016). Deep learning
erative adversarial nets. Advances in neural for music. arXiv preprint arXiv:1606.04930.
information processing systems, 27.
36. Huang, C.-F., Lian, Y.-S., Nien, W.-P., and
27. Guan, F., Yu, C., and Yang, S. (2019). Chieng, W.-H. (2016). Analyzing the percep-
A GAN Model With Self-attention Mecha- tion of Chinese melodic imagery and its appli-
nism To Generate Multi-instruments Sym- cation to automated composition. Multimedia
bolic Music. In 2019 International Joint Tools and Applications, 75(13):7631–7654.
Conference on Neural Networks (IJCNN).
37. Huang, C.-Z. A., Cooijmans, T., Roberts, A.,
28. Hadjeres, G. and Nielsen, F. (2017). Interac- Courville, A., and Eck, D. (2019). Coun-
tive Music Generation with Positional Con- terpoint by convolution. arXiv preprint
straints using Anticipation-RNNs. arXiv arXiv:1903.07227.
preprint arXiv:1709.06404.
38. Huang, C.-Z. A., Vaswani, A., Uszkoreit, J.,
29. Hadjeres, G., Pachet, F., and Nielsen, F. Shazeer, N., Simon, I., Hawthorne, C., Dai,
(2017). DeepBach: a Steerable Model for A. M., Hoffman, M. D., Dinculescu, M., and
Bach Chorales Generation. In International Eck, D. (2018). Music transformer. arXiv
Conference on Machine Learning, pages 119– preprint arXiv:1809.04281.
127.PMLR.
39. Huang, S., Li, Q., Anil, C., Oore, S., and
30. Han, C., Murao, K., Noguchi, T., Kawata, Y., Grosse, R. B. (2019). Timbretron: A
Uchiyama, F., Rundo, L., Nakayama, H., and wavenet (cyclegan (cqt (audio))) pipeline
Satoh, S. (2019). Learning more with less: for musical timbre transfer. arXiv preprint
Conditional PGGAN-based data augmenta- arXiv:1811.09620.
tion for brain metastases detection using
highly-rough annotation on MR images. In 40. Jaques, N., Gu, S., Turner, R. E., and Eck,
Proceedings of the 28th ACM International D. (2017). Tuning recurrent neural networks
Conference on Information and Knowledge with reinforcement learning.
Management, pages 119–127.
41. Jeong, J., Kim, Y., and Ahn, C. W. (2017).
31. Herremans, D. and Chew, E. (2019). A multi-objective evolutionary approach to
MorpheuS: Generating Structured Music automatic melody generation. Expert Sys-
with Constrained Patterns and Tension. tems with Applications, 90:50–61. Publisher:
IEEE Transactions on Affective Computing, Elsevier.
10(4):510–523.
42. Jhamtani, H. and Berg-Kirkpatrick, T.
32. Herremans, D., Chuan, C.-H., and Chew, E. (2019). Modeling Self-Repetition in Music
(2017). A Functional Taxonomy of Music Generation using Generative Adversarial
Generation Systems. ACM Computing Sur- Networks. In Machine Learning for Music
veys, 50(5):1–30. Discovery Workshop, ICML.

33. Hu, X., & Lee, J. H. (2012, October). A 43. Jiang, J. (2019). Stylistic Melody Generation
Cross-cultural Study of Music Mood Percep- with Conditional Variational Auto-Encoder.
tion between American and Chinese Listen-
ers. In ISMIR (pp. 535-540). 44. Jiang, J., Xia, G. G., Carlton, D. B.,
Anderson, C. N., and Miyakawa, R. H.
(2020). Transformer VAE: A Hierarchical 53. Lin, P.-C., Mettrick, D., Hung, P. C., and
Model for Structure-Aware and Interpretable Iqbal, F. (2018). Towards a music visual-
Music Representation Learning. In ICASSP ization on robot (MVR) prototype. In 2018
2020 - 2020 IEEE International Conference IEEE International Conference on Artifi-
on Acoustics, Speech and Signal Processing cial Intelligence and Virtual Reality (AIVR),
(ICASSP), pages 516–520. ISSN: 2379-190X. pages 256–257. IEEE.

45. Jie, C. H. E. N. (2015). Comparative study 54. Liu, H.-M. and Yang, Y.-H. (2018). Lead
between Chinese and western music aesthet- Sheet Generation and Arrangement by
ics and culture. Conditional Generative Adversarial Net-
work. arXiv:1807.11161 [cs, eess]. arXiv:
46. Jin, C., Tie, Y., Bai, Y., Lv, X., and Liu, S. 1807.11161.
(2020). A Style-Specific Music Composition
Neural Network. Neural Processing Letters, 55. Lopes, H. B., Martins, F. V. C., Cardoso,
52(3):1893–1912. R. T., and dos Santos, V. F. (2017). Combin-
ing rules and proportions: A multiobjective
47. Kaliakatsos-Papakostas, M., Floros, A., and approach to algorithmic composition. In 2017
Vrahatis, M. N. (2020). Artificial intelli- IEEE Congress on Evolutionary Computa-
gence methods for music generation: a review tion (CEC), pages 2282–2289. IEEE.
and future perspectives. Nature-Inspired
Computation and Swarm Intelligence, pages 56. Loughran, R. and O’Neill, M. (2020).
217–245. Publisher: Elsevier. Evolutionary music: applying evolution-
ary computation to the art of creating
48. Kaliakatsos-Papakostas, M. A., Floros, A., music. Genetic Programming and Evolvable
and Vrahatis, M. N. (2016). Interactive Machines, 21(1):55–85. Publisher: Springer.
music composition driven by feature evolu-
tion. SpringerPlus, 5(1):1–38. Publisher: 57. Lousseief, E., Sturm, B. L. T., and Sturm,
Springer. B. L. MahlerNet: Unbounded Orchestral
Music with Neural Networks. In the Nordic
49. Keerti, G., Vaishnavi, A., Mukherjee, P., Sound and Music Computing Conference
Vidya, A. S., Sreenithya, G. S., and Nayab, 2019 and the Interactive Sonification Work-
D. (2020). Attentional networks for music shop 2019 (pp. 57-63).
generation. arXiv preprint arXiv:2002.03854.
58. Lu, C.-Y., Xue, M.-X., Chang, C.-C., Lee,
50. Kumar, H. and Ravindran, B. (2019). Poly- C.-R., and Su, L. (2019). Play as You Like:
phonic Music Composition with LSTM Neu- Timbre-Enhanced Multi-Modal Music Style
ral Networks and Reinforcement Learning. Transfer. Proceedings of the AAAI Confer-
arXiv preprint arXiv:1902.01973. ence on Artificial Intelligence, 33:1061–1068.

51. Leach, J. and Fitch, J. (1995). Nature, 59. Luo, J., Yang, X., Ji, S., and Li, J.
music, and algorithmic composition. Com- (2019). MG-VAE: Deep Chinese Folk
puter Music Journal, 19(2):23–33. Publisher: Songs Generation with Specific Regional
JSTOR. Style. arXiv:1909.13287 [cs, eess]. arXiv:
1909.13287.
52. Liang, X., Wu, J., and Cao, J. (2019).
MIDI-Sandwich2: RNN-based Hierarchical 60. Makris, D., Kaliakatsos-Papakostas, M.,
Multi-modal Fusion Generation VAE net- Karydis, I., and Kermanidis, K. L. (2019).
works for multi-track symbolic music gener- Conditional neural sequence learners for gen-
ation. arXiv:1909.03522 [cs, eess]. arXiv: erating drums’ rhythms. Neural Computing
1909.03522. and Applications, 31(6):1793–1804.
23

61. Manzelli, R., Thakkar, V., Siahkamari, A., 69. Oore, S., Simon, I., Dieleman, S., Eck, D.,
and Kulis, B. (2018). CONDITIONING and Simonyan, K. (2020). This time with
DEEP GENERATIVE RAW AUDIO MOD- feeling: learning expressive musical perfor-
ELS FOR STRUCTURED AUTOMATIC mance. Neural Computing and Applications,
MUSIC. arXiv preprint arXiv:1806.09905. 32(4):955–967.

62. Manzelli, R., Thakkar, V., Siahkamari, A., 70. Payne C. (2019) MuseNet.OpenAI
and Kulis, B. (2018). An end to end model for Blog. https://openai.com/blog/musenet/.
automatic music generation: Combining deep Accessed 11 January 2022
raw and symbolic audio networks. In Proceed-
ings of the Musical Metacreation Workshop 71. Plut, C. and Pasquier, P. (2020). Genera-
at 9th International Conference on Computa- tive music in video games: State of the art,
tional Creativity, Salamanca, Spain. challenges, and prospects. Entertainment
Computing, 33:100337. Publisher: Elsevier.
63. Medeot, G., Cherla, S., Kosta, K., McVicar,
M., Abdallah, S., Selvi, M., Newton-Rex, E., 72. Ramanto, A. S., No, J. G., and Maulidevi,
and Webster, K. (2018). StructureNet: Induc- D. N. U. Markov Chain Based Procedu-
ing Structure in Generated Melodies. In ral Music Generator with User Chosen Mood
ISMIR, pages 725–731. Compatibility. IN International Journal of
Asia Digital Art and Design Association,
64. Mogren, O. (2016). C-RNN-GAN: Continu- 21(1),19-24.
ous recurrent neural networks with adversar-
ial training. arXiv:1611.09904 [cs]. arXiv: 73. Rivero, D., Ramı́rez-Morales, I., Fernandez-
1611.09904. Blanco, E., Ezquerra, N., and Pazos, A.
(2020). Classical Music Prediction and Com-
65. Mura, D., Barbarossa, M., Dinuzzi, G., Gri- position by Means of Variational Autoen-
oli, G., Caiti, A., and Catalano, M. G. (2018). coders. Applied Sciences, 10(9):3053.
A Soft Modular End Effector for Underwater
Manipulation.: A Gentle, Adaptable Grasp 74. Roberts, A., Engel, J., Raffel, C., Hawthorne,
for the Ocean Depths. IEEE Robotics & C., and Eck, D. (2018) A Hierarchical Latent
Automation Magazine, PP(4):1–1. Vector Model for Learning Long-Term Struc-
ture in Music. In nternational conference on
66. Muñoz, E., Cadenas, J. M., Ong, Y. S., and machine learning (pp. 4364-4373). PMLR.
Acampora, G. (2014). Memetic music com-
position. IEEE Transactions on Evolutionary 75. Scirea, M., Togelius, J., Eklund, P., and Risi,
Computation, 20(1):1–15. Publisher: IEEE. S. (2016). Metacompose: A compositional
evolutionary music composer. In Interna-
67. Olseng, O. and Gambäck, B. (2018). Co- tional Conference on Computational Intelli-
evolving melodies and harmonization in evo- gence in Music, Sound, Art and Design, pages
lutionary music composition. In International 202–217. Springer.
Conference on Computational Intelligence in
Music, Sound, Art and Design, pages 239– 76. Sturm, B. L., Ben-Tal, O., Monaghan, ı̂,
255. Springer. Collins, N., Herremans, D., Chew, E., Had-
jeres, G., Deruty, E., and Pachet, F. (2019).
68. Oord, A. v. d., Dieleman, S., Zen, H., Machine learning research that matters for
Simonyan, K., Vinyals, O., Graves, A., Kalch- music creation: A case study. Journal of
brenner, N., Senior, A., and Kavukcuoglu, K. New Music Research, 48(1):36–55. Publisher:
(2016). WaveNet: A Generative Model for Taylor & Francis.
Raw Audio. arXiv:1609.03499 [cs]. arXiv:
1609.03499. 77. Sturm, B. L., Santos, J. F., Ben-Tal, O.,
and Korshunova, I. (2016). Music transcrip-
tion modelling and composition using deep
learning. arXiv preprint arXiv:1604.08723. Transfer Using Cycle-Consistent Boundary
Equilibrium Generative Adversarial Net-
78. Supper, M. (2001). A Few Remarks on works. arXiv:1807.02254 [cs, eess]. arXiv:
Algorithmic Composition. Computer Music 1807.02254.
Journal, 25(1):48–53.
88. Yang, L.-C., Chou, S.-Y., and Yang, Y.-
79. Tapus, A. (2009). The role of the physi- H. (2017). Midinet: A convolutional gen-
cal embodiment of a music therapist robot erative adversarial network for symbolic-
for individuals with cognitive impairments: domain music generation. arXiv preprint
longitudinal study. In 2009 Virtual Rehabil- arXiv:1703.10847.
itation International Conference, pages 203–
203. IEEE. 89. Yu, Y., Srivastava, A., and Canales, S. (2021).
Conditional LSTM-GAN for Melody Genera-
80. Trieu, N. and Keller, R. M. (2018). Jaz- tion from Lyrics. ACM Transactions on Mul-
zGAN: Improvising with Generative Adver- timedia Computing, Communications, and
sarial Networks. In MUME 2018: 6th Inter- Applications, 17(1):1–20. arXiv: 1908.05551.
national Workshop on Musical Metacreation.
90. Zhang, N. (2020). Learning adversarial trans-
81. Valenti, A., Carta, A., and Bacciu, D. (2020). former for symbolic music generation. IEEE
Learning Style-Aware Symbolic Music Transactions on Neural Networks and Learn-
Representations by Adversarial Autoen- ing Systems. Publisher: IEEE.
coders. arXiv:2001.05494 [cs, stat]. arXiv:
2001.05494. 91. Zhu, H., Liu, Q., Yuan, N. J., Qin, C., Li, J.,
Zhang, K., Zhou, G., Wei, F., Xu, Y., and
82. Vaswani, A., Shazeer, N., Parmar, N., Uszko- Chen, E. (2018). XiaoIce Band: A Melody
reit, J., Jones, L., Gomez, A. N., Kaiser, \., and Arrangement Generation Framework for
and Polosukhin, I. (2017). Attention is all Pop Music. In Proceedings of the 24th
you need. In Advances in neural information ACM SIGKDD International Conference on
processing systems, pages 5998–6008. Knowledge Discovery & Data Mining, pages
2837–2846, London United Kingdom. ACM.
83. Veblen, K., & Olsson, B. (2002). Community
music: Toward an international overview. The 92. Zhu, J.-Y., Park, T., Isola, P., and Efros,
new handbook of research on music teaching A. A. (2017). Unpaired image-to-image trans-
and learning, 730-753. lation using cycle-consistent adversarial net-
works. In Proceedings of the IEEE interna-
84. Waite, E. and others (2016). Generating long- tional conference on computer vision, pages
term structure in songs and stories. Web blog 2223–2232.
post. Magenta, 15(4).
93. Zipf, G. K. (2016). Human behavior and the
85. Wang, B. and Yang, Y.-H. (2019). Per- principle of least effort: An introduction to
formanceNet: Score-to-Audio Music Genera- human ecology. Ravenio Books.
tion with Multi-Band Convolutional Residual
Network. Proceedings of the AAAI Confer- 94. Westergaard, P. (1959). Experimental music.
ence on Artificial Intelligence, 33:1174–1181. composition with an electronic computer.

86. Williams, D., Hodge, V. J., Gega, L., Mur- 95. Cope, D. (1991). Recombinant music: using
phy, D., Cowling, P. I., and Drachen, A. the computer to explore musical style. Com-
(2019). AI and Automatic Music Generation puter, 24(7):22–28.
for Mindfulness. page 11.
96. Miranda, E. R. and Al Biles, J. (2007).
87. Wu, C.-W., Liu, J.-Y., Yang, Y.-H., and Evolutionary computer music. Springer.
Jang, J.-S. R. (2018). Singing Style

You might also like