Professional Documents
Culture Documents
Proceedings of the
Audio Mostly Conference
- A Conference on
Interaction with Sound
Sound and Immersion in the First-Person Shooter: Mixed Measuring the Player's Sonic Experience. 9
Mark Grimshaw, Craig A.Lindley & Lennart Nacke
Psychologically Motivated Techniques for Emotional Sound in Computer Games. Inger Ekman 20
Interactive Sonification of Grid-based Games. Louise Valgerður Nickerson & Thomas Hermann 27
Using audio aids to augment games to be playable for blind people. David C. Moffat & David Carr 35
Control of Sound Environment using Genetic Algorithms. Scott Beveridge & Don Knox 50
Saturday Night or Fever? Context Aware Music Playlists. Stuart Cunningham, Stephen Caulder & Vic Grout 64
A Musical Instrument based on 3D Data and Volume Sonification Techniques. Lars Stockmann, 72
Axel Berndt & Niklas Röber
Same but different - Composing for Interactivity. Anders-Petter Andersson & Birgitta Cappelen 80
The Harmony Pad - A new creative tool for analyzing, generating and teaching tonal music. 86
Gabriel Gatzsche, Markus Mehnert, David Gatzsche & K. Brandenburg
Sonic interactions with hand clap sounds. Antti Jylhä & Cumhur Erkut 93
Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity. Ulrich Reiter 101
An Embedded Audio-Based Vehicle Classification Based on Two-level F-ratio. JiQing Han 108
dots: an Audio Entertainment Installation using Visual and Spatialbased Interaction. Andreas Floros, 112
Nikolaos Grigoriou, Nikolaos Moustakas & Nikolaos Kanellopoulos.
AllThatSounds: Associative Semantic Indexing of Audio Data. Hannes Raffaseder, Matthias Husinsky & 117
Julian Rubisch
IMPROVe: The Mobile Phone as a Medium fo Heightened Sonic Perception. Richard Widerberg & 121
Zeenath Hasan
Automatic genre and artist classification by analyzing improved solo parts from musical recordings. Jakob 127
Abesser, Christian Dittmar & Holger Grossmann
The Heart as an Ocean: Exploring meaningful interaction with biofeedback. Pieter Coussement, 132
Marc Leman, Nuno Diniz & Michiel Demey
EarMarkit: An Audio-Only Game for Mobile Platforms. David Black, Kristian Gohlke & Jörn Loviscach 135
Contributors CV
Mark Grimshaw
Mark Grimshaw was educated in Kenya, South Africa, England and New Zealand where he
gained his PhD. He is currently the Reader in Creative Technologies at the University of
Bolton, England, and his most recent work 'The Acoustic Ecology of the First-Person Shooter'
was published by VDM in May 2008.
Daniel Kromand
Daniel Kromand holds a bachelors degree in Media Studies from University of Copenhagen
and is currently attending the IT University of Copenhagen for a master's degree in Media
Technology and Games. He has previously presented research in avatar theory at the DiGRA
2007 conference. Daniel Kromand also works as a project manager in Copenhagen.
Inger Ekman
Inger Ekman researches sound design for interactive media and virtual environments. She
received a MSc in Computer Science from University of Tampere in 2003 and is currently
working on her doctoral thesis on game sound design. She is particularly interested in non-
musical sound effects and ambience and the ways in which these influence player emotion.
Louise Nickerson
Louise Nickerson is a PhD candidate at Queen Mary, University of London (QMUL) in the
Department of Computer Science. She holds an MSc in Computer Science from QMUL
(2002) and a BA in French and Italian Literature from the University of Virginia (1998). She
is part of the Interaction, Media and Communication research group which focuses on human-
human interaction with a smattering of people working on audio. Her interest stems from
accessibility and the belief that soon mobile computing will be the norm. Her research focuses
on the development and definition of auditory overviews to make auditory interfaces more
approachable.
When not being a PhD student, Louise can be found nurturing her interest in foreign
languages or on the Thames with her crew-mates from the Sons of the Thames Rowing Club.
Thomas Hermann
Dr. Thomas Hermann studied physics at Bielefeld University. From 1998 to 2001 he was a
member of the interdisciplinary Graduate Program "Task-oriented Communication". He
started the research on sonification and auditory display in the Neuroinformatics Group and
received a Ph.D. in Computer Science in 2002 from Bielefeld University (thesis: Sonification
for Exploratory Data Analysis). After research stays at the Bell Labs (NJ, USA, 2000) and
GIST (Glasgow University, UK, 2004), he is currently assistant professor (German C1
position) in the Neuroinformatics Group where he coordinates and conducts research on
sonification, human-computer interaction and cognitive interaction technology.
In his research, Thomas Hermann is developing techniques for interactive multimodal data
representation and exploratory analysis of high-dimensional data with a particular focus on
sonification, novel interactive data mining techniques and human-computer interaction. His
1
research topics include furthermore Tangible Computing, Ambient Information Systems,
Gestural Interactions and Augmented Reality.
David Moffat
David Moffat is a Lecturer in Computing at Glasgow Caledonian University.
His research interests are mainly in emotion and affective computing, often as applied to
video games, but he also has wider interests in other fields of AI and cognitive science.
David Carr
David Carr is a Masters student in Advanced Computing at Glasgow Caledonian university.
He is one of the first appointed "scholars" in the School of Engineering and Computing, for
his particular interests in AI programming for games.
Mats Liljedahl
Since the mid 80’s, Mats Liljedahl has been involved with questions concerning ICT as a tool
for learning and artistic expression. Between 1988 and 1996, Mats worked as a music teacher
and took part in projects dealing with integrating ICT as an active tool for music teachers.
Between 1996 and 2000 he worked as a teachers educator at Ingesund University College of
Music in Arvika, where he also had the opportunity to work with several development
projects on ICT, music and learning. Since August 2000, Mats works at the Interactive
Institute and has taken part in several projects dealing with music, learning and ICT.
Nigel Papworth
Nigel Papworth trained at the London College of Printing in Graphic Design after some years
working in advertising in London, a career he started at
17 before completing his education. After moving to Sweden in 1985, he co-founded of one
of Sweden’s first major games companies, Daydream, where he worked as lead game
designer for 9 years. Two of his designs ‘Safecracker’ and ‘Traitors Gate’ enjoyed
considerable success, especially in the USA. He has worked and lectured on the integration of
Ai, dialogue, game states and behavioral simulation systems utilizing chaos theory in
computer games. After joining the Interactive Institute, he has concentrated on the role audio
can play in driving gameplay and conveying game content and status to the player.
Nigel is married and has three children aged 21 to 14.
Scott Beveridge
Scott Beveridge B.Sc. (Hons.), PhD, Audio Technology, B.Sc. (1st Class Honours), Audio
Technology
Publications
D. Knox, G. Cassidy, S. Beveridge, R MacDonald. Music Emotion Classification by Audio
Signal Analysis: Analysis of Self-selected Music During Gameplay. 10th ICPMC, Sapporo,
Japan, August 25th – 29th, 2008
D. Knox, S. Beveridge, R. MacDonald. Emotion Classification in Contemporary Music.
DMRN+2: Digital Music Research Network One-day Workshop 2007, Queen Mary,
University of London, December 2007.
2
Research Interests
Algorithmic composition and music generation using non-deterministic source data. In
particular the sonification of socio-spatial behaviour in large sensate environments. Current
research investigates the intersection between Human Computer Interaction and Music
Information Retrieval in a framework which uses models of emotion to generate socially
reflexive audio material.
Daniel Hug
Daniel Hug has a background in music, sound design, interaction design and project
management in applied research. Since 1999 he investigates sound and interaction design
related questions through installations, design works and theoretical publications. Since 2005,
he is teaching sound studies and sound design for interactive media and games at the
Interaction Design department of the Zurich University of the Arts, Switzerland. Hug pursues
a PhD on sound design for interactive commodities at the University of the Arts and Industrial
Design of Linz, Austria, in close exchange with the European COST-initiative "Sonic
Interaction Design".
Stuart Cunningham
Stuart Cunningham was awarded the BSc degree in Computer Networks in 2001, and in 2003
was awarded the MSc Multimedia Communications degree with Distinction, both from the
University of Paisley (UK). He is a Chartered IT Professional (CITP), member of the British
Computer Society (BCS), the Institution of Engineering & Technology (IET) and the Institute
of Electrical and Electronics Engineers (IEEE). Stuart was also a member of the MPEG Music
Notation Standards (MPEG-SMR) working group.
Stuart is currently a Senior Lecturer in Computing and a PhD student at Glyndŵr University
in the UK, studying under the supervision of Professor Vic Grout. His research interests
include: measurement of audio similarity, audio compression, image sonification, and musical
content & context analysis.
Lars Stockmann
Lars Stockmann has just received his Diplom (Master's Thesis) in Computational Visualistics
from the Otto-von-Guericke University of Magdeburg. His current research interests are
interactive acoustic environments. This includes interactive sonification techniques as well as
computer-based instruments for live performances and computer games.
Previous studies include API-design for audio applications on mobile devices, stereo vision
and rendering (raytracing and real-time cg). He is currently working as developer and
programmer at a company in the course of formation.
Anders-Petter Andersson
Anders-Petter Andersson is a musician, composer and doctoral researcher in Musicology at
Göteborg University / Malmö University / Interactive Institute. The working title of his Ph D
project is "Interactive Music Composition". He tries to answer the question of how one can
compose musically satisfying sound and music for games and interactive applications. In
interactive music, the listening role is complex, as the listener participates and alters the
composition. With knowledge from musical traditions such as improvisation he develops
composition methods for audio-tactile and physical environments such as Strainings, Do-Be-
DJ and Mufi.
3
Anders-Petter is co-ordinator of a new education and BA programme in Interactive Sound
Design at Kristianstad University, combining music and the computer within mobile services,
game and music industry. Join the Interactive Sound Design community at the website:
www.interactivesound.org
Birgitta Cappelen
Birgitta Cappelen is an industrial designer (SID), interaction designer and associate professor
at Oslo school of Architecture and Design (AHO). She is also working on her Ph D at Malmö
University (K3). The working title of her Ph D project is "Co-create and Re-create -
rethinking Industrial Design in the digital age." In her work she tries to answer the question of
what meaningful design can be in our time after postmodernism and with the computer as a
material requirement. Instead of designing beautiful and user-friendly objects, she suggests
designing fields of possibilities with a high degree of inscription and potential of circulation.
She calls this design quality "multivalence".
www.musicalfieldsforever.com
Gabriel Gatzsche
Gabriel Gatzsche studied Media Technology at the Technische Universität Ilmenau and
received his Diploma in 2003. After that he joined to the Fraunhofer IDMT Ilmenau where he
works on the MPEG-4 based storage and transmission of object oriented WFS audio scenes,
the software development of spatial audio reproduction systems for large venues and the
development and standardization of data formats for the Digital Cinema. Since 2005 he works
on a doctoral thesis which deals with the analysis of synthesis of musical audio signals.
Within that relationship he developed the HarmonyPad which is presented at the
AudioMostly2008.
Antti Jylhä
Antti Jylhä was born in Helsinki, Finland, in 1981. He received the M.Sc. degree in
telecommunications from Helsinki University of Technology (TKK), Espoo, Finland, in 2007.
He is currently working at the Department of Signal Processing and Acoustics at the same
institute as a researcher and pursuing a doctoral degree in acoustics and audio signal
processing. His current research interests include auditory and multi-modal interfaces in
human-computer interaction, and modeling and analysis of multiple interacting sound
sources. He is also instructing student projects related to automatic sports monitoring, and is
involved in the activities of the Helsinki Mobile Phone Orchestra.
Cumhur Erkut
Cumhur Erkut was born in Istanbul, Turkey, in 1969. He received B.Sc.
and the M.Sc. degrees in electronics and communication engineering from the Yildiz
Technical University, Istanbul, Turkey, in 1994 and 1997, respectively, and the Dr.Sc.(Tech.)
degree in electrical engineering from the Helsinki University of Technology (TKK), Espoo,
Finland, in 2002. Between 1998 and 2002, he worked as a researcher, and between 2002 and
2007 as a postdoctoral researcher at the Laboratory of Acoustics and Audio Signal Processing
of the TKK, and contributed to various national and international research projects. 2007
onwards, he is appointed as an Academy Research Fellow, he conducts his research project
Schema-SID [Academy of Finland, 120583] and contributes to the COST IC0601 Action
"Sonic Interaction Design" (SID). His primary research interests are sonic interaction design,
and physics-based sound synthesis and control.
4
Ulrich Reiter
Ulrich Reiter studied Electrical Engineering at RWTH Aachen, from where he received a
Diplom-Ingenieur degree (equ. M.Sc.) in 1999. He has been working at the Institute of Media
Technology (director Prof. Dr. Karlheinz Brandenburg) at Technische Universitaet Ilmenau as
a researcher from 1999-2007, where he was also giving lectures on Virtual and Applied
Acoustics and on Recording Studio Technology. He did his Ph.D. on perceived quality in
interactive audiovisual application systems of moderate complexity. Since 2008 he is working
with Prof. Peter Svensson at the Centre for Quantifiable Quality of Service in Communication
Systems (Q2S), a Norwegian Centre of Excellence, at the Norwegian University of
Technology in Trondheim. His current research interests include multimodal perception,
virtual acoustics and interactive audiovisual application systems using the MPEG-4 standard.
JiQing Han
JiQing Han, received the Master degree and Doctor degree both from Harbin Institute of
Technology, China in 1990 and 1998, respectively. He is a Professor in School of Computer
Science and Technology, Harbin Institute of Technology, China. His research interests
include Speech recognition and synthesis, Audio signal processing, Pattern recognition.
Andreas Floros
Andreas Floros was born in Drama, Greece in 1973. In 1996 he received
his engineering degree from the department of electrical and computer
engineering, University of Patras, and in 2001 his Ph.D. degree from the
same department. His research was mainly focused on digital audio
signal processing and conversion techniques for all-digital power
amplification methods. He was also involved in research in the area of
acoustics. In 2001, he joined ATMEL Multimedia and Communications,
working in projects related with digital audio delivery over PANs and WLANs, Quality-of-
Service, mesh networking, wireless VoIP technologies and lately with audio encoding and
compression implementations in embedded processors. Since 2005, he is a visiting assistant
professor at the Department of Audio Visual Arts, Ionian University. Dr. Floros is a member
of the Audio Engineering Society, the Hellenic Institute of Acoustics and the Technical
Chamber of Greece.
Nikos Moustakas
Nikos Moustakas was born in Athens, Greece. He graduated from high
school in 2004 and till now is an undergraduate student at the Ionian
university in the department of Audio-Visual Arts in Corfu. He
participated in 3 photo exhibitions that his department organized as well
as Audio-Video festivals in the last two years. Nikos has attended the
“Miden Festival” in Kalamata where he presented his own work (some
videos). In addition, he has some experience on the conservation and
restoration of old photos. Also he is an operator in the computer
laboratory of his Department. His main focus is on audio-video interactive installations.
5
Nikolas Grigoriou
Nikolas Grigoriou was born in Heraklion, Crete, Greece. He graduated
from high school in 2004 and till now is an undergraduate student at
the Ionian university in the department of Audio-Visual Arts in Corfu.
He participated in 3 photo exhibitions that his department organized as
well as Audio-Video festivals in the last two years. Nikolas has attended
the “Miden Festival” in Kalamata where he presented his own work
(some videos). In addition, he has some working experience in terms
of practicum in a TV Studio in his hometown and this year he won the
first prize at a student oriented conference “Student Eureka 2008”. His
main focus is on audio-video interactive installations, surround effects,
and sound in space.
Matthias Husinsky
Matthias Husinsky was born in 1982. He studied Media Engineering and Media Design at the
University of Applied Sciences Hagenberg, Austria from 2000 to 2004 and received his
Masters degree with a work on a guitar tuner for mobile phones. From 2005 till now he
worked as a project assistant in several research projects at the University of Applied Sciences
St.Pölten, Austria, mainly focusing on modern audio technologies in a networked society. He
is now also a PhD candidate at the Johannes Kepler-University in Linz, Austria, working in
the field of MIR (music information retrieval).
Julian Rubisch
Julian Rubisch was born in 1981. He studied Telecommunications and Media at the
University of Applied Sciences St. Pölten, Austria from 2004 to 2007 and received his
Bachelor of Science degree for a work on psychoacoustical models within audio information
retrieval. He is currently employed as research assistant at the University of Applied Sciences,
St. Pölten, where he is also working on his Masters thesis on generative music for media
applications.
Richard Widerberg
Richard Widerberg is a sound artist and new media designer living in Göteborg, Sweden. He
has a background working as a new media designer, organizing events, making radio, doing
sonic works, and playing music.
His main focus during the last years has been to investigate the many dimensions of sound
and listening. Also related to his sonic works are works that deal with location, mobility,
interaction and social exchange. He has a strong interest in new forms of copyright and
distribution of both music and sound as well as open-source development. He is also an active
musician.
http://www.riwid.net
rwiderberg@gmail.com
Zeenath Hasan
Zeenath Hasan involves herself in the people centered practice of design research to exercise
the potential of media technologies for socially appropriate intervention. The studio from
where she conducts her doctoral research on the role of media in a democracy is located in the
School of Arts and Communication, Malmö University. She was born in Kolkata and
6
currently resides in Malmö. Her stint as new media practitioner was sparked after her training
in the MS Communications programme at Manipal. Her footprints crossed over to Finland
during her training in MA New Media Studies from the University of Art and Design
Helsinki. She has worn the labels of Information Architect, Interface Designer, Design
Researcher, Cultural Producer, and Media Artist and Researcher. She also runs a 1-person
design research firm.
http://www.zeeniac.net
zeenath.hasan@gmail.com
Kazuhiro Jo
Kazuhiro Jo <http://jo.swo.jp/> is a Research Fellow at RCAST, University of Tokyo and
Visiting Research Fellow at Digital Media group, Culture Lab, Newcastle University. He is
also a member of The SINE WAVE ORCHESTRA <http://swo.jp/>, a member of the
Monalisa project <http://monalisa-au.org/plog/>, a member of AEO, and a co-organizer of
dorkbot Tokyo <http://dorkbot.org/dorkbottokyo/>.
Jakob Abesser
Dipl.-Ing. Jakob Abesser studied computer engineering with specialization in
telecommunication and measurement engineering at the Ilmenau University of Technology
from 2002 to 2008. He investigated the characterization of instrumental solo parts in music
pieces by means of musical high-level features as well as their application for genre and artist
classification within his diploma thesis. After his graduation he joined the Fraunhofer IDMT
and is now working in the Metadata department as a Ph.D. student. His research interests are
the automatic transcription of stringed instruments like bass and guitar as well as performance
analysis and artist classification.
Christian Dittmar
Dipl.-Ing. (FH) Christian Dittmar studied electrical engineering with specialization in digital
media technology at the University for Applied Sciences in Jena from 1998 to 2002. In his
diploma thesis, he investigated Independent Subspace Analysis as a means of audio signal
analysis. Subsequent to his successful graduation he joined IDMT in 2003 to work at the
Metadata department. He has contributed to a number of scientific papers in the field of music
information retrieval and automatic transcription. In 2005 he participated in the MIREX
evaluation category automatic drum detection. Since late 2006 he is Semantic Metadata
Systems group manager at Fraunhofer IDMT.
Holger Grossman
Dipl.-Ing. Holger Grossmann studied electrical engineering at the Ilmenau University of
Technology from 1986 until 1993. Afterwards he worked as software developer in the fields
of electronic musical instruments as well as client-/server e-business systems. In 2001 he
joined the Fraunhofer-Arbeitsgruppe für Elektronische Medientechnologie AEMT, Ilmenau
where he worked as engineer and researcher in the Metadata department. During the
following 3 years his research focus was the development and realisation of the music
identification system AudioID which was standardized in MPEG-7 as the
AudioSignatureType Description Schema. The Fraunhofer-Arbeitsgruppe became an
independent institute in 2004. Since then Holger Grossmann is head of the Metadata
department at IDMT, focussed on research in the fields of automated semantic media
annotation and multimedia search. He is co-author of scientific papers and was invited as
7
speaker on international conferences. In 2007 he chaired the AudioMostly conference in
Ilmenau.
Pieter Coussement
Pieter Coussement is a new media artist, performer and composer based out of Ghent,
Belgium. His main focus is on the function and position of the human body within interactive
art with sound as the main mediator. After several years of teaching new media, he started his
PhD at the Institute for Psychoacoustics and Electronic Music –IPEM at the university of
Ghent, where he furthers his artistic research on interactive art.
Education:
PhD 2008 Institute for Psychoacoustics and Electronic Music –
IPEM, University Ghent, Belgium
drs. Audiovisual arts, (re)experiencing interactive art
MA 2003 Royal Academie of Fine Arts, Ghent, Belgium
Master of Fine Arts, Mixed Media
BA 1999 Royal Academie of Fine Arts, Ghent, Belgium
Three Dimensional Art, Multi Media
Professional Experience:
2007 PHL department of visual arts, Hasselt, Belgium
Workshop teacher Interaction with audiovisual media
2007 Howest PIH, Kortrijk, Belgium
Workshop teacher Creative Coding: Realtime user responsive 3D Visuals in
openGL
2004-2008 City Academy of Arts, Ostend, Belgium
Teacher Digital Visual Design
2003-2008 City Academy of Arts, Ostend, Belgium
Teacher Experiment Digital Design
David Black
David Black is a master student at the Digital Media program of the Hochschule Bremen. As
a graduate of the University of Southern California School of Music and the Royal
Conservatory of Den Haag's Institute of Sonology, he is involved with interactive music
pieces for electronics, dance, percussion, electroacoustic composition, guided walking audio
tours, and auditory display.
Kristian Gohlke
Kristian Gohlke is currently a master student at the Hochschule Bremen. His interests circle
around physical-computing, electronics and programming as well as human-computer
interaction. He also works as a student tutor at Bremen University and the University for the
Arts in Bremen where he teaches physical computing to children and artists, and does and
consulting on related art and design projects. He lives at: http://krx.at/
Jörn Loviscach
Joern Loviscach is a professor at Hochschule Bremen. His major interests lie in computer
graphics, human-computer interaction, and audio and music computing. He is a regular
contributor to conferences such as SIGGRAPH, Eurographics and the AES Convention. In
addition, he has published numerous chapters in book series such as Game Programming
Gems and ShaderX Programming. He lives at: http://www.l7h.cn/
8
Sound and Immersion in the First-Person Shooter: Mixed Measurement
of the Player's Sonic Experience
Abstract. Player immersion is the holy grail of computer game designers particularly in environments such as those found in first-
person shooters. However, little is understood about the processes of immersion and much is assumed. This is certainly the case
with sound and its immersive potential. Some theoretical work explores this sonic relationship but little experimental data exist to
either confirm or invalidate existing theories and assumptions.
This paper summarizes and reports on the results of a preliminary psychophysiological experiment to measure human arousal and
valence in the context of sound and immersion in first-person shooter computer games. It is conducted in the context of a larger set of
psychophysiological investigations assessing the nature of the player experience and is the first in a series of systematic experiments
investigating the player's relationship to sound in the genre. In addition to answering questionnaires, participants were required to
play a bespoke Half-Life 2 level whilst being measured with electroencephalography, electrocardiography, electromyography,
galvanic skin response and eye tracking equipment. We hypothesize that subjective responses correlated with objective
measurements provide a more accurate assessment of the player's physical arousal and emotional valence and that changes in these
factors may be mapped to subjective states of immersion in first-person shooter computer games.
-1-
9
concepts of immersion outlined above. Attempts to provide Arousal is commonly measured using galvanic skin response
evidence include Jørgensen, who uses player surveys, [13] and (GSR), also known as skin conductance. [23] The conductance
Shilling, Zyda and Wardynski. [18] The latter paper is of of the skin is directly related to the production of sweat in the
particular interest to this work because it not only explores the eccrine sweat glands, which is entirely controlled by the human
use of sound in an FPS game/simulation (America’s Army [19]) sympathetic nervous system. Increased sweat gland activity is
but it also attempts to objectively measure the player’s thus directly related to electrical skin conductance. Hence,
emotional arousal through the use of temperature, electrodermal measuring both GSR and EMG provides sufficient data to
response and heart-rate measurements. However, although the provide an interpretation of the emotional state of a game player
authors state that “emotional arousal has a positive impact on in real time, according to a phasic emotional model.
[the] sense of immersion in virtual environments” and that the
precise conjunction of a sound and an action seen on the screen This paper describes and analyzes the results of a preliminary
is “crucial for immersing the player”, the paper is a description experiment that investigates the role of sound in enabling player
of their attempts to introduce, and amplify, emotion within the immersion in the FPS game Half-Life 2. [24] The investigation
game environment through sound rather than an attempt to is designed to provide both subjective responses (through the
effect and measure immersion. The link between emotional use of questionnaires) and objective measurements (through the
arousal and immersion is assumed and so the relationship use of electromyography (EMG), galvanic skin response (GSR),
between sound and player immersion remains undefined in electroencephalography (EEG), electrocardiography (ECG) and
objective terms. eye tracking equipment). The overall aim of the experiment is
to find external (that is, objective) measures that may be reliably
Emotions are a central part of the game experience, motivating correlated with subjective experiences assessed via
the conscious cognitive judgments and decisions made during questionnaires in order to provide more detailed descriptions of
gameplay. Psychophysiological investigations suggest that at the emotional experience of game players during gameplay, both
least some emotional states may be quantitatively characterized in the degree of emotions experienced and in the timescale of
via physiological measurements. Specific types of measurement emotional changes and modulations. It is further hoped that this
of different physiological responses (such as GSR, EMG, ECG method may lead to real-time measures for states of immersion
and EEG, as described below) are not by themselves reliable of players playing first-person shooter computer games. Finally,
indicators of well-characterized feelings; [20, 21] a de rigueur correlating discriminations within psychophysiological data with
cross-correlation of all measurements is crucial to identify the different categories of immersion can provide at least one
emotional meaning of different patterns in the responses. method for validating those categorizations. The experiment is
Moreover, the often described many-to-one relation between preliminary since the psychophysiological characterization of
psychological processing and physiological response [20] allows states of immersion is not yet well developed.
for psychophysiological measures to be linked to a number of
psychological structures (for example, attention, emotion, The study further aims to provide a psychophysiologically-based
information processing). Using a response profile for a set of answer to the assumption that sound plays a role in enabling the
physiological variables enables researchers to go into more immersion of the player in the FPS game world. If the results of
detail with their analysis and allows a better correlation of the experiment provide a positive answer, that sound does
response profile and psychological event. [21] The crucial issue indeed play this role, it is envisaged that future experiments,
here is the correlation of patterns of measurement characteristics using a similar methodology, will be designed to investigate
for a set of different measures with subjective characterizations more specific questions about the relationship between the
of experience such as emotion and feelings (for example, the player and sound in FPS games.
feeling of immersion in gameplay).
The experiment was conducted in May 2008 in the Game and
Facial electromyography (EMG) is a direct measure of electrical Media Arts Laboratory at Blekinge Institute of Technology
activity involved in facial muscle contractions; EMG provides (BTH) in Sweden. The investigation of sound formed part of a
information on emotional expression via facial muscle activation larger psychophysiological investigation into the nature of the
(even though a facial expression may not be visually observable) player experience in computer games. This paper is also limited
and can be considered as a useful external measure for hedonic to the analysis of GSR, EMG and questionnaire data. Further
valence (that is, degree of pleasure/displeasure). [22] Positive analysis taking into account the other data types is ongoing.
emotions are indexed by high activity at the zygomaticus major
(cheek muscle) and orbicularis oculi (periocular muscle) Method
regions. In contrast to this, negative emotions are associated
with high activity at the corrugator supercilii (brow muscle) Subjects played a Half-Life 2 game mod especially designed for
regions. This makes facial EMG suitable for mapping emotions a short immersive playing time of maximum 10 minutes. The
to the valence dimension in the two-dimensional space game mod was played four times with different sound
described in Lang’s dimensional theory of emotion. [22] This modalities and physiological responses were measured together
dimension reflects the degree of pleasantness of an affective with questionnaires (assessing subjective responses) for each
experience. The other dimension, the arousal dimension, depicts modality.
the activation level linked to an emotionally affective experience
ranging from calmness to extreme excitement. In this kind of 2.1 Design
dimensional theory of emotion, emotional categories found in The game sessions were played under four different conditions,
everyday language (for example, happiness, joy, depression, corresponding to the permutations of the independent variable
anger) are interpreted as correlating with different ratios of (sound modality): playing with diegetic game sounds (normal
valence and arousal, hence being mappable within a space sounds), playing with speakers completely turned off (no
defined by orthogonal axes representing degrees of valence and sounds, no music), playing with diegetic game sounds and an
arousal, respectively. For example, depression may be additional music loop (sounds and music), and playing with
represented by low valence and low arousal, while joy may be diegetic game sounds turned off and hearing only the music loop
represented by high valence and high arousal. (only music). Participants played under each condition in a
-2-
10
shifting order to eliminate repeated-measures effects (using a 2.3 Apparatus
Latin Squares design). Physiological responses (as indicators of Facial EMG. We recorded the activity from left orbicularis
valence and arousal) were recorded for each session as well as oculi, corrugator supercilii, and zygomaticus major muscle
questionnaire answers. Questionnaire item order was regions, as recommended by Fridlund and Cacioppo, [26] using
randomized for each participant using the open-source software BioSemi flat-type active electrodes (11mm width, 17mm length,
LimeSurvey. [25] 4.5mm height) electrodes with sintered Ag-AgCl (silver/silver
chloride) electrode pellets having a contact area 4 mm in
2.2 Participants diameter. The electrodes were filled with low impedance highly
Data were recorded from 36 students and employees, recruited conductive Signa electrode gel (Parker Laboratories, Inc.). The
from the three BTH University campuses and their age ranged raw EMG signal was recorded with the ActiveTwo AD-box at a
between 18 and 41 (M = 24, SD = 4.89). 19.4% of all sample rate of 2 kHz and using ActiView acquisition software.
participants were female. When asked how frequently they play
digital games, 50% answered that they play games every day, Galvanic skin response (GSR). The impedance of the skin was
22.2% play weekly, 22.2% play occasionally and only 5.6% measured using two passive Ag-AgCl (silver/silver chloride)
play rarely or never. However, it should be noted that 62.1% of Nihon Kohden electrodes (1 microamp, 512 Hz). The electrode
all the males play on a daily basis and 20.7% play weekly. In pellets were filled with TD-246 skin conductance electrode
contrast to that, most of the females enjoyed playing only on an paste (Med. Assoc. Inc.) and attached to the thenar and
occasional (57.1% of all females) or weekly (28.6%) basis. hypothenar eminences of the participant’s left hand.
Out of all participants, 47.2 % considered themselves casual Video recording. A Sony DCR-SR72E video camera
gamers, 38.9% said that they belong to the hardcore gamer (handycam) PAL was put on a tripod and positioned
demographic and 13.9% could not identify themselves with any approximately 50 cm behind and slightly over the right shoulder
of those. Nevertheless, no female participant considered herself of the player for observation of player movement and in-game
to be a hardcore gamer and 71.4% of all females said they were activity. In addition, the video recordings served as a validation
casual gamers. Male gamers were more evenly distributed tool when psychophysiological data were visually inspected for
among hardcore (48.3%) and casual (41.4%) gamers: the larger artifacts and recording errors.
percentage of males considering themselves hardcore players.
Game experience survey. Different components of game
91.7% of the participants were right-handed and 50% were experience were measured using the game experience
wearing glasses or contact lenses. 94.4% believed they had full questionnaire (GEQ). [27] As shown in a previous study by
hearing capacity (5.6% stated explicitly that they lack full Nacke and Lindley, [28] the GEQ components can assess
hearing capacity). 69.4% had a preference for playing with a experiential constructs of immersion, tension, competence, flow,
music track on. 44.4% preferred playing with surround sound negative affect, positive affect and challenge with apparently
speakers, while 33.3% opted for playing with stereo good reliability.
headphones. 11.1% liked playing with stereo speakers and the
final 11.1% preferred surround sound headphones. 33.3% Sound immersion. Subjective player experience of sound
played an instrument. 13.8% played the piano or keyboard and immersion was measured using our own additional
8.3% played the guitar. 41.7% saw themselves as hobby questionnaire items rated on a 5-point scale ranging from 1 (for
musicians – some people worked with sound recording and example, not immersive) to 5 (for example, extremely
programming but did not play instruments. immersive) for sessions where sound was audible. Specific
sound questions included the following:
66.7% of participants were enrolled as University students.
16.7% already had a Bachelor’s degree and 13.9% had a • How important are sounds in general for you in FPS
Master’s degree. 61.1% of the participants had already played games?
the digital game Half-Life 2 before, 30.6% played it between 10 • Diegetic Sounds:
and 40 hours and 58.3% played it on a PC, leaving only one How immersive were the following?
participant who played it on an Xbox 360. • Background sounds
• Sounds of opponents
To estimate preconceptions of sound immersion, participants • Sounds that you produced yourself (player-
were also asked how important they considered sounds, in produced sounds)
general, for first-person shooters (FPS). The results were rated How important was the sound for you in the level you
on a 5-point scale ranging from 1 (not important) to 5 (very just played?
important). 55.6% claimed that sound was very important and • No Sound, No Music:
36.1% said it to be important. The term “immersive”, which was How much did it bother you to play without sound?
also part of the questionnaire items assessing sound immersion,
• Nondiegetic Music Only:
was explained to participants beforehand as “the feeling of being
Did you miss the sound effects in this level? (Yes/No)
encapsulated inside the game world and not feeling in front of a
monitor anymore”. This was so phrased for reasons of lay
intelligibility and deemed to be a synthesis of previous
definitions of game immersion noted above, particularly those of
Ermi and Mäyrä, Garcia and Carr. This is suitable for
investigating whether immersion in a very general sense may be
distinguishable in psychophysiological measurement features; if
so, ongoing experiments may address the psychophysiological
detection of finer distinctions within the broad category of
immersion.
-3-
11
all modalities, participants were thanked for their participation
and paid a small participation fee before they were escorted out
of the lab.
-4-
12
diegetic sound (whether combined with music or not) also seems (χ2(5) = 25.16, p < .001), corrugator supercilii EMG means
to be an enabling factor of the subjective experience of (χ2(5) = 57.65, p < .001), zygomaticus major EMG means
challenge and flow — flow especially seems be experienced (χ2(5)= 16.43, p = .006) and GSR means (χ2(5)= 52.41, p <
more easily with diegetic sounds. .001). Hence, degrees of freedom were corrected using
Greenhouse-Geisser estimates of sphericity for EMG means (ε =
The complete absence of sound seems to negatively influence .63, ε = .45, and ε = .76) and GSR means (ε = .47).
the subjective feeling of immersion to a significant degree as it
is the lowest rated item in this modality. With missing auditory Nevertheless, neither EMG responses (orbicularis oculi: F(1.90,
feedback, there is also a decrease in the feeling of competence 53.21) = 0.86, p > .40, corrugator supercilii: F(1.36, 38.02) =
among all participants. The combined presence of sound and 0.66, p > .40, zygomaticus major: F(2.27, 63.58) = 0.61, p >
music seems to also have a soothing effect on play as ratings for .40), nor GSR [F(1.40, 39.05) =0.68, p > .40] could achieve
tension and negative affect are very low under this modality. It statistical significance for the repeated measures design. The
is also the modality that has the highest score for the immersion results of the ANOVA show that tonic measurements of
item. However, it should be noted that music also seems to be physiological response from an accumulated game session were
somewhat distracting from game flow since flow ratings are not significantly affected by different sound modalities.
higher when music is omitted and only diegetic sounds are
presented. Discussion and Future work
For GEQ components Immersion (χ2(5) = 3.49, p > .50), This paper has described and analyzed the results of a
Competence (χ2(5) = 10.28, p > .05), Negative Affect (χ2(5) = preliminary experiment to measure the effect of FPS sound and
5.36, p > .30), and Flow (χ2(5) = 10.12, p > .05), Mauchly’s test music on player valence and arousal and to detect any possible
indicated that the assumption of sphericity had been met, but for correlations between measurable valence and arousal features
the remaining items Tension (χ2(5) = 11.98, p < .05), Positive and self- reported subjective experience.
Affect (χ2(5) = 11.56, p < .05) and Challenge (χ2(5) = 23, p <
.05) it was violated. Therefore, degrees of freedom were There are two important and related results. Firstly, the data
corrected for the latter three using Greenhouse-Geisser estimates gathered from the subjective questionnaires (see Table 1) shows
of sphericity (ε = .80, ε = .84, and ε = .71). a significant statistical difference between the four modalities
over the GEQ components. This is particularly the case with
Statistical significance was achieved for all components Flow and Immersion, the results of which show higher scores
(Immersion: F(3, 105) =8.20, p < .001), Competence: F(3, 105) when diegetic sound is present than when it is not. Prima facie,
= 4.49 , p < .01), Negative Affect: F(3, 105) = 9.75 , p < .001), this would indicate that diegetic sound does indeed have an
Flow: F(3, 105) = 9.42 , p < .001), Tension: F(2.39, 83.73) = immersive effect in the case of FPS games. Music also appears
7.85 , p < .001), Positive Affect: F(2.52, 88.21) = 6.18 , p < .01), to increase immersion, while reducing tension and negative
and Challenge: F(2.14, 74.78) = 5.17 , p < .01)). These results affect, at the expense of a reduction in the experience of flow
show that the subjective game experience measured with the within gameplay.
GEQ was significantly affected by the different sound
modalities. Secondly, the psychophysiological data do not support the
subjective results, but are instead both inconclusive and lacking
Table 2 shows a comparison of the normalized physiological statistical significance (see Table 2). If we maintain the
responses. Negatively valenced arousal would be indexed by assumption that physiological evidence, in these circumstances,
increased GSR and corrugator supercilii activity (with decreased can be used to confirm the subjective evidence, then there are
zygomaticus major and orbicularis oculi activity). [29] This is several potential explanations for the lack of correlation between
not the case for any of the accumulated measurements shown. the two result sets. Further analysis and experimentation will be
The only notable decrease of orbicularis oculi and zygomaticus required to explain this disparity. Some initial possible
major activity is shown under the no sound condition. However, explanations (assuming a valid experiment design and
corrugator supercilii activity is also decreased and galvanic skin implementation) include:
response is somewhat consistent across conditions.
1. The GEQ incorporates distortions derived from the
Physiological Sound & Diegetic Nondiegetic No Sound or retrospective “storytelling” context of the
Response Music Sound Only Music Only Music
questionnaire.
Orbicularis 1.85 (0.37) 1.85 (0.37) 1.86 (0.42) 1.79 (0.31)
oculi (ln[µV]) 2. The physiological data, gathered over 10 minutes of
Corrugator 1.94 (0.25) 1.90 (0.27) 1.95 (0.33) 1.89 (0.26)
supercilii play, contains too much noise to produce a significant
(ln[µV]) result. It must be noted that the data analyzed here was
Zygomaticus 1.98 (0.40) 2.00 (0.38) 2.00 (0.43) 1.94 (0.35) accumulated over one game session and even after
major (ln[µV])
Galvanic skin 0.72 (0.18) 0.73 (0.17) 0.70 (0.18) 0.72 (0.17)
inspection of histograms and logarithmic correction
response not all measurements were perfectly normally
(log[µS]) distributed. Even though a non-parametric statistical
analysis or a range correction of physiological
Table 2: Means (and standard deviations) for the corrected responses could be conducted, it is unlikely that this
physiological measurements (EMG and GSR) under the will show significant effects over the 10 minute
different modalities3 timescale used. Connecting physiological response
data to game events using more precise phasic
Accordingly, Mauchly’s test indicated that the assumption of measurements, as described in Nacke et al., [30] could
sphericity had been violated for orbicularis oculi EMG means yield more insight into the emotional effects of sound.
This level of detail can be achieved but it would need
3 an additional method for recording subjective
N = 29 (after data reduction).
-5-
13
responses at the same event level precision to be [4] Grimshaw, M., The Acoustic ecology of the first-person
correlated with. shooter: The player, sound and immersion in the first-person
shooter computer game, Saarbrücken, VDM Verlag Dr. Mueller
3. The subjectively reported experience is a function of e.K., (2008).
the modulation of emotions within a smaller time [5] McMahan, A., Immersion, engagement, and presence: A
scale than that used in the analysis of new method for analyzing 3-D video games, in M. J. P. Wolf &
psychophysiological data. This means that the B. Perron (Eds.), The Video Game Theory Reader, New York,
emotional net effect may be the same, but the details Routledge, 67–87 (2003).
of emotional dynamics produce different subjective [6] Kearney, P. R., & Pivec, M., Immersed and how? That is the
experiences as reported by the GEQ. As analogy: a question, Game in' Action, Sweden, (2007, June 13—15).
flat sea and a sea with big waves may have the same [7] Pine, B. J., & Gilmore, J. H., The experience economy: Work
mean level, but one makes for much better surfing is theatre & every business a stage, Boston, Harvard Business
than the other. This might be detectable by derived School Press, (1999).
measures form the current data set. [8] Ermi, L., & Mäyrä, F., Fundamental components of the
gameplay experience: Analysing immersion, Changing Views –
4. The subjectively reported results are not measurable Worlds in Play, Toronto, (June 16—20, 2005).
using our equipment and methods. In particular, the [9] Murray, J. H., Hamlet on the holodeck: The future of
source of the GEQ components reported in Table 1 narrative in cyberspace, Cambridge, The MIT Press, (2000).
may have a different psychological explanation than [10] Garcia, J. M., From heartland values to killing prostitutes:
that captured by the arousal/valence model of An overview of sound in the video game Grand Theft Auto
emotion. This consideration raises the need for more Liberty City Stories, Audio Mostly 2006, Piteå, Sweden,
thorough ongoing conceptual investigations of terms (October 11—12, 2006).
such as immersion, presence, flow, challenge and fun [11] Carr, D., Space, navigation and affect, in Computer Games:
(as started in [31]). Based upon a richer range of Text, Narrative and Play, Cambridge, Polity, 59-71 (2006).
linguistic and conceptual distinctions, it may be [12] Laurel, B., Computers as theatre, New York, Addison-
possible to devise experiments having more Wesley, (1993).
discriminating power among the range of descriptive [13] Jørgensen, K., On the functional aspects of computer game
models thus created. In particular, these are complex audio, Audio Mostly 2006, Piteå, Sweden, (October 11—12,
concepts used in different ways by different authors, 2006).
and it may not be the case that they have simple [14] Murphy, D., & Pitt, I., Spatial sound enhancing virtual
mappings to instantaneous emotions measured by story telling, Lecture Notes in Computer Science, 2197, 20–29
psychophysiological techniques. Explanatory theories (2001).
then need to move to higher levels in modeling the [15] Grimshaw, M. & Schott, G., A Conceptual framework for
structuring of a series of measurable emotions, related the design and analysis of first-person shooter audio and its
to perceptions and player actions, to provide a more potential use for game engines, International Journal of
systemic account of the foundations of the quality of Computer Games Technology, 2008, (2008).
play experience, as suggested by Lindley and [16] Stockburger, A., The rendered arena: Modalities of space
Sennersten. [32] in video and computer games, unpublished PhD thesis, London,
University of the Arts, (2006).
These questions must be addressed by ongoing research. To our [17] Grimshaw, M., The Resonating spaces of first-person
surprise, our research contradicts the results presented by shooter games, Proceedings of The 5th International Conference
Shilling et al. [18], who indicated a strong correlation between on Game Design and Technology, Liverpool, (November 14—
sounds and physiologically elicited emotions. Unfortunately, 15, 2007).
Shilling et al. did not report direct values of their measures that [18] Shilling, R., Zyda, M., & Wardynski, E. C., Introducing
would allow a direct comparison. It remains for more thorough emotion into military simulation and videogame design:
future analysis to find greater scientific evidence for a America’s Army: Operations and VIRTE, GameOn, London,
relationship between sound and psychophysiological measures. (2002).
Our future aim is to investigate this within our research. [19] MOVES Institute, America’s army, Monterey, Naval
Postgraduate School, (2002).
Acknowledgements [20] Cacioppo, J.T., Tassinary, L.G. and Berntson, G.G.
Psychophysiological science, Handbook of psychophysiology,
The research reported in this paper has been partially funded by 3-26 (2007).
the FUGA (FUn of GAming) EU FP6 research project (NEST- [21] Cacioppo, J.T., Handbook of Psychophysiology, Cambridge
PATH-028765). We thank our FUGA colleagues, especially University Press, (2007).
Niklas Ravaja, Matias Kivikangas, and Simo Järvelä, for many [22] Lang, P. J., The emotion probe. Studies of motivation and
stimulating inputs to this work. We would also like to thank attention, American Psychologist, 50, 372–385 (1995).
laboratory assistant Dennis Sasse for helping in the execution of [23] Lang, P. J., Greenwald, M. K., Bradley, M. M., & Hamm,
the experiment. A. O., Looking at pictures: Affective, facial, visceral, and
behavioral reactions, Psychophysiology, 30, 261–273 (1993).
[24] Valve Software, Half-Life 2, Electronic Arts, (2004).
[25] Schmitz, C., LimeSurvey (v1.70) from
References http://www.limesurvey.org (2008).
[26] Fridlund, & Cacioppo., Guidelines for human
[1] id Software, Doom series, Activision, (1993-2004). electromyographic research, Psychophysiology, 23(5), 567-589
[2] id Software, Quake series, Activision, (1996-2005). (1986).
[3] Valve Software, Half-Life series, Electronic Arts, (1998- [27] IJsselsteijn, W.A., Poels, K. and de Kort, Y.A.W.,
2004). Measuring player experiences in digital games. Development of
-6-
14
the Game Experience Questionnaire (GEQ), Manuscript in
preparation.
[28] Nacke, L. and Lindley, C., Boredom, Immersion, Flow - A
Pilot Study Investigating Player Experience, IADIS Gaming
2008: Design for Engaging Experience and Social Interaction,
IADIS, Amsterdam, The Netherlands, (2008).
[29] Ravaja, N., Turpeinen, M., Saari, T., Puttonen, S. and
Keltikangas-Jarvinen, L., The Psychophysiology of James Bond:
Phasic Emotional Responses to Violent Video Game Events,
Emotion, 8 (1), 114-120 (2008).
[30] Nacke, L., Lindley, C. and Stellmach, S., Log who’s
playing: psychophysiological game analysis made easy through
event logging, International conference on Fun and Games 2008,
Springer, Eindhoven, The Netherlands, (2008).
[31] Lindley, C.A., Nacke, L. and Sennersten, C., What does it
mean to understand gameplay? First Symposium on Ludic
Engagement Designs for All, Aalborg University, Esbjerg,
Denmark, (2007).
[32] Lindley C. A. and Sennersten C., 2006 A Cognitive
Framework for the Analysis of Game Play: Tasks, Schemas and
Attention Theory, Workshop on the Cognitive Science of Games
and Game Play, The 28th Annual Conference of the Cognitive
Science Society, Vancouver, Canada, (July 26—29, 2006).
-7-
15
Sound and the diegesis in survival-horror games
Daniel Kromand
Brydes Allé 23, 431
2300 Copenhagen S
kromand@itu.dk
Abstract. The paper analyzes the affordances of the soundscape in survival-horror games by examining the barrier between diegetic
and non-diegetic sounds and how the validity of a sound cue is challenged. The three game examples all veil their aural warning cues
with a broken causality and causes the player to distrust the cues’ tie to the game world. This results in an uncertainty whether the
sound comes from an in-game object or from non-diegetic ambience. The paper argues that the horror genre draws upon this
discrepancy between soundscape and perceived reality to create an ambience of fear.
16
3. Trans-diegetic sound and running water, but might also notice a monotonous note
The trans-diegetic effect of audio examined by Kristine played in crescendo. These two examples are arguably not
Jørgensen (2007) is a transgression of the traditional barrier background music as they do not have a constant presence and
between diegesis and non-diegesis as explained by only constitute a limited tune. It might however be argued that
Bordwell/Thompson (2004). The traditional distinction between they are examples of Jørgensen’s point that ambience should not
diegesis and non-diegesis divides the soundscape into sounds be taken literally: Instead ambience frames the general
that are heard within the fictional world and those that are not. atmosphere in the player’s current area (Jørgensen 2007; 110).
The film viewer’s ability to understand this divide comes from A crescendo is traditionally used in horror movies to instill
repeated exposure to the language of films, where breakdowns anticipatory fear, an effect that BioShock mimics in a slightly
of the barrier are rare and usually act as comic relief (such as the different form. In BioShock the crescendos are activated after a
big band in Mel Brooks’ Blazing Saddles that is thought to be time lapse and not on basis of narrative structures. I argue that
non-diegetic, but then happens to be located out in the desert). this leads to an increase in tension, as most players will be
familiar with the culturally implied meaning of such a crescendo
Jørgensen argues however that due to the interactivity of video (a leitmotif for shocks), and thus they will be prepared for
games, sound can pass the barrier and in effect become trans- encounters, which may or may not take place. The soundscape
diegetic. Units in Warcraft III for example speak directly to the furthermore often has a deep throb present in the ambience, only
player as a way of conveying information, i.e. they speak from clearly audible at high volume settings. This throb is not held at
within the diegesis to the outside, while music in the game can the same note, and thus dispels silence and helps fill the
function as a leitmotif for certain events, allowing the player to soundscape with unfamiliar noises. Not all ambience in
anticipate future events (Jørgensen 2007; 110). Trans-diegetic BioShock is unmelodic though: At select fights throughout
sound does not dissolve the barrier though: It merely causes a BioShock regular background music is played, for example at
short transgression that still keeps the division between diegesis the first encounter with a Houdini splicer in Arcadia or the fight
and non-diegesis intact. The trans-diegetic effect therefore against Dr. Steinman in the Medical Pavilion, which fulfills the
typically transfers information from the game to the player, dramatic effect of a standard boss fight.
according to Jørgensen in two different versions: Either as a
reactive sound affirming player input or as a proactive sound The part of the ambience that is closer to the diegesis (e.g. the
informing the player of an altered game state (Jørgensen 2007; creaks, announcements over the speakers, etc.) serves to create a
116). general atmosphere and fills the sensory system. Grimshaw
refers to ‘sensory fillers’ as music that is irrelevant for gameplay
Jørgensen’s argument relies upon the fact that players can (2007; 4), but in BioShock the sensory filler feeds the player too
interpret the soundscape of a given game as triggered by specific much aural input to cope, which in turn causes confusion
events, i.e. not being completely random. The ability to interpret regarding the diegetic ties to the game world. The creaking noise
can partly be learned through genre conventions and by keeping of Rapture fills the soundscape and potentially masks or
a consistency of sound within the individual game itself muddles the sounds produced by real, in-game enemies. The
(Grimshaw 2007; 102-103). player’s understanding of affordances can help to perform better
in the game of BioShock, as certain sounds pass information
4. Welcome to Rapture regarding nearby opponents, but at the same time these exact
When first checking into the grim world of BioShock (2K affordances are mimicked in the ambience. This creates a dual
Games 2007) and immersing into the city of Rapture the scene is relationship to the soundscape of BioShock: On one hand the
one of confusion and paranoia: Monsters –the so-called spatial location of opponents is revealed through hearing and
splicers– are roaming the hallways, ammunition is limited, and helps to prepare the player for an encounter, but on the other
the ambience is filled with running water and creaking metal. hand the ambiences causes a distrust in the player towards the
The soundscape of BioShock is densely populated with a division of diegetic and non-diegetic sounds.
general ambience, and sounds tied to the inhabitants of Rapture
and the player. 5. Paced horror in F.E.A.R.
F.E.A.R. (Monolith Productions 2005, henceforth FEAR) is set
The in-game objects of BioShock, such as the player, enemies, in modern day, albeit with slightly futuristic technology, and
robots and vending machines, offer a quite varied selection of revolves around a special unit’s attempt to handle a paranormal
sound objects. Movement creates sound, which can help the crisis. The game draws heavily on Japanese horror with frequent
experienced player identify the source (e.g. the heavy thud of a submersions in water, movement through dark crawlspaces, and
Big Daddy is easily recognized, while stealthier opponents, such a The Ring-inspired ghost called Alma. The game resembles a
as the spider splicer, create a minimum of noise). Speech is generic shooter at times and player will often find himself in
another form of identifying opponents, since several of them firefights with enemy soldiers, but other elements of the game
will replay sentences: Some are preset, for example a specific creates an atmosphere of horror and uneasiness, for example
splicer early in the game –a deranged mother bending over a crawlspaces, which –paired with a limited flashlight battery and
baby carriage– plays the same monologue when encountered a giggling ghost– builds a very tense experience.
(although it is interruptible according to the player’s actions),
while other enemies replay their sentences at random, which The ambience in FEAR does not adapt to fights as seen in other
both adds to the atmosphere of Rapture, but also informs the games (Whalen 2004), and is usually deep bass or a slowly
player of possible danger. oscillating pitch. Similar to BioShock the ambience lacks
tonality and a clear melody, and therefore is likely to be
The ambience in BioShock lingers between the diegetic and the perceived as part of the diegetic world. An analytical player
non-diegetic: a typical engine hum is for example audible, but might identify the ambience as non-diegetic, but the atonality
underneath there are sporadic, high-pitched sounds. The latter breaks the usual expectations to background music. Arguably
does not appear to have any connection to the diegetic world of atonal ambience can be experienced as closer to the diegesis as
Rapture, and likewise while in Neptune’s Bounty (an area of it is less cohesive, thereby provoking uncertainty about the
Rapture) the player will hear the omni-present creaks of metal
17
sound’s non-diegetic nature. The ambience in FEAR thus puzzles and monsters to reach the end. The games of the series
assumes the role of sensory filler with uncertainty as an effect. have been examined academically in their own right, and
especially the radio carried has received a lot of academic
FEAR paces the player in a manner, which is not unlike attention (Carr 2003, Whalen 2004, Perron 2004).
BioShock: The action sequences are designed as a wavelike
curve with enemy resistance focused in smaller areas in contrast The radio carried by the avatar in Silent Hill 2 produces static
to being spread out over the level. The player rarely encounters whenever enemies are approaching, but similar to the static in
a lone soldier and does not experience a steady stream of FEAR it fails under certain conditions. At select locations the
enemies. Therefore the player will also traverse a relatively radio will produce static that sounds like a mixture of chirping
large amount of space with no enemy encounters. Ironically it is birds and a squeaky wheel, but is not produced by any entities.
in these unchallenged spaces that the player can be expected to This sound is not ambience, as it still can be heard when the
fully experience the game’s horror atmosphere. In areas where background music is turned off in the options menu.
the player is left alone both visual and aural cues (such as the Furthermore the radio does not register the “crawlers” (the
flickering of static, which is often, but not always, a premonitory zombies in their cockroach-like state) and the player is likely to
warning) allude that paranormal threats are nearby. The second be surprised when one them runs off with a high-pitched squeal
encounter with the ghost Alma –inside the aforementioned from underneath a car (as it happens on Martin St.). Therefore
crawlspace– the avatar’s pulse and breathing becomes audible the radio, which is the primary tool for locating enemies –due to
while Alma can be heard shuffling around and giggling. Earlier the omnipresent fog– has both too much and too little sensitivity
Alma was capable of killing the player, and thus her presence to be a completely reliable warning system.
inside the crawlspace naturally causes distress. Before
encountering Alma in the crawlspace, the player is aware that The ambience of Silent Hill 2 is in some sense similar to both
something is about to happen, primarily due to the set-up –a BioShock and FEAR, as it is highly metallic and often a deep
dark, claustrophobic space– and the aural cues of heavy bass with an oscillating pitch, but with a significant difference in
breathing and pulse. Nothing paces the player to go through the the fact that ambience is situated, which means that topological
crawlspace, but the space has to be traversed if the player wants locations have their own sound (e.g. one apartment has another
to progress the game. At his own pace the player has no option ambience than others). The dynamic and unsettling soundscape
but to move through an area where aural cues have indicated makes it harder for the player to get accustomed with the
that danger might be lurking. This design resembles Hitchcock’s ambience. This distorts an efficient reading of the radio’s
well-known example of a bomb hidden underneath a table where affordances, as some rooms upon entry trigger static from
two people are having a conversation, which will cause nearby enemies, while other rooms trigger a new ambience (for
prolonged suspense for the audience if they are aware of its example room 208 in Wood Side Apts.). The atonal quality of
existence (cited in Perron 2004; 2). In FEAR the player knows the ambience makes it harder to differentiate between static
that a threat is present, but the cues are slow to reveal exactly from the radio and from the ambience. Due to the fact that the
when and where it will happen. This threat warning system radio often displays different sounding warnings, new
combined with a minimum of pacing is created to make the background ambience can create uncertainty whether monsters
player move very slowly and to examine every minor aural or are nearby or not.
visual cue. The horror in FEAR thus relies on an uncertain
cueing of threats and slow pacing that allows the player to fully As in both BioShock and FEAR the soundscape in Silent Hill 2
immersive himself in the sinister atmosphere. purposefully works against the player’s effort to read the
affordances of aural warning cues. The game does supply them,
FEAR utilizes a variety of sounds that afford a threat warning, but with an irregular set of consequences and therefore a broken
such as the radio communication between enemy soldiers, which causality. The soundscape of Silent Hill 2 operates within a
can be heard far away and is the primary source for identifying frame of uncertainty that constantly holds the player between
their presence before visual confirmation can be made. Other knowledge and ignorance. Along with limited ammunition and
sounds afford a similar warning, but lack the consistency of the field of vision, the soundscape efficiently builds a setting of
radio signals, such as the static caught in the player’s earpiece. horror.
The static is usually an indicator for approaching paranormal
events, but is sometimes played without any following 7. Conclusion
consequences. This pseudo-causality is designed to put the Analysis of the three games shows they mainly have unmelodic
player on edge and make him carefully considering his moves ambience without a tonic (although music was included at select
even though no threat is imminent. The misuse of the static moments). The soundscape affords less (or in a sense too much)
reduces the player’s faith in it as a reliable tool, but accentuates to the player, as sounds can be hard to tell apart. This was
that something might happen. especially the case in FEAR and Silent Hill 2 where the player’s
in-game warning system proved to be unreliable. The player’s
Aural affordances are, as in BioShock, limited because they difficulties of dividing sounds into non-diegetic and diegetic can
have flawed causality. The unreliable affordances delivered by be described as an actual collapse of the diegetic barrier. This
the audio design keeps the player in a limbo between trust and effect goes beyond Jørgensen’s theory of trans-diegesis, since
apathy. Furthermore the slow pacing in the game’s horror the players do not prepare for future action based on non-
sequences adds another dimension, as the player will have to diegetic information. Instead they act because they do not know
approach a possible threat at his own pace. FEAR gives release if the sound is diegetic or not. The atonal ambience reduces the
to the built up tension through intense firefights with enemy perceived field of non-diegetic sound, with the exception of the
soldiers where the game resembles a generic shooter. boss music, and all sounds can be suspected to belong to the
diegesis. The diegetic collapse is the effect when player are no
6. Finding way through the fog in Silent Hill 2 longer fully capable to discern between diegetic and non-
Silent Hill 2 (Konami 2001) takes place in a town of the same diegetic elements. It is safe to say that the soundscape of the
name, which is immersed in constant fog. The player controls three survival-horror games produce an ‘un-knowledge’ and
the avatar in third person and has to traverse a number of while it does offer affordances, it also creates an uncertainty
18
Sound and the diegesis in survival-horror games
References
[1] Aarseth, Espen, Quest games as Post Narrative Discourse,
Narrative Media, pp. 361-376. University of Nebraska Press
(2004)
[4] Carr, Diana: Play Dead – Genre and Affect in Silent Hill
and Planescape Torment, Gamestudies vol. 3 issue 1, 2003
19
Psychologically Motivated Techniques for Emotional Sound in
Computer Games
Inger Ekman, inger.ekman@tml.hut.fi
Department of Media Technology,
Helsinki University of Technology
P.O.Box 5400
FIN-02015 HUT
Abstract. One main function of sound in games is to create and enhance emotional impact. The expressive model for game sound
has its tradition in sound design for linear audiovisual media: animation and cinema. Current theories on emotional responses to
fiction are mainly concerned with linear medial, and only partly applicable to interactive systems like games. The interactivity
inherent to games introduces new requirements for sound design, and suggests a break in perception compared with linear media.
This work reviews work on emotional responses to fiction and applies them to the area of game sound. The synthesis is
interdisciplinary, combining information and insights from a number of fields, including psychology of emotion, film sound theory,
experimental research on music perception and philosophy. The paper identifies two competing frameworks for explaining fictional
emotions, with specific requirements, and signature techniques for sound design. The role of sound is examined in both cases. The
result is a psychologically motivated theory of sound perception capable of explaining the emotional impact of sound in film, as well
as identifying the similarities and difference in emotional sound design for these two media.
20
masterful sound designs influence emotion are as much a And if they are real, what is their cause? To shed some light
celebration of the skills that went into designing them in the on these questions, and as an inventory of the tools available
first place as tools for shaping new experiences in the future. for looking at the effects of sound, let us turn for a moment to
I am also positive that a bit of clearly enunciated arguments psychology. What psychological support exists for emotional
for why sound holds so much power can be of use, especially reaction to fiction?
in defending sound when it has to compete for attention (or
budget) with visual effects. Finally, this paper links into the The simplest objection to calling emotional responses to
academic discussion on sound design, and provides a fiction real is related to the reality of the events underlying
synthesis of relevant literature from several vastly different these responses. Most emotion theories require that evaluated
fields. The topic of sound design is still advanced in many events should be of real significance to the evaluating
separate camps, partly because it is interdisciplinary in nature subject. This is precisely the concern underlying Tan’s
and that requires tapping into so many fields all at once. The argument [27]. Investigating the emotional responses in film
goal in this paper is not to provide an exhaustive review. viewing, he bases the validity of his model in Frijda's [9]
Rather it offers a selection of findings from different sources, theory of emotions and discusses in length the compatibility
suggesting both where future connections may be found and of his theory with Frijda's basic tenets. In short, he argues
directions how some of them may be explored further. that emotional responses can arise to fictive events as long as
they are perceived as apparent reality. This is achieved
The structure of the paper is as follows: Chapter 2 discusses mainly by the diegetic effect, a cognitive and perceptual
the nature of emotion in general, and looks at emotion theory illusion in the viewer' head. The illusion is maintained with
applied to film and games. Chapter 3 considers the processes several cinematic techniques, usually by dissolving and
wherein film sounds influence emotional response and eradicating the medium. The witness position, Tan argues, is
consider several sources of emotion in sound. Chapter 4 one of the very means in which the cinema excuses the
offers a summary on emotional effect of sound and compares viewer's passivity and explains, in keeping with the diegetic
emotional sound in film with sound in the interactive context effect, the viewer's ability to see it all.
of games.
Another way to make room for fictional emotions is to allow
2 Emotions and Emotional Responses in emotional responses not only to real events, but to the
Film and Games fictions of our imagination. Damasio does just so: he calls the
latter type as-if emotions, and describes in detail the
In psychology literature, whole chapters have been written processes in which the mind simulates affective responses
about defining emotions. Perhaps it is due to the subjective [7]. According to Damasio, fundamentally affective
nature of emotional experiences, as well as culturally processes are part of all rational decision-making, and
negotiated nature of what emotions are suitable and for function as a means of making decisions – we reason by
whom, that it is hard to pin down precisely what constitutes simulations on how the outcome might feel. From this point
an emotion. Intuitively, however, the struggle to define of view, fiction would be just another way of running
emotion seems puzzling. Most of us would seem to know, by simulations.
intuition, what sort of experience the word emotion refers to.
For the purpose of this text, let us though consider briefly
2.2 Emotional Responses to Film
what it implies. Oatley and Jenkins [20] 96] mention the
following three defining features: There are two dominant approaches to understanding how
- Emotion is usually caused by a person (consciously or film elicits emotional responses. One approach is based on
unconsciously) evaluating an event of some Freudian psychoanalysis, invoking as a key concept the
significance in relation to a goal or situation. Lacanian mirror stage, which “typifies an essential libidinal
- The core of an emotion is a readiness to act, and relationship with the body-image” [14]. This stage is called
emotion influences actions by preparing for certain forth by identifying in film, a process that links the self with
types of actions and causing a sense of urgency. the film experience through subconscious mental responses
- An emotion is usually experienced as a distinctive type (the basis of which is rooted in early childhood and
of mental state, sometimes accompanied or followed by formations of the Ego). An alternative approach explains the
bodily changes, expressions, actions. emotional reaction to film through cognitive appraisal of
portrayed fictional events. Instead of identifying
Emotions are thus evaluations of a specific (visceral and subconsciously, the viewer responds emotionally because of
urgent) type, which signal events of critical importance and their cognitive investment in the fictional events. The
relevance to the perceiver. They provide the perceiver with viewers' emotional responses are related to motivational
an evaluation of the event, producing (typically) either a processes [27][9][16].
positive or negative sensation. This sensation is referred to as
valence. They also produce a sense of urgency. Generally, According to the appraisal theory, the key to understanding
the activation part of emotions is called arousal. emotions is to understand the role fictive events play to the
viewer. Tan identifies two types of emotion at play in the
2.1 Fictional Emotions? viewing situation: fiction emotions (F emotions, emphatic
emotions) and artefact emotions (A emotions, non-empathic
From a biological viewpoint the function of emotions seems emotions). Empathetic (F) emotions are emotional reactions
clear. They provide the foundations for successful behaviour to the story; they require empathy with the main character.
in an environment with promises and perils. Emotional Non-empathetic artefact emotions are linked to sensory
relations to film and games, however, is an example of pleasures – the viewer enjoys the good looks of the
emotional relation to fictional events. This has been a very protagonist or the beautiful scenery. [27]
controversial subject, and arguments go back and forth about
whether all these emotional reactions, which have their cause Non-empathetic emotions require no appreciation of events.
in completely fictive events, are even real emotions at all. The mechanism by which emphatic emotions are created is
21
more complicated. Tan suggest the film viewer connects to 3 Roles of Sound in Emotional Appraisal
the fiction is through a witness position. This position of The appraisal theory would seem to give little explicit
agreed-upon inaction (even in the most anxious of moments, explanation of how sound contributes to emotion. The
we are content to just watch the film) dictates the relations question is to what extent sound influences the process of
between viewer and film events. Our commitment to the appraisal. It seems clear that for both film and games, sound
protagonist's cause gives rise to emphatic relations. However, is part of the process enforcing the inherent emotionality of
the inactivity dictated by the viewing position also creates events as appraised within each framework. However, the
tensions, such as worry or frustration when we know more practical and functional approach to sound design is different.
than the protagonist, but cannot change the situation. On the In film, sound makes actions seem real and consequential for
other hand, inactivity may allow us to more deeply the viewer, both factors mentioned as prerequisite to
empathize, since no action is asked of us. empathetic or F emotions [27]. Within this category of
emotions, two specific cases can be identified for sound.
2.3 Emotions in Games
The interaction in games usually has the player controlling a One goal of film (and game) production is to make things on
main character (or other objects in the game). Whereas screen seem real. A silent two-dimensional representation
cognitive appraisal of game events lends itself to an has limited apparent reality on its own, but adding sound
emotional impact, a key question in games is how the helps perceiving the pictures on screen as physical bodies.
narrative reality is maintained. Especially, games remove the Especially important in this aspect is the use of Foley, or
passive viewing position, previously suggested a main synchronized sound effects.
contributor of empathic effect in film. Also, the typical
structure of games – featuring repetitive action, often only As another case of supporting realism, an integral role for
loosely bound together by a story – is challenging to sound at least in classical (Hollywood) film production has
empathetic emotion. been to hide the medium. An example of such is the common
practise of continuing sounds over cuts, making them less
On the surface, it would seem that the active nature of apparently noticeable. Many other conventions of film sound
players in games break the passivity that Tan proposes so also contribute to the invisible medium effect, either directly
significant in relation to justification of emotions. However, or indirectly. Thus, a role of sound is to create a sense of
it turns out the activity of games lends itself rather well to the immediacy.
purposes of explaining emotion and the cognitive appraisal
model has also been invoked in relation to computer games In the context of games, the significant emotional investment
and emotions [21][15]. Activity, while challenging the goes into advancing goal-related progress. Most current
passive ride of empathetic emotion, provides an alternative games rely on both auditory and visual content to represent
frame in which to evaluate actions. the game world. Crucially, in the context of interaction,
sound comes to take on a new task: facilitator and
Perron [21] explicitly works from Tan’s theory to add confirmatory of action. Historically, technical considerations
gameplay emotions (he calls them G emotions), that arise have long dictated tools for sound expression in interactive
from the cognitive appraisal of game situations. Perron’s G context, which has forced (and allowed) game sound to
emotions arise just because, in games, the player is invested deviate from some film sound conventions. The role for
in acting. Emotional evaluation is fuelled by care for the sound in games is at least partly dictated by a functional
progress of the game. Lankoski [15] offers a detailed approach and a sound's impact defined by its capability of
breakdown of emotions within the gaming context, supporting and facilitating gameplay.
demonstrating how different goal evaluations give rise to
basic emotions and their combinations. Finally, in discussing empathic emotions we mentioned that
some appraisals require no cognitive investment in the story,
In games with a protagonist, gameplay emotions may equal but are linked purely to sensory pleasures. Interestingly, there
care for the protagonist, but this care is essentially different is no explicit mention of negative affect in the context of
from empathetic emotion: From the perspective of gameplay, artefact emotions. The same artefact emotions are present in
the protagonist is a means, a tool, for playing the game (and games as well, where part of the game can be enjoyed (or
achieving the personal goal of completing the task). not) apart from aspects of gameplay-related progress.
Whereas the protagonist can also provide a vessel for Soundtrack CD sales gives strong evidence that artefact
empathetic emotions, to some extent, the two frameworks for emotions functions with regards to both film and game
emotional evaluation - empathetic and gameplay - are music. Beautiful pieces of music obviously give themselves
competing. As Perron [21] notes, even in games, both the to being appreciated as such, disregarding of whether the
main story line as well as individual plot elements and viewer is attending to the story. Further, it is motivated to
narrative turning points are indeed furthered from time to broaden the spectrum of artefact emotions to include also
time through filmic means, using stills, predetermined negative effects. With thought to how visual material is used
animation sequences/ dialogue or cut scenes. During these for shock effects (e.g. displaying blood and entrails), it is
moments, the player is stripped of control and, effectively, easy to imagine a similar process in which unpleasant
reduced into a witness position. Empathetic emotion comes sounds, regardless of story, could produce negative affect.
with loss of activity. A similar point is made by Lankoski Both in film and games, sound provokes sensory pleasure
[15] when he suggests the empathetic capability of the player and displeasure.
is inversely related to the cognitive challenge of action - in
the heat of a battle, there is little time to ponder the 3.1 The Realism Fallacy
protagonist's feelings. The above categories identify the main roles of sound in
creating and steering the emotional experience. However, as
22
we shall soon see, there are unresolved contradictions hiding Inferences” [31]. This view is further fortified by subsequent
within these categories. Namely, the effects within the two results from experimental psychology, leading researchers to
latter categories of sound (gameplay and non-empathetic) suggest that there exists such a thing as unconscious emotion
seem to contradict the traditional sound design goal of [30]. For example, Öhman [32] has demonstrated fear
narrative realism. reactions in people who were presented with spider and snake
pictures subconsciously; that is, people became frightened of
Appraisal theory of emotion holds that empathetic emotional pictures they never even realized they had seen.
processing and value judgement is guided by conscious
attention to a story, and this cognitive investment is In light of unconscious value judgements, it is easy to read
heightened in the perceived realism of portrayal. On the other Tan's non-empathic emotions are but one example of these:
hand, it allows emotional experiences that arise from the sounds of the film are invoking emotion by nature of their
appreciation of the artefact: the pictures and sounds of the perceptual properties, unrelated to story at hand. This also
film/game as such. Similar appreciation is present in games, a opens up a way to understand and predict what would result
medium where technological artistry is elaborately in a positive (or negative, for that matter) value judgement.
showcased, often even used in promotion. One especially interesting factor is the importance of
familiarity and perceptual fluency for eliciting positive value
Tan’s view is that artefact emotions detach the viewer from judgements [23]. By this account, beauty is defined by the
the story, drawing attention away from the narrative towards ease in which a stimulus can be processed.
the film as artefact, thus making the actions within the
narrative less consequential for the viewer. This position 3.3 Misattribution and Making Sense of
raises a complicated question, namely how to interpret such Emotions
sounds that appear to transgress the borders of realism,
Affective responses include paraphernalia of bodily
despite experientially supporting narrative.
responses (pounding heart, sweaty palms) and may also bring
with them a certain action tendency (e.g. fight-or-flight). In
For example, while discussing the effect of sentiment, Tan
fact, it has been suggested that one possible function in which
and Frijda [28] 62] mention sound, especially orchestral
we do consciously attend to our affective processes is by
music, as one possible source for the awe-inspiring. Awe
appreciation of the abrupt changes in the felt background
requires a sense of overwhelming power, and this role is
state, or what Russell [24] has called core affect. This change
partly to be played by sound. The effect, according to Tan
would lead us seeking for a cause of our altered state, leading
and Frijda, is the emotional function of total submission, a
us via cognitive process to attribute the jolt to the most
feeling underlying e.g. crying. Sound is thus a tool in
plausible event in our environment.
portraying power, and heightening sentiment. The question
is, by which channel this non-empathetic heightening of
Whether or not we are willing to accept unconscious affect as
sentiment is capable of influencing the (empathetic)
the source of emotions, we can agree that when consciously
evaluation of narrative events.
attended to, emotions tend to have an object. Emotions are
evaluations of something. To be able to function properly, we
The problem of border transgression is most apparent in two
must be able to determine an object for our emotions,
sound conventions that would seem to deviate from the
something to be afraid of, or pleased by. This distinguishes
purpose of reality. In what seems like a blatant contradiction,
emotion from moods, which are long-term affective states
they invoke a sense of realism in highly unrealistic sounds.
without an object. However, and here comes the catch, the
One is the use of musical scoring. The other is the use of
events we cognitively allocate, as objects do not necessarily
sound effects that mismatch what is seen on screen. Similar
have to be the true cause of our initial affective response. In
breaches are omnipresent in games, as well, where sound
fact, when it comes to reasoning why we feel like we do, we
elements effortlessly transgress borders, allowing objects
are prone to make mis-attributions and erroneously
within the story world (diegesis) to refer to non-diegetic
appreciate our affects quite differently from their real causes
space, and vice versa [8][12].
even in everyday life.
The above concerns are about where to draw the line of
A classical example of misattribution is a study by Schachter
realism and about how emotional effects communicate across
and Singer [26]. They injected subjects with doses of
different categories of judgement. Upon closer scrutiny, both
adrenaline, a hormone associated with an excited body state.
questions generalize to the way non-empathetic affect
Depending on the situation that followed the injection,
influences other sources of emotion (empathetic or
subjects judged their aroused state as either anger or elation.
gameplay- related appraisal). To proceed further, we need an
It can be argued that in most cases, our appreciation of our
explanation on how non-empathetic emotions arise, and a
emotional state is at least partly determined by context.
way of predicting when and how artefact emotions lend
emotional meaning to other evaluative processes.
3.4 Misattribution and Emotion in Film and
Games
3.2 Unconscious Affective Processes
Misattribution is a process wherein the contextual appraisal
The theoretical frameworks dealt with above have considered
of perceived emotional ‘raw material’ lending emotional
affect by means of perceived experience. Nevertheless, it
meaning to an outside cause, irrelevant to that particular
seems many of the associations and effects of sound in both
emotional stir. Now consider a similar process at work during
film and games are working on an unconscious level.
film viewing or when playing a game, with music, sounds,
pictures, actions, all mingling to create emotional impacts. Is
Several findings state to the fact that at least some
it not probable, that at some point, the true causes of our
evaluations of stimuli are made precognitively. Zajonc was
feelings might remain oblivious to us? Is it not possible, that
among the first to point this out in his essay, famously
we, stirred by our passions, unwittingly, in deciphering the
entitled “Feeling and Thinking: Preferences need no
23
cause of our emotions take them to be caused by whatever studies at least part of the functions of musical meaning
the film serves to us on a silver(screen) plate? Could it be appear to be universal [19].
that we just happen to attend to a game event, and assume our
emotions are caused by that event, when in fact they are not? The existence of universal function of musical emotion
suggests there may be other sources of emotion humans draw
This is, essentially, what Annabel Cohen proposes. Cohen from in their interpretations of music. The obvious case is
[5][6] has dealt extensively with the difficult question why, that music invokes memories and connotations awaken by
and how, something as obviously constructed as the film that music. However, those would not be universal, even less
score, does not completely destroy the sense of realism in a so than culturally learned expectations. Van Leuwen [17]
film. On the contrary, as many composers will confirm, a turns to the human body, suggesting that the most primitive,
carefully chosen (or composed) piece of music will actually and also a common link between sound and emotion for all
heighten the sense of reality in a film. Music also seems to humans, is the perception of our own bodies. Especially the
lend a great deal of emotion to events, in ways other than vocal system sets a reference point through simultaneous
proposed by the cognitive appraisal theory. Cohen's answer experience of how it feels and what it requires to produce a
lies in a congruence-associationist model of film viewing certain sound.
[18], whereby music focuses attention on those objects in the
film that are congruent with the sound. At the same time, the Another suggestion is that evaluation of some sounds has
conscious attention is directed away from non-associated biological motivation. This appears to be the fact with the
sounds, and attending suppressed for stimuli irrelevant for startle response (the phenomenon in which we jump to
ongoing cognitive processes. someone shouting ‘boo’ at us), which aside from providing
for pranks also makes us more alert for dangers and
The emotional impact comes from the fact that sounds, even automatically directs our attention toward potentially harmful
when unattended to, will nevertheless affect perception of events. However, there may well be other ways in which our
objects in the film. Cohen [6] highlights the importance of perception of sound is evolutionary determined. For example
temporal unity as a binding factor and predictor of which Huron [11] suggests that the perceived cuteness of sounds
parts of the sound will draw attention. She calls our attention may be an evolutionary adaptation that promotes parenting.
to animation and the technique called mickey-mousing,
whereby sound effects are replaced by short musical motifs. Value judgements (which are the raw stuff of emotion) seem
Their temporal matching allows these music snippets to to be going on even on the lowest level of perception. A
replace the original sounds of the events, at the same imbuing classical example within music (and other perceptual)
both the events, and the objects part of the action, with research is the mere exposure effect, wherein a stimulus is
specific characteristics. judged as likeable merely as the function of familiarity.
Investigations into a phenomenon called perceptual fluency
The account on musical meaning in film provides an equally suggest that emotional processes are influenced by the very
useful tool for approaching the question about other film ease of processing [23]. These findings would imply that
sounds as well. Applied to object sound, the theory suddenly such things as differences in perceptual clarity (think signal-
appears much less mysterious: Consider how we find out to-noise ratio) of audio influence the emotional impact of
properties of objects in real life. What we do is handle the sound, such as perceived beauty or likeability.
object – tap it, stroke it, bang it against something. By
perceiving synchronic sounds, we find out the normal sound The critical requirement for musical expectations to arise is
of a chair, a balloon or a mandolin. Now, turn that process that it is attended to as music. Further, there appears to be
around and we have precisely what sounds do in a film (and, boundaries in our listening schemes that separate different
to a great extent in games as well): now the temporal unity of styles of listening. Notably, musical listening, in which
event and sound defines the object through what sound it sounds are perceived as sounds, is not the only form of
makes. Longer chains of events, if temporally matched, cause attending to events. An illustrative deviation from this frame
similar perceptions. When approached from this angle, it is are listening styles provoked by compositional techniques
not so odd that sounds in fiction may deviate somewhat from invoking other listening styles, like musique concrète, where
their real life counterparts without seeming false or the use of real world sounds provokes listening, not at sounds
unrealistic. What may seem surprising is that a whole sum of and patterns, but for causes – Chion aptly refers to this as
temporally congruent sounds may become involved in the causal listening [4]. This is a special case of music, perhaps
same process, from simple Foley through more elaborate seldom used in film, but appearing more and more in games.
layers of sound effects all the way to music. In these cases, the framework of listening is perhaps more
determined by evolutionary and low-level perceptual
3.5 Where do The Emotions Come From? processes of meaning-making than musical listening modes.
So far we have established that unconscious emotional
processes may ‘contaminate’ temporally congruent events 3.6 Realism Revisited
through misattribution and shown how this may influence We should now attempt a new understanding of realism.
perception of events in film. The big question remains, where Within fiction, realism is not an absolute, but a nominator for
do the emotions come from? a certain level of fit, an apparent realism or credibility. On
the narrative level, realism allows taking the story seriously
The most frequently researched category of emotional sound enough to allow emotions of empathetic quality. Good fit is
is music. Huron [10] describes musical emotions in terms of determined by whether the sound is credible (or illustrative)
fulfilling expectations: the interplay between anticipated and of a certain sound source [1], pp 190]. When the Foley artist
sounded music progression creates patterns of dynamic (the person responsible for creating sounds to on-screen
tension and relaxation. Musical expectations arise from events) smashes pumpkins in his studio, he does so in order
several sources, most of them cultural, but according to some to produce such sounds with good fit with on screen events.
Many times, the sounds produced have little to do with the
24
actual event seen on the film screen – indeed, often non- Finally, in interactive systems interpretations of sound take
realistic sounds are purposefully used to make the action on a new role, conveying functional information [29]. By this
sound better. It is, for example, recognized that walking on task, sounds are evaluated on a third level, in how well they
cornstarch sounds much 'more real' on film than the actual serve a functional value, Jørgensen [12], pp 49] refers to this
sounds of walking on snow. as a sounds functional fidelity. On this stage, emotional
evaluations are no longer determined only by the sound itself,
A possible explanation underlying the perceived realism of but by the utility of a sound for the higher goal of performing
some Foley sounds is the notion of prototypicality. A goal-related actions. The value of functional sound depends
prototype is an object that inhabits central perceptual on how the functional aspect supports game progress, the
characteristics of a given category. Prototypes do not utility of sound. The utility of sound is connected to goal-
necessarily exist in reality, they are mental constructs of our related cognitive evaluations. High utility enforces gameplay
perceptual system. The prototypical chair is the average of all emotion.
chair perceptions of your brain, and by definition, it will be
the 'chairest' chair of them all. Experimental psychology has 4 Comparison of Emotional Sound in Film
established that people perceive prototypes as more easily and Games
recognized [27], also more beautiful, and trustworthy [23]
than other category members. The cognitive appraisal framework provides two alternatives
for emotional relation to fictive events: the passive witness
Similarly, narrative reality determines how sound behaves position allows empathetic emotion, while the active player
within the diegesis, and how the source sounds should sound draws emotional meaning from goal-related evaluation.
when listened to from different Points of Audition1, such as These two frameworks ask the viewer/gamer to take on a
listening behind a wall or under water. At the core, then, also different attitudes towards the fiction and appear to be
immediacy is but one way of creating a sense of realism. In competing. They also rely on different strategies for sound
film, it serves to reinforce the witness position, being present design.
but out of control. Tan [27], pp 25] considers this in his
analysis of the camera point of view, mentioning how even in For film sound, the effort is usually on heightening narrative
first person view the camera is often a bit off, making space reality. This is achieved through detailed attention to
for someone to ‘look over the shoulder’. In the case of sound, narrative fit, often striving for high apparent reality. Focus is
the heightened, focussed sound including over-clear dialogue on advancing the narrative, heightening and clarifying those
can be considered realistic if we view it as a portrayal not of specific actions that are necessary for following the story’s
the scene, but of the experience of listening to the scene. progress (usually the top priority is on dialogue).
Consider this: while our environment usually contains a
multitude of sounds, we only attend to a select few at a time. For games, the focus point is different, as games have to
We are also exemplary at picking out and following these support player action. In games, the task of many sounds is
sounds. A person with normal hearing has no difficulty in primarily to provide feedback about actions. Hence, narrative
following a single conversation in a room filled with people, fit is often sacrificed for utility. To the extent that auditory
the phenomenon so aptly named the ‘cocktail-party’ effect. cues are used to guide actions, they are treated with utmost
Thus, what would seem presented as the films sounds is not respect for legibility. For example, even in the case of
the scene itself from a given point in space, but the scene as instructions with diegetic source (a non-player character,
heard if listened to by attentive ears. voice mail, etc.) it is common that auditory instructions
remain heard even if the character runs away from their
The narrative realism of a sound is thus not in faithful diegetic source.
reproduction of sound sources, nor of their environments.
The apparent realism of a sound in the context of narrative is An interesting avenue for sound design in games is to shift
defined by how representative a sound is of a certain event. focus from music to the emotional impacts of Foley and
Sounds that are highly representative have good narrative sound effects. A possible alternative for emotionality in
fit. High narrative fit supports empathetic emotion. games is in environmental sounds, which is already used in
many games, where ambient sounds are beautifully merged
Importantly, however, the evaluation of sounds spans several with musically suggestive elements and event sounds into a
layers. Below the level of narrative meaning are (partly) sonic landscape in the spirit of musique concrète. However,
unconscious processes whereby sounds are judged for this approach to be systematically explored, there is need
emotionally. For example, a sound can have good fit for better understanding of how everyday sounds influence
narratively, but poor legibility, because the signal-to-noise emotions. In these investigations, theories of unconscious
ratio is so high. Importantly, breaches in this level are emotion may prove especially informative.
disruptive to the perception of sound. We have seen that
perceptual fluency is also capable of influencing affective References
evaluations [23]. Thus, as with the narrative fit, unconscious [1] Bordwell, D. and Thompson, K. 1985. Fundamental
processing of sound influences emotional judgements. These Aesthetics of Sound in the Cinema. In Weis, e. and
affects are unrelated to the narrative content of the sound, but Belton, J. (eds.) Film Sound, Theory and Practice.
tap into the notion of artefact emotions. Depending on their Columbia University Press. 181-199.
nature, they can cause pleasure or displeasure, which can [2] Brandon, A. 2005. Audio For Games. Planning, Process
then be attributed to other temporally congruent events. and Production. New Riders.
[3] Bridgett, B. 2007. Audio Postmortem: Scarface: The
World is Yours. Available at
1 http://www.gamasutra.com/features/20070322/bridgett_
Similar to Point of View in camera techniques, but
using sound. pfv.htm [Accessed 29.5.2008]
25
[4] Chion, M. 1994. Audio-Vision. Sound on Screen. [19] Narmour, E. 1990. The analysis and cognition of basic
(Translation by Claudia Gorbman). Columbia University melodic structures: The implication-realization model.
Press. University of Chicago Press.
[5] Cohen, A. 1990. Understanding musical soundtracks. [20] Oatley, K. and Jenkins, J. 1996. Understanding Emotion.
Empirical Studies of the Arts 8, 111-124. Blackwell Publishing.
[6] Cohen, A. 2001. Music as the source of Emotion in [21] Perron, B. 2005. A Cognitive Psychological Approach to
Film. In Juslin, P. and Sloboda, J (eds.) Music and Gameplay Emotions. Proc. DiGRA 2005 Conference:
Emotion. Oxford University Press. 249-272. Changig Views – Worlds in Play.
[7] Damasio, A. 2005. Descartes' Error: Emotion, Reason, [22] Prince, B. Tricks and Techniques for Sound Effect
and the Human Brain. Penguin, paperback reprint Design. Computer Game Developers Conference 1996.
(1994). Available at
[8] Ekman, I. 2005. Understanding Sound Effects in http://www.gamasutra.com/features/sound_and_music/0
Computer Games In Proc. Digital Arts and Cultures 81997/sound_effect.htm [Accessed August 20, 2008.]
2005, Kopenhagen, Denmark. [23] Reber, R.; Schwarz, N. and Winielman, P. 2004.
[9] Frijda, N. H. 1986. The emotions. Cambridge University. Processing Fluency and Aesthetic Pleasure: Is Beauty in
[10] Huron, D. 2007. Sweet Anticipation: Music and the the Perceiver's Processing Experience? Personality and
Psychology of Expectation. MIT press. Paperback Social Psychology Review, 8 (4). 364-382.
reprint (2006). [24] Russell, J. 2003. Core Affect and the Psychological
[11] Huron, D. 2005. The Plural Pleasures of Music. Proc. Construction of Emotion. Psychological Review 110 (1),
2004 Music and Music Science Conference. Kungliga 145-172.
Musikhögskolan & KTH (Royal Institute of [25] Sanger, G. 2003. The Fat Man on Game Audio: Tasty
Technology), 1-13. Morsels of Sonic Goodness. New Riders.
[12] Jørgensen, K. 2007. ‘What are Those Grunts and Growls [26] Schachter, S., & Singer, J. 1962. Cognitive, Social, and
Over There?’ Computer Game Audio and Player Action. Physiological Determinants of Emotional State.
Ph.D. dissertation, Copenhagen University. Psychological Review, 69, 379-399.
[13] Kutay, S. 2006. Bigger Than Big: The Game Audio [27] Tan, E. 1994. Film-induced affect as a witness emotion.
Explosion. A Guide to Great Game Sound. Available at Poetics 23, 7-32.
http://www.gamedev.net/reference/articles/article2317.a [28] Tan, E. and Frijda, N. 1999. Sentiment in Film Viewing.
sp [Accessed 29.5.2008] In Plantinga, C. and Smith, G. (eds.) Passionate Views.
[14] Lacan,J. 1951. Some reflections on the ego. (fut lu par Film, Cognition, and Emotion. Johns Hopkins. 48-64.
Lacan à la British Psycho-Analytical Society le 2 mai [29] Tuuri, K.; Mustonen, M.-S.; Pirhonen, A. 2007. "Same
1951) Available at: http://aejcpp.free.fr/lacan/1951-05- sound - Different meanings: A novel scheme for modes
02.htm [Accessed 29.5.2008] of listening." Proc. AudioMostly 2007, Ilmenau,
[15] Lankoski, P. 2007. Goals, affects, and empathy in Germany.
games. Paper presented at Philosophy of Computer [30] Winkielman, P. and Berridge, K. 2004. Unconscious
Games, Reggio Emilia, Italy. Available at: Emotion. Current Directions in Psychological Science,
http://www.mlab.uiah.fi/~plankosk/blog/?p=53. 13 (3). 120-123.
[Accessed 29.5.2008] [31] Zajonc, R. B. 1980. Feeling and Thinking: Preferences
[16] Lazarus, R. 1991. Emotion and adaptation. Oxford Need No Inferences. American Psychologist, 35, 151-
University Press. 175.
[17] Leewen, T. van. 1999. Speech, Music, Sound. [32] Öhman, A. 2005. The role of the amygdala in human
Macmillan. fear: Automatic detection of threat.
[18] Marshall, S. and Cohen, A. 1988. Effects of Musical Psychoneuroendocrinology 30, 953-958.
Soundtracks on Attitudes toward Animated Geometric
Figures. Music Perception 6, 95-112.
26
Interactive Sonification of Grid-based Games
Louise Valgerður Nickerson and Thomas Hermann
1 2
1
Interaction, Media and Communication, Department of Computer Science,
Queen Mary, University of London, London, U.K.
lou@dcs.qmul.ac.uk
2
Ambient Intelligence Group, Cognitive Interaction Technology · Excellence Center (CITEC)
Bielefeld University, Bielefeld, Germany
thermann@techfak.uni-bielefeld.de
Abstract. This paper presents novel designs for the sonication (auditory representation) of data from
grid-based games such as Connect Four, Sudoku and others, motivated by the search for eective auditory
representations that are useful for visually-impaired users as well as to support overviews in case that the visual
sense is already otherwise allocated. Grid-based games are ideal to develop sonication strategies since they
oer the advantage of providing an excellent test environment to evaluate the designs by measuring details of
the interaction, learning, performance of the users, etc. We present in detail two new playable sonication-
based audio games, and nally discuss how the approaches might generalise to general grid-based interactive
exploration, e.g. for spreadsheet data.
27
Interactive Sonication of Grid-based Games
sonication is to be selected and what parts of the grid one might perform with grid-organised data, such as
to be explored. Indeed, interaction plays an important in a spreadsheet. Connect Four can represent looking
role in inspecting grids, as can be seen also in visual for linear patterns in data while Sudoku can represent
exploration of grid games, where eye-movements, xa- cross-correlating subsets of data.
tions and saccades are naturally used to serialise and
access the information. In a similar fashion we believe 2.1 Traditional methods
that manual interaction is an important (if not the key)
ingredient to create successful designs. We present for A grid-based representation that often gets tackled is
instance a graphics-tablet based sonication approach the auditory representation of images. This is tradi-
where proprioceptive information serves the intuitive tionally done via scanlines. Examples of this can be
understanding of the position in the grid whereas sound seen in representations of images where each pixel value
conveys the information about the grid content at hand is mapped to sound and played in order. More ad-
of the 4×4 Sudoku in section 3. Dierent exploration vanced techniques involve nding textures in the image
strategies emerge from such an approach. to represent in sound. The diculty with the scanline
To couple interaction to plausible acoustic responses, approach is the challenge of lining up the rows so that
we use ideas from Model-Based Sonication [2, 4] one can understand patterns that are orthogonal to
(MBS). MBS describes how to use excitatory systems the direction of the scanline. The pattern and audi-
in order to create informative sound as result of pro- tory texture approach is much closer to what we try to
cesses where the user's interaction puts energy into accomplish here with our implementation of Connect
an data-driven sound-capable system. Even without Four.
creating a coherent sonication model, MBS might be Other pertinent work is research into the sonica-
helpful to create designs that are more intuitively un- tion of spreadsheet or tabular data. Stockman, Hind
derstood. and Frauenberger [7] describes a system where visually-
impaired users can navigate spreadsheet data by map-
A key problem in grid-based game sonication is the
ping numerical values to a range of pitches. The data
missing persistence of the grid, as opposed to the per-
is then played serially by row or column. This is meant
sistent visual game board. To create a close analogy to
for generic use; our approach is to look to the specic
the visual task of adding visual elements on a board,
to inform the generic. Kildal and Brewster [5] describe
an auditory version can use a stationary sound pat-
a method of providing overviews of numerical data in
tern which is permanently played, allowing players to
tables by again mapping values to pitches. Here, rows
add sound elements accordingly. Conditions to win a
and columns are presented concurrently giving the user
game translate to corresponding auditory conditions
quick access to where the highest and lowest values are
within the sound. This analogy might open a window
to be found. The idea of concurrency is one we apply
to the design of very interesting new audio games, how-
to our implementation of Connect Four.
ever, we here keep the focus on grid-based games, and
thereby translate the analogy into a rhythmical soni-
cation strategy where, instead of a stationary sound, 2.2 Connect Four versus Sudoku
a repetitive sound pattern is created, which can be re- One can generalise grids as M × N grids with a set of
garded as one bar in a repeating sonic loop. We de- k potential token values. Connect Four is a 7 × 6 grid
velop this idea into a playable version of Connect Four with three token types (one for each player as well as
in section 4. the `empty cell' item). Sudoku is a n2 × n2 grid with
We discuss our ideas via qualitative experiments n2 + 1 tokens (n2 tokens and the `empty cell' item).
with a limited set of users, since we are still within The most common variant of Sudoku is where n = 3
the design phase towards stable sonications, and we or the 9 × 9 grid.
close the paper with outlooks on our ongoing work. There are several dierences between the games and
their grids. Connect Four is a two player game while
2 Background Sudoku is a single player game. Another dierence is
that in order to win Connect Four, a pattern of four
In the visual realm, space is used to make salient in- tokens in a line must be achieved while in Sudoku the
formation of interest. In the case of grid-based games, tokens must be uniformly distributed. In both games
it organises the items on the grid so that the players the grids get lled one move at a time. In Connect
can easily make sense of the state of the game. This Four at the end of each pair of turns there are an equal
is also true of data that is visually displayed in tab- number of each token in the grid while in Sudoku this
ular format: it makes correlations between two axes condition is only properly achieved when the puzzle is
clear. We can use grid games to represent tasks that completely solved.
-2-
28
Interactive Sonication of Grid-based Games
Another key dierence is that in Connect Four, when interaction (see gure 2). We also employ MBS to pro-
tokens are added to the grid, they are placed in the low- vide contextual information to the player. Sonication
est unlled cell in the selected column while in Sudoku, examples are provided at [1].
cells can be lled in any order. The playing of Connect
Four is what drives the dominant features of the soni- (1,4)
cation described in section 4. The columns are primary
as their state is what informs the players where tokens (1,1)
Sudoku play area
dEij X
= −λEij + q · (Ekl − Eij ) (1)
dt
(k,l)∈N (i,j)
-3-
29
Interactive Sonication of Grid-based Games
standard techniques for panning and ltering of sound. Tapping vs dragging We assumed that the major-
More information will be provided on our website at [1]. ity of interaction would be by dragging. However,
the majority of players (Players 2, 4, 5 and 7) pre-
3.2 Playing the game ferred to tap the cells to excite the grid. Player 2
Players a stylus to explore the grid and enter values explained this by saying that the sounds made by
on the graphics tablet. There is also a graphical repre- the model made this tapping interaction more in-
sentation of the grid (see gure 4) which provides the tuitive. Another explanation is that players were
limits of the cages. To probe the grid, players either tapping in order to compare only two values at a
tap or drag the stylus across the grid. When the sty- time.
lus enters a cell, the cell is injected with energy (as Draggers For the players who dragged more than
described in equation 1) and the energy ows through they tapped, the stylus and tablet interaction al-
the grid. To enter a value, players click a button on the lowed them to quickly scan a row, column or cage
stylus. Each click cycles the current value of the cell by drawing lines or circles in the grid. These play-
to the next value. If a cell is a starting value, nothing ers appeared to be the fastest at completing puz-
happens. zles. We anticipate that this is because a quick
scan allowed players to quickly determine which
tone was missing or if there were two tones of the
same value in the row/column/cage.
Panning One surprise was that both Players 2 and
5 (both tappers instead of draggers) found that
the panning was not helpful and in fact was dis-
tracting and made it harder to compare values.
This possibly indicates that the use of the graph-
ics tablet suciently localises a player and the
additional cues are unnecessary.
Based the two dierent ways of interacting with the
grid (dragging vs tapping), we expect that a better
Figure 4: The graphical interface for Sudoku. Players
interact with the grid using the graphics tablet stylus. tting model will need to be devised to make it more
natural for the dragging technique to be used. The
faster interaction allows for the patterns that occur
3.3 Discussion to be more quickly absorbed. With a more intuitive
model, players can more naturally take advantage of
5 men and 2 women played the auditory version of 4×4 the way we process audio.
Sudoku, two of whom were musicians and one of whom
was visually impaired. Their level of experience with 3.3.2 Informing grid sonifications
playing Sudoku ranged from beginner to advanced.
Much like with how direct manipulation and the in-
3.3.1 Player feedback
troduction of the mouse revolutionised graphical user
interfaces, the use of the tablet enables the user to a
Feedback for Model-Based Sudoku was varied. We pre- greater degree than keyboard interaction. The tablet
sume that is this partially due to it being a single player interaction contributed more to the success of the
puzzle game (a two-player game, on the other hand, en- 4 × 4 Sudoku than the use of Model-Based Sonica-
gages the players competitiveness and allows them to tion. With a suciently small grid so that the number
learn from one another). The general consensus was of values is not overwhelming, stylus interaction gives
that while 4 × 4 Sudoku is quite simple visually, the the user the exibility to explore the grid as they de-
auditory version was quite challenging and the smaller sire and provides speed that is dicult to mimic with
versions was approximately the right level of diculty. traditional keyboard or 5-way navigation (such as on a
Here are some of the more specic ndings: mobile phone or a game controller). It also neutralises
First try First attempts were often frustrating, some- the problem of strongly localising a user in the grid
times resulting in starting over. This indi- through sound.
cates that dierent initial solving techniques are
needed: mapping out grid density rather than lo- 3.4 Scaling up to 9×9 Sudoku
cation of similar items. Second games were much The 4 × 4 implementation of Sudoku does not scale up
smoother. well to 9 × 9 Sudoku. The main problem is that there
-4-
30
Interactive Sonication of Grid-based Games
are simply too many values to remember. We require 4.1 Design and implementation
a new model that lends itself better to the larger grid.
The important features of Connect Four are the
The main concepts here are generating models that can
columns and the locations of tokens, especially where
support the cross-hatching technique where players
there are several of the same value in a line. Knowing
cross-correlate values in rows, columns and cages to
what is around a token is therefore very important as
deduce values and also categories and order the values
well as being able to focus on each token individually.
used in the grid. The problems that occur in the 9 × 9
Sudoku grid inform us how to alter the model that was
used for sonifying the 4 × 4 grid. 4.1.1 Representation of the grid
It is clear that we need to develop specialised We represent the grid in a short looping sound so that
overviews and lters to allow users to focus on dier- players can think about the entire grid and understand
ent parts of the grid. Key information is about what where tokens are in relation to one another. The aim is
is present or missing and picking out items of similar to provide all the information quickly enough so that
values. What this implies is that players must be able the players can reason about it as a whole with the
to apply certain lters in combination as they interact distinct parts making up a pattern that they can work
with the grid or prompt overviews to be played. How- with. The end result is that the grid is like a short bar
ever, it is also important not to lose the advantages of music. We then punctuate this bar of music with two
from the direct interaction provided by the graphics drum sounds to help players localise themselves within
tablet. For example, were the player interested in each loop. A stronger (or louder) drum plays at the
an overview of a row, tapping to the left or right of start of the grid and the softer (or quieter) one occurs
that row could play an overview of the row where to- at the fth column of the grid. Our initial design did
kens are played in a predened order using the graphi- not include the second drum however, it was quickly
cal interface and panning to re-enforce their positions. apparent that when the grid is sparse, it did not have
To query where a particular token is present, players the energy or liveliness for which we were aiming nor
could select the token from a list and use gestures in was the localisation strong enough. This aim also drove
each cage to determine if it is present. Another lter the rate of our auditory display. We experimented with
could be used in combination with a the token lter to grid lengths of 0.7 to 3.5 seconds. Less than a second
show where the token is missing. Finally a lter that was found to be quite manic and over two seconds a
only displays where the empty cells are could highlight bit too slow. Our preferred length was 1.4 seconds with
where the grid is dense or sparse. To solve the prob- 0.2 second pause between loops, coming to a total of
lem of the large number of values to be entered, sounds 1.6 seconds.
that can be vocalised can be used. This enables the
player to self-organise the tokens and these can then
also be used as input removing the necessity to make
mappings between tokens and their graphical represen-
tation. Vocal sonications have been successfully used
in the sonication of EEG data [3].
Given the complexity of the sonication and inter-
action needed, we have tabled our work on Sudoku for
the time being and are focusing on the second game we
implemented: Connect Four.
-5-
31
Interactive Sonication of Grid-based Games
-6-
32
Interactive Sonication of Grid-based Games
Looping through the grid The technique of making the is shown in gure 8. This allowed players more free-
grid into an auditory loop be it column-wise as we dom in their interaction and also pushed them to rely
have done or not shows promise for providing a grid more on the auditory feedback rather than looking at
overview. This is similar to other work in auditory where their opponent placed a token throughout their
overviews [5], but instead of the column being sonied move.
at the request of the user, it is repeated to continuously
remind the user of the state of the grid. We believe
this to be a technique that can help overcome the lack
of persistence in the auditory channel. Here, we have
implemented this technique and players of the Connect
Four game found it useful and engaging.
What remains to be tested is the limitations of
this technique. Our sonication was limited to seven
columns with a maximum of six values to represent
while most data sets encompass many more than that.
It remains to be seen whether the technique is depen-
dant on the number of columns presented or on the du-
ration of the sonication. We envisage this technique
being extended to comparing data sets as well, pro-
Figure 7: Two people playing Connect Four. The play-
vided an overview of each data set could be presented ers trade o the stylus and use areas on the tablet to
as we have presented columns here. play in a column, as show in gure 8.
-7-
33
Interactive Sonication of Grid-based Games
at this issue further as we complete our full analysis. Our next steps in this work is to complete our anal-
ysis of the formal evaluation of Rhythmic Connect
Four and to integrate some of ndings from Sudoku
5 Conclusion to strengthen its implementation. This will include
In this paper, we have presented some new approaches a whole spectrum of grid inspection: direct cell-based
for the sonication of grid-based games. These grid interaction, localised region overviews and overall sum-
games represent use cases of data displayed on a maries into a coherent interactive sonication system.
grid allowing us to develop techniques that can be This demands that we structure the sonications so
transferred to related applications, such as real-time that the information obtained via the dierent ap-
video stream sonication, spreadsheet sonication for proaches can easily be fused into an increasingly ac-
visually-impaired users, or the generalisation to 3D curate mental image of the grid.
grids. These are attractive follow-up steps on our re-
search agenda towards a better exploitation of sonic Acknowledgements
interactions for grid-structured data types. We would like to thank the COST IC0601 Action on
We introduce an interactive sonication of 4×4 Su- Sonic Interaction Design (SID) for sponsoring this work
doku grids using direct interaction with a graphics and allowing collaboration between the Ambient In-
tablet. The Sudoku grid can inform on how we might telligence Group at Bielefeld University and the In-
sonify sets of data and how they cross-correlate. The teraction, Media and Communication Group at Queen
sonication design was straightforward, following the Mary, University of London. Thanks is also extended
Model-Based Sonication idea that data parametrises to all the members of those groups as well as those of
acoustic systems, and that movement on the grid ex- the Centre 4 Digital Music (QMUL) who helped eval-
cites these systems. Thereby the sounds indicate quite uate this work.
directly what state a grid cell is in. Interestingly, users
start quickly to develop strategies to explore the 4×4 References
grids which we haven't anticipated beforehand, such as
[1] Thomas Hermann. Online sonication examples.
drawing circles in cages, or doing quick line-scans, or
http://sonication.de/publications.
tapping on cells. Due to the limited complexity of the
grid, this direct interaction is suited to allow users to [2] Thomas Hermann. Sonication for Exploratory
solve the Sudoku. However, scaling the problem to the Data Analysis. PhD thesis, Bielefeld University,
9×9 Sudoku fails for two reasons: the user's memory is Bielefeld, Germany, 2002.
exceeded with the many items, and the proprioception [3] Thomas Hermann, Gerold Baier, Ulrich Stephani,
is not enough accurate to understand exactly what cell and Helge Ritter. Vocal sonication of pathologic
is being inspected. To better solve the 9×9 Sudoku, EEG features. In Proceedings of the International-
possibly more specic sonication designs need to be Conference on Auditory Display (ICAD), 2006.
developed.
[4] Thomas Hermann and Helge Ritter. Listen to your
The Connect Four game represents grid data where
data: Model-based sonication for data analysis. In
linear patterns occur. A rhythmic sonication ap-
Advances in intelligent computing and multimedia
proach was developed for the game, which can now
systems, pages 189194, August 1999.
successfully be played with the visual display playing
a very minor role. It exemplies an auditory display in [5] Johan Kildal and Stephen A. Brewster. Providing
good analogy to visual games where the board is per- a size-independent overview of non-visual tables.
sistent for both players here the persistence is cre- In Proceedings of the 12th International Conference
ated by a looped sonic pattern which serialises the grid on Auditory Display (ICAD), June 2006.
column-wise. First comments from players are promis- [6] Gregory Kramer. An introduction to auditory dis-
ing, however, we need to conduct user studies in order play. In Auditory Display. Addison-Wesley, 1994.
to investigate the potential of learning to better under-
stand the grid set-up. [7] Tony Stockman. Interactive sonication of spread-
sheets. In Proceedings of the. International Confer-
We are condent that by focusing on grid-based
ence on Auditory Display (ICAD), 2005.
games we will be in a very good position to evaluate
sonication designs and to compare the eectiveness
of dierent designs. These games thus represent an
ideal platform to examine sonic interactions. We hope
to make the games attractive so that players enjoy to
play and generate valuable data for us voluntarily.
-8-
34
Using audio aids to augment games to be playable for blind people
email: David.C.Moffat@gmail.com
Abstract. One potentially important use of audio technology is to make video games for blind people. Visual impairment makes
almost all current video games totally inaccessible. We augmented a simple shooting game with audio aids to provide information
about the direction and distance of moving targets, and tested it to see whether people could play the game without seeing the screen.
Players managed to play the game quite well in the blind condition, and in some cases the audio aids improved their performance
when they could see the screen. The conclusion is drawn that audio aids can make mainstream games accessible for the blind.
-1-
35
In the following listed examples of games for the visually The player in AudioQuake has a RADAR device that beeps to
impaired, synthetic speech output occurs often as an audio aid. signify danger like enemy monsters. The beeps sound different
There are other forms of audio aid that have been tried as well, for friends, for enemies, and for other objects. They are doubled
with some degree of success. By the phrase "audio aid" is meant to enable them to carry more information. The double beeps are
any type of audio output that is intended to help visually faster when the target is closer, and the second beep is a higher
impaired players to play the game. pitch when the target is higher than the player.
Compared with the speech output in the games mentioned
1.2.1 Special games above, these beeping earcons allow information to be conveyed
Some developers make games especially for the visually to the player quicker, which is important for an FPS game. By
impaired. These are not intended to be played by normal giving the player more detailed information about the objects in
(sighted) people. They are audio games, without a graphical the 3D world, AudioQuake gives players the opportunity to use
aspect, designed so that blind people can play them well. their own spatial reasoning to make decisions. It still provides
One developer that makes such games is GMA games [2]. They some automatic aiming, however, in that weapons "lock onto"
have made the game Lone Wolf, for example, which is a targets that are within a certain angle from straight ahead. The
submarine warfare game, in which the player captains a experience it provides with the RADAR device is reduced in
submarine on battle missions. The submarine uses speech output another way, too, in that only the nearest friend or enemy is
to report from its onboard sonar system, to tell the player when detected. To have too many earcons representing several other
enemy submarines are within firing range. The technicalities of objects at the same time was judged to be too confusing.
aiming torpedos can be assumed to be left to the crew and to The game AudioQuake is quite close to providing an authentic
automation. It is a clever idea to choose a game world in which experience for blind players, but again the aiming part of the
visual impairment would not be a disability. The technique of game's play is different from that of original Quake. The earcons
using in-game automatic devices for tasks like aiming is used in used are also symbolic sounds, rather than naturalistic ones.
other accessible games, too. Real enemies and friends do not beep faster when they approach
In another game of theirs, Shades of Doom, which is based on you; nor beep higher when they are above you. A more
the original Doom FPS (first-person-shooter) game, the player naturalistic way to convey that kind of location information
has a special "night scope" device that signals with earcons would be to use the human capacity for stereo hearing.
(special sounds) that an enemy is within range [3]. Firing the
weapon then automatically hits the target. While this game was Another free game that makes good use of audio aids for blind
based on a game for normal players, the game-play has been players is Top Speed 2, by Playing in the Dark [6]. It is a car
changed by this manner of shooting, to accommodate the blind. racing game, in which the sounds made by other cars passing on
To change the game-play for an FPS game in this way is to the right or left play louder in one or other speaker. The stereo
make it a special game that is quite different from the original. perception of audio is also used to guide the player around the
track, by providing audio feedback for position. The centre of
the track emits a sound, so that the player hears it louder on one
In these special audio games, it can be seen how good game- side when the car moves off the central "racing" line. Rather
play can be brought to the visually impaired. It is inclusive to than use sounds for walls to warn the player away, as in
bring them into the world of games; but it is still somewhat AudioQuake above, the road-centre sound is used to keep the
exclusive in that they are then playing specially made games player in the middle of the track, even as it turns.
that normal people would not generally choose to play. Another clever trick is to provide the driver with a co-driver, as
For the visually impaired sector of the population to depend on in rally races. This means that the game can give a lot of
special games means that they would never have the same range information to the driver about the road ahead, as well as timing
of games to choose from, either. The market is smaller than the performance, in quite a natrual way, using speech synthesis. By
market for normal games; and the game developers will often be this means the game provides a fairly authentic experience
normally sighted people, who would need special training to without needing the blind player to play a special game. The
understand the kinds of problems that visually impaired gamers modifications to the driving game are small enough to allow
face. From earlier points made, it is apparent that normally normal players and blind players to play together.
sighted developers suffer from blind spots of there own, in so
often assuming that all players can see as well as they can. Another computer game for normal players which has been
To leave special game development to visually impaired modified to be accessible to people of various disabilities,
developers, on the other hand, would make for much more including visual, is AccessInvaders [7], which is a variant of the
accessible games; but even fewer of them. It would mean a classic arcade game Space Invaders.
minority of the game-playing population being served by a When set to be played by a visually impaired player, the game is
minority of the game-making population, which would make the simplified to have only one column of enemies; and they cannot
games doubly expensive as well as relatively few in number. kill the player. Instead, the player loses only if an enemy reaches
the ground. The aliens emit a "sonar-like" sound, and it is spatial
For those reasons, but especially for social inclusion, to let so that its position can be inferred. In this last point,
visually impaired play with normally sighted people, it is good AccessInvaders uses spatial sound like the game Top Speed 2
to attempt to modify normal games with extra audio output so does, allowing the player to use both ears in a natural way.
that the visually imparied can play them, too. In simplifying the game for the blind, however, it becomes a
special game again, quite different in game-play to the normal
1.2.2 Games augmented with symbolic audio (earcons)
version of the game. Since the designers want to encourage
An example of a normal FPS game that has been modified to be
blind people to play games with sighted people, they have
playable by blind people is AudioQuake, which was developed
another way to make this possible. They introduce a
by the AGRIP project [4], and is free, being based on the code-
handicapping system, in which some players get a simplified
base for the original Quake FPS game by id Software [5]. The
game with reduced experience, to make the game accessible to
intention of the AGRIP project is to modify mainstream games
them; and the scoring and teamplay still allow them to play
for visually impaired players; and allow them to modify them
alongside normally sighted people.
further if they wish.
-2-
36
1.2.3 Games augmented with naturalistic audio
The above games all make different compromises to give blind
players something like an authentic game-play experience,
without simplifying the game too much, and allowing some
degree of joint play with normally sighted players. They all use
audio aids to make this possible, and may use weapons that can
aim automatically or do other things to assist the player in ways
that are not alway realistic.
The games use earcons to signify events in the world, but they
tend to be symbolic rather than naturalistic. They use stereo
hearing in some cases, but typically in a one-dimensional
setting, like Space Invaders or driving, where the player only
has to go left or right. When the game requires more complex
maneuvering in 2D or 3D space, like the shooting games, the
audio aids tend to revert to carrying information symbolically,
or with speech.
It is a challenge to support a blind player in a shooting game,
with naturalistic audio aids that convey information in the way
that sounds do in real life. Not with synthetic, coded beeps, that Figure 1: The Asteroids game
is; but with natural properties of sounds, like volume and stereo
detected direction. To convey location information, however, AudioAsteroids does
use more naturalistic properties of sound. As shown below, the
The study reported below attempted to meet that challenge. A game tries to solve the main problems that blind people have in
traditional shooter game is augmented with audio to give the playing normal games.
player location information for objects (targets) in a naturalistic
way, to see how well blind people could play it. The game is not
simplified for them in any way. 2.2 Speech for the menu system
The first problem that visually impaired players encounter is
starting the game and setting options, because they cannot use
2 Audio aided Asteroids the visual menu system.
Asteroids is a classic arcade game in which a spaceship moves The audio aid to solve this part of the problem simply uses a
over a 2D area rotating left and right (or anti- clockwise), and synthetic speech output tool that speaks the options available at
using its jets to go forwards (thrust). The ship must shoot at and each stage. This aid was tested more thoroughly in another game
destroy the asteroids moving through space around it (see which described scenes by reading out the screen text, and gave
Figure 1), and if the asteroids break into pieces, then they must quite complicated series of changing options. It was an
be destroyed also. When an asteroid goes off the screen, it adventure game, and the menu audio aid did enable people to
isntantly "warps" to appear at the opposite edge. If any asteroid play the game without looking at the screen; but we focus
collides with the ship, the player loses a life and starts with a attention here rather on the other audio aids for location.
new ship in the centre of the screen. The aim is to destroy the Because the ship in AudioAsteroids can rotate and move through
asteroids in the shortest time. space, players need to know the relative direction of asteroids,
AudioAsteroids is the version of the game developed for this and how far away they are. The two audio aids for these are the
study. It is set to be fairly easy to play, even for beginners, so directional aid and the distance aid.
that the effects of audio aids can be seen on a range of players.
However, it does have two levels. Players must complete both 2.3 The directional aid
levels in order to finish the game. In the first level (L1) the Because we have two ears, we can estimate the horizontal angle
asteroids wait to be shot at; and in the second level (L2) they are of a sound source quite accurately, depending on circumstances.
moving across the screen, and so harder to hit. When the source is almost directly ahead, we can tell if it moves
through an angle of as little as one degree of arc ([8] cited in
2.1 Symbolic sounds for different objects [9]). This is mainly because of the difference in time of arrival
The main feature of AudioAsterdoids is that each asteroid emits of the sound at the two ears, and consequent phase difference.
a characteristic sound. The sounds used were a cello string being For higher frequency sounds, there is also a difference in
plucked about twice per second, and the rattle of a small card volume at the two ears, because one ear will be partly shielded
box of paperclips. The manipluations of these sounds were from the sound source by the head.
intended to convey information about the location of the The auditory cortex of the brain uses these clues to inform us of
asteroids to the player. the location of the sound source. There are other clues available,
These sounds are clearly not naturalistic, because real asteroids also. For example, the relative height of frontal sources can be
do not sound like cello strings. They are symbolic sounds – estimated if their sounds are high in pitch, because of the way
"earcons" – whose meaning can only be understood once it has that the shell of the ear reflects certain high frequencies back
been explained. This is a necessary compromise, because into the ear canal.
objects in space in fact make no noise at all. Other space games The shape of the ears also helps us to determine whether the
and science fiction films make the same compromise when they sound source is in front or not, and sources from the rear will
have loud explosions and other sound effects in space. Sound in sound a little "muffled."
space is physically impossible, because there is no air for sound
waves to travel through; but it is shown in films because, like The way the ears register sounds, and the brain decodes the
background music, it gives an "emotional atmosphere." signal, is highly complex and not fully understood. However
using some of the main properties as outlined above it is
-3-
37
possible to manipulate sounds in ways that make it appear to a difference could be allowed to grow. Clearly the proportion
listener with stereo headphones that the source is moving from should be as large as feasible, to make the player's distance
left to right (or "panning") and even going behind the head. estimation more accurate. If there were no difference then all
asteroids would sound the same distance away. But if the
2.3.1 Sound from behind the head proportion is too large, then only the nearest objects are heard.
Each sound that is to be used to represent an asteroid is The minimum volume is set at 60% of the maximum volume,
processed to make a second version that sounds like it comes which was found to be a good level to allow the asteroid to still
from the rear. This was done with a software tool called be heard when it is far away, and not get drowned out by much
Maven3D [10], which includes algorithms to simulate the nearer asteroids.
"muffling" referred to above, that makes sounds appear to No tests were done to confirm that players could accurately
source from behind the head. estimate the distance of asteroids, because we did not expect that
There are therefore two sound files prepared for each asteroid: they would be able to. The purpose of the distance audio aid in
the original one for when the asteroid is towards the front of the the game-play is to tell the player which asteroid is closest, and
ship (ahead of the ships left-right axis), and the muffled version when it is so close that it presents a danger to the spaceship.
for when the asteroid is somewhere behind the ship.
-4-
38
experience of the audio aids used in AudioAsteroids, having experiences shown in Table 1. Each player is also consistent in
played a part in its development. The relative skill levels of the performance, as shown by the standard deviations. The least
players are shown in Table 1, in which players are shown to consistency is shown by the weakest player, again as would be
have experience by a plus (+) sign. expected.
To confirm these inspections with statistical analysis, t-tests
P1 P2 P3 P4
were drawn between players to see if the scores were
+ + + - Experience of games
+ + - - Experience of Asteroids
significantly different. They were. The (unpaired, two-tailed) t-
+ - - - Experience of Audio Aids tests between P1 and P2 yielded probabilities of 0.01 for shots,
and 0.66 for time, indicating that their times were not
significantly different, but that their shots surely were.
Table 1: The participants' levels of experience
It appears that P2 is the stronger player, wasting significantly
fewer shots.
3.2 Procedure Visuals S T
The game was played on a laptop computer, with headphones 1.1 26 64
attached for the sounds output. The spaceship is controlled with 1.2 24 54
1.3 25 49
an Xbox 360 wireless joypad.
1.4 26 63
The game was started for the players, who first played it five
1.5 24 54
times with only the visual output on screen. They then played it P1 mean 25 56.8
five times with audio on, but not looking at the screen (no stdev 1 6.46
visuals); and finally they played five times again, with both
t-test 0.01 0.66
visuals and audio.
The difficulty of the game was set to be easy enough to allow 2.1 23 61
most players to complete the game without losing a life. They 2.2 19 58
would differ mainly in time taken. The hope was that the game 2.3 14 67
would spread the players, not defeat them. 2.4 22 64
2.5 18 45
To play the game once, it was necessary to play through two
P2 mean 19.2 59
levels, in which L1 starts with stationary asteroids, but in L2 the
stdev 3.56 8.51
asteroids are moving targets. In either case, a minimum of 7
shots are needed to destroy all the asteroids and their fragments. t-test 0.00 0.01
Performance was noted by recording two figures for each play 3.1 30 80
of the game: the number of bullets fired, and the time taken. 3.2 25 61
3.3 32 78
3.3 Results 3.4 32 91
3.5 30 96
All players completed all levels, destroying all asteroids, and P3 mean 29.8 81.2
any fragments resulting from asteroids the first time they are stdev 2.86 13.55
shot.
t-test 0.08 0.03
Every player found it difficult to reach 100% accuracy, and they
nearly always wasted some shots, firing more than 7 times in a 4.1 47 121
level. Only one player lost a life (this was P3 in the Audio-only 4.2 59 133
condition, which is the hardest); so we ignore lives lost. 4.3 27 65
4.4 37 133
The players differed in their completion times, both because of
4.5 27 104
skill level, and because of the difficulties of playind "blind." P4 mean 39.4 111.2
Those results are the aim of the experiment, but first we validate
the experimental setup with the results of all players in the Table 2: Confirmation of player skill levels
normal condition, with Visuals only, as in the traditional version
of the game, which is normal Asteroids. The other t-tests shown in Table 2 (comparing P2 against P3,
and P3 against P4) were again unpaired, but they were one-
3.3.1 Confirmation of player skill levels (Visuals) tailed. This was because we expect that P2 should get higher
The game was fairly easy for all players to complete without scores than P3, and P3 higher than P4, due to the Ps' differences
losing lives; but there were significant differences in shots in experience.
wasted and in time taken to complete. The t-tests confirm that P2 (an Asteroids player) is more skilful
than P3 (a general gamer, but new to Asteroids), based on shots
We first verify that the players' reported skill levels are and on time. P3 does not waste significantly fewer shots than P4
consistent with their results, and that individual players perform does (the novice), but is significantly quicker.
quite consistently within each condition.
This validation and consistency of the players' relative skill
Table 2 shows the results for all players and all games in the levels for the Visuals-only condition, which is comparable to
first condition, with only the visuals on. P1's third game is classic Asteroids, means that the further results can be relied
signified by the row "1.3" and so on. The mean scores for each upon, and the significance of any variation in results can be
player are shown, and their standard deviations. The scores are assessed by comparison with the variation for a player within
the number of shots (S) fired to complete both levels in the each condition.
game, and the time (T) taken to complete them, in seconds.
By inspection it is clear that P1 and P2 are the best players of 3.3.2 Development of performance measure
the game, followed by P3 and lastly P4. This is just what would To help in further performance comparisons between players,
be expected from the players' relative skills as suggested by their the S and T scores (for shots and time) are combined into a
-5-
39
summary performance measure, P, which measures the P2 does not do as well as P1 in the blind condition, despite
divergence from an optimal or extremely strong performance. being the better player, which can be put down to P1's prior
The P measure is larger for weaker players, who waste more experience of the audio aids. It is notable that P2 does not
shots and waste time. P is smaller, close to zero, for stronger perform much better than P3 in the blind condition (one-tailed,
performances that waste few shots and complete in near record unpaired t-test; p = .38), although the gap between them was the
time. P is calculated by the following formula: largest one in the Visuals condition. One hypothesis is that P2
cannot any longer use strategies learned in previous Asteroids
1 1 experience, and is thus reduced to re-learning the game in blind
P = ⋅3⋅ S−14 ⋅T −34
2 2 mode. With that specialist knowledge of Asteroids neutralised,
P2 is not much better than a general gamer like P3.
The lowest value that S can possibly have is 14, over both levels On the other hand, now P4 does much worse than P3 (one-tailed,
of a game; so S-14 represents the number of shots wasted in a unpaired t-test; p = .002), whereas she was closer before. It
game. The fastest times that any player achieved in any level of appears that there are two factors for this result. The first factor
any game total 34 seconds for both levels, which is therefore a is that P4 cannot cope well with the heavy task imposed on her.
fair indication of the optimum time that can be reached by a She is experiencing a cognitive overload, which the experienced
player who completes both levels. By inspection, the average gamers are not: she still has to think explicitly about the nature
firing rate of all players was about 3 seconds per shot, meaning of the gaming context, including the joypad keys, while the
that the number of wasted shots should be tripled to compare the others are familiar with the context, if not the game itself, and
cost with wasted seconds. Finally these two "waste factors" are can therefore concentrate more on playing it.
combined with equal weighting (of ½ each) to arrive at a The second factor is related: P4 is getting tired. This suggestion
performance score. is consistent with her results in the final condition, as analysed
Using this notion of performance, players with different styles – below.
who fire more accurately but take longer, for example – can be
meaningfully compared. Despite the large drop in performance that all players suffer in
the blind (Sounds only) condition, it is remarkable that all of
The mean scores including performance (P) for each player in them, including the possibly overloaded novice player P4,
all conditions are shown in Table 3. manage to complete the game. Only one life is lost in all these
games, and that is by P3, not P4.
The neutralisation of the more skilful players' experience
Visuals M S T P suggests that they are having to learn new strategies. The game
P1 25 57 28 in the blind condition is almost like a new game to them. It was
P2 19 59 20 probably too much to hope that blind players would be able to
P3 30 81 47
play against sighted ones on an equal footing. But even if the
P4 39 111 77
mean 28 77 strategies or skills required for the blind version of the game are
different, the fact remains that blind players can play with
Sounds M sighted ones, albeit at novice level.
P1 34 125 75
P2 36 168 101 3.3.4 The audio aids do not impair performance (Both)
P3 48 146 107 In the final Visuals+AudioAids condition (BOTH in Table 3),
P4 67 384 255 performance is comparable to Visuals alone.
mean 46 206 Three players even improve: P1 improves from 28 to 20 (with
p = .06 for unpaired, two-tailed t-test), by shooting more
BOTH M
P1 19 58 20
accurately, P2 and P3 appear to improve slightly, but not
P2 20 50 17 significantly.
P3 24 79 38 The novice player P4 gets worse (no less accurate, but slower)
P4 38 140 89 in the final condition, compared with the Visuals only one. As
mean 25 82 noted above, this is consistent with the hypothesis of mental
tiredness due to the greater cognitive load the task has put her
Table 3: Ps' mean scores in each condition under.
According to the summary P scores in the Visuals condition, it 3.3.5 Moving targets are much harder, when blind
appears the P2 is indeed the strongest player of classic The performance of players at each level can be calculated, and
Asteroids. The unpaired, two-tailed t-test gives a barely compared, to give a measure of how much performance suffers
significant probability of this of 0.08 however, so there is not when the targets are moving.
much in it. Unpaired one-tailed t-tests confirm that P2 is much Figure 1 shows the performance for each player and condition in
stronger than P3 (with p=.0006); and P3 is stronger than P4 L2 over performance in L1. When this measure is about 1, there
(p = .04). is no difference. As the graph shows, however, having the
asteroids move through space causes difficulties for the players.
3.3.3 Performance in the blind condition (Sounds) The most severe difficulties are caused for P1 and P2, who are
Referring to Table 3, it is clear that all players perform much the players most experienced in Asteroids. This suggests that
worse in the blind condition ("Sounds,"with only the audio aids they have learned strategies for shooting moving targets that are
to guide them). no longer applicable with only the audio aids for information.
Player P1's performance falls to 75, which is comparable to the P3's performance cost for moving targets, on the other hand,
novice performance in the Visuals condition (P4 in the top of the does not suffer in the blind condition, suggesting that this
table). Therefore, at least in this case, a skilful but blind player experienced gamer, new to Asteroids, has no particular shooting
can match a normally sighted novice at a shooting game, when strategies to lose.
provided with helpful audio aids.
-6-
40
because they have experienced the game in earlier blocks. If the
plays had been interleaved, in order to control for any learning,
P1
then some of the conclusions suggested earlier could have been
stated more strongly.
Visuals they study could not see the screen, yet were not actually blind.
P3 Sounds It is a further question whether really blind people, such as those
BOTH blind from birth, could also play the game. It might be that they
would lack the gaming history, and not well understand the
P4 context of shooting asteroids, for example.
-7-
41
seemed to play different strategies. Analysis of collected video
clips would be one way to pursue this question.
Acknowledgements
References
[1] Bierre, K., Chetwynd, J., Ellis, B., Hinn, D. M., Ludi, S. &
Westin, T. "Game Not Over: Accessibility Issues in Video
Games." In Proc. of the 3rd International Conference on
Universal Access in Human-Computer Interaction,
Lawrence Erlbaum (2005).
[2] GA Games. http://www.gmagames.com/
Accessed on 4th June, 2008.
[3] Peterson, B. Guide to Shades of Doom by GMA Games.
http://www.audiogames.net/pics/upload/shadesofdoom.doc
Accessed on 4th June, 2008.
[4] The AGRIP Project. http://www.agrip.org.uk/about/
Accessed on 4th June, 2008.
[5] id Software. Quake.
http://www.idsoftware.com/games/quake/quake/
Accessed on 4th June, 2008.
[6] Playing in the Dark. Top Speed 2. A free computer game at
http://www.playinginthedark.net/
Accessed on 4th June, 2008.
[7] Grammenos, D., Savidis, A., Georgalis, Y., & Stephanidis,
C. (2006). "Access Invaders: Developing a Universally
Accessible Action Game." In K. Miesenberger, J. Klaus, W.
Zagler, & A. Karshmer (Eds.), Computers Helping People
with Special Needs, Proceedings of the 10th International
Conference, ICCHP 2006, Linz, Austria, 12 – 14 July (pp.
388–395). Berlin Heidelberg, Germany: Springer.
[8] Middlebrooks, J.C. & Green, D. M., "Sound localization by
human listeners," Annual Review of Psychology 42, pp.
135-159 (1991.)
[9] Lu, Y.-C., Cooke, M. & Christensen, H. "Active binaural
distance estimation for dynamic sources." Proceedings of
InterSpeech-2007, Antwerp, Belgium (2007).
[10] Maven3D. Software package for 3D audio editing.
http://www.venturaes.com/emersys/index.html
Accessed on 4th June, 2008.
[11] DarkGDK. Software dame development kit.
http://gdk.thegamecreators.com/ and
http://www.microsoft.com/express/samples/GameCreators/
Accessed on 4th June, 2008.
-8-
42
Beowulf field test paper
Abstract. A practical field test, covering some of the parameters governing audio based games, designed for mobile applications
utilizing new techniques with the intention of allowing greater interpretive freedom for the player. The tests are realised through an
audio-based, simple game application: ‘Beowulf’.
-1-
43
[Title of Paper]
-2-
44
[Title of Paper]
• Does the level of pre-game information colour the player’s 3. Game controls and navigation – six statements.
experience? 4. General computer game habits – two statements.
• Can an audio-based game give a satisfactory game
experience? 3.2.2. Questionnaire – quantitative part
• Does a non-visual game world present the player with For the second part of the questionnaire the subjects returned to
navigational problems or other hinders to a working game the game application and were asked to describe six distinct
experience? places in the game environment. They were asked to, for each
• Can a game emulate visual combat situations using only place, describe the physical environment plus the emotions it
sound? evoked in as great detail as possible. All subjects described the
• Can a player correctly interpret visual sounds without the same six places and jumped from one place to the next by
aid of visual reference? clicking buttons in a short cut window presented by the test
• Does a sound based environment encourage a player to leader.
contribute more to the game experience?
-3-
45
[Title of Paper]
Each subject was given ca. ten minutes of playing time and was
then asked to fill out the first part of the questionnaire.
They were then instructed to return to the game environment
and by using the six short-cut buttons in the application, visit six
specific locations. These they then described in the second part Figure 3: response to statement: I often play computer games.
of the questionnaire.
4.1.2. General appreciation of the game concept
4. Results -Despite this lack of gaming experience the general reaction to
The subjects have filled out the questionnaires in a consistent the application was favourable. 27 of 48 subjects responded 5 or
manner, which makes us confidant that the test has worked and more to the statement ‘It was fun to play’:
that the results are reliable. A number of check-points supports
this:
• There is a strong negative correlation (-0,658) between the
two opposite-pole statements ‘I do not like to play
computer games’ and ‘I often play computer games’.
• A strong correlation (0,784) between the statements ‘This
is a game I want to improve on playing’ and ‘This is a
game I want to play more times’.
• Strong correlations between the statement ‘It was fun to
play’ and all of the following statements: ‘The game idea
(game play) is good’, ‘This is a game I want to improve on
playing’ and ‘This is a game I want to play more times’.
At the same time there is a relatively strong negative
correlation to the statement ‘I was bored early on’.
Generally there are strong correlations between the
statements in the category ‘General appreciation of the
Figure 4: response to statement: It was fun to play.
game concept’.
• The correlations between the statements in the category
-A high number unexpectedly thought the application would
‘Game controls and navigation’ indicates that the
make an excellent game for home computers despite its having
questionnaire have worked and contains valid data.
next to no graphic content.
The only category that do not show strong correlations between
statements are ‘Experience of presence and immersion’. This
category contains the most subjective and illusive of the 24
statements and the subjects may have had problems relating to
the statements given. On the other hand is the aspect of
immersion and presence covered in the qualitative second part
of the questionnaire, which makes this a smaller problem.
-4-
46
[Title of Paper]
-This is especially interesting when compared to the same 5. Descriptive (foggy, sunny, dark etc.)
question, but now focused on mobile phones (the intended
platform for the application). There was no correlation between the quality and complexity of
the description and the information the groups had received at
the start. Both those subjects who had an emotionally
descriptive introduction and those that had a factual introduction
showed mixed levels of creativity in their interpretation of the
environments dependant on the individual. Since the amount and
nature of information prior to the game experience do not seem
the have any greater impact on that experience, it can interpreted
as the answers and responses to the questionnaire emanates from
the gaming experience itself and that very little coloration, if
any, resulted from the pre-information.
-5-
47
[Title of Paper]
A: The test results show no evidence for that the information Despite the game having minimal instructional information and
given prior to playing the game should color the game a very limited graphic interface and feedback, the game was
experience. One interpretation of this is that the game played successfully by all subjects…although there were
experience itself was strong enough to shadow the experience different levels of entry with some subjects grasping the concept
from the, relatively short, pre-information phase of the test. immediately and others needing some minutes to grasp the
principles of navigation and concept, all subjects were able to
Q: Can an audio-based game give a satisfactory game control the character, use the sword and move through the
experience? environment. Most of the test subjects were even able to
A: Although not to everyone’s taste, a significant percentage of complete the set task and ‘kill’ the monster.
the test group were sufficiently positive towards the application
to confirm that it can. 5.4. Problems illuminated by this test
Paradoxically, the principle problems with audio exclusive
Q: Does a non-visual game world present the player with games lie in their principle strengths. The ability of audio
navigational problems or other hinders to a working game applications to trigger self-generated, complimentary visual
experience? content in the user also means that the game designers have little
A: The test did show that there are issues to be concerned about control over what the user is ‘seeing’. This places great demand
regarding navigation, however these presented no major on the game play and sound design. This, in turn, highlights the
problems for the test group and are not sufficient to sabotage an need for a sound design methodology in this area. An absolute
application. majority of the work carried out in this field is still
experimental. With an informed set of “do’s and don’ts” the
Q: Can a game emulate visual combat situations using only work could be lifted to higher levels.
sound?
A: The test showed that this was entirely feasible and only a few 6. Acknowledgments
of the test group expressed any problem in understanding and
successfully utilizing the combat gameplay. Stefan Lindberg, Interactive Institute Sonic for all the sounds.
Martin Nordlinder for dedicated work with software
Q: Can a player correctly interpret visual sounds without the aid development.
of visual reference? Pupils and staff at Central Skolan, Arvika
A: The test showed that this was not an issue. The players might Students at Ingesund University College of Music
vary in their personal interpretation, but at no time did these Pupils and staff at Carlshöjd School Umeå
variations interfere with the gameplay. Students at Umeå School of Art
Stuart Cunningham for advice, help and feedback
Q: Does a sound based environment encourage a player to
contribute more to the game experience?
A: The high gameplay marks suggest that the subjects were
prepared to accept a much lower level of gameplay content than
in a visually based environment. It should be noted that the
gameplay in Beowulf is neither particularly sophisticated nor References
complex. Whether this impression was caused by the novelty of [1] Gaver, W. W., Beaver, J., and Benford, S. 2003. Ambiguity
the game environment or through the addition of self-generated as a resource for design. In Proceedings of the SIGCHI
game content by the user is a discussion for another paper and Conference on Human Factors in Computing Systems (Ft.
needs further user testing. Lauderdale, Florida, USA, April 05 - 10, 2003). CHI '03.
ACM, New York, NY, 233-240. DOI=
5.2. Feasibility of audio based games http://doi.acm.org/10.1145/642611.642653
We can observe that there are only small technical limitations to [2] Sengers, P. and Gaver, B. 2006. Staying open to
creating audio-based games on personal computers. In this interpretation: engaging multiple meanings in design and
project we have used a large number of existing and off the shelf evaluation. In Proceedings of the 6th Conference on
tools and technologies. The narrowest bottleneck for the area is Designing interactive Systems (University Park, PA, USA,
not technical but more related to lack of knowledge about sound June 26 - 28, 2006). DIS '06. ACM, New York, NY, 99-108.
and sound design. DOI= http://doi.acm.org/10.1145/1142405.1142422
There are no limitations when designing audio based games [3] Liljedahl, M., Papworth, N., and Lindberg, S. 2007.
though there are added demands when solving problems where Beowulf: an audio mostly game. In Proceedings of the
one would normally rely on a graphic pointer. international Conference on Advances in Computer
As these games are highly interpretive, great care must be Entertainment Technology (Salzburg, Austria, June 13 - 15,
shown in the design process, as it is only too easy to give 2007). ACE '07, vol. 203. ACM, New York, NY, 200-203.
unfocused and misleading signals to the player through audio DOI= http://doi.acm.org/10.1145/1255047.1255088
content. [4] Lumbreras, M. and Sánchez, J. 1999. Interactive 3D sound
There are no real limitations to the (sighted) player’s potential hyperstories for blind children. In Proceedings of the
and ability to interpret and accurately play audio based games, SIGCHI Conference on Human Factors in Computing
as long as the design process has been intelligently and correctly Systems: the CHI Is the Limit (Pittsburgh, Pennsylvania,
executed. The test subjects had no preconceived negativity to United States, May 15 - 20, 1999). CHI '99. ACM, New
the idea of a non-visual game. And if there was any scepticism York, NY, 318-325. DOI=
prior to the test this was soon dispelled when the game was http://doi.acm.org/10.1145/302979.303101
played. [5] Röber, N., Masuch, M. 2004. Auditory game authoring.
DOI= http://games.cs.uni-
5.3. Ability of subject to navigate play magdeburg.de/audio/data/Roeber_2004_AGA.pdf
-6-
48
[Title of Paper]
-7-
49
Control of Sound Environment using Genetic Algorithms
Scott Beveridge, Don Knox,
Glasgow Caledonian University, Glasgow, Scotland
scott.beveridge@gcal.ac.uk, d.knox@gcal.ac.uk.
Abstract. Sonification - the production of sound to represent some form of data or information, has been
applied in various fields including analysis of financial, meteorological and physiological data. A system overview
is presented which is based on analysis of sociospacial behaviour via video capture of a given environment. The
example given is a busy commuter environment such as a train station. Activity in the environment is mapped
to a set of musical performance parameters, and also forms the basis for controlling an optimisation function
based on genetic algorithms (GA). The aim is to develop a socially reflexive audio environment, where those
present unconsciously interact with the input, and output, of the sonification process. The output of the system
is a series of musical chords, optimised as regards their musical fitness as defined by three consonance criteria.
50
[Control of Audio Environment using Genetic Algorithms]
Activity/Trigger Sonification
Figure 1: Plan view from webcam
Number of people Dynamic Intensity
Speed of movement Tempo
Video frames are captured using an inexpensive web
cam, and manipulated using Matlab image process-
Direction of movement Timbre
ing tools to remove unnecessary information (floor de-
tail and static objects). This type of approach has
Table 1: Map of environment activity to sonifica-
previously been used only in small scale performance
contexts, requiring specialised motion capture systems
tion parameter
[9, 5]. Subsequent image processing (see Figure 2) and
statistical analysis allow extraction of various param- This framework provides a comprehensive but po-
eters which describe the nature of movement in the tentially complex set of acoustical parameters which
environment (see section 2.2). represent the activity within the environment. To
simpify these variables and so make the data set more
useful a higher level representation is required. The
objective is to obtain a qualitative measure of these
features in order to simplify audio generation. A re-
cent study by Lu et al [11] has been adopted to provide
this representation. The research involves a framework
for the emotion classification of music which examines
audio on the basis of the features extracted in Table 1.
The system uses signal analysis techniques to extract
intensity, timbre and rhythm (ITR) features which are
used in a classification process based on guassian mix-
ture models (GMM). The result is music which is la-
belled in terms of the two-dimensional stress-energy
model proposed by Thayer [12] (see Figure3). The
model, which was adapted from the original proposed
by Russell [13], forms quadrants equating to content-
ment, depression, exuberance and anxious/frantic.
Classifying features in this manner provides a con-
Figure 2: Figures identified by vision capture system text for further decisions regarding music generated by
the system as the emotion labels are directly analagous
to the activity with the environment. This represen-
2.2 Mapping tation will not be sonified directly, instead the system
In order to acheive sound output that is in some way in- generates audio with opposing features to counteract
dicative of events occuring within the environment be- the activity within the environment.
-2-
51
[Control of Audio Environment using Genetic Algorithms]
e
ner
gy/
arous
al parameters. In an experiment where music with vary-
ing tempo and loudness was played as background mu-
sic within a supermarket environment it was found that
in-store traffic flow could be modified with fast tempo,
e
xube
ranc
e a
nxi
ous
/fr
ant
ic high intensity music proving the most effective in in-
creasing the passage of consumers through the sales
space. The system proposed in this paper can not only
s
tre
ss/
val
enc
e
dynamically alter these musical variables in real time
(and in response to user input) but also incorporate
structural musical features which results in novel au-
dio with the potential of controlling group behaviour.
c
ont
ent
ment de
pre
ssi
on
2.2.2 Structural Features
The mood classification scheme proposed by Lu et al is
based solely on acoustical parameters and does not take
into account more complex structural features which
may attribute to emotional expression. Musical as-
Figure 3: Thayer’s stress/energy model, showing the
four mood clusters; contentment, depression, exuber- pects such as mode, interval and melodic contour are
ance and anxious/frantic necessary in expressing emotion and provide a basis
for higher cognitive constructs associated with music
listening.
It has long been established that expectation plays
In our busy commuter environment we may observe
a pivotal role in a listener’s experience of music [20].
a large number of individuals moving quickly in a uni-
This theory is based on the assumption that while lis-
form direction. This behaviour indicates high intensity,
tening to music an individual will form expectancies
fast tempo parameters with distinct timbre character-
about its continuation. If these presuppositions are
istics dependent on direction. The classification frame-
violated it evokes a corresponding emotional reaction
work proposed by Lu et al (figure 4) places audio with
[21]. Narmour proposed a refinement of this concept
these features within the exuberance category. The
in the implication-realization (I-R) model [22] which
system will use this classification outcome to construct
judges melodic expectancy based on the Gestalt-based
audio with opposite characteristics and so generating
principles of proximity, similarity and good continua-
audio with features which would place it in the con-
tion. The I-R model is based on a three note archetype
tentement category and produce an affective balance
which focusses on the distance and direction relation-
to the original observed behaviour.
ships between the intervals. By measuring melodic
Considering musical features in terms of emotion is contour on the basis of these principles the model has
useful for a number of reasons. Studies conducted on been found to predict melodic expectancy with reason-
group psychology and consumer behaviour suggest that able accuracy.
individuals are very sensitive to changes in environ- This system copes with these structural aspects
ment [14, 15] and that music is a critical factor in their by implenenting an algorithmic approach. Parame-
construction. Spatial aesthetics or ‘atmospherics,’ is ters within the genetic algorithm are modified in real-
the term used to describe the concious designing of time in a mechanism which seeks to optimize structure
space to create certain effects in buyers [16]. Based within specific parameters. The emotion classification
on atmospherics indivduals are likely to display one of process plays an essential part in this process by ensur-
two behaviours. Approach behaviour involves such re- ing a context with which to voilate expectancies and
ponses such as physically moving toward something, hence generate emotion. The initial implementation
affiliating with others in the environment through ver- of this system uses this framework to optimize chord
bal communication and eye contact, and performing triads which form the basis for generated audio.
a large number of tasks within the environment [17].
Avoidance or negative reinforcement behaviour [18] in-
3 Sound Generation
cludes trying to get out of the environment, a tendency
to remain inanimate in the environment, and a ten- The system is built upon an evolutionary design model
dency to ignore communication attempts from others inspired by the Survival of the Fittest concept proposed
[19]. A study by Millman [15] shows how manipulation by Charles Darwin in 1859. In biological terms, this
of these responses can be acheived by altering musical relates to the competition for predominance amongst
-3-
52
[Control of Audio Environment using Genetic Algorithms]
-4-
53
[Control of Audio Environment using Genetic Algorithms]
-5-
54
[Control of Audio Environment using Genetic Algorithms]
[9] M. Droumeva and R. Wakkary, “Sound intensity [24] H. Helmholtz, On the Sensations of Tone.
gradients in an ambient intelligence audio dis- Thoemmes Continuum, english ed 1875 edi-
play,” Conference on Human Factors in Comput- tion ed., 1875.
ing Systems, pp. 724–729, 2006.
[10] A. de Campo, C. Frauenberger, and R. Holdrich,
“Designing a generalized sonification environ-
ment,” Proceedings of the ICAD, 2004.
[11] L. Lu, D. Liu, and Z. Hong-Jiang, “Automatic
mood detection and tracking of music audio sig-
nals,” Audio, Speech and Language Processing,
IEEE Transactions on, vol. 14, no. 1, pp. 5–18,
2006.
[12] R. E. Thayer, The Biopsychology of Mood and
Arousal. Oxford University Press, 1989.
[13] J. A. Russell, “A circumplex model of emotions,”
The Journal of Personality and Social Psychology,
vol. 39, no. 6, pp. 1161 – 1178, 1980.
[14] R. E. Milliman, “Using background music to affect
the behavior of supermarket shoppers,” Journal of
Marketing, vol. 46, no. 3, pp. 86–91, 1982.
[15] R. E. Milliman, “The influence of background mu-
sic on the behavior of restaurant patrons,” Jour-
nal of Consumer Research, vol. 13, no. 2, p. 286,
1986.
[16] P. Kotler, “Atmospherics as a marketing tool,”
Journal of Retailing, vol. 49, no. 4, pp. 48–64,
1973.
[17] B. H. Booms and M. J. Bitner, “Marketing ser-
vices by managing the environment,” Cornell
Hotel and Restaurant Administration Quarterly,
vol. 23, no. 1, p. 35, 1982.
[18] M. M. Bradley, “Emotion and motivation,” Hand-
book of psychophysiology, vol. 2, pp. 602–642,
2000.
[19] R. J. Donovan and J. R. Rossiter, “Store atmo-
sphere: an environmental psychology approach,”
Retailling: Critical Concepts, 2002.
[20] L. B. Meyer and L. B. M. Meyer, Emotion and
meaning in music. The University of Chicago
Press, 1956.
[21] D. Huron, Sweet anticipation: music and the psy-
chology of expectation. MIT Press, Cambridge,
Mass., 2006.
[22] E. Narmour, The Analysis and Cognition of Basic
Melodic Structures: The Implication-Realization
Model. University of Chicago Press, 1990.
[23] J. Manzolli, A. Moroni, F. Von Zuben, and
R. Gudwin, “An evolutionary approach ap-
plied to algorithmic composition,” Proceedings
of VI Brazilian Symposium on Computer Music,
p. 201210, 1999.
-6-
55
Genie in a Bottle: Object-Sound Reconfigurations for Interactive
Commodities
Daniel Hug
Interaction Design Department
Zurich University of the Arts
daniel.hug@zhdk.ch
Abstract. Everyday commodities are increasingly enhanced with information or communication technologies and become more
complex and interactive. Electroacoustic sound can be a powerful way of shaping the aesthetic and functional aspects of such artifacts
beyond the modification of the physical aspects. This is a challenge for sound design, which traditionally has prospered mainly in
linear audiovisual media. The nature of objects requires new approaches to sound design, both because physical objects and sound
have a close relationship and also because artifacts are embedded in complex sociocultural contexts. Thus sound design for interactive
commodities is one element that contributes to the hermeneutic affordances of complex, interactive commodities. This paper outlines
the sociocultural significance of objects, their relationship with sounds, and how sound design can re-configure this relationship.
1 See http://www.cost-sid.org 3 Schizophonia is the term coined by R. M. Schafer to denote the sepa-
2 Closing the Loop on Sound Evaluation and Design, see ration of sound from their - natural - sources by means of electroacoustics.
http://closed.ircam.fr and [28]. For him this concept carries only negative connotations [31].
56
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities
connotative, meanings [27]. In his semiotic writings, Barthes of- vention of language. Thus, he resorts to language as paradigm for
fers two broad connotational categories for objects. The first is understanding objects [4]. Objects challenge the semiotic idea of
the ”existential” connotation, which understands objects as obsti- communication by signs and code through their immediate, com-
nate, inhuman or even antihuman. The other is the ”technolog- plex, archetypical multi-sensorial presence. They are not delim-
ical” connotation which is concerned with objects as consumed, ited to relatively restricted semiotic frameworks like media such
reproduced and functional products [4]. as film, photography or even architecture. It is not surprising that
A special aspect in the creation of meaning in objects is ma- many accounts of semiotics of objects actually refer to the presen-
terial. In his dialogue ”Hippias Maior” Plato describes a discus- tation of objects in media such as advertisements (see e.g. [4]).
sion between Socrates and Hippias about what makes the beauti- Joan M. Vastokas calls for a ”semiotics of visual phenomena”
ful beautiful. One proposition is that certain materials like gold (and we shall add auditory phenomena), taking into account the
make ordinary things beautiful, another proposition is that some full spatial, temporal, and gestural dimensionality of the artifact.
materials are more functional and thus make an object appropri- She proposes a narrative concept of artifact, understanding the
ate (cited in [9]). Materials like marble and (fake) gold suggest procedural component of an artifact as being born through inten-
a royal or noble atmosphere in shopping malls, wood can signify tionality of its creator(s), going through a life of use and abuse
closeness to nature, and so forth ([9], [6]). as ”meaningful and expressive object in itself, and as a ritual per-
former in social and cultural life”, finally ”dying” and being dis-
2.2 Sociocultural Objects posed, exhibited in a museum or recycled ([37], p. 341). She
describes several essential points to consider in the study of arti-
Artifacts play an important role on a sociocultural level. Based on facts from a sociocultural perspective:
interviews with over 300 people Csikszentmihalyi and Rochberg-
Halton investigated the meaning of everyday artifacts beyond ”(1) The meaning of artifacts, including works of
mere functionality. They describe how objects become part and visual ”art”, is constituted in the life of the objects
are the result of the process of cultivation, that is ”the process themselves, not in words or texts about them; (2) the
of investing psychic energy so that one becomes conscious of the artifact is not an inert, passive object, but an interac-
goals operating within oneself, among and between other persons, tive agent in sociocultural life and cognition; (3) the
and in the environment” ([14], p. 13). According to Csikszent- signification of the artifact resides in both the object as
mihalyi and Rochberg-Halton things embody goals, make skills a self-enclosed material fact and in its performative,
manifest and shape the identities of their users. Objects thus are ”gestural” patterns of behavior in relation to space,
embodiments of intentionality. Hence they alter patterns of life, time, and society; (4) the processes, materials, and
they reflect and define the personality, status, social integration products of technology, especially those of a society’s
etc. of both, producer and owner. In addition, things evoke emo- dominant technology, function as cultural metaphors
tions through interpretations in the context of past experiences, at many levels and in many sociocultural domains; and
thus becoming signs or symbols of one’s attitude. Moreover they (5) theoretical insights derive, not from theorizing in
can mediate conflicts within the self. the abstract, but from direct observation and experi-
Through the ”objectivity” and permanence of objects such ence of the phenomenal world of nature and culture.”
identities can be shared. This is the precondition for the social- ([37], p. 337)
izing effect of things and of their ability to provide role mod-
els. Csikszentmihalyi and Rochberg-Halton also shed light on 2.4 Active, Animated and Magic Objects
the role of the functional aspect of objects for culture, describing
how even the use of things for utilitarian purposes is inseparable These considerations lead to an aspect of the study of artifacts
from the symbolic context of culture. Artifacts socialize people which seems to be counterintuitive at first glance: The notion of
to certain habits or ways of life and represent these as signs [14]. the object as actor. Latour proposes in his actor-network theory
(ANT) that human beings and non-human or even inanimate ob-
Socio-semiotic studies describe the practice of collecting, the
jects are interlinked in mutual interactions, both being actors in
accumulation of objects as dowry, the expression of self in rela-
the process. In this view, action is not limited to what intentional,
tion to society in bricolage, the rhetoric of the displayed artifact,
meaningful humans do. According to Latour any thing that does
shopping centers as super-objects and stages for objects and social
modify a state of affairs by making a difference is an actor [26].
actions, the mystification and commodification of culture through
souvenirs, and the complex culture around jewels. An excellent A theoretical approach which is somewhat related to ANT, but
collection of essays related to these questions can be found in more focused on human agency, is activity theory, introduced by
[29]. Baudrillard finally points out the relationship between ob- Leontiev in the late 1970’s. Instead of departing from a relational
jects, production systems and society, describing the development approach (between symmetric nodes in networks, ”actors”, that
of artifacts from a static role in a traditional pre-modern society can be people, machines, or other things) activity theory proposes
to their emancipation to flexible, functional entities, and how sys- a primacy of activity over both object and subject, originating in
tems of objects and their consumption reflect socio-ideological purpose, need and intentionality. As for the role of artifacts, they
circumstances [6]. All these aspects contribute to the meaning are described as the product of cultural needs, embodying our in-
making related to artifacts. tentions and desires. Thus they mediate between people and the
world and in this sense things have agency [22].
What has been outlined in these reflections seems to come to
2.3 Beyond Semiotics
a new level in interactive commodities. In her book ”The Sec-
It shines through some of these points that semiotics in a struc- ond Self” Sherry Turkle described the computer as an ”evocative
turalistic sense is limited as an analytical tool when dealing with object for thinking about human identity”. It serves as projection
the world of concrete physical objects. Barthes states that the surface for our desires and fears and is often considered animate in
semiotic study of objects is still at an early stage, one reason being some way - not only by children - exhibiting behaviors indicating
that no pure significant system of objects exists without the inter- some kind of reasoning and agency. The main criterion for alive-
-2-
57
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities
ness, autonomous motion, is being replaced or extended by the listening we hear an approaching car rather than four wheels, an
notion of psychological autonomy. Looking at the more recent de- engine and the various vibrations of the car’s body [17].
velopments in the 20th Anniversary Edition of ”The Second Self” In terms of perceiving physical qualities, the intersensorial link
Turkle states that these characteristics have become commodified, between audio and haptics is quite strong. For example, Kayser et
are part of our everyday experiences. But although computational al. demonstrated somatosensory and auditory interaction and the
technology has lost much of its uncanniness we still tend to per- conditions for its effectiveness, namely temporal coincidence and
sonify and project our self onto it. In the case of the increasing inverse effectiveness [23]. And Avanzini and Crosato have suc-
autonomy and complexity of computational artifacts this tendency cessfully demonstrated how sound can modulate the haptic per-
becomes even stronger. And in some cases like wearable comput- ception of stiffness [1]. Some studies deal with the audio haptic
ers or computational implants the border between computer and relationship in food consumption, for example the factors asso-
human is increasingly blurring on a very concrete level [35]. ciated with judging apples to be mealy [3]. In the less scientific
According to Daniel Chandler the quality of purposiveness and but nevertheless relevant domain of film and game design, sound
autonomy in artifacts arises from the whole being more than the often is used to substitute or denote haptic sensations of protag-
sum of the parts when technology becomes too complex to con- onists, and is essential for suggesting authenticity of objects on
trol. Technological artifacts seem to have a will of their own and screen and suspending disbelief [15].
we tend to anthropomorphize them. The resulting technological
animism credits an inanimate entity with consciousness and will4 3.2 Critique of Sonic Causalism and Naturalism
[11]. But the relation between sounds and physical objects and pro-
From here the step to a notion of a magical quality in complex cesses is more complex. On closer investigation we can discover
computerized artifacts can be easily made. According to Arthur a dialectical relationship between objects and sounds: While ob-
C. Clarke’s third law ”any sufficiently-advanced technology is in- jects are permanent and concretely graspable, sounds are tempo-
distinguishable from magic.”5 rary and evasive, yet still they often have an almost intimate rela-
tionship with the physical world, which is the basis for the natural-
3 The Relationship Between Sound and Objects izing effect of sound in media. But the natural link between sound
and causing physical event has to be questioned. Sounds also have
We have stated above that the study of objects in all their com- an existence which is detached from their original source, am-
plexity matters for the sound design of interactive commodities. bivalent, sometimes carrying a rather vague notion of material in
This is the case both because sounds and objects are often closely them, sometimes being totally abstract. Chion criticizes the sim-
related and because of the power of sound to modulate an objects plistic assumption of the (scientific) discourse about sound ”nat-
identity. Moreover sounds can evoke a certain object and thus its urally” representing a certain cause. This myth leads to a general
sociocultural significance described above. research focus on sounds that are empirically verified as ”well
In the following a closer look at this relationship is provided. identified” and thus supposedly meaningful and useful for design.
However, the list of ”well identified sounds” is relatively short,
often restricted to archetypes or clichés, and very case dependent.
3.1 Sound and Physical Properties
Their successful identification depends on a specific recording of
Sounds are directly connected to an artifact’s physical properties. a specific cause (e.g. a slamming door). And not a small number
In our everyday experience the acoustic properties of materials of the sounds in such a list are well identified because we have
and objects provide us with information on their quality. We learned a specific connotation through the consumption of me-
might see a transparent object, but we will only be able to tell dia like film. It seams evident that resorting to some empirically
whether it is glass or acrylic after tapping it with our finger. We ”well identified sounds” would mean to abandon the richness and
can distinguish metals in the same way or determine whether a diversity of the sonic world in favor of a statistical ”average”.
piece of wooden furniture is made of hardwood or plywood or Chion also points out the linguistic ambiguity when speaking
whether it is in good or bad condition. Many studies have been about sounds and their cause: A sound of a piano can be the tone
concerned with the ability of detecting material properties, shape emitted when pressing a key on a piano or it can be the sound
and size through sound, as well as processes of interacting mate- resulting from hitting the piano with a hammer. Or the sound
rials (see e.g. [18] for a comprehensive overview). Nonetheless, of a piano can actually result from a synthesizer. And although
many aspects of recognition are still not completely understood, we might identify a certain sound as coming from a wooden box
as most of these studies deal with simplified sound events, such we can not say that one particular sound of a wooden box exists.
as mallets hitting metal plates [38]. Depending on where and how we exert physical force on the box
Opposed to typical laboratory setups, the sounds we hear in ev- it will sound differently. In summary: it is usually not possible
eryday life are often composites of several sound sources. Each to claim that a sound is the sound of something specific, or, vice
single vibration in the human hearing range merges into one versa, that every thing has its one sound [12].
sound gestalt and is perceived as a complex entity. With increas- Steven Connor describes the dialectic relationship between
ing complexity, the actual elements that cause a sound are not sounds and objects as being an ”immaterial corporeality”: ”One
discerned at all anymore. In what is called everyday or ecological apparent paradox of hearing is that it strikes us as at once intensely
corporeal - sound literally moves, shakes, and touches us - and
4 In our everyday myths and narratives this topic reappears often under
mysteriously immaterial. (...) Perhaps the tactility of sound de-
a somewhat humorist veil, also referred to as ”resistentialism”. This term, pends in part on this immaterial corporeality, because of the fact
coined by Paul Jennings, stands for a humorous theory in which inanimate that all sound is disembodied, a residue of production rather than
objects display hostile desires towards humans, a ”fact” apparent in expe-
riences such as cars not starting when one’s in a hurry or the bread always
a property of objects.” ([13] p. 157)
falling on the side with butter on it. Last but not least, the relationship between objects and sounds
5 To be found in a 1973 revision of his compendium of essays, ”Profiles can be even viewed from a totally different angle. Chion states,
of the Future”. that we can find certain irregularities, frictions, traces of impacts
-3-
58
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities
in abstract sounds that can give it a material, bodily quality. He Lessing is reported being one of the first to organize an anti noise
calls these notions of materiality ”indices sonores matérialisants” campaign. Bijsterveld points out that ”the sound of technology
([12], p. 102). is a key aspect of technological culture, because sound has been
highly controversial and deeply invested with symbolic signifi-
3.3 The Sonic Metaphysics of Objects cance.” ([8], p. 165)
New technologies like the automobile or industrial machines
In the sound related discourse the notion exists that sound can
bring with them new sounds that become symbols of progress for
be the ”voice” of objects in an actual, immediate manner and not
some and a primitive nuisance for others. A group that embraced
merely metaphorically. This notion again is dialectic. Connor
noise and loud powerful sounds of technology with almost reli-
states: ”When we hear something we do not have the (...) sen-
gious devotion were the Italian Futurists. In his manifesto ”The
sation of hearing the thing itself. This is because objects do not
Art of Noises” Luigi Russolo states: ”We find far more enjoyment
have a single, invariant sound, or voice.” ([13], p. 157) At the
in the combination of the noises of trams, backfiring motors, car-
same time however, sound rarely comes completely apart from its
riages and bawling crowds than in rehearsing, for example, the
source. Connor continues that ”to think of a sound as the ’voice’
’Eroica’ or the ’Pastoral’.” (cited in [30], author’s translation)
of what sounds (...) is also to think of the sound as owned by and
These examples show how the discourse about sound strongly
emanating essentially from its source, rather than being an acci-
reflected the societal structures and significant changes brought
dental discharge from it.” ([13], p. 157) The following dictum is
about by industrialization and technological changes in general.
ascribed to Oskar Fischinger: ”Sound is the soul of an inanimate
Many comparable accounts could be found today, referring to
object.” (cited in [15], p. 330, author’s translation) John Cage
ghetto blasters and mobile phones. Since the industrial age and
is reported to have hit the objects wherever he went in order to
the introduction of electroacoustic technology sounds have be-
investigate their inner nature.
come more pervasive than ever. And this trend will increase due
An old example of experiencing sound as voice of things can
to the technique of electroacoustic enhancement of commodities
be found in Homer’s Odyssey. Homer describes Odysseus, hav-
described above. The study of the sociocultural history of sound
ing returned home and competing in an archery contest: ”Then
reveals that the importance of sound goes far beyond purely func-
his right hand took the string and made it vibrate, the cord sung
tionalist purposes e.g. of providing feedback in an interface.
beautifully and clearly, like the call of a swallow.” (cited in [12],
p. 102, author’s translation) This is more than just a metaphor:
To some extend it is the string which is acting, singing, even if 4 Implications, Directions and Strategies for
excited by the plucking. Sound Design
”Through noise nature vibrates of sense”: With this poetic for- We have described how sounds relate to artifacts, how they be-
mulation Barthes describes the ability of sound to give things come meaningful and even can give inanimate things expressive
a voice. According to him, listening is always connected to a qualities. And we also have described how both artifacts and their
hermeneutics that aims at understanding the dark, the blurred or sounds play an important sociocultural role. Thus, to say that
mute and to make the sense ”behind” appear. Listening to these sound has great potential for conveying information about the na-
sounds is an essentially religious experience, connecting the lis- ture of artifacts and their hidden processes or properties is correct
tening subject with the hidden world of the gods [5]. This at- but falls short of grasping the full complexity of the role of sound
tribution of sound to an expressive quality of objects and to the in artifacts. Sound does not simply convey an information but
soul and voice of things is an obvious connection to the topics of it offers the listener resources, affordances, clues for an interpre-
antropomorphization and animism described above. tative act. The dialectic relation of sound with physical objects
which is combined with the dual nature of artifacts - being at the
3.4 The Cultural Signification of the Sounds of Ob- same time abstract signs and concrete, physical realities - pro-
jects vides endless possibilities for complex combinations. The result
Barry Truax states that sound mediates the relationship between is a complex interleaving of levels of interpretative clues through
listener and environment [34]. This includes also the sounds pro- sound. In the following we will elaborate on how sounds can be
duced by the cultural-technological artifacts we create. For in- designed with these aspects in mind, focusing on the relationship
stance, Mark M. Smith describes how in travel accounts of an- and possible reconfigurations of objects and their sounds.
tebellum America sounds became the signs of positively con-
notated, pre-industrial work, linking sound to an increase of 4.1 A Narrative Approach to Sound Design
wealth and population. He also reveals significant differences be- In the introduction we have described the transformation of inan-
tween the industrialized North and the slavery dominated South. imate objects to procedural, interactive objects and the resulting
Both areas and cultures had a distinct soundscape with different narrative potential. A technologically complex object with the
keynote sounds and soundmarks6 [32]. The sounds of certain ar- ability to sense, process, store and communicate can be seen as
tifacts and machines thus contribute to the identification with a an actor in a narrative of interaction. However, there is hardly any
class, political orientation, etc. theoretically grounded know-how and only a very small number
Through a comparative study of noise abatement campaigns of practical examples which employ a narrative notion expressed
from the early 20th century, Karin Bijsterveld points to class being through sound in the interaction with computerized objects. In
an important element involved in definitions of noise and noise order to establish criteria and a methodological framework to in-
pollution. The cultural struggle about sounds is also a struggle be- vestigate the possibilities of this new direction of sound design for
tween intellectual and working classes. The philosopher Theodor interactive commodities, we propose to start with film and game
6 The sound design because they provide a rich source of material de-
terms ”keynote sound” (sounds that are heard by a particular so-
ciety frequently enough to form a characteristic sonic background) and scribing how the narration of object interaction can be designed
”soundmark” (a sound with a special meaning for a community) have been sonically. Maribeth Back suggests that it could be worthwhile to
coined by R. Murray Schafer [31]. investigate design practices of sound designers of narrative me-
-4-
59
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities
dia. She describes how sound can help to create micro-narratives, etc.) in order to fit them into the ”carrier” sound8 . Particularly
using both cultural experience (codes, sound as sign) as well as interesting to note here is that such sounds, detached from their
physical experience [2]. original source through electroacoustic recording, can become fa-
In the following we will outline a few considerations and possi- miliar and strange at the same time. As mentioned above, some-
ble directions for design drawn from film and game sound design. what contrary to many accounts of everyday listening, the identi-
fication of a sound source can be subjected to quite a big deal of
4.2 Strategies for Object-Sound Reconfigurations interpretation and uncertainty, especially when the sound is rep-
resented in a recording9 . However, this separation of sound and
We have mentioned the possibility of referring sonically to an ob- referent is a powerful device for design because it allows to cre-
ject or a specific action with an object. The sound produced when ate products which have a distinguished and novel sonic identity
an object is used or operated (or just touched in some way) can without being entirely alien.
become not only the signifier of that object but also of its function,
use context and sociocultural significance and turn into a signifier 4.3 The Use of Sound to Animate and Characterize
of metaphorical or associated qualities of the object. Let us con- Objects
sider the example of the sound of a hammer hitting a nail: It is
Animation, Fantasy and Science-Fiction film is certainly also a
an index of a hammer as well as hammering and can symbolize
rich source for the study of objects as expressive ”characters”. In
strength, aggression, work, DIY-culture, communism and so forth
these genres we can find countless examples of objects that are
or it can even be an enunciation of headache.
provided a personality and emotional expressivity through sound.
These levels of meaning creation can become literally mixed
Beauchamp states that ”virtually any character or object can be
in sound through the strategic layering and intertwining of various
personified by adding speech, movement, and the expression of
sounds into a new sound gestalt. This is a common practice in film
emotions.” ([7], p.21) Even in ”realistic” movies the sound de-
sound: Through layering of sounds, combining concrete, identi-
sign often goes beyond just naturalizing or physicalizing an ob-
fiable sounds with each other or even with more abstract sounds,
ject, adding a narrative component. According to Flückiger ob-
meaning potentials7 are transferred between them. Film sound
jects in movies often are not only made credible but also animated
designers for example use this method to create richer meaning
through sound. The purpose of beeping computer displays, the
potentials into seemingly simple sounds. For example, the water-
extensive sonic endowment of supernatural occurrences or alien
drops falling onto an ant colony in ”A Bug’s Life” (John Lasseter,
creatures is to distract from their mocked nature, to suggest an
1998) are sounds of splashing water, combined with sounds of
actual existence, life and function [15].
rockets and explosions [7].
An example for sonically antropomorphized and animated ob-
Another example of the ability of sounds to redefine the nature
jects is the sound design of Darth Vader’s Tie Fighter in ”Star
and meaning of objects displayed on screen are also the typical
Wars”: The distorted scream of a human being was mixed with
cartoon sound effects that have been created for producers like
various jet engines, producing a nail-biting sound while contain-
William Hanna and Joseph Barbera or Tex Avery. Here signifier
ing an eerie humaneness: ”His ship and its sounds are extensions
and signified are subjected to extreme reconfigurations: Instead
of this merger (between flesh and machine, note from the author),
of sonifying literal physical processes, metaphorical or symbolic
and thus they scream in pain.” ([39], p. 109) But also fantas-
sounds are used. Emotional expression and interpretative clues
tically animated objects become more credible when having the
are achieved by the associative nature of the sound and the spe-
appropriate sounds. Think of the living dishes in ”Beauty and the
cific cartoon aesthetic based on analogy, contrast and exaggera-
Beast” (Gary Trousdale, Kirk Wise, 1991). And sound designer
tion. An example is the sound of an anvil being hit with a heavy
Ben Burtt used morphological analogies to human language sen-
hammer used in a scene where a character gets hit by something
tences and baby talk to shape the beeps of R2D2 from ”Star Wars”
like a baseball bat [7]. Many new sound icons, which stand for a
(George Lucas, 1977) [33].
very specific meaning which is entirely artificial and established
Sound plays also a role in marking ordinary objects as extraor-
only through repeated use, have emerged from animated film.
dinary or magic. This can be the case with simple, inanimate ob-
Examples are many film sound clichés: falling objects whistle,
jects that not only become ”alive” through use, but also reveal ex-
strong fist blows in fights have a metallic impression, quick run-
traordinary qualities through sound: An example of a rather sim-
ning produces a sound of a ricochet (which, again, is often the
ple object being enriched and personified in this way is the jade
same, stereotyped sound, and not an arbitrary, realistic recording
sword in ”Crouching Tiger, Hidden Dragon” (Ang Lee, 2000).
of a ricochet), and so forth.
The sword emits glass-like singing sounds, conveying a fine, pre-
These sounds are not questioned by the viewer-listener. Instead
cious identity, fragile and sharp at the same time. When used
the naturalizing power of sound suppresses the notion that the au-
inappropriately it starts to oscillate which is accompanied by a
diovisual event could be impossible or unrealistic, at least during
wobbling sound, as if it was responding in annoyance.
the actual experience of the film. This ability of sound to create
Sonnenschein suggest the use of ”archetypal templates” to cre-
credibility is often used for fantastic or incomprehensible things
ate nonhuman sounds that we can relate to. We might ”find
and processes. Ben Burtt, the sound designer of Star Wars, first
squeals, squeaks or hisses from compressed air hoses; groans
established credibility for fantastic sound effects by finding ”an-
from old wooden doors; and a laughter-like craziness from bend-
chors” in familiar sounds like animals (in which case there is also
ing a saw.” ([33], p. 61) Often sounds of animals are not only
the aspect of animism to be considered, see below) or familiar ma-
used to animate objects but also serve as sign carriers and conno-
chinery. Using these sounds he could inject the fantastic objects
tational devices. Lion growls, yapping chimps or cat purrs can be
and processes depicted on screen with the necessary amount of
familiarity and credibility [39]. These sounds then often are ma- 8 The carrier-modulator principle known from FM synthesis serves
nipulated (pitch shift, time stretch, filtering, amplitude envelopes well here as a metaphor for how layered sounds often work.
9 Quite often students of our classes in sound design are not able to
7 Van Leeuwen proposes this term instead of the static code to express identify the sound sources of recordings their fellows made, unless the
the contextual dependency of meaning making. [36] sounds are very typical or strongly contextualized.
-5-
60
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities
-6-
61
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities
extend, which implies a certain ethical dimension in the discourse gest a certain material quality.
about sound design for artifacts. The possibility to integrate a But the relevant questions are: What happens if such ”transnat-
even wider range of sounds by means of miniaturized electroa- ural” materials are present in an actual, physical object, which
coustic devices and to control them through computer technology has already a material identity? How can design methods, which
makes the relationship between physical object and its sound ar- mostly are used in immersive media, be adapted to the context of
bitrary11 . Never before there was so much control over the sonic actual physical objects with their non-immersive qualities? And
appearance of an artifact. of course this sound design is not ”free”, in many ways it is much
R. Murray Schafer was deeply concerned about the increasing more restricted by external factors than the sound design for tra-
schizophonia resulting from the separation of sounds from their ditional audiovisual media. These factors are technical limita-
”natural” sources. But schizophonia is already the normal cultural tions as to what sounds can be chosen (e.g. resonance body of
condition we live in and have learned to deal with (most of the the device, loudspeaker specifications), but also how the object is
time at least). While we might still have some resistance towards handled or used (hitting, shaking, heating, etc.). In short: Fur-
the schizophonic reality, the generation of teenagers, walking on ther investigation is also needed into synchresis and diegetics of
the streets listening to distorted and filtered pop songs from their sound for interactive commodities. We propose to draw on mate-
mobile phones, obviously enjoying it, will most likely have no rial from game sound research as a good starting point as it is able
problems at all with this notion. to link the narrative audio-visual world of film with interactive
As designers we have to embrace this new aesthetics of the media and physical interfaces.
sonically extended artifact as a new field where practices and vo- Finally, we will have to investigate strategies for defining and
cabularies are still to be developed. To just pick one example con- modulating the ”threshold of attention” by shaping how sounds
sequence indicated above: It might be a good idea not to stick too integrate into a soundscape and how they may emerge from it
tightly to sounds that are ”well identified” in laboratory settings, in a meaningful way. Traditionally, much research has been con-
thinking of them as the only way of conveying a clear meaning. ducted in order to evaluate the perceived urgency of alarm sounds.
For designers it will be necessary to deal with the fact that most In reverse we suggest that it should also be possible to shape
of decontextualized (recorded!) sounds are ”badly identified” or sounds in such a way that they easily blend into the background
at least reinterpreted. They are the sound designer’s raw material of our attention, becoming a keynote sound and being available
and the question is: What creative and interpretative potential lies to listening-in-search or listening-in-readiness [31]. In addition,
in them? designing meaningful, rich, subtle sounds with a certain degree
of individuality, freshness and unpredictability in terms of what
6 Future Work they are supposed to mean will be a good foundation to actually
prevent them from being annoying, helps to create ”hi-fi” sound-
In terms of design many methodological questions remain to be scapes (in the Schaferian sense), and supports the human ability to
answered. One central issue is the conceptual and practical inte- interpret and appropriate whatever we encounter and make sense
gration of the macro and micro levels of the design. The macro out of the alien.
level deals with the overall experience of the interaction with the To make progress in these directions a collaboration between
object and how it is embedded into sociocultural context. This the still quite separate communities of sound and music comput-
requires a suitable method of design and evaluation that takes into ing, auditory display, and traditional sound design is required.
account cultural factors and interpretative processes. The design of rich sounds for interactive commodities will have
On the micro level the challenge for the sound designer is the to integrate the two design cultures and their strengths: Dynamic
integration of sounds into each other, into the artifacts and into and interactive control on the one hand and rich, detailed sonic
the interaction dynamics. For example, an open question is how semantics on the other.
”hybrid sound gestalts” can be designed, that merge with a device.
How far can we stretch an objects (sonic) identity? What sounds
are acceptable on a metal, plastic, wooden object? The meth-
References
ods and tools for the fine-tuning of the fit of the electroacoustic [1] Federico Avanzini and Paolo Crosato. Haptic-Auditory
enhancements and the objects own sounds to processes and dra- Rendering and Perception of Contact Stiffness, volume
maturgies exist and are widely used in film and game sound de- 4129/2006 of Lecture Notes in Computer Science. Springer,
sign: Manipulation of pitch, amplitude, spectrum over time with 2006.
envelopes, adjusting the mix, crossfading, masking, layering or
[2] Maribeth Back. Micro-Narratives in Sound Design: Con-
using semantically complex effects like reverb, delay, filtering,
text, Character, and Caricature in Waveform Manipulation.
and so on. And designing sounds for narrative purposes means
In Proceedings of the 3rd International Conference on Au-
that we have to design transition points, find linking strategies be-
ditory Display, Palo Alto, California, 1996.
tween sounds to create continuity, and model the dynamic time-
space relationship of interactive applications. This requirement [3] P. Barreiro, C. Ortiz, M. Ruiz-Altisent, V. De Smedt,
has already lead to advanced game sound middleware such as S. Schotte, Z. Andani, L. Wakeling, and P.K. Beyts. Com-
FMOD12 or Wwise13 . Finally, there are several techniques for parison Between Sensory and Instrumental Measurements
”material transfer”: Convolution, manual filtering and manipu- for Mealiness Assessment in Apples. A Collaborative Test.
lation of resonance ”by ear”, or physical modeling. Technically Journal of Texture Studies, 29:509–525, 1998.
very simple and often very successful is the subtle layering of fil-
[4] Roland Barthes. Semantik des Objektes. In Das semiologis-
tered recordings of physical processes onto other sounds to sug-
che Abenteuer. Suhrkamp, 1988.
11 Every recording of a sound in fact already is a distortion of the au-
[5] Roland Barthes. Der entgegenkommende und der stumpfe
thentic link between a physical process and the resulting sound. Sinn. Suhrkamp, 1990.
12 http://www.fmod.org
13 http://www.audiokinetic.com [6] Jean Baudrillard. The System of Objects. Verso, 1996.
-7-
62
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities
[7] Robin Beauchamp. Designing Sound for Animation. Else- [28] K. Obermayer K. Franinovic Y. Visell et. al. P. Susini,
vier, Burlington, MA, 2005. D. Rocchesso. Closing the Loop of Sound Evaluation and
[8] Karin Bijsterveld. The diabolical symphony of the mechani- Design. In ISCA Workshop on Perceptual Quality of Sys-
cal age. In Michael Bull and Les Back, editors, The Auditory tems, 2006.
Culture Reader, pages 165–189. Berg, 2003. [29] Stephen Harold Riggins, editor. The Socialness of Things
[9] Gernot Böhme. Der Glanz des Materials - Zur Kritik der - Essays on the Socio-Semiotics of Objects. Mouton de
ästhetischen Ökonomie. In Atmosphäre. Suhrkamp, 1995. Gruyter, 1994.
[10] Stephen Brewster. Overcoming the lack of screen space on [30] Luigi Russolo. L’art des Bruits, Luigi Russolo, textes établis
mobile computers. Personal Ubiquitous Comput., 6(3):188– par Giovanni Lista. L’Age d’homme, 2001.
205, 2002. [31] R. Murray Schafer. The Soundscape: Our Sonic Environ-
[11] Daniel Chandler. Technological or media deter- ment and the Tuning of the World. Destiny Books, New
minism, 1995. Available from World Wide Web: York, 2nd edition 1994 edition, 1977.
http://www.aber.ac.uk/media/Documents/tecdet/tecdet.html [32] Mark M. Smith. Listening to the Heard Worlds of Antebel-
[cited 24.1.2008]. lum America. In Michael Bull and Les Back, editors, The
Auditory Culture Reader, pages 137–163. Berg, 2003.
[12] Michel Chion. Le Son. Editions Nathan, Paris, 1998.
[33] David Sonnenschein. Sound Design - The Expressive Power
[13] Steven Connor. Edison’s teeth: Touching hearing. In Veit
of Music, Voice, and Sound Effects in Cinema. Michael
Erlmann, editor, Hearing Cultures - Essays on Sound, Lis-
Wiese Productions, 2001.
tening and Mordernity. Berg, 2004.
[34] Barry Truax. Acoustic Communication. Ablex, 2nd edition,
[14] Mihaly Csikszentmihalyi and Eugene Rochberg-Halton.
2000.
The meaning of things - Domestic symbols and the self.
Cambridge University Press, Cambridge, 1981. [35] Sherry Turkle. The Second Self: Computers and the Human
Spirit. MIT Press, 20th anniversary edition, 2004.
[15] Barbara Flückiger. Sounddesign: Die virtuelle Klangwelt
des Films. Schüren Verlag, Marburg, 2001. [36] Theo van Leeuwen. Speech, Music, Sound. Palgrave
Macmillan, 1999.
[16] Karmen Franinovic, Daniel Hug, and Yon Visell. Sound
embodied: Explorations of sonic interaction design for ev- [37] Joan M. Vastokas. Are artifacts texts? Lithuanian woven
eryday objects in a workshop setting. In Proceedings of the sashes as social and cosmic transactions. In Stephen Harold
13th international conference on Auditory Display, 2007. Riggins, editor, The Socialness of Things - Essays on the
Socio-Semiotics of Objects, pages 337–362. Mouton de
[17] W. W. Gaver. What in the world do we hear? An ecological
Gruyter, 1994.
approach to auditory event perception. Ecological Psychol-
ogy, (5):1–29, 1993. [38] G. B. Vicario. Prolegomena to the perceptual study of
sounds. In D. Rocchesso and F. Fontana, editors, The Sound-
[18] B. L. Giordano. Everyday listening, an annotated bibliogra-
ing Object, pages 17–31. Edizioni di Mondo Estremo, 2003.
phy. In D. Rocchesso and F. Fontana, editors, The Sounding
Object, pages 1–16. Edizioni di Mondo Estremo, 2003. [39] William Whittington. Sound Design & Science Fiction. Uni-
versity of Texas Press, Austin, 2007.
[19] Adam Greenfield. Everyware: the dawning age of ubiqui-
tous computing. New Riders, 2006.
[20] Daniel Hug. Towards a hermeneutics and typology of sound
for interactive commodities. In Proc. of the CHI 2008 Work-
shop on Sonic Interaction Design, Firenze, 2008.
[21] Ute Jekosch. Assigning Meaning to Sounds - Semiotics in
the Context of Product-Sound Design. In Jens Blauert, edi-
tor, Communication Acoustics. Springer, 2005.
[22] Victor Kaptelinin and Bonnie A. Nardi. Acting with Tech-
nology. MIT Press, Cambridge, Massachusetts, 2006.
[23] C. Kayser, C. I. Petkov, M. Augath, and N. K. Logothetis.
Integration of touch and sound in auditory cortex. Neuron,
48:373–384, 2005.
[24] Gregory Kramer, Bruce Walker, Terri Bonebright, Perry
Cook, John Flowers, Nadine Miner, and John Neuhoff.
Sonification report: Status of the field and research agenda.
1999.
[25] Klaus Krippendorff. The semantic turn - A new foundation
for design. Taylor and Francis, 2006.
[26] Bruno Latour. Reassembling the Social - An Introduction to
Action-Network- Theory. Oxford University Press, 2005.
[27] Winfried Nöth. Handbuch der Semiotik. J. B. Metzler, 2.,
vollständig neu bearb. und erw. aufl. edition, 2000.
-8-
63
Saturday Night or Fever?
Context Aware Music Playlists
Abstract. Context awareness provides opportunities for enhanced user experience, interaction and customisation of electronic
devices, particularly those which hold large data sets of information which may often only be relevant to a user in certain scenarios.
In this work, we examine how context awareness can be applied to the automatic generation of music playlists on mobile music
devices, such as MP3 players and mobile phones. We hypothesise that the type of music which a person might wish to listen to will
often be influenced by external factors such as the time of day, the ambient temperature, amount of ambient or background noise,
their current amount of physical activity, and their emotive state, to name a few.
We detail the results and data sets of preliminary investigation into several human movement scenarios, emotional status and external
factors. These results are obtained by employing the cost-effective Wiimote controller to record acceleration profiles. The Wiimote is
assessed against a professional level, high-cost, motion capture device to identify if such portable devices are useful in everyday
scenarios. Base values for subject locomotion were investigated for the Wiimote device and verified and analysed using the Qualisys
3D-motion capture system. This was done for setting a base line for the subject forward velocity, but also to allow the research into
further complex locomotion studies for this project. A model of the playlist generation system is provided, which can be used to
simulate responses to various type of context-informing input.
It is noted that the system has been implemented using a fuzzy rule base system (FRBS). This will allow the initial construction
based on a knowledge base relating a suggested emotional state (E-state) based on various inputs. The longer term concept is to
investigate the adaptable nature of the initial knowledgebase and allow it to adapt to an individuals actual emotional state
preferences. It is suggested that further research into the implementation of a Self-Learning Fuzzy Rule Based System (SL-FRBS).
64
automatic playlist generation is beyond the scope of this work, initially of interest comes from two main sources: the user or
however, we briefly present an overview of the field and refer listener and the environment in which the listener exists. This is
the reader to the references made in this section should he or she further ratified by Reynolds et al. who also consider contextual
wish to gain a deeper knowledge of playlist generation input parameters from these two domains [7].
techniques [1, 2, 3, 4, 5, 6, 7].
3.1. The Listener
Initially, recommendation and playlist generation systems relied Information which can be extracted from the user is arguably the
on abstract or meta-data level information and user preference in most useful data which can be acquired if one wishes to
order to order the music tracks. These systems are not hugely determine contextual information regarding the listener’s current
different from Automated Collaborative Filters (ACFs) [8] in emotional state and level of activity. This is illustrated in more
that they track and correlate user preference and build up table detail in Figure 2 which is presented as a subset of the previous
of similarity based on musical information such as artist and diagram in Figure 1.
genre [1, 2, 3]. Although the can take simplistic forms by
counting number of plays, favourite artists, etc. the processes of The emotional state of the user is highly likely to influence the
learning user preferences purely based on these factors can also type of music which they wish to listen to. Listeners who are
take more complex forms [3]. happy or contented are likely to desire their favourite music
tracks and music which is from genres which is known to have
More recent work has been focussed on extracting and analysing positive effects on happiness and reflect and stimulate their
content present within the user’s music collection and making current emotional state. Equally, a listener who is unhappy
decisions based upon similarity metrics or correlations, available might wish to listen to slower, calmer music that fits with their
as a result of content analysis. A more notable example of this current mood. However, this is not to negate the fact that a sad
type of analysis is in the field of audio thumbnails [4, 5, 6]. Such or unhappy listener might listen to upbeat, happy music in order
content information can then be coupled with the meta-data and to change their mood. Therefore, the listening requirements
user preferences mentioned previously in order to provide, what cannot be based purely upon determination of emotion, or at
is generally agreed to be, a more suitable and effective system least, multiple inputs are needed to determine if a sad listener
for playlist generation and music recommendation [4, 5]. wishes to remain sad or wants to be cheered up. We propose that
mechanisms such as skin conductivity and heart rate might be
In similar work to this paper, Reynolds et al. propose systems acquired directly from the user, before being sent to a decision
more advanced than the more conventional approaches to making tool which also took into account other parameters and
playlist generation mentioned earlier. Their work supports the semantic knowledge of the music database.
theories that contextual information is also highly valuable and
appropriate when suggesting or ordering musical tracks for the
listener. In fact, they mull over many of the factors which we
propose to be crucial in our own work, and explain in more
detail later in this paper. Reynolds et al. consider variables such
as temperature, activity and location to be incorporated as meta-
data, and also indicate that the mood of the listener is another
key variable which must be considered in automatic playlist
generation. Their work also presents an excellent overview of
the history of automatic playlist generation and the links
between music and mood or emotion [7]. A detailed exploration
of emotional states, measurement and music goes beyond the
intended scope and context of this paper, however the reader is
further referred to the work of Meyers who an in-depth
exploration of the links between emotion and music [9].
65
Bluetooth-enable devices, such as computers and provides are present, we suggest that two scenarios are possible. The first
valuable motion information via its accelerometers [11]. may be that the user will be tired and at rest (determined via
motion detection) in which case music of lower tempo and
which provides a more relaxing experience may be required.
However, an alternative may be that the user wants music to
stimulate them perhaps because they are exercising or deriving a
positive, happy feeling from the strong temperatures (such as
when on the beach during a holiday). This can be further refined
by measuring the amount of light. If light levels are low then it
is most likely night time and, again, in combination with
movement and ambient temperature the user might wish to
either dance or relax and chill-out. The amount of ambient noise
is useful firstly to help ensure that the listener is provided with a
constant, desired volume level proportional to the amount of
noise in the environment (within reason). It might also be used
to determine if the user is inside or outside.
66
this stage, rather than other factors directly read from the listener provide hands-free and low cost motion feedback to the FRBS.
or user. Therefore, we define states and parameters for a number The Wii controller provides acceleration feedback for the three
of environmental factors, which are presented in Tables 3 to 5. the x, y and z axis. Initial usage of the device is to obtain an
approximate value for the forward linear velocity V from
the acceleration ∆a across a period of time ∆t as
Table 3: Temperature States
Cold Warm Hot
V = ΔaΔt . (1)
15-18 oC 20-23 oC 27-30 oC We experimented with extracting locomotion data from the
Wiimote controller and compared this to the data which can be
extracted from a full-blown motion capture system; the
Table 4: Lighting States
Qualisys. Figure 6 shows how placement of the Wiimote on
Dark Grey Day Light Light Sunny subjects was achieved and Figure 7 provides an image of a
0-3 2-4 3-5 4-7 6-10 subject being tracked by the Qualisys system.
Knowledgebase
Input
Decision
Fuzzification Making Defuzzification
logic
Output
Figure 5: Schematic of FRBS
67
As can be seen, the Qualisys system is more cumbersome than
the Wiimote and actually requires a number of cameras to track
motion accross a very fine range. However, the Wiimote is
much less intrusive and can be attached to a belt or put in a
pocket. Furthermore, provided the Wiimote sunject remains
within the transmission radius of the Bluetooth transceiver
(either 10 meters or 100 meters in laboratory conditions), the
user will have complete freedom of movement. This makes the
Wiimote not only better for recording natural movement, but
much more practical for deployment into real-world scenarios,
such as playlist generation.
68
The current rule base has been derived on the expert or
suggested emotional state of a subject for various states of
inputs. Input ranges for locomotion and temperature have been
determined experimentally, while Lighting and weather are base
on a incremental scheme and a initial triangular or trapezium
membership function was chosen initially and further research
on the rule base and fuzzy set distribution, membership
functions will provide refinement of the system performance.
5. Initial Results
5.1. The Wiimote as a Motion Device
The Qualisys motion capture system allowed comparative data
capture of the three Primary directions of subject motion. For
this paper we are primarily interested in forward velocity. It is
noted from the Qualisys results a potential coupling is possible
due to placement of the Qualisys Sensor. The data recordings
from the Wiimote and the Qualisys system were conducted
under the same conditions in order to fully investigate the
comparative effectiveness of each device.
69
factors. We carried these tests out over ten subjects to begin Additionally, we constructed attached an emotional state range
with, who were experienced at using portable digital music to each of the songs in our small database to which the output
players in a variety of scenarios, and established the knowledge emotional state can be correlated. Table 7 shows the E-state
in the fuzzy system using the average of these tests. The range which, through pilot testing, we attached to each song.
scenario configurations used are given below in code along with
a textual description of the scenario and the average set of Table 7: Song Database with E-States
results relating to emotional state is presented in Figure 16. ID Song E-state E-State Median
Although an average response is employed, there were strong 0 One More Time (Radio Edit) 5-8 6.5
correlations between the majority of the subjects and their 1 Love Unlimited 4-6 5
perceived emotional state indicators, which indicate that the data
2 Over and Over 7-9 8
recorded is reliable for those circumstances.
3 Harvester of Sorrow 0-3 2
%1 Walking, temperature is hot, lighting is 4 Comfortably Numb 3-4 3.5
dark/grey and weather is light rain. 5 Push The Button 7-9 8
ip1=[1.5;28;2;2.5] 6 Breathe 0-3 1.5
E(1)=evalfis(ip1, a) 7 Gimme All Your Lovin’ 5-8 6.5
%2 Stationary, temperature is cold, Table 8 shows the results of each of the seven experimental
lighting is dark and weather is raining. scenarios presented along with the resulting playlist to be
ip2=[0.1;14;1;0.5] generated. The grade in the playlist G is determined by taking a
E(2)=evalfis(ip2, a) simple Euclidean distance measurement of the form
( p − q)2
%3 Stationary, temperature is Warmish,
lighting is brightening and weather is dry. G( p, q) = (2)
ip3=[0.1;21;9;9]
E(3)=evalfis(ip3, a) from the song E-state median and the current E-state of the
listener, based on the scenario. The playlist is shown as a ranked
%4 Running, temperature is hot, lighting is set of song ID numbers from the database for each E-state.
Daylight/Getting brighter and dry.
ip4=[3.5;30;5;9.5] Table 8: Playlist Order for Test Scenarios
E(4)=evalfis(ip4, a)
Scenario E-state Playlist order
%5 Walking, temperature is getting hot, 1 4.3 1; 4; 0; 7; 3; 6; 2; 5
lighting is dark and weather is drizzling. 2 0 6; 3; 4; 1; 0; 7; 2; 5
ip5=[1.5;16;1;6] 3 6.8 0; 7; 2; 5; 1; 4; 3; 6
E(5)=evalfis(ip5, a) 4 7.7 2; 5; 0; 7; 1; 4; 3; 6
5 3 4; 3; 6; 1; 0; 7; 2; 5
%6 Stationary, temperature is hot, lighting 6 3.8 4; 1; 3; 6; 0; 7; 2; 5
is grey and weather is dry. 7 6.5 0; 7; 1; 2; 5; 4; 3; 6
ip6=[1.4;32;2.5;9.5]
E(6)=evalfis(ip6, a) 6. Conclusions & Future Work
%7 Walking/Jogging, temperature is mild, We have demonstrated that the Wiimote can function as a highly
Daylight and it’s dry. useful instrument for measuring forms of human motion and in
ip7=[2;17;4.5;9] future plan to assess the functionality of other low cost motion
E(7)=evalfis(ip7, a) devices such as the accelerometers which have been integrated
into mobile music players such as the iPhone and iPod Touch.
Emotions=floor(E*100) These devices, and the Wiimote functionality, can be further
ratified by more comparisons with data extracted using high-end
motion capture hardware, such as the Qualisys, mentioned
earlier.
70
Saturday Night or Fever? : Context Aware Music Playlists
Multiple Output (MIMO) would allow multiple meta-data the listener’s emotional state. However, to derive measurements
selection process based on beat or tempo of the music file, for and indicators directly from the listener suggest a much more
example. robust and reliable evaluation of the emotional state.
The initial FRBS has been implemented and provides a baseline Acknowledgments
model for the development of a more defined model. Fuzzy sets
and memberships functions need to be refined to allow further The authors wish to thank Sue Taylor and Janet Hayes of Sports
optimisation of the system, extending the range of inputs to and Exercise Sciences at Glyndŵr University.
allow a far more flexible system. Of interest is the modification
of the suggested rule base. An aspect is the ability to modify the References
initial rule base with rules more selective of the subjects’
individual emotional states. Approaches being investigated [1] Aucouturier, J-J. & Pachet, F., Scaling up music playlist
based on a self-learning strategy: generation, Proceedings of the IEEE International Conference
on Multimedia Expo, Lausanne, Switzerland (2002).
1. Modifying the fuzzy set definitions. Modification of the
fuzzy sets implies potential issues with the fundamental [2] French, J.C. & Hauver, D.B., Flycasting: On the Fly
nature of the linguistic meaning. Mamdani suggested that a Broadcasting, Joint DELOS-NSF International Workshop on
change in the fuzzy set definitions should be avoided Personalization and Recommender Systems in Digital Libraries,
although potential minor or small modifications may be Dublin, Ireland, 18th – 20th June (2001).
possible [13].
[3] Platt, J.C., Burges, C.J.C., Swenson, S., Weare, C., & Zheng,
2. Modifying the set of rules in the rule base. Given a fuzzy A., Learning a Gaussian process prior for automatically
relation is used to describe each rule in the rule-base of the generating music playlists. Advances in Neural Information
FRBS. Then given the general rule expression: Processing Systems 14, pp. 1425–1432 (2002).
Rn : If m is Mn AND t is Tn AND l is Ln AND w is Wn [4] Gasser, M, Pampalk, E., & Tomitsch, M., A Content-Based
THEN u is Un User-Feedback Driven Playlist Generator and its Evaluation in
a Real-World Scenario, Audio Mostly 2007, Ilmenau, Germany,
Where 27th – 28th September (2007).
Rn = the fuzzy relation of rule n.
M, T, L, W, and U are linguistic labels assigned to each [5] Logan, B., Content-Based Playlist Generation: Exploratory
variable of rule n. Experiments, ISMIR2002, 3rd International Conference on
Musical Information (ISMIR), (2002).
The general relation ‘R’ is constructed as the union of individual
relations [6] Kukharchik, P., Martynov, D., & Kheidorov. I., Indexing
and retrieval scheme for content-based search in audio
n databases, Audio Mostly 2007, Ilmenau, Germany, 27th – 28th
R = UR i . (3) September (2007).
i =1 [7] Reynolds, G., Barry, D., Burke, T., & Coyle, E., Towards a
So that using Zadeh’s compositional rule of inference Zadeh the Personal Automatic Music Playlist Generation Algorithm: The
output fuzzy set assuming for this case a Mamdani model then Need for Contextual Information, Audio Mostly 2007, Ilmenau,
Germany, 27th – 28th September (2007).
u o = (m o × t o × l o × wo )oR . (4) [8] Cunningham, S., Bergen, H., & Grout, V., A Note on
Content-Based Collaborative Filtering of Music, Proceedings of
It is therefore feasible to modify the rule-base based on factors IADIS - International Conference on WWW/Internet, Murcia,
such as the individual’s emotional preference, or additional Spain, 5th-8th October (2006).
inputs such as heart rate. The system would then potentially
provide an emotional state ranking that is more reflective of that [9] Meyers, O.C., A Mood-Based Music Classification and
individual. Exploration System, MS Thesis, Massachusetts Institute of
Technology (MIT), USA (2007).
Another point for development is to attach a number of E-states
to each song in the database, since certain types of music may be [10] Tamminen, S., Oulasvirta, A., Toiskallio, K. & Kankainen,
suitable for two or more E-states. Through further investigation, A., Understanding mobile contexts, Personal and Ubiquitous
we propose to develop a primary emotional rating and then a Computing, 8(2), 135–43 (2004).
secondary and possibly third E-state. Furthermore, the E-state
suitable for each song can be investigated until a much deeper, [11] Maurizio, V., & Samuele, S., Low-cost accelerometers for
accurate estimation of the range of suitable E-states can be physics experiments, European Journal of Physics, (28) pp. 781–
determined. 787, Institute of Physics (2007).
71
A Musical Instrument based on
3D Data and Volume Sonification Techniques
Lars Stockmann, Axel Berndt, Niklas Röber
Department of Simulation and Graphics
Otto-von-Guericke University of Magdeburg
Lars.Stockmann@Email.de
{aberndt|niklas}@isg.cs.uni-magdeburg.de
Abstract. Musical expressions are often associated with physical gestures and movements, which represents
the traditional approach of playing musical instruments. Varying the strength of a keystroke on the piano results
in a corresponding change in loudness. Computer-based music instruments often miss this important aspect,
which often results in a certain distance between the player, his instrument and the performance.
In our approach for a computer-based musical instrument, we use a system that provides methods for an inter-
active auditory exploration of 3D volumetric data sets, and discuss how such an instrument can take advantage
of this music-based data exploration. This includes the development of two interaction metaphors for musical
events and structures, which allows the mapping of human gestures onto live performances of music.
1 Bars & Pipes www.alfred-j-faust.de/bp/MAIN.html 3 An overview of some musical controllers can be found
2 Native Instruments www.nativeinstruments.de at www-ccrma.stanford.edu/~serafin/NBF/Newport.htm
72
A Musical Instrument based on 3D Data and Volume Sonification Techniques
wise has to be absorbed by the visual channel alone. tionally utilizing channel redundancy [11, 8]. A simple
The main challenge for sonification research is to find example is a flash light that is augmented with a sound
an expressive, intuitive, and comprehensible mapping while flashing. The proper use of acoustic stimuli in
from the data domain towards sound. combination with the visual representation also gener-
In our sonification system, we employ spatial inter- ates a deeper sense of immersion, especially in inter-
actions to facilitate an intuitive method for an audi- active 3D environments [17]. Gregory Kramer stated
tory exploration of 3D volumetric data sets. It uses a that ’spatialized sound can, with limitations, be used to
strictly functional mapping of data to complex sounds, [...] represent three-dimensional volumetric data’ [13].
based on differences in pitch and volume. This sys- One reason is that spatialized sound provides a direct
tem is the basis for a novel computer-based instrument mapping to the physical 3D space.
that can be used without musical experiences. The in- 3D volume data occurs in countless fields of research
strument is designed out of two metaphors: The Tone and is used to represent the inner and outer structure of
Wall metaphor allows a performer to directly gener- objects or materials in a voxel representation. To find
ate a melody, while the Harmonic Field is used for an an expressive mapping of the these voxels to sound is
computer-aided accompaniment. Both techniques can one of the main challenges when designing a sonifica-
be used at the same time. It produces diverse sounds, tion system.
and allows for a highly interactive performance. It can Since the development of powerful graphics acceler-
be shown that spatial interactions inherent a great po- ators, there has been much research on finding a good
tential for the use in computer-based instruments. mapping in the visualization domain, but only a few
The paper is organized as follows: After an intro- attempts exist to exploit the possibilities of sonifica-
duction to the sonification of volumetric data sets in tion to convey 3D volume data. Minghim and Forrest
Section 2, we advance by presenting our sonification have suggested methods like the “Volume Scan Pro-
system in Section 2.1. This includes some technical de- cess”, in which the density inside a volume probe is
tails regarding our implementation. We then elaborate mapped to the pitch of a generated tone [16]. David
in Section 2.2 how sonification and computer-based in- Rossiter and Wai-Yin Ng traverse the voxels of a 3D
struments connect, and how live music performances volume and map their values to different instrument
can benefit from an instrument that uses our sonifi- timbres, amplitudes and pitches [18]. Both systems
cation system. In Section 3 we describe musical data are controlled through a quite simple mouse/keyboard
can be derived from spatial gestures in volumetric data interface. However, for the sonification of 3D volume
sets. The Tone Wall metaphor (Section 3.1) specifies data, interaction must not be seen as requirement, but
the pitch, loudness, and timbre space for melodic pur- as key aspect. In fact, it is the second most impor-
poses. The Harmonic Field (Section 3.2) describes how tant aspect after the mapping. A direct exploration of
volume data can be used to represent harmonies, bro- the data by, e.g., moving the hand through an interac-
ken chord play, and musical textures. Section 3.3 is tive 3D environment can provide the user with a better
concerned with a combination of both concepts for the understanding of extent or local anomalies. Both ex-
presentation of a one man polyphonic performance. Fi- amples of related work lack this ability of a responsive
nally the results are discussed in section 3.4, which also user interface for 3D input like a realtime tracking sys-
includes possible improvements for further research. tem, or need to compile the audio data before one can
listen to it. The next passage outlines our sonification
system, which focuses on direct interactions and an ex-
2 Volume Data Sonification pressive mapping of the inner structure of 3D volume
data.
Data sonification is an underdeveloped, but growing
field of research. In this section we describe how soni-
fication can be applied to acoustically describe 3D vol- 2.1 Spatial Exploration of Volume Data
ume data sets. Before we describe our method, we As mentioned before, a sonification system can greatly
discuss several advantages that make sonification tech- benefit from tracking devices that allow a direct ex-
niques at times superior to a more classic visual exam- ploration of the volume data. In the visualization do-
ination and presentation of scientific data sets. Exam- main, this is generally done using a certain Viewpoint-
ples are monitoring applications, or any type of unfo- Metaphor, such as the ones presented by Colin Ware
cused operations and processes. The generated acous- and Steven Osborne [23]. With respect to data sonifi-
tic stimuli can be heard without paying direct atten- cation, the eye in hand metaphor can be easily trans-
tion. This yields an improved mobility. Furthermore, formed into the above described volume probe. Instead
Kristine Jørgensen states that the presence of sound of a spherical or cubical shape, our approach uses the
increases attention, and eases the perception by inten- metaphor of a chime rod, as illustrated in Figure 1.
-2-
73
A Musical Instrument based on 3D Data and Volume Sonification Techniques
-3-
74
A Musical Instrument based on 3D Data and Volume Sonification Techniques
A computer that is asked to improvise could, of course, Indeed the tangible interface is very intuitive though
not use mood or emotion as basis for its performance, these attempts are momentarily limited to two dimen-
but arbitrary, or specially arranged data. Using mu- sional space. Besides the afore mentioned reacTable
sic to convey data can have some advantages. Often and Sound Rose The “Morph Table” system that uses
sonification suffers from annoyance. Paul Vickers and morphing techniques presented in [25] is a good exam-
Bennett Hogg state that ‘Sonification designers con- ple how this interface can be used for music generation
centrated more on building systems and less on those [2]. However, the music is also controlled on a rather
systems’ æsthetic qualities’ [22]. Accoustic stimuli that high level. The system generates transitions between
abide by the rules of music are generally more appeal- a source- and a target pattern which is applied on pre-
ing for the listener than sounds that use arbitrary pitch composed melodies and rhythms. It is not possible to
and timbre. It may even stimulate the interactive ex- create a melody directly. Furthermore, it is limited to
ploration of data, as the listener self-evidently becomes two dimensions.
a music performer by interacting with the dataset. She Chadabe describes a system called Solo that uses mod-
or he will try to achieve the most pleasant musical re- ified theremin’s (see [21]) as 3D input devices to guide
sult. A distinct variation in the data means a distinct the system [3]. Again, the melody is generated al-
variation in music. Its location can be memorized more gorithmically. The performer controls variables like
easily when the performer ‘explores it intentionally’ be- tempo and timbre. The computer is used for sound
cause she or he feels that this particular variation fits synthesis. Thus, this approach is similar to that de-
best in the current music progression. scribed in [5] and [2] as the performer has only a global
However, finding a meaningful mapping of arbitrary influence on the generated music. However, we think
multi-dimensional data to music must be considered that 3D input devices can be used to intuitively control
highly challenging. Some approaches can be found in both, melody and accompaniment. Where the former
projects like the Cluster Data Sonification or the Solar is generated through a direct mapping of the position
Songs by Marty Quinn. In his Image Music4 sonifi- to pitch while the latter could benefit from semi auto-
cation, the user can interactively explore a 2D image matic composition or precomposed elements. This not
through music. However, nothing has been done yet only opens the path for diverse improvisations but also
in the domain of 3D volume data. Furthermore, the can be considered more immersive than just influencing
said examples are not intended for live music perfor- certain aspects of music that is otherwise algorhythmi-
mances. The interaction is limited to mouse input that caly generated.
does not meet the high responsiveness demanded by a Our system for interactive exploration of 3D volume
music performer. data is applicable in that it provides the necessary de-
Besides the mapping, the method for interacting grees of freedom to have both aspects in one instrument
with the system is crucial for its efficiency. Like the as well as the responsiveness demanded for a live per-
afore mentioned sonification system computer-based formance. This makes it possible to develop metaphors
instruments mostly use either mouse/keyboard inter- for music and sound generation. Two are described in
action, or are designed to be played with MIDI- the next section.
keyboards. These demand a certain skill in order to
be adequately handled. Systems using the elements
of direct interaction as a means for acoustic excita-
tion are scarce. Instruments like the Fractal Composer
introduced by Chapel, for example, provide a mouse 3 Volumetric Music
driven graphical user interface [5]. The system com-
poses music using the MIDI protocol in realtime that
Along the lines of traditional musical instruments,
depends on parameters which are set by the user. She
computer-based musical instruments have to find in-
or he has no direct control over the melody or har-
tuitive performative metaphors for musical events. A
mony that is generated. This induces a big distance
typical example: To strike one key on the piano means
between the performer and the instrument. She or
playing its corresponding pitch. The keystroke veloc-
he can only influence the composition on a fairly high
ity regulates its loudness. The following sections will
level. These systems are referred to as interactive in-
describe and discuss this mapping of spatial gestures
struments [4] or active instruments[5]. In contrast, the
to musical events and structures, in analogy to the pre-
reacTable and the Sound Rose mentioned earlier are
viously discussed image and volume data sonification
collaborative instruments that use direct interaction.
techniques. The volumetric data represents thereby
4 Design Rhythmics Sonification Research Lab www. the medium of interaction and defines the basis for a
drsrl.com/ music processing.
-4-
75
A Musical Instrument based on 3D Data and Volume Sonification Techniques
pitch
affects the attack and onset behavior of the tone. A fast ics
punch causes a short attack (a very direct beginning of
the tone), and a more percussive onset performed in a
slow velocity results in a softer tone at the beginning
Figure 4: Tone Wall
independent of its dynamic level.
Thus, the y- and z-axis open up the complete band-
width of expressive tone forming known from keyboard
It defines a number of regions (as illustrated in fig-
instruments, like the piano, and the punch velocity is
ure 5) with their own harmonic content, e.g. a C ma-
a new means to specify details of the tone beginning.
jor harmony in the grey area (harmony 1), a minor in
However, it would be unwise to not additionally exploit
the yellow (harmony 2), a cluster chord in the area of
the potentials lying in the x-axis. Many instruments
harmony 5, and so on. The performer can move his
allow the player to vary its timbre to a certain extent,
focus via a head-tracking interaction over the regions
for which the x-axis is predestined. Different timbres
to change the harmony that is currently played; he lit-
can be blended from left to right, e.g. from a very dark
erally looks to the harmonies to play them.
sinusoidal waveform over relaxed, clear sound charac-
teristics up to brilliant and very shrill sounds. There Each harmonic area defines a density gain towards
are no limitations in sound design in comparison to the peak in its center. The density allocation can, of
traditional musical instruments. The complete Tone course, also feature more complex shapes, define mul-
Wall concept is illustrated in figure 4. tiple peaks, holes and hard surfaces. The values can
For more timbral variances and freedom, it is possi- be used for fading techniques, such as those described
ble to fill the Tone Wall with volumetric data of vary- in [1]; high density can be implemented with a louder
ing density. It can be employed as static behavior or re- volume than low density. But the harmonic field is not
act on interactions, e.g. like particles that are charged restricted to static tones only. Chords can be orna-
with kinetic energy when they are hit by the interactor mented by arpeggiated figures and compositional tex-
device. Due to the freedom to apply any sound synthe- tures can be defined. Instead of using a simple in/out
sis methods, the Tone Wall interface is not restricted fading, the texture density can be adapted: very sim-
to pitch based melodic structures, but also for more ple, transparent textures at lower density areas and
complex sound structures and noises for contemporary rich in detail figures at higher densities.
music styles. Since harmonic areas can overlap, we applied a num-
ber of transition techniques—other than fading that
does not satisfy in any situation. Held chords are tran-
3.2 Harmonic Field
sitioned part by part. Each part is moving stepwise
In contrast to the Tone Wall concept, which speci- towards its targeted pitch, where the steps are cho-
fies an interface to create basic musical events, the sen according to the underlying scale of the harmony
Harmonic Field is already a pre-composed musical en- (e.g., major, minor, or chromatic scale). Instead of a
vironment, which can be freely explored by the per- stepwise movement, the transition can also be done by
former. linear glissando. The transitional pitch is an interpo-
-5-
76
A Musical Instrument based on 3D Data and Volume Sonification Techniques
3.4 Discussion
As with all musical instruments, it is necessary to in-
vest a certain amount of practice to learn the intu-
harmony 5 ition and motoric sensitiveness for a confident expres-
harmony 4 sive play. The intuitive correspondence between ges-
tural and musical events, especially in the case of the
Tone Wall interface, turned out to be very supportive
for a steep training curve. Nonetheless, a few practical
issues have to be discussed.
Figure 5: Harmonic Field
The interaction with the Tone Wall is subject to a
motoric limitation; it is quite exhausting to create fast
pace melodies with a proper play over a long period
lation of the pitches of each harmonic area according
of time. Tracking latencies (ranging between 8–10 ms)
to their density weightings. The goal pitch is reached
and sampling artifacts (interaction sample rate is 60
when the old harmonic area is left, or a hole within
Hz with two interactors) also slightly interfere with the
with zero density is found. With complex cluster-like
play and the possible speed of interaction.
harmonies, the resulting metrumless clouds do wake
Because of the absence of any visual reference points,
associations with György Ligeti’s Clock and Clouds for
it is at times difficult to meet the intended pitches.
women’s choir and orchestra.
A calibration, according to the size of the performer,
Compositional textures, in any respect, are not
can lower this problem; his body can provide several
metrumless. They are well-defined sequences of pit-
reference points.
ches/events in a certain tempo and rhythm. In the
For playing melodic intervals, the interactor has to
case of different tempi, the transitional tempo is an in-
leave the wall, jump over the unwanted pitches, and
terpolation depending on the density weighting. Since
punch back into it. Moving the interactor within the
the textures are repetitive, the morphing techniques of
wall would trigger pitches in-between. Thus, melodic
Wooller and Brown [25], and the interpolation tech-
intervals are always adherent with short pauses. A
nique of Mathews and Rosler [15] can be applied to
legato articulation is not possible within this approach.
combine the figural material.
Therefore, an interactor speed dependency has to be
However, generative textures were not included at
incorporated: a pitch is only played if the interac-
the current state. Therefor, transition techniques for
tor’s velocity is below a certain threshold. Pitches
generative algorithms have to be developed and are
can be skipped by faster movements even within the
classified as future work.
wall. Since this raises the problem of creating fast
pace melodies, this mode has to be detachable, e.g.
3.3 Poly Field
by a button on the hand interactor.
When performing music, it is always desirable of be- The same approach could be useful to reduce the
ing able to handle both, melodic and harmonic data, wah-effect when playing a pitch. The punch always hits
simultaneously. Thus, both interfaces, the Tone Wall the low dynamics area at the wall surface first, and the
and the Harmonic Field, have to be accessible and con- loud dynamics afterward. Hence, each tone fades in,
trollable by one person at the same time. even with fast punches that do only effect a more direct
This is achieved by employing two input devices tone attack. Although the interaction sampling rates
-6-
77
A Musical Instrument based on 3D Data and Volume Sonification Techniques
used lower this effect, a velocity dependent sampling and the integration of serial and generative concepts.
of the interactor would make the dynamic level more The volumetric interaction interface also opens up a
accessible. promising possibility for the conduction of music.
However, all performative means of expression are The musical volume representation concept is also
available and easy to perform—dynamics and empha- a novel view on musical structure and elements, en-
sis, articulation, (de-)tuning, timbral and articula- abling new compositional forms and means of expres-
tional (glissando, triller etc.) effects. sion. Here lies the biggest potential of new computer-
For the Harmonic Field the composer is free to de- based instruments. It is unnecessary to imitate tradi-
fine any chords, assign them to any timbral instru- tional instruments to create music that is performed
mentation and figurative ornamentation, and combine better with the real ones. If one wants to play a pi-
them by overlapping. He can actually define any com- ano, violin, trombone etc. the real ones perform always
positional and timbral texture and it can be explored better. New instruments should not imitate them, but
freely by the player. The player, however, is fixed to stand for a confident self-reliance to open up new possi-
this predefined set, unable to create new chords and bilities for new music to constitute their right to exist.
textures interactively during the performance. Fur-
thermore, the three-dimensional space cannot be ex- References
plored adequately using head-orientation alone, i.e. [1] A. Berndt, K. Hartmann, N. Röber, and M. Ma-
looking at a harmonic area from a relatively fixed po- such. Composition and Arrangement Techniques
sition, which allows only an exploration in 2D. The for Music in Interactive Immersive Environments.
player should be able to move freely in 3D space. This In Audio Mostly 2006: A Conf. on Sound in
raises conflicts with the Tone Wall metaphor. A pos- Games, pages 53–59, Piteå, Sweden, oct. 2006.
sible solution is to position the Tone Wall always in Interactive Institute, Sonic Studio Piteå.
front of the player and reposition it when the player
moves through the Harmonic Field. [2] A. R. Brown, R. W. Wooller, and T. Kate. The
However, the combination of the Harmonic Field Morphing Table: A collaborative interface for mu-
with the Tone Wall interface open up a very large sical interaction. In A. Riddel and A. Thorogood,
musical bandwidth with more timbral freedom than editors, Proceedings of the Australasian Computer
any traditional musical instrument can offer. The Music Conference, pages 34–39, Canberra, Aus-
three-dimensional setup of harmonic structures and tralia, july 2007. Australian National University
their density-dependent ornamentation textures are Canberra.
also unique and provides an inspiring platform espe- [3] J. Chadabe. Interactive Music Composition and
cially for performing contemporary music. Performance System. United States Patent Nr.
4,526,078, july 1985. filed sep. 1982.
4 Conclusion, Future Work [4] J. Chadabe. The Limitations of Mapping as a
Structural Descriptive in Electronic Instruments.
In this paper we presented a gesture based approach to- In Proceedings of the Conference on New Instru-
wards virtual musical instruments. We introduced the ments for Musical Expression (NIME-02), Dublin,
conceptual basis, which is a novel interaction mecha- Ireland, may 2002.
nism developed for the interactive auditory exploration
[5] R. H. Chapel. Realtime Algorithmic Music Sys-
of volumetric data sets. For their sonification we de-
tems From Fractals and Chaotic Functions: To-
vised the musical metaphors of the Tone Wall and the
wards an Active Musical Instrument. PhD thesis,
Harmonic Field, and conceived their sonic behavior in
University Pompeu Fabra, Department of Tech-
a way that the interaction with them produces musical
nology, Barcelona, Spain, sept. 2003.
events and aesthetic structures, like tones, melodies,
timbre effects, chords, and textures. We discussed as- [6] A. Crevoisier, C. Bornand, A. Guichard, S. Mat-
sets and drawbacks of these metaphors and outlined sumura, and C. Arakawa. Sound Rose: Creat-
advancements. ing Music and Images with a Touch Table. In
3D interaction devices open up a multitude of new NIME ’06: Sixth meeting of the International
possibilities for the design of computer-based instru- Conference on New Interfaces for Musical Ex-
ments. Their big potential lies in their intuitive as- pression, pages 212–215, Paris, France, 2006.
sociation with physical human gestures and musical IRCAM—Centre Pompidou.
events, for which the interaction with virtual volume [7] W. T. Fitch and G. Kramer. Sonifying the body
data turned out to be the medium of choice. Future electric: Superiority of an auditory over a vi-
work includes the development of further metaphors sual display in a complex multivariate system.
-7-
78
A Musical Instrument based on 3D Data and Volume Sonification Techniques
In G. Kramer, editor, Auditory Display: Soni- [19] Laurie Spiegel. Music Mouse. http://retiary.
fication, Audification, and Auditory Interfaces, org/ls/programs.html, 2004.
Boston, MA, USA, 1994. Addison-Wesley. [20] Lars Stockmann. Designing an Audio API for Mo-
[8] C. Heeter and P. Gomes. It’s Time for Hyper- bile Platforms. Internship report, 2007.
media to Move to Talking Pictures. Journal of
[21] L. S. Theremin. Method of and Apparatus for the
Educational Multimedia and Hypermedia, winter
Generation of Sounds. United States Patent Nr.
1992.
73,529, dec. 1924.
[9] A. Hunt and T. Hermann. The Importance of
Interaction in Sonification. In ICAD 04—Tenth [22] Paul Vickers and Bennett Hogg. Sonification ab-
Meeting of the International Conference on Audi- straite/sonification concrète: An ‘Æsthetic per-
tory Display, Sydney, Australia, july 2004. spective space’ for classifying auditory displays in
the ars musica domain. ICAD 06 – 12th Interna-
[10] S. Jordà, M. Kaltenbrunner, G. Geiger, and
tional Conference on Auditory Display, Juni 2006.
R. Bencina. The reacTable. In Proceedings of
the International Computer Music Conference, [23] Colin Ware and Steven Osborne. Exploration
Barcelona, Spain, 2005. International Computer and virtual camera control in virtual three di-
Music Association. mensional environments. SIGGRAPH Comput.
Graph., 24(2):175–183, 1990.
[11] K. Jørgensen. On the Functional Aspects of Com-
puter Game Audio. In Audio Mostly 2006: A [24] J. Williamson, R. Murray-Smith, and S. Hughes.
Conf. on Sound in Games, pages 48–52, Piteå, Shoogle: Excitatory Multimodal Interaction on
Sweden, oct. 2006. Interactive Institute, Sonic Mobile Devices. In Proceedings of the SIGCHI
Studio Piteå. conference on Human factors in computing sys-
[12] M. Kaltenbrunner, S. Jordà, G. Geiger, and tems, pages 121–124, New York, USA, 2007.
M. Alonso. The reacTable: A Collaborative Musi- ACM.
cal Instrument. In Proceedings of the Workshop on [25] R. W. Wooller and A. R. Brown. Investigat-
”Tangible Interaction in Collaborative Environ- ing morphing algorithms for generative music. In
ments” (TICE), at the 15th International IEEE Third Iteration: Third International Conference
Workshops on Enabling Technologies, Manch- on Generative Systems in the Electronic Arts,
ester, U.K., 2006. Melbourne, Australia, dec. 2005.
[13] G. Kramer, editor. Auditory Display: Soni-
fication, Audification, and Auditory Interfaces.
Addison-Wesley, Boston, MA, USA, 1994.
[14] M. V. Mathews. The Digital Computer as a Musi-
cal Instrument. Science, 142:553–557, nov. 1963.
[15] M. V. Mathews and L. Rosler. Graphical Lan-
guage for the Scores of Computer-Generated
Sounds. Perspectives of New Music, 6(2):92–118,
Spring–Summer 1968.
[16] R. Minghim and A. R. Forrest. An Illustrated
Analysis of Sonification for Scientific Visualisa-
tion. In IEEE Conference on Visualization, At-
lanta, USA, oct. 1995.
[17] Niklas Röber and Maic Masuch. Playing Audio-
only Games: A compendium of interacting with
virtual, auditory Worlds. In Proceedings of 2nd
DIGRA Gamesconference, Vancouver, Canada,
2005.
[18] David Rossiter and Wai-Yin Ng. A system for the
complementary visualization of 3D volume images
using 2D and 3D binaurally processed sonification
representations. In Proceedings of the 7th confer-
ence on Visualization, pages 351–354, San Fran-
cisco, USA, 1996. IEEE Computer Society Press.
-8-
79
Same but Different – Composing for Interactivity
Abstract. Based on experiences from practical design work, we try to show, what we believe, are the similarities and differences,
between composing music for interactive media compared to linear music. In our view, much is the same, built on traditions that
have been around for centuries within music and composition. The fact that the composer writes programming code is an essential
difference. Instead of writing one linear work, he creates infinite numbers of potential musics that reveal themselves as answers to
user interactions in many situations. Therefore, we have to broaden our perspectives. We have to put forward factors that earlier was
implicit in the musical and music making situations, no matter if it was the concert hall, the church, or the club. When composing
interactive music we have to consider the genre, the potential roles the listener might take, and the user experience in different
situations.
80
living tapestry, which they create together. Or they can just chill The Interactive Challenge
out and feel the vibrations from the music sitting in the largest The composer of interactive music does not write notes on
modules. paper or mix sound samples together to a linear track. He
The installation has a 4-channel sound system that makes creates music and software which totally or partly are the same,
listening a distributed experience. ORFI consist for the time depending on if the music elements are programs or sound
being of 8 genres, or collections of rules, which the user can samples. This means that the composer composes potential
change between. Our use of the term “genre” has references to music, and software that controls the potential relation between
popular culture, such as music, and everyday activities made music elements. Music elements that will follow each other or
when consuming the music, such as dancing [3, 4]. In ORFI we lie as layers on top of each others and be distributed in the 4-
explore 8 different musical genres: channel system. All depending on what the user does, which
JAZZ (bebop jazz band, dancing, ambient), never can be predicted exactly. Writing software represent a
FUNK (groove, dancing) totally different potential than writing for an instrument, because
RYTM (techno, club) the computer can wait, remember and learn in a more or less
TATI (speech, onomatopoeic, movie) intelligent manner. Therefore one can write software so that the
GLCH (noise, club) installation or the interactive medium can behave more or less
ARVO (ambient, relaxation) like an active actor [7] instead of as an instrument. Playing a
MINI (minimalist instruments, playing with toys) traditional instrument a musical gesture on an instrument will
VOXX (voice recordings generated dynamically by produce an immediate mechanical sound response [8]. Writing
the user). software for a computer one can decide that a gesture or
interaction from the user, after a while will create a more
In this paper we have chosen to describe the compositions in the complex musical answer. This is more like the improvisational
JAZZ and MINI genres, because they represent oppositions musician, which after some time comes with his answer to your
regarding genres and therefore serve as explanatory examples. solo play.
The many possibilities, such as many distributed wireless In ORFI we use both strategies in order to offer multiple
modules, and many genres to choose between, reflect our goal to possibilities in all situations [2]. This means that the user
facilitate collaboration and communication on equal terms, interacting with ORFI gets both a direct, immediate answer in
between different users in different use situations light and sound, as when playing on an acoustic instrument. But
after a little while he gets a complex musical answer to motivate
New Situations and Roles the user to further co-creation with ORFI. Examples of how it is
One of the aspects we have put a lot of effort into when creating composed will be presented under “Interactive music
ORFI is related to the use or consumption situation. We don’t composition”.
know and cannot control in what situation and for how long ORFI is created so it continuously invites to collaboration in
ORFI will be played on and listened to. This differs in a different ways and through different media and forms [9]. Since
fundamental way from composing music for a stage we have a very ambitious goal that ORFI shall work satisfactory
performance. In a stage performance one knows implicit that the for most users in many situations over long time it is necessary
audience will sit in the dark with the face towards the stage with an open concept of collaboration. Nothing is the right way
quietly listening for one or two hours. Radio listening is more to do something. Nothing is wrong. It is as right to listen and
like our situation, but here one usually knows by knowing the sleep in the interactive landscape as it is to throw the modules
time, what everyday ritual the radio program is part of [5]. between each other while playing. It is equally right to build and
Music as an ambient sound tapestry in a home is more like our shape one’s own interactive landscape, as it is to concentrate and
situation, but here the user is limited to turning the music on or move the wings rhythmically for some minutes. To offer this
off, or change to next tune. These actions represent a break in amount of openness we have put a lot of effort into the design of
continuity of the listening experience. ORFI in order to offer qualities like robustity, sensitivity,
In ORFI the interaction shall be a seamless [6] part of the music ambiguity, musicality and consistency.
experience. The music therefore must invite and motivate ORFI has to be very robust physically to handle being thrown,
dynamically to interaction. To the co-creation of the music stepped on and bended intensely. But it also has to be robust,
experience. In this meaning, ORFI is more like the tolerant and sensitive in the software and hardware to register a
improvisational musician, but in ORFI we can not count on the weak movement from a child’s hand and attempt to follow the
user to have a professional musical know-how. It must be rhythm.
satisfactory to both musical professionals and people with little ORFI offers many visually, physically and musically
music competence, if we are to reach our ambitious goals. possibilities in many situation. It tries to answer and encourage
In ORFI the audience changes continuously between roles in the users in different ways to musically interaction, in spite of
different situations, from being a passive listener, to a musician their competence, or lack of competence, in music. This means
and a composer. Through long use the user gets deeper that for some, in some genres, the lights rhythmical blinking
knowledge about ORFI`s complexity and the user becomes motivates the user to rhythmic interaction. But in other
more like the improvisational musician, who with his situations, for other users, it is the complex dynamic graphic that
competence creates music on an instrument in real time. But the gives the user a visual image, and motivates the user to create
user also comes nearer and becomes more like a composer over the musical narrative.
time. The user becomes like a co-composer, which based on the All these possibilities open up for various experiences for
potential, the composer and writer of the software has different users depending on the individual’s competences and
formulated, “composes” music by choosing and mixing music experiences with ORFI. It has been very important for us to
together. In this way the real composer who has written the design ORFI to offer many possibilities in every situation,
software is present in the installation by continuously giving ambiguity [10]. But it has also been very important to give ORFI
musical answers, offering new musical possibilities to the user, a clear and unique identity. This so ORFI might act as a
the co-composer. This is also a major difference between linear convincing actor in a collaboration or improvisation. The
an interactive music composition. continuous change of roles the user can make. The many
possibilities the users are offered. The potential infinite uses and
81
the many consumption situations make the interactive A sound node can be a linear sound file or programming code.
composition challenge much more complex than in linear music. We have chosen to present the JAZZ and MINI genres, to show
The fact that the nature of interactive music composition is our solutions in the two cases, sound samples vs. code.
software, and not notes or samples, also makes it necessary to The accompaniments (ground), horn riffs and saxophone sound
structure the interactive composition in another way than linear nodes in the JAZZ genre are sound files. The melodic patterns in
music. the MINI genre are programming code. The difference between
sound file and programming code is that the auditive result
Interactive Music Composition formulated in programming code varies dynamically for each
So how have we met the interactive challenge of composing interaction, while the sound file is essentially the same each time
potential music? How have we created algorithms and rules in it is played. This makes the programming code potentially more
programming code that regulate the relations between musical flexible, since it can vary with user interaction.
elements, the conditions for potential music? And how did we creation of a node
compose music that motivated both professional musicians and Similar to traditional jazz, the blues in our JAZZ genre is
laymen to interact? composed and recorded by a jazz ensemble [11]. Each musician
With concrete examples we will try to show how we have has recorded his instrument until the result is mixed down to a
composed the music for ORFI, exemplified by the two most jazz song.
diverse genres. After recording the music, we have cut and grouped the
We have chosen to structure ORFI’s software and music recorded instruments into separate sound files. Then we have
composition into the following layers: sound node, composition arranged the files interactively by writing rules for ORFI [2].
rule, and narrative structure (see Figure. 3). Our arrangement builds on the style’s traditional "improvisation
Sound nodes are the smallest musically defined elements, such on a theme".
as, tones, chords, or rhythmic patterns.
nodes and node structure
The sound nodes can be joined into sequences or parallel events
In traditional jazz the musicians play on instruments with direct
by composition rules, forming phrases, cadences, or rhythmic
response only. In ORFI the user might instead play on 20
patterns. The sound nodes can be joined into phrases or parallel
physical soft modules. When interacting each module plays
events by composition rules (algorithms). The user experiences
three different saxophone sound samples depending on the
these phrases as narrative structures based on a genre.
situation. The reason for this solution, is that we wanted to be
able to vary the expression from soft Ben Webster-like sound
nodes within Dorian scale, to hard, growling and dissonant
sound nodes outside the Dorian scale, and percussive saxophone
pad sound nodes. Which sound nodes the program combines and
how depend on if the users are active or passive, interact on their
own or collaborate, synchronise to the musical beat or not.
roles and experience
Similar to jazz, our interpretation has separate roles tied to the
different instruments. We use the tenor saxophone in the soloist
role, blues drums and walking base as accompaniment (ground),
and horn riffs played on saxophone, trombone and trumpet. An
important difference is that the interacting users continuously
can choose to change between the role of improvisator, soloist,
accompanist by choosing the module to play on etc. Therefore
the roles are potential and open for interpretation more than
definitely.
An example; When improvising, a saxophonist creates music
from pre-composed short motifs. He also creates phrases from
Figure 3: Structure of the interactive music composition two contradictory curves of tension – amplitude and vibrato.
software and interface in ORFI. The amplitude curve goes from strong to weak (>). And the
curve for vibrato goes in the opposite direction from little to
Figure 3 shows two users interact (bottom) with input sensors A much vibrato (<). We use the same strategy when composing
and B. The composition rules written in programming code, interactive music; This result in a sound node that potentially
selects the saxophone sound nodes (sax 1, 5, 7, 2) based on the has two gestures at the same time, a decreasing and an
users interaction. Another composition rule creates the switch increasing gesture. These contradictory curves of tension could
from “ground 1” in high tempo, to “ground 2” in slow tempo, so potentially function as the start tone, building up a tension in a
that it synchronise smoothly with the pulse without creating a phrase, and as an end tone finishing the phrase creating a
break. Over time the user creates a narrative structure of an 8 bar release. The user, both laymen and musicians, can choose to
jazz blues that motivates further interaction. hear it as a tension or release depending on the situation.
In our JAZZ genre laymen can use the saxophone’s tension
Sound Node
creating curve for other purposes than the professional musician.
definition, mediation and qualities For instance speeding up a movement, rolling his body over the
We call the smallest musically defined elements sound nodes. soft modules spread out on the floor, while communicating with
They are categorized into sound qualities like length, a friend.
instrument, pitch, harmony, tempo, meter, etc. Based on the
response and experiences
sound qualities and the composition rules, the program choose
The melodic pattern generated by the programming code is
and creates the narrative, e.g. a melodic motif, where the
changing dynamically with the user interactions and other
expressive qualities depend on user interaction.
melodic patterns playing at the same time. So when the user
82
interacts, the software realise one out of many possible moment to play back motifs, and selects sound nodes that add
melodies. variation to the harmony and musical phrasing.
Similar to playing on traditional instruments, the user gets direct Unlike traditional music, laymen can communicate with each
response when interacting. And at the same time the user other directly through the music. In ORFI laymen interact
contributes to a musical whole. actively and the program responds to individual, as well as
The MINI genre gives immediate response in simple 2-6 notes collective interactions.
melodic motifs. The sound nodes also contribute to a complex An obstacle in traditional music is that it is hard to make music.
and musically satisfying response that motivate users to interact It is hard for a layman to keep the rhythm, pick the right notes
with others over longer time. and create musically satisfying phrases.
Similar to traditional music our JAZZ genre uses different In ORFI its different; The program and its composition rules are
instruments to create complex variations and contrasts between tolerant, making it easier for laymen to synchronise to the pulse.
the instruments. This is also the case for groups of sound nodes, The composition rules tolerate deviations from what is
within an instrument. rhythmically correct and synchronise motifs to the harmony and
code vs. tune the pulse in other sound nodes. The result is avoidance of
The sound nodes in the MINI genre are inspired by minimalist technically difficult situations. And the laymen can instead focus
music in the style of Steve Reich [12]. on communication and collaboration with others.
Similar to minimalist music our genre is characterised by composition techniques in JAZZ
repetitions and small variations of short rhythmical and melodic We have been inspired by traditional cool jazz and its modal
motifs, rather than large-scale development such as phrasing, or harmony, rhythmically improvised and laid back performing
sonata form. With less happening on a macro level the focus is style of such artists as Miles Davies and Ben Webster [15, 16].
directed towards the surface and the micro level of small In cool jazz a saxophonist can make it sound great letting the
changes in melody, rhythm and timbre. instrument wander casually along a modal scale. He search his
What makes our MINI genre different is that every sound node way along the background of drums and falling fifths of a
is a program (see Figure 4). walking base.
In cool jazz it is custom to use themes in modal scales with less
SynthDef(\pattSynth, {|out= 0, freq= 440, amp= 0.1, atk= 0, rel= 0.5, chords in order to make it easier to improvise freely, with focus
max= 40| on rhythm and musical expression.
var e= EnvGen.kr(Env.perc(atk, rel), doneAction:2); Similarly we use Dorian modality in ORFI, to motivate
var f= EnvGen.kr(Env.perc(0, 0.01), 1, 1, 0, Rand(0.95, 1.1));
improvisation and interaction.
var z= SinOsc.ar(freq*[1, IRand(0, 3).round(1).max(0.5)+Rand(1,
1.02)], f*Rand(10, 40), amp*0.1); As in cool jazz, interactive music composition, also use effects
Out.ar(out, e*z); like growling and dissonant saxophone with harsh timbre as a
}).store; musical rhetoric technique.
Unlike traditional jazz we use growling and dissonant
Figure 4: The sound node programming code for a synthesised, saxophone to express, stage and dramatize the conflict when
marimba, written by Fredrik Olofsson in SuperCollider [13, 14]. many users interact simultaneously. For instance when many
people play and tease each other, by interacting with many
Composition Rule ORFI modules at the same time. The result is that the program
definition and mediation creates many growling noises outside the Dorian scale, in
Similar to traditional music the composition rule is composition addition to the user created, soft and consonant tones.
knowledge, the composer use to create the traditional musical As in traditional jazz we use soft consonant leading notes for
work. making musical ornaments. These motivate to improvisation,
It can for instance be knowledge about how to create relations such as call-and-response communication, and duets between
between tones, rhythm, melody, timbre and harmony in music. musicians.
Different from traditional music is that the composition rules are Different is that the soft and consonant leading notes are used to
programming code realised through use. For instance the sound express pauses, motivating turn-taking between laymen. When
files in the JAZZ genre are controlled by the composition rules one or many users make pauses while in a sequence interact-
in a program that consider both the musical and the interactive stop-interact-stop-interact, etc., the composition rules create soft
development over time. Another example is the MINI genre leading notes in Dorian scale. This in addition to the ones that
melodic pattern, where both sound nodes and composition rules the user has chosen. The result is that the user becomes aware of
are formulated as programming code (figure 4) and where the the silent pause between the interactions, and the relations
difference between sounds and rules is lacking. The result is that between his own actions and actions of others. This motivates
the music can change dynamically, and that it sounds differently dialogue, imitations and play in call-and-response manner.
over time, with different users and situations. composition techniques in MINI
competence and experience Similar to traditional minimalist music, the ORFI motifs borrow
Similar to traditional music, a musician can use his polyrhythmic techniques from Gamelan, African and Middle
improvisational competences for making music by joining pre- Ages music. Here, polyrhythmic and harmonic gaps in the
composed elements together. rhythmic patterns, make them fit into each other, creating
Differently from traditional music is that a layman with less “hocket” patterns. This motivates improvisation and interaction.
musical competence can interact with the program and its A difference is that minimalist motifs are used to express
composition rules. The program interprets the interaction, delays contrasting and varying responses that motivate laymen. The
and changes the response in order to make it musically composition rules then vary the pattern so that the hocket effects
satisfying, according to the composition rules. disappear, in order to reappear when the rule for variation is
The composition rules regulate the synchronisation of motifs to active again. A difference is that the music varies with the
the pulse after every user input. It also regulates tonal and number of users interacting, giving dub-delay effects to one user
harmonic development so that they don’t contradict the genre and reverb to another. The result is a blurred and distorted
rules. Instead, the program waits for a rhythmically suitable effect. The effect is used to separate between two individual
laymen motivating them to collaborate.
83
Narrative Structure tolerant minimalist sound carpet to sink down into, or a melodic
toy to throw and play with, or an improvising partner and active
definition actor, continuously inviting and motivating to communication.
We use the term narrative structure to describe structures for
connecting series of events in ORFI, creating experience and same and different
expectations about future musical output. We have tried to show how we have composed music for the
MINI and JAZZ genres in ORFI. We have found the comparison
role, action and expectations
between traditional popular music and its application to
The difference between linear music and ORFI`s music is that interactive music to be very successful. Much of music’s
the composer has to negotiate the narrative structure with expressive qualities, variation and repetition techniques are the
interacting users as well as passive listeners. Similar to linear same in interactive and linear music. A great deal of traditional
music, in ORFI there are often opposing or contradictory knowledge about analysis and composition of music can be
expectations about what the narrative structures mean. For transferred to the interactive music.
example, a melodic structure driving the music forward creates Differences we found in the design of ORFI are primarily tied to
expectations of tension and a crescendo. In the same piece, a user expectations and structures in composition rules and
rhythmic pulse in the ground, might create expectations of narrative structures to support those expectations. Our
bodily movements and dance to the pulse. Similar to linear music experiences are that the interacting users need immediate
it is possible for the users in ORFI to negotiate meaning, response to be able to orient and find their way, as well as more
following or denying expectations about the narrative structure. complex response in order to get motivated to continue
Similar to traditional jazz, the narrative structure of our JAZZ interacting over time. We often found that musically complex
genre follows the development and tension in an 8 bar jazz blues structures or processes found in traditional music could
structure. Traditionally the blues structure is the ground for the strengthen other situations with laymen interacting and playing
soloist’s improvisation over a repeated series of chords and a alone and in collaboration with other people.
pulse. Often the soloist creates expectations that follow the
convention, of playing as many rounds as he think suits him.
Add Perspectives
When he feels ready to hand over to the next soloists he gives a
In our paper we have tried to show what we believe, are the
sign, making cadences or finishing riffs. And the next person,
similarities and differences, between composing music for
eager to make his interpretation and show off to the audience,
interactive media compared to linear music. In our view, much
takes over. Building up to the moment just before the start of a
is the same, built on traditions that have been around for
new round in bar 7-8, there exists a short period of 6-8 beats
centuries within music and composition. However, our main
where the tension is at top and the negotiation is strongest.
conclusion about the new auditive medium is that we have to
interpretation and negotiation broaden our perspectives. We have to put forward factors that
A difference in our JAZZ genre is that the system can analyse if earlier was implicit in the musical and music making situations,
the user is synchronising his actions to the pulse in the no matter if it was the concert hall, the church, or the club.
accompaniment. If he succeeds, ORFI answers with rewarding When composing interactive music we have to consider the
horns riffs, stressing the harmonic, periodic and rhythmic genre, the potential roles the listener might take, and the user
development in the blues. experience in different situations.
Another difference is that the blues accompaniment with drums The consumer situation in interactive media is dynamically
and base in our JAZZ genre can be used to negotiate the changeable. Interactive music consumption can take place at
narrative structure. This is often made by users playing and home, in the street, at school. It doesn’t need to be static, pre-
craving for more musical variations of a certain riff. We have destined and hierarchical, with the professional and recognised
found that the accompaniment in addition can motivate two musician on stage and anonymous audience in darkness. In the
users playing a game, dancing, or a person laying down resting concert hall or the club the sound comes from a centrally placed
without focus on the music. sound system. In interactive media, however, the sound can be
Another difference is that the accompaniment is divided into distributed and mobile, so that it moves and follows the persons
three ground beats in different tempi creating possibilities for the interacting.
user to start, stop, change tempo, play together with horns, etc. It The persons consuming the sound are not passive listeners
increases the possibilities for the user to negotiate what and how anymore, but active users, able to dynamically shift between
strong the narrative structure should be. When interacting, the roles, by choosing position in space, relations and roles to other
active users actions and the references to activities like dancing, people and the music. The user can take part in changing the
playing, creating music etc. produce, uphold and nurture a sound experience in real time, based on the rules the composer
narrative structure that potentially invites other users. has created as a potentiality in the software. This differs in a
genres and experience significant way from the jazz improvisator or the professional
The traditional minimalist narrative structure follows the musician. The fact that the composer writes programming code
development on a micro level, with fewer expectations on large is an essential difference. Instead of writing one linear work, he
form structure development. It is almost a contradiction that creates infinite numbers of potential music that reveal
anything relevant should happen on the macro level in themselves as answers to user interactions in many situations.
minimalist music. Instead the expectations should be directed This might be like an instrument responding to a musical
towards the micro level and the tiny variations, we can hear if gesture, or a competent and intelligent actor answering
we sharpen our senses. musically in an improvisation session. But everything has to be
A difference in our MINI genre is that the system organise the formulated in advance as rules in the software. The challenge is
synchronisation of the melodic patterns to the pulse. The to create music, through user interaction, that motivates to
synchronisation is freeing the user from the responsibility to further co-creation of the music and moving image narrative.
keep track of the beat. Instead, we have found, it creates Everything has to be formulated in advance, based on genre and
possibilities for the user to focus on the communication and music knowledge and competence in social behaviour. It’s all
improvisation with others. about broaden the perspective to look wider, further and deeper.
The biggest difference, however is that it the user can choose to
negotiate what role to play, and if he wants ORFI to be a
84
Same but Different – Composing for Interactivity
Acknowledgements
Without Fredrik Olofsson’s unique artistic and technological
competence and knowledge in development of music, hardware
and software, ORFI would not have been possible to create. We
also thank Jens Lindgård, Petter Lindgård and Sven Andersson
for their work with music. We thank the Swedish
Inheritance Fund and Borgstena Textile AB for their
contributions. We thank Interactive Institute and K3 Malmö
University for being a source of inspiration to our work in the
group MusicalFieldsForever.
References
[1] Andersson, Anders-Petter, Cappelen, Birgitta, Olofsson,
Fredrik, ORFI, interactive installation, MusicalFieldsForever,
Art’s Birthday Party, Museum of Modern Art, Stockholm,
(2008)
[2] Cappelen, Birgitta & Andersson, Anders-Petter, From
Designing Objects to Designing Fields - From Control to
Freedom, Digital Creativity 14(2): 74-90, (2003)
[3] Fabbri, Franco, A theory of Popular Music Genres: Two
Applications, Popular Music Perspectives, Horn, D. & Tagg. P.
(ed.), Göteborg and Exeter: A. Wheaton, 52-81, (1982)
[4] Holt, Fabian, Genre in Popular Music, University of
Chicago, (2007)
[5] Tacchi, Jo, Radio Texture: between self and others, Material
Cultures, Why some things matter, (ed.) Miller D., London,
(1998)
[6] Weiser, Marc, The Computer for the Twenty-First Century,
Scientific American, 256(3): 94-104, (1991)
[7] Latour, Bruno, Pandora's Hope, Essays on the Reality of
Science Studies, Cambridge, MA; London, UK: Harvard
University Press, (1999)
[8] Godøy, Rolf Inge, Haga, Egil, Refsum Jensenius, Alexander,
Exploring Music-Related Gestures by Sound-Tracing. A
Preliminary Study, Congas, Leeds, (2006)
[9] Crawford, Chris, On game Design, US, (2003)
[10] Andersson, Anders-Petter & Cappelen, Birgitta, Ambiguity
- a User Quality, Collaborative Narrative in a Multimodal User
Interface, Proceedings AAAI, Smart Graphics, Stanford, (2000)
[11] Lindgård, Petter, Lindgård, Jens, Andersson, Sven (music),
Andersson, Anders-Petter (arr. & composition rules for
interactive installation), JAZZ genre, Do-Be-DJ/Mufi,
MusicalFieldsForever, (2000)
[12] Reich, Steve, Music for 18 Musicians, Recording, ECM,
(1978)
[13] SuperCollider, http://www.audiosynth.com, (2008
[14] Olofsson, Fredrik (music and composition rules for
interactive installation), MINI genre, ORFI/Mufi,
MusicalFieldsForever, (2007)
[15] ) Davies, Miles, Birth of the Cool, Recording, Capitol,
(1950)
[16] Webster, Ben (arr.), Arlen H., Koehler, T, I’ll Wind,
Recording, Soulville, Verve, (1957)
85
The HarmonyPad
- A new creative tool for analyzing, generating and teaching tonal
music
G. Gatzsche, Fraunhofer IDMT, Ilmenau, Germany, gabriel.gatzsche@tu-ilmenau.de
M. Mehnert, Technische Universität, Ilmenau, Germany, markus.mehnert@tu-ilmenau.de
D. Gatzsche, Staatl. Berufsbildende Schule, "‘Janusz Korczak"’, Weimar, Germany, david.g@tzsche.de
K. Brandenburg, Technische Universität , Ilmenau, Germany, Karlheinz.brandenburg@tu-ilmenau.de
Abstract. Learning a classical musical instrument is a challenging task that requires long term practise, high
motor skills and intensive training. Within that relationship the following challenges exist: 1.) Students often
perceive pure score reading and reinterpretation as being boring. To teach musical improvisation additionally
would solve this problem. 2.) But to be able to improvise the student has to reach a certain technical level first.
3.) Pure score reading and reinterpretation does not automatically train the ability to improvise. Often very
good score players have difficulties to accompany a given melody. To overcome these problems a new musical
instrument is proposed that meets the following properties: The instrument is very easy to play. Its interface
is designed to reveal important structural properties of tonal music become geometrically: The relationships
between tones, intervals, chords and keys, functional aspects, aspects of consonance and dissonance and aspects
of tension and resolution. The instrument can be played without having extensive motor skills or prior music
theoretical knowledge. Through the usage of the proposed device the student implicitly acquires knowledge
about musical structure which again helps to compose, improvise, analyze musical pieces or to accompany a
given melody. Teachers can use the instrument to teach music theory.
86
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music
IDMT001278
2 The user interface
The user interface of the HarmonyPad has to be some
kind of touch sensitive surface. This can be a normal
Figure 1: The HarmonyPad touch screen, a multi touch surface, a button matrix, a
pen display or also an innovative controller like the Re-
actable1 . The examples of the following explanations
1.1 Pitch space based musical instruments have been implemented on the JazzMutant Lemur as
an multi touch controller and a Elo Touch Screen. The
The proposed creative tool is a so called pitch space
advantage of the Elo touch screen is the possibility to
based musical instrument. Pitch space base musical
use the standard graphics driver to visualize the pitch
instruments consists of three main components: The
space, the selection function and also musical data. This
pitch space, the pitch selection and the user interface:
again allows a direct interaction with the visualized
pitch space. The player can touch and play the shown
The pitch space defines a geometric arrangement of
tones or can transform the selected area. The drawback
pitches. Similar to color spaces the pitch space arranges
of the Elo touch screen is that the screen it is not able to
tones in a way such that semantic relationships between
process multiple touches simultaneously. This problem
the tones become geometrically apparent. Such aspects
is solved with JazzMutant Lemur which provides the
are for example consonance or dissonance, chordal or
information of up to ten touch points. Another big ad-
melodic grouping, cognitive similarities or simply the
vantage of the Lemur is the possibility to define different
pitch height. The better the tones are organized within
user interface configurations very easily. The drawback
a pitch space the easier it is to generate a wanted sound
of the Lemur is that it is not possible to program and
i.e. tone combination.
show complex data visualizations and geometric mod-
The selected area is a subregion of a given pitch space els.
which contains the tones to be played. If tones that
belong musically together are located in neighborhood 3 The pitch space
then it is possible to create meaningful tone combina- The HarmonyPad consists of two pitch spaces, a first
tions through the definition of simple shaped selected space to setup the key and a second pitch space to gen-
areas. By transforming the selected area (e.g. trans- erate the actual chords. The pitch space that is used to
lation, scaling, rotating, inversion, ...) it is possible to setup the key is the circle of fifth which is shortly denoted
transform a selected chord into another one. Further-
more it is possible to assign alpha values to every point
1 Within this paper we focus on touch based techniques but it is
of the selected area which additionally assigns an indi-
also possible to use other controllers to control the pitch space
vidual weight to every selected pitch.
and the selected area (see Section 2) to get the music out of
the pitch space. So it would be possible to use a joystick like
The user interfaceis the mean by which the musician controller like the 3DConnextion SpaceNavigator, a Theremin
controls the parameters of the pitch space (e.g. the based Controller or a innovative game controller like the Nin-
geometric arrangement of the pitches within the pitch tendo WiiMote.
-2-
87
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music
-3-
88
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music
shown in Figure 3 the chord transition e0 , a0 , c00 → a0 , c00 , pressed button sound completely different then a whole
e00 is generated. tone shift? Through the geometric arrangement of tones
within the pitch space meaningful or often used tone
Pitch class
combinations are in geometric neighborhood and stand
in a simple geometric relationship. This again makes
D
it possible to define a simple formed shape, which cov-
ers neighbored pitch classes. Through a translation of
b the shape or a controlled change of the shape’s dimen-
G sions the desired sound can be formed. To make this
possible we have to think about a simple parameter set
e
to control the shape of the selected area. Such a set of
C parameters is presented now. The proposed parameter
set is explained using the TR but it can be applied to
a
every circular pitch space5 . In section 3 we proposed to
F represent pitches in a two dimensional coordinate sys-
tem. In the example of the TR this coordinate system is
d
a polar coordinate system whereas the first dimension
Pitch
is the angular dimension and the second dimension is the
height
d' e' f' g' a' b' c'' d'' e'' f'' ...gze@IDMT001270
radial dimension. Therefore the parameters start angle,
apex angle, start radius and apex radius are proposed to
control the shape of the selected area. This is illustrated
Figure 3: The pitch class/ pitch height space: The in Figure 4: The grey shape represents the selected area
first dimension of a two dimensional space repre- which is described by the four parameters named be-
sents the pitch height and the second dimension fore. We will discuss these parameters in more detail
the pitch class. The pitch space allows to generate now.
important chords and chord transitions by defin-
ing and moving rectangular areas.
C e
Start angle a
4 The selected area a a G
b Apex angle b
Principally it is possible to define arbitrary shaped se- r1 r2
lected areas to select and play tones. But the goal of Apex radius r2
this section is to propose an effective set of parameters Start radius r1
to describe the selected area. Through this parame-
terization it is possible to reduce the number of tasks
that have to be performed by the player to generate a F b
-4-
89
Änderungen bitte in IDMT001278 vornehmen und hier rein kopieren!
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music
apex angle of the selected area has been set in the way
C e C e C e
that three neighbored tones are covered. Therefore it is
possible to play complete chord cadences by touching a G a G a G
-5-
90
Änderungen bitte in IDMT001278 vornehmen und hier rein kopieren!
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music
C e C e
key for example, he/she has to touch the point located
at -90◦ . All key changes become immediately visible
a G a G and audible. Therefore it is possible to play the cadence
C-major, F-major, G-major and C-major by selecting the
chord C-major (like done in Figure 1) and touching the
F h F h
appropriate key points at the circle of fifths.
d d
IDMT001274
With the help of the circle of fifths we could reach a
more intuitive way to change the key and to play key
from other chords. But the problems of fifths parallels
Figure 7: a) Increasing the start radius crossfades
still exists: If we select the key C-major and change
a chord through different inversions. b) Increasing
to the key G-major then the played chord C-major is
the apex radius adds tones from other octaves to transposed by seven semitones and a forbidden fifths
the chord parallel occurs. To solve this problem we have to go
back to the pitch class/ pitch height space shown in
Figure 3: The space is not simply transposed but shifted
is the generation of fifths parallels which are forbidden along the pitch class axis. As described in section 3 this
in classical composing. But the problems shown before leads to the automatic generation of a well sounding
can be solved with the following two steps: 1.) Instead chord transition.
of using a semitone based transposition we propose to
perform the key changes based on the circle of fifths. 5.2 Rise or lower single pitch classes
The circle of fifths represents all diatonic keys in an or- The second way to play chords from other keys is to rise
der, that keys with more common tones are located close or lower single pitch classes. This means for example to
together, keys with less common tones are located more transform the chord C-major into a chord C-minor the
far away. For example the keys C-major and G-major pitch class e can be lowered by a half semitone to become
have six of seven tones in common, that are the tones c, the pitch class e[ . Like denoted above this alternative
d, e, g, a and b. In the circle of fifths C-major and G-major allows it also to generate non diatonic or other dissonant
are in neighborhood. Another advantage of using the chords. A possibility to implement such a feature is to
circle of fifths for key changes is that the relative minor assign an appropriate user interface element to every of
key and the parallel major key of a given key are local- the pitch classes. With such an interface element it is
ized in a simple geometric ratio with the current key. If possible to rise the appropriate pitch class by a certain
the current key is the key C-major then the parallel key number of semitone.
c-minor can be found by selecting the key −90◦ of the
current one6 . The last advantage of using the circle of
fifths is that it reveals symmetries between different key 6 Pedagogical application
changes: To find the parallel major key of a minor key Combined with a substantiated pedagogical concept the
is done in nearly the same way like finding the parallel HarmonyPad becomes a helpful tool in early music edu-
minor key of a major key. The difference is that not the cation e.g. in kindergartens, in primary schools but also
key -90◦ but the key +90◦ of the current one has to be se- in schools, music schools. Music students should use
lected. Revealing this symmetries helps music students the tool to improve their music theoretical knowledge
to recognize structural redundancies and to internalize and their ability to compose. Through the simplicity of
music theoretical knowledge much effective. play older people can use the HarmonyPad to learn a
The realization of the proposed way to change keys musical instrument still in advanced years. The sub-
can be seen in Figure 1. The outer ring consists of twelve sequent paragraphs summarize music theoretical resp.
grey points which all represent a key. The assignment tonal relationships that can be taught using the Har-
of keys to the grey points is not absolute: The point monyPad:
at the circle’s top represents the key currently selected.
To the left and to the right of the current key the other 1. The student learns the tones that build the most
keys in an order according to the circle of fifths follow. often used major and minor chords. By selecting
If the musician wants to change into the parallel minor a narrow region at the surface of the HarmonyPad
the student can listen to single tones. Enlarging the
6 We do not use the standard mathematical system where the selected region the single tones can be crossfaded
angle 0◦ is located at the x-axis and angles increase counter into major and minor chords.
clockwise. We use the musical coordinate system. Here the
angle 0◦ is located at the positive y-axis i.e. the circle’s top. 2. The student learns functional relationships: By se-
Furthermore the angle increases clockwise lecting an area on the left side of the HarmonyPad
-6-
91
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music
the subdominant chords S; Sp , by selecting an area required to optimize the HarmonyPad for a given tar-
on the right side the dominant chords D; Dp and get group, this can be infants, children, music pupils,
by selecting an area in the upper center the tonic music students but also older people. All in all it can
chords T; Tp can be played7 . be said that the HarmonyPad complements the piano
3. The HarmonyPad links visual geometric posi- keyboard in a very good way: While the piano key-
tions, musical gestures, musimathematical struc- board organizes the tones along melodic relationships
tures and of course the resulting sound effect. This the HarmonyPad does the same for harmonic relation-
again trains to think in harmonies and to remember ships. For this beside the piano the HarmonyPad should
musical elements. become a central part of every scholar and private music
education.
4. The circular arrangement of pitch classes in the
HarmonyPad allows to visualize chord progres-
References
sions as geometric tracks. Chord progressions can
be linked with a geometric shape. For example the [1] G, Gabriel ; M, Markus ; B-
chord progression T,S, D, T can be visualized as a , Karlheinz ; A, Daniel: Circular Pitch
triangle at the HarmonyPad’s surface. Space based Musical Tonality Analysis. In: 124th
5. By playing the HarmonyPad students learn to as- AES Convention (2008)
sign minor and minor chords to its relative minor [2] G, Gabriel ; M, Markus ;
and major chords. This is possible because major S̈, Christian: Interaction with tonal
and its relative minor chords occupy neighbored pitch spaces. In: Proceedings of the 8th International
regions8 the HarmonyPad. Conference on New Interfaces for Musical Expression
6. The student learns automatically common tones of NIME08, 2008
third related chords [3, S. 142]: For example the [3] K̈, Thomas: Harmonielehre im Selbststudium.
chords a-minor and C-major have an overlapping Neuausg. Wiesbaden u.a. : Breitkopf und Härtel,
region which contains the common tones C and e. 2006. – ISBN 9783765102615
7. With the HarmonyPad it is very easy to learn, [4] M, Markus ; G, Gabriel ; B-
which tones can be used to accompany a given , Karlheinz ; A, Daniel: Circular Pitch
tone in a given key. To find the right tone the se- Space based Harmonic Change Detection. In: 124th
lected area has to be setup such that it contains AES Convention (2008)
three (or four) tones. After that all chords that can [5] M, Markus ; G, Gabriel ; G,
be covered by the previously defined selected area David ; B, Karlheinz: The analysis of
and that contain the tone to be accompanied can be tonal symmetries in musical audio signals. In: Inter-
used to accompany the denoted tone9 . national Symposium on Musical Acoustics ISMA 2007,
8. Last but not least the student learns that there are 2007
different inversions of every chord. This inversions [6] S, Roger: Geometrical approximations to the
can be easily created by moving the selected area structure of musical pitch. In: Psychological Reveview
towards the pitch class axis. 89(4) (1982), Jul, S. 305–333
[7] W, J. D.: Seperating pitch chroma and pitch
7 Summary and conclusion
height in the human brain. In: Proceedings of the
The state of the HarmonyPad described to this point is National Academy of Sciences of the United States of
a base around which many improvements like a pitch America, 2003, S. 10038–10042
height dependent change of the apex angle to prevent
dissonant tone combinations in the lower frequency re-
gions, the integration of multi touch controllers like the
JazzMutant Lemur or other geometric representations
like a cartesian one have been implemented. The out-
come of these further developments is that it is now
-7-
92
Sonic interactions with hand clap sounds
Antti Jylhä and Cumhur Erkut
Dept. Signal Processing and Acoustics
Helsinki University of Technology
antti.jylha@tkk.fi
Abstract. In this paper, we present a control interface, which applies the hand claps of the user as control input.
With this system, we aim at providing more realistic and engaging control over the output. We present three
exemplary use cases: controlling a synthetic audience, controlling the tempo of a musical piece, and controlling
a simple sampler. Qualitative evaluation shows that the system performs well in the use cases. The control
interface has potential in other types of human-computer interaction as well.
93
[Sonic interactions with hand clap sounds]
mentation has been built on sndpeek6 , a package for possible for the user to perform some other music on
real-time audio visualization. the beat.
In this study, we aim at providing a prototype of It is noteworthy that the three-agent system of the
a control interface, which applies hand claps as con- user, a piece of music, and the synthetic audience, is
trol input. In the current implementation, the control not restricted to this form of interaction. In another
interface can be applied to control a synthetic crowd possible scheme, it could be the audience controlling
of clappers, the tempo of a musical piece, or a simple the user to clap, and the music would follow. Alter-
sampler, by hand clap gestures in a continuous fashion. natively, the user could become the conductor of both
This implementation should be considered an early ex- the audience and the music simultaneously.
ploration of the possibilities of such a system, and we
aim at extending the framework to other sound syn-
thesis and human-computer interaction applications as
well. Examples of intended future implementations in-
clude interactive games, novel HCI schemes, and a tool
for sound and interaction designers for prototyping and
evaluation.
2 Applications
The current implementation of the system can perform
in three different functional modes. In the audience
mode, the user aims at synchronizing a synthetic au-
dience with her clapping tempo and possibly with a
musical piece by clapping to the beat of the music. In
the music tempo control mode, the user claps her hands
to control the tempo of music. In the sampler mode,
the user claps her hands to control a simple table-read
sampler. Technical details of the implementation will
be discussed in Sec. 3.
-2-
94
[Sonic interactions with hand clap sounds]
2.3 Hand-clap-driven sampler runs a modified version of the hand clap synthesis en-
gine ClaPD [8], with a new control interface and addi-
Another simple application for the interface in the mu-
tional functionality. In its previous versions, the con-
sical domain is a sampler controlled by the clapping
trol parameters of ClaPD have only been adjustable by
hands. The user can select an audio sample (.wav-file)
conventional HCI techniques, i.e., the mouse and the
and then control the playback of the sample with the
keyboard. Here we propose a technique for using hand
claps. This application can also function in two ways.
claps as input to extract control parameters for the sys-
The sample can be automatically looped so that the
tem. The hand clap audio data is processed with PD to
tempo of the claps can control the looping rate and
yield parametric control data for the synthesis engine.
the rate of reading the sample from a table. Alterna-
ClaPD and the control interface are described in detail
tively, the sample can be played every time a clap is
in the remainder of this section. Also the techniques
detected.
for implementing the music tempo control and sampler
A clap-driven sampler could be applied for exam-
applications are discussed.
ple to control the tempo of a drum loop in a musical
performance, by mapping the clap tempo to the loop-
3.1 Extracting control parameters
playing rate. Another example could be the sound de-
sign of movies, when discrete auditory events are to be From the user’s hand claps, we extract three types of
placed in the movie soundtrack in a rhythmic way or in control parameters: onsets, tempo, and strength. For
instantaneous locations. The sound designer could for onset detection, we have experimented with different
example mimic the rhythm of a monster’s footsteps envelope-based and band-based methods. In this pro-
with her hand claps to glue the samples into desired totype, we chose to apply a readily implemented PD
place. object known as bonk∼, which is an object designed for
detecting and classifying percussive sounds [10]. The
3 System architecture and technique algorithm is based on analysis of the incoming signal in
11 frequency bands in overlapping time windows. The
The architecture of our system consists of a computer overall change in the power of the bands is applied
running PD, a microphone, and the user. The user is for detecting an attack. The bonk∼ object can also
an integral part of the system, as the user’s gestures be trained to classify percussive sounds by template
are needed to control the system. matching [10].
Capturing the hand clap sounds is easy to do with The output of bonk∼ is the power of each frequency
any conventional microphone. To extract control infor- band, the tag of the class if classification task is rele-
mation from hand claps in our system, the microphone vant, and the summed-up power of the subbands. We
does not have to be of high quality. This makes the use the output to determine the onset to onset interval
system widely applicable, because consumer-oriented (OOI) between subsequent hand claps with a simple
computer microphones are not expensive, and many deterministic tempo tracker. Every time bonk∼ gives
computers even have a sufficient built-in microphone. an output, an onset has occurred, and we can use this
PD is a graphical programming language that was onset information for further rhythm estimation. From
originally developed for audio signal processing [9]. PD bonk∼, we also directly obtain the power of the attack
programs consist of graphical patches, which may con- as an estimate for the strength of the current clap. Al-
tain different objects (functional elements), messages though the clap strength remains unused in this pro-
(parameters for the objects), arrays, and other pro- totype, its potential is acknowledged for future work.
gramming elements. These elements are connected to The onset information obtained from bonk∼ is
each other from their inlets and outlets by drawing a further processed by the rhythm estimator object5 ,
line between them, i.e., routing the data through pro- which is an online tempo tracking object for onset in-
gramming commands. For example, a summation ele- formation, i.e., bang messages in PD. The rhythm es-
ment contains two inlets and one outlet to accept two timator algorithm was originally developed for tatum
numbers to sum and to give their sum as output. A grid analysis of musical signals [11], but it serves well to
PD patch may contain many nested parts, which may provide an estimate of the intended clapping tempo in
be either subpatches (saved along with the main file), our research, too. Tatum is the smallest metrical level
abstractions (graphically programmed individual .pd in musical rhythm, which in the case of quasi-periodical
files), or externals (PD objects compiled from C code). hand clapping translates to the average duration be-
Control data, e.g., numbers and strings, is processed tween successive claps. The algorithm is described in
without real-time requirements, while signals are pro- more detail in [11]. It is based on analyzing the OOI’s
cessed in blocks of data at a different rate. by storing the OOI values into a time-varying OOI his-
For the virtual audience application of Sec. 2.1, PD togram (inter-onset interval (IOI) histogram in [11]).
-3-
95
[Sonic interactions with hand clap sounds]
-4-
96
[Sonic interactions with hand clap sounds]
dience mode has been significantly extended [2]. Cur- a second. In order for the mapping to work realisti-
rently the audience can be dynamically generated and cally, that is for the clapper’s claps to match the beats
multiple sound generators can be hosted in ClaPD. The of the music, the reference BPM of the original song
last property allows us to implement a special Patron needs to be known. If the reference is not known be-
object, which by listening to tapping or clapping of the forehand, it can be determined by the user by clapping
user can drive the audience to synchronize with itself to the beat of the song in the virtual audience mode.
within the limits of the virtual audience. These lim- The mapping of the clap BPM to the phase vocoder
its are imposed by the cosc (coupled oscillator) object precession speed is linear, defined as
that performs the entrainment of each virtual clapper.
BP Mclap
In the synchronized mode, each clapper aims to clap v = 100 ∗ , (2)
around the same rate (frequency-locking) and absolute BP Mref
time (phase-locking) with the Patron, and calculates
its phase difference (measured in milliseconds) with the where v is the precession speed, BP Mclap is the esti-
Patron. Since the Patron has a quasi-stationary OOIP , mated clap BPM, and BP Mref is the reference BPM.
the phase difference can be considered a uniform distri- If a wrong reference BPM value is applied, the claps of
bution U (0, OOIP ) with the mean of OOIP /2 ms. If a the user will still control the tempo, but they will not
clapper is trailing behind of the Patron, then the phase match the beats of the music.
difference is smaller than OOIP /2 ms and its clapping When the user claps to control the music, sonic feed-
rate is accelerated. Similarly, if the clapper is ahead back is provided to the user to indicate the detected
of the Patron (phase difference greater than OOIP /2) claps. This feedback is a hand clap sound sample8 ,
ms its clapping rate is slowed down. The exact expres- which has been written to a table and is played back
sions for the acceleration and deceleration are given in every time a clap has been detected. While it would
[8] for the constant Patron OOI of 440 ms. also be possible to route the clapper’s own clapping
This scenario can be extended to generate syn- sounds back to the audio output, this would result in
chronous virtual applause along an external rhythmic feedback problems if loudspeakers are used to repro-
piece of music, as explicated in Sec. 2.1. In our exper- duce the sounds.
iments, we found synchronizing the ClaPD clappers to The major downside of this time stretching tech-
be easy, when clapping around the built-in preferred nique is that the applied phase vocoder implementa-
clapping rate of the virtual clappers (440 ms), but clap- tion does not readily work with streaming audio. There
ping much faster or slower did not lead to synchro- are also some audible artifacts in the processed sound,
nization. Therefore, we have made the Patron rate such as ”phasiness” and ”loss of presence”, which are
variable. Admittedly, this approach makes the origi- characteristic to phase vocoders [6]. However, for pro-
nal model behave in a less natural way, but gives good totyping purposes, the simple phase vocoder proved to
results in synchronizing the audience. be quite sufficient.
-5-
97
[Sonic interactions with hand clap sounds]
3.5 Latency
In the applications, there is an amount of latency in-
troduced by the system between the excitation and
response. The latency as a whole is a sum of many
software and hardware-related latencies, including the
soundcard latency, PD audio buffering, and the com-
putational latency of the PD program. The overall
system latency is strongly dependent on the operat-
ing system and its audio drivers, and the way these Figure 3: The PD control interface of the example ap-
are configured. While it is not straightforward to mea- plications. The user can select the functional mode and
adjust the other settings as required. The BPM esti-
sure all the different latencies, the latency within the mated from the user’s claps is presented in all modes.
PD program can be estimated by PD’s own realtime
object, which calculates the elapsed time between two
program events. the music at reference BPM, and STOP! will end the
As a result of the latency, the user’s claps and the playback. In sampler mode, GO! and STOP! will start
claps of the virtual audience are not trivially simulta- and stop a loop, if looping has been selected. There are
neous. Measuring the latency and taking into account also buttons for loading the music or the sample, i.e.,
the cyclic nature of clapping, a simple remedy for the opening up an ”Open file” window, so that the user
problem is delaying the system response to match the can select the music she wants to control or the sample
time of the user’s next predicted clap. This prediction to use in the sampler. In this prototype, these must be
can be calculated as the difference of the estimated .wav files.
time between claps and the measured latency. In the virtual audience mode, a visualization of the
Naturally, latency appears also in the other func- clapping crowd is presented as a blinkenlights9 can-
tionalities, where it cannot be compensated for with vas, i.e., a grid of flashing pixels indicating individual
the simple trick used in the audience mode. On the clappers. The middlemost pixel in the grid visualizes
other hand, informal evaluation of the system indicated the claps of the user.
that while the latency is noticeable, it is not necessarily
disturbing. 4 Informal evaluation of the system
3.6 Graphical control interface To evaluate the system, we requested two persons to
try out the user interface. The emphasis was on the
The graphical control interface is presented in Fig. 3. virtual audience mode and the tempo control mode.
Although the actual interaction with the system is per- The persons were first instructed to clap their hands
formed with hand claps, a conventional interface for and try to get the virtual audience synchronized with
selecting the functionality and other relevant controls them. Both test subjects reported that the synchro-
is required. nization feels both realistic and engaging. However,
The control interface consists of a selector for the an amount of latency is noticeable, especially during
functional mode, a selector for the input type, a BPM accelerandos and ritardandos.
visualizer, output gain and audio on/off settings, and The idea of controlling the tempo of music with hand
GO! and STOP! buttons for performing mode-specific claps was appealing to the test subjects. It turned
start and stop commands. In the virtual audience out, however, that controlling the tempo of an unfa-
mode, the GO! button starts the synthetic applause, miliar musical piece required some practice. The origi-
while the STOP! button ends it. In music tempo con- nal tempo of the music seemed to control the clapping
trol mode, the user should first adjust the song BPM tempo of the user at first, and it required conscious
to the normal BPM of the piece. Then loading a song,
hitting GO! and starting to clap will start playback of 9 http://ydegoyon.free.fr/software.html
-6-
98
[Sonic interactions with hand clap sounds]
effort to break the cycle. After a few tryouts, the test totype of the interface.
subject was able to control the tempo. Although the The current system makes no effort to distinguish
test subjects did not report it, according to the au- between hand claps and other impulsive excitation sig-
thors’ own experiments a similar phenomenon seems nals, such as tapping the microphone. The system
to occur in the audience mode. The tempo of the clap- would be more robust against impulsive noises if such
ping crowd affects the tempo of the user’s clapping, and a distinction could be made. Another important ad-
it requires concentration to start clapping in a different dition to the system is to exploit the identification of
tempo than the crowd. different hand clap types. Using this discriminative in-
Latency was noticed by both test subjects in the vir- formation, it is possible to construct more diverse con-
tual audience application, but it was not reported too trol mappings and sophisticated human-computer in-
distracting. With real clapping the latency is more se- teraction with simple and natural gestures. Although
vere than with mouse clicks due to buffering for the bonk∼ would serve as a testbed for such an extension,
incoming audio and also in part due to the bonk~ ob- too, we aim at providing another classification scheme
ject, which is naturally computationally intensive when for the purpose. An algorithm and first results of of-
compared to simply sending a bang message. fline identification of eight different hand clap types
have already been presented [5], and a real-time im-
plementation of the algorithm as well as its further
5 Conclusions development has been indicated as future work.
We introduced a prototype of a gestural control inter-
face for sound synthesis driven by hand clap sounds, 6 Acknowledgements
and simple applications to demonstrate the potential
of such an interface. Hand claps, being a universally This work is supported by the Academy of Finland
understood sonic gesture, make a natural control sig- (project 120583 Schema-SID). Our thanks to Leevi Pel-
nal which is easy to learn. The most important steps tola for the initial release of the ClaPD software, and to
in the future are to elaborate the system with more Jussi Pekonen and Hannu Pulakka for their comments
efficient and novel algorithms, and to come up with and for volunteering as test subjects in the evaluation
innovative applications for the control interface. Al- of the system.
though the possibilities of using hand claps as control
signals may seem limited, these possibilities are well References
worth exploring due to the potential of hand claps as
[1] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury,
an easily understood means of control.
M. Davies, and M. B. Sandler. A tutorial on onset
A drawback of the current system is that it incorpo- detection in music signals. IEEE Trans. Speech
rates an amount of latency between the excitation and and Audio Processing, 13(5):1035–1047, 2005.
response. An important objective in the future is to
optimize the computational efficiency of the program [2] C. Erkut and K. Tahiroglu. ClaPD: A testbed
in order to reduce this latency to the minimum. Also, for control of multiple sound sources in inter-
the rhythm estimator histogram resolution issue dis- active and participatory contexts. In PureData
cussed in Section 3.1 needs to be solved in a way that Convention 2007, Montreal, Canada, August
doesn’t affect the resolution of the estimated BPM as 2007. Online proceedings at http://artengine.
much. Alternatively, another algorithm could be de- ca/~catalogue-pd.
veloped for rhythm estimation. [3] F. Gouyon and S. Dixon. A Review of Automatic
During this study, we did not formally compare the Rhythm Description Systems. Comp. Music J.,
two control mechanisms, i.e., hand claps and mouse 29(1):34–54, 2005.
clicks. To make a final justification for applying a hand
clap interface in human-computer interaction, such an [4] K. Hanahara, Y. Tada, and T. Muroi. Human-
experiment should be undertaken. Also the distrac- robot communication by means of hand-clapping
tiveness of the latency could be subject to subjective (preliminary experiment with hand-clapping lan-
testing. guage). In Proc. IEEE Intl. Conf. Systems,
Man and Cybernetics, pages 2995–3000, Montreal,
To provide another modality to the interface and
Canada, October 2007.
more variety in the control signals, the trajectories of
the hands can be extracted from the actions of the [5] A. Jylhä and C. Erkut. Inferring the hand con-
clapper. In the running of this research, we already figuration from hand clapping sounds. In Proc.
begun experimenting with optical hand tracking. The 11th Intl. Conf. Digital Audio Effects (DAFx-08),
concept will be studied in more detail for the next pro- pages 301–304, Espoo, Finland, September 2008.
-7-
99
[Sonic interactions with hand clap sounds]
-8-
100
Toward a Salience Model for Interactive Audiovisual Applications of
Moderate Complexity
Ulrich Reiter, Q2S - NTNU, Trondheim, Norway, reiter@q2s.ntnu.no
Abstract. To provide users of interactive audiovisual application systems with subjectively high presentation
quality of the content (Quality of Experience, QoE), it is usually not effective to increase the simulation depth
of the rendering process alone. Instead, by focusing on salient parts of the content, perceived overall quality can
be increased without causing additional computational costs. This paper provides the basis for a novel salience
model for interactive audiovisual applications of moderate complexity that is based on influence factors which
have been identified in a coordinated series of experimental studies.
101
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity
-2-
102
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity
experimental
SENSORY COGNITIVE QUALITY evaluation
STIMULUS
PERCEPTION PROCESSING IMPRESSION
REACTION
Figure 2: The suggested salience model for interactive audiovisual applications of moderate complexity.
level of saliency of each perceived object. It is easily cognitive processing on the other hand. Sensory per-
seen that a generalized salience model is too complex ception can be affected by a number of influence factors
and the influence factors too manifold to cope with at of level 2. These involve the physiology of the user
the current state of knowledge. Therefore it is nec- (acuity of hearing and vision, masking effects caused
essary to get away from a generalized salience model. by limited resolution of the human sensors, etc.) as
Instead, it is reasonable to focus on a salience model well as all other factors directly related to the physical
valid for interactive audiovisual applications of moder- perception of stimuli.
ate complexity. Cognitive processing produces a response by the
Fig. 2 shows how such a salience model may be struc- user. This response can be obvious, like an immedi-
tured. The basis of human perception are the stimuli. ate reaction to a stimulus, or it can be an internal re-
For interactive applications these are generated by the sponse like re-distributing attention, shifting focus or
application system itself, so they will depend on a num- just entering another turn of the Perceptual Cycle (see
ber of factors: the influence factors of level 1. These [3]). Obviously, the response is governed by another set
factors comprise the audiovisual reproduction setup, of influence factors of level 3. These span the widest
e.g. (multichannel) loudspeaker setup, headphones, fre- range of factors, and also the most difficult to quantify:
quency range, panning laws and -algorithms applied, experience, expectations, and background of the user;
size and resolution of screen, brightness and color dis- difficulty and aim of task (if any); degree of interactiv-
tribution, etc. Note that the actual weight of these fac- ity; type of application; etc. Influence factors of level 3
tors may also depend on the audiovisual content itself: are related to the processing and interpretation of the
a static, acoustically “dry” sound source in a frontal perceived stimuli.
position will not be critical to different panning laws Cognitive processing will eventually lead to a certain
or number of loudspeaker channels in the back. In- quality impression (Quality of Experience, QoE) that
fluence factors of level 1 also comprise technical input is a function of all influence factors of levels 1-3. This
devices for user feedback to the system. As an ex- quality impression cannot be directly quantified by hu-
ample, navigation in a 3D scene can be controlled via mans. It needs additional processing to be uttered in
computer mouse, keyboard, joystick, accelerometer or the form of (quantitative) ratings on a quality scale, as
other advanced technical devices that offer differing de- (qualitative) semantic identifiers, and so on.
grees of freedom (DOF) and thus differing amount and The common way of assessing the overall quality im-
precision of control. This in turn influences how pre- pression is to evaluate single or combined quality at-
cisely the system can react by appropriately producing tributes. The scientific community has developed a
/ modifying the stimuli. To summarize, influence fac- number of attributes that are believed to be relevant
tors of level 1 are those related to the generation of for an overall audiovisual quality impression. Among
stimuli. these are audiovisual synchrony (both temporal and
The core elements of human perception have been spatial), the localization of events, sound as well as
identified to be sensory perception on the one hand and video quality by themselves (which, nevertheless, influ-
-3-
103
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity
-15° 15°
ence each other), responsiveness to interaction (when
applicable), and many more.
2.72 m
Woszczyk et al. have tried to arrange these into a -45° 45°
2.8 m
volvement, Balance) within these dimensions [12]. But
again, a quantification of their impact is hardly possi-
ble as of now. This is because their weight not only
3.4 m
depends on the audiovisual content (the stimulus) un-
der assessment, but also on the experimental evalua-
tion (the test methodology) itself. An attribute that is -105° 105°
-4-
104
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity
The assessment compared three versions of the Per- collect as many objects as possible within a given
ceptual Approach algorithm to each other. The test time limit of 30s.
results make clear that - for the audiovisual case - sub- Interestingly, and contrasting with results obtained
jects were not able to identify the three versions of the by others (e.g. Zielinski, Kassier, Rumsey, Bech et al.
algorithm under assessment. Increasing the density of in [19] and [20], both based on smaller sample sizes),
the diffuse reverberation part remains without perceiv- the different tasks that test subjects had to perform did
able improvement of the quality in bimodal (audiovi- not have an effect on the quality evaluation (Friedman:
sual) perceptual situations. Therefore the Perceptual χ2 = 3.3, df = 2, p > 0.05, p = 0.190, ns).
Approach algorithm as specified in MPEG-4 Scene De- Two possible explanations exist:
scription can be simplified to use only four internal
workchannels without degrading the overall perceived • The subjects’ task of navigating through the scene
quality in the audiovisual context. was not demanding enough, making the differ-
ences in quality too obvious.
5.2 User Interaction • Whereas user interaction was related to visual and
haptic modalities, the quality rating was based on
The next three assessments [18, 21, 22] focused on the audiovisual percepts. The distraction generated
effect that user interaction with the audiovisual ap- by interaction was not high enough to be signifi-
plication might have on the perceived overall quality. cant across modalities.
Here the general assumption was that by offering an at-
These possible explanations were examined in the
tractive interactive content or by assigning the user a
next two experiments [21, 22]. On the one hand, user
challenging task, the user would become more involved
interaction, rating process and tasks aimed at sharing
and thus experience a subjectively higher overall qual-
the same modality. On the other hand, test subjects
ity.
were confronted with a mentally complex, yet easily
The first experiment in this series compared the per-
scalable task: the n-back working memory paradigm.
ceived overall quality of audiovisual scenes under dif-
In this, the subject typically is required to monitor a
ferent degrees of interaction [18]. The actual amount
series of stimuli and to indicate whether or not the
of interaction was determined by three different tasks
stimulus currently presented is the same as the one
that the test subjects had to fulfill during the assess-
presented n steps before.
ment. These were:
Here, a sequence of spoken numbers was presented
1. Listen and watch task: Test subjects were pre- and subjects had to compare the numbers. At the same
sented with an automated movement through the time, the reverberation time was varied and subjects
virtual scene. No activity on their side was re- were subsequently asked to correctly rate the length
quired. The automated movement lasted around of reverberation in comparison to a reference rever-
30s, selected from two different predefined motion beration time1 . Unlike in previously published exper-
paths. iments, both the attribute to be rated and the dis-
tracting task were located in the same modality. An
2. Listen and press a button task: Again, test sub- analysis of the collected data indicates that the pre-
jects were asked to experience an automated cision with which auditory parameters can be rated /
movement through the virtual scene. This time, discriminated by humans is dependent on the degree of
an object automatically appeared within the field distraction present in the same modality. A highly sig-
of view. It was subsequently approached and nificant difference in rating accuracy was shown for the
(again automatically) collected. Then, a new ob- “navigation only” condition vs. the “navigation with 2-
ject would appear, and so on. Test subjects were back task” condition using Wilcoxon’s T test (matched
asked to immediately press a button whenever the pairs signed ranks, T = 20, p ≤ 0.01).
object appeared. This result further confirms and specifies the find-
3. Listen and collect an object task: Test subjects ings of [18, 19, 20]: Whereas cross-modal division of
were using the computer mouse to navigate freely attention only renders a small significant effect and
inside the virtual scene. Their task was to col- - apart from being listener-specific - depends on the
lect the object that was positioned somewhere on experimental conditions, with inner-modal distraction
the floor. When they had approached the object test subjects would predictably commit errors in their
closely enough, it was collected and re-appeared 1 In [18], an additional semi-structured interview had re-
in another location. The new location was either vealed that reverberation time was regarded as one of the
within the field of view, or the subjects had to most important attributes for the given type of interactive
turn around to see it again. They were asked to audiovisual content by all test subjects.
-5-
105
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity
ratings. Apparently, inner-modal influence is signifi- (T = 452.50, p ≤ 0.01), and the cut-off frequency
cantly greater than cross-modal influence. This is also fc = 12kHz (T = 812, p ≤ 0.01). The low-pass filter-
supported by some of the theories of capacity limits in ing in the active session (Game condition) was rated
human attention [23]. as being generally less perceptible.
This assessment showed that a cross-modal influence
5.3 Cross-Modal Interaction of interaction is possible when stimuli and interaction
Finally, the last assessment [24] in this series investi- are carefully balanced. Interaction performed in one
gated the possibility of cross-modal influence of inter- modality (e.g. visual-haptic) can dominate the percep-
action upon perceived quality. Whereas in the previous tion of stimuli in another modality (here: auditive).
two assessments the influence of interaction within the Yet, at this time it is not possible to determine or quan-
same modality was investigated, here the influence of tify that balance a priori.
a visual (-motion) task upon the perceived audio qual-
5.4 Conclusions
ity was evaluated. This experiment is borrowing from
what Zielinski et al. [19] and Kassier et al. [20] have de- The experiments have clearly identified a number of
scribed, but the test panel was significantly larger (31 factors that influence the perceived quality of audiovi-
test subjects opposed to 6 and 7, respectively), thus sual content. These are of technical nature, i.e. depend-
allowing a profound statistical analysis. ing on the reproduction setup and simulation algorithm
For this experiment, a computer game was designed used, but also of contextual and subjective nature, i.e.
to assess the effect of divided attention in the evalu- depending on user task, on degree and modality of in-
ation of audio quality during involvement in a visual teraction offered, and on individual attention capacity
task. Subjects had to collect selected flying objects limits.
(donuts) by running into them and avoid the collision
with other objects (snowballs). For the navigation, test 6 Summary and Outlook
subjects used the left and right arrow keys of a com- The model introduced here identifies and classifies the
puter keyboard. Movement was only possible to the most important influence factors that determine the
sides, at a fixed distance from the source of the flying saliency of objects in a multimodal perceptual situa-
objects. tion. It has been specifically developed to describe the
A game score was recorded for each subject to verify perception of audiovisual content in interactive appli-
subjects’ involvement in the game and to prod the sub- cation systems of moderate complexity, yet it can be
jects to actively play the game. By collecting the right extended to include true multimodality. It is based on
object (donut) the score was increased by one point, the experimental evaluation of perceived overall quality
whereas a collision with a snowball decreased the score (Quality of Experience) tested in a coordinated series
by one point. of subjective assessments.
For the experiment, each subject carried out a pas- The model needs further refinement to be put to
sive and an active session. The active session consisted use in real-world applications. One of the tasks that
in playing the computer game and evaluating the audio remain is the context-dependent quantification of the
quality. This session was designed to cause a division influence factors: in its current state of development,
of attention between the process of rating the audio the model is a purely qualitative one that does not yet
quality and the involvement in the computer game. In allow a priori statements (quantified estimations) on
the passive session, subjects were asked to evaluate the the weight of individual factors.
audio quality while only watching a game demo. The
audio quality degradations were realized by modifying References
the tonal quality. The original music signal (16kHz)
was low-pass filtered using three different cut-off fre- [1] Reiter, Ulrich. On the Need for a Salience Model
quencies fc = 11kHz, 12kHz and 13kHz. Addition- for Bimodal Perception in Interactive Applica-
ally, an anchor with a low-pass filtering at the cut-off tions. IEEE/ISCE’03, International Symposium
frequency fc = 4kHz was created. on Consumer Electronics. Sydney, Australia, De-
The Wilcoxon T test showed that the quality ratings cember 3-5, 2003.
of the active session varied significantly from the rat- [2] Landragin, Frederic; Bellalem, Nadia; Romary,
ings of the passive session for cut-off frequencies up Laurent. Visual Salience and Perceptual Grouping
to 12kHz. A significant decrease in rating correct- in Multimodal Interactivity. Proc. International
ness was shown for the Game condition in compar- Workshop on Information Presentation and Nat-
ison to the No Game condition for the anchor item ural Multimodal Dialogue IPNMD. Verona, Italy,
(T = 37, p ≤ 0.01), the cut-off frequency fc = 11kHz December 14-15, 2001.
-6-
106
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity
[3] Farris, J. Shawn. The Human Interaction Cycle: [15] Recommendation ITU-R BS.775-1. Multichannel
A Proposed and Tested Framework of Perception, stereophonic sound system with and without ac-
Cognition, and Action on the Web. PhD Thesis. companying picture. International Telecommuni-
Kansas State University, USA, 2003. cation Union, Geneva, Switzerland, 1994.
[4] Wertheimer, Max. Untersuchungen zur Lehre von [16] Reiter, Ulrich; Partzsch, Andreas; Weitzel,
der Gestalt II. Psychologische Forschung. 4, 1923, Mandy. Modifications of the MPEG-4 AAB-
pp 301-350. IFS Perceptual Approach: Assessed for the Use
[5] Zwicker, Eberhard; Fastl, Hugo. Psychoacoustics - with Interactive Audio-Visual Application Sys-
Facts and Models. 2nd updt. ed., Springer Verlag. tems. Proc. AES 28th Internat. Conf., Pitea, Swe-
Berlin, 1999, ISBN 3-540-65063-6. den, June 30 - July 2, 2006, pp 110-117.
[17] Int. Std. (IS) ISO/IEC 14496-11:2004. Informa-
[6] Lee, Kwan Min; Jin, S. A.; Park, N.; Kang, S. Ef-
tion technology - Coding of audio-visual objects -
fects of narrative on feelings of presence in com-
Part 11: Scene description and Application en-
puter/video games, Annual Conference of the In-
gine. Geneva, Switzerland, 2004.
ternat. Communication Association (ICA), New
York, NY, USA, May 2005. [18] Reiter, Ulrich; Jumisko-Pyykkö, Satu. Watch,
Press and Catch - Impact of Divided Attention
[7] Lee, Kwan Min; Jeong, Eui Jun; Park, Namkee;
on Requirements of Audiovisual Quality. 12th In-
Ryu, Seoungho. Effects of Networked Interactiv-
ternat. Conf. on Human-Computer Interaction,
ity in Educational Games: Mediating Effects of
HCI2007, Beijing, PR China, July 22-27, 2007.
Social Presence, PRESENCE2007, 10th Annual
International Workshop on Presence, Barcelona, [19] Zielinski, Slawomir; Rumsey, Francis; Bech,
Spain, Oct. 25-27, 2007, pp 179-186. Soren; de Bruyn, Bart; Kassier, Rafael. Computer
Games and Multichannel Audio Quality - the Ef-
[8] Steuer, Jonathan. Defining Virtual Reality: Di- fect of Division of Attention Between Auditory
mensions Determining Telepresence. Journal of and Visual Modalities. Proc. AES 24th Interna-
Communication. 42/4, 1992, pp 73-93. tional Conference on Multichannel Audio, Banff,
[9] Larsson, Pontus; Västfjäll, Daniel; Kleiner, Alberta, Canada, June 2003.
Mendel. On the Quality of Experience: A Multi- [20] Kassier, Rafael; Zielinski, Slawomir; Rumsey,
Modal Approach to Perceptual Ego-Motion and Francis. Computer Games and Multichannel Au-
Sensed Presence in Virtual Environments. Pro- dio Quality Part 2 - Evaluation of Time-Variant
ceedings First ISCA ITRW on Auditory Quality of Audio Degradation under Divided and Undivided
Systems AQS-2003. Akademie Mont-Cenis, Ger- Attention. AES 115th Convention, New York,
many, April 23-25, 2003, pp 97-100. USA, October 2003, Preprint 5856.
[10] Lombard, Matthew; Ditton, Theresa. At the Heart [21] Reiter, Ulrich; Weitzel, Mandy; Cao, Shi. Influ-
of it All: The Concept of Presence. Journal of ence of Interaction on Perceived Quality in Au-
Computer-Mediated Communication, 3, 1997. dio Visual Applications: Subjective Assessment
[11] Sheridan, Thomas B. Further Musings on the Psy- with n-Back Working Memory Task, Proc. AES
chophysics of Presence. Presence, 5/1994, pp 241- 30th International Conference, Saariselkä, Fin-
246. land, March 15-17, 2007.
[12] Woszczyk, Wieslaw; Bech, Soren; Hansen, Villy. [22] Reiter, Ulrich; Weitzel, Mandy. Influence of In-
Interactions Between Audio-Visual Factors in a teraction on Perceived Quality in Audio Visual
Home Theater System: Definition of Subjective Applications: Subjective Assessment with n-Back
Attributes. AES 99th Convention, New York, Working Memory Task, II. AES 122nd Conven-
USA, 1995, Preprint 4133. tion, Vienna, Austria, May 5-8, 2007.
[13] Reiter, Ulrich. TANGA - an Interactive Object- [23] Pashler, Harold. The Psychology of Attention. 1st
Based Real Time Audio Engine. Audio Mostly paperback edition, The MIT Press, Cambridge,
2007, 2nd Conference on Interaction with Sound, MA, USA, 1999, ISBN 0-262-66156-X.
Ilmenau, Germany, September 27-28, 2007. [24] Reiter, Ulrich; Weitzel, Mandy. Influence of In-
[14] Reiter, Ulrich. Subjective Assessment of the Opti- teraction on Perceived Quality in Audiovisual Ap-
mum Number of Loudspeaker Channels in Audio- plications: Evaluation of Cross-Modal Influence.
Visual Applications Using Large Screens. Proc. Proc. 13th International Conference on Auditory
AES 28th Internat. Conf., Pitea, Sweden, June Displays (ICAD), Montreal, Canada, June 26-29,
30 - July 2, 2006, pp 102-109. 2007.
-7-
107
An Embedded Audio-Based Vehicle Classification
Based on Two-level F-ratio
Abstract. The human auditory system can be regarded as a complex signal and information processing machine. With the rapid
progress of information technology, it is reasonable to use computer to simulate the ability of human auditory system. One of this
kind of researches is audio sense technology, which uses computer to analyze the environment sounds and identify the sounder by
analysis of audio signals. It can extend the audio sense ability of people. One application of audio sense technology is vehicle
classification, which means using the audio features of different vehicles to design a classifier for identifying the vehicles. For
embedded applications, generally a linear weighted classifier is used for the vehicles classification, and the F-ratio, which denotes the
contributions of different audio feature dimensions, is adopted as the weights. In this paper, we find just using F-ratio as the weight
there are confusion errors of classification, thus an approach of using two-level F-ratio as the weights is proposed to overcome the
problem. We first use F-ratio of total patterns as the weight to select the candidate set of confusion patterns, and then in the candidate
set, we use the F-ratio of the confusion patterns as the weight to give the final classification results, and a real time accuracy of
82.1% is gotten in an embedded platform based on a MSP430F149 microcontroller.
1. Introduction slicing are passed to a finite state machine, which makes the
vehicle detection decision. In [3], a traffic monitoring detector
The human auditory system can be regarded as a complex signal for the vehicle classification was developed to aid traffic
and information processing machine. With the rapid progress of management systems, where a Time Delay Neural Network
information technology, it is reasonable to use computer to (TDNN) was chosen to classify individual traveling vehicle
simulate the ability of human auditory system. One of this kind based on their speed-independent acoustic signature, and the
of researches is audio sense technology, which is a main branch Linear predictive coding (LPC) preprocessing and feature
of human computer interface, and it is important for the extraction technique were applied to the work. In [4], a fusion
automatism and intelligence of computer. This technology can framework using both video and acoustic sensors for vehicle
use computer to analyze the environment sounds automatically detection and tracking was proposed. In the detection phase, a
and identify the sounder by analysis of audio signals, which can rough estimate of target direction-of-arrival was first obtained
make computer has the auditory intelligence. Audio sense using acoustic data through beam-forming techniques. This
technology will be widely used, because it can extend the audio initial estimate designates approximate target location in video.
sense ability of people by using computer, which will be a Given the initial target position, the method was refined by
primary assistant to human being in information acquisition moving target detection using the video data. And Markov
domain. Chain Monte Carlo techniques were then used for joint audio-
The motivation of this paper is to discuss the approaches of visual tracking.
using audio sense technology to identify different kinds of Considering there are many embedded applications which are
vehicles. It is an important signal processing task that has been based on a processor of limited ability of computation, the
found widespread applications such as intelligent transportation approaches of simple computation and small storage are needed
systems, sensor network application [eg. 1-3]. Many researches to develop. Generally, the simple feature extraction and liner
had been explored for this kind of works. In [1], the classification algorithm are adopted for reducing the
classification of moving vehicles in a distributed, wireless computation complexity. In order to represent the contributions
sensor network was investigated, where a local pattern classifier of different audio feature dimensions, a method of audio feature
at each sensor node first made a local decision on what type of weighted Euclidean distance is used for pattern matching in the
the vehicle was based on its own feature vector. The probability linear classifier, and the F-ratios of different audio feature
of correct classification can also be estimated. The local dimensions are selected as the weights. In this paper, we find
decision, together with the estimated probability of being a that using F-ratios of total patterns as the weights, there are also
correct decision then can be encoded and transmitted efficiently errors of classification. We analyze the confusion patterns of
via the wireless channel to a local fusion center ready for error classification and find their features with larger values of
decision fusion. It was found that data fusion and decision F-ratios are near similar, and the features with smaller ones are
fusion enhanced the performance of vehicles classification. In very different. Consequently, in order to improve the
[2], an adaptive threshold algorithm were proposed for real-time performance, a two-level F-ratio weighted algorithm is proposed.
vehicle detection applications. It is a time-domain energy- In which we first use F-ratio of total patterns as the weight to
distribution-based alogrithm. Which first computes the time- select the candidate set of confusion patterns, and then in the
domain energy distribution curve and then slices the energy candidate set, we use the F-ratio of confusion patterns as the
distribution curve using a threshold updated adaptively by some weight to give the final classification results. In the paper, audio
decision states. Finally, the decision results from threshold signal gathering technology, audio signal feature extraction
technology and embedded system technology are also carefully
analyzed. The schematic diagram of our vehicle classification
*
This work is supported by National Natural Science Foundation of China system is shown in Figure 1. Furthermore, an embedded real-
under Grant No. 60672163.
108
time audio sense processing system for identifying different Generally, the feature analysis of acoustic signal is based in
kinds of vehicles based on a Texas Instruments MSP430F149 three domains: time, frequency, and time-frequency domains.
microcontroller is implemented. Acoustic signal processing in time domain is a natural approach,
but not an optimal one due to the complexity of the
environment, i.e., the time domain signatures of acoustic signals
Preprocessing Feature Two-level F-ratio can be hampered by noise from other moving vehicles. The
extraction method
features of the frequency and time-frequency domains are useful
to get a good performance of classification, but they are not
suitable for real-time vehicle classification since they tend to
Sound of vehicle Vehicle classification require intensive computation and samples from a long period of
time. Considering the requirement of real-time processing and
Figure 1. Schematic diagram for acoustic signature the limited computation ability of MSP430F149, the time
processing for vehicle classification domain features are adopted in this paper, and six features in
time domain are used, which include four general ones, such as
The paper is organized as follows: in section 2, the pre- short time energy(STE), short time zero-cross-rate(ZCR), short
processing and feature extraction of vehicles audio signal are time band-pass zero-cross-rate(BZCR) and the number of short
first introduced, which includes preemphasis, windowing and time peaks. And also two new time features are used, i.e., the
feature computation, and four traditional features and also two average value of short time peaks and the variance of short time
new features are gotten, next the proposed two level F-ratio peaks. The above features are defined as follows,
based method is discussed. And then in section 3, the Let ~yl (n) (n=0,1, …, N-1) denote the acoustic signal samples in
experiments and discussions are given, which includes the
a frame, the short time energy is defined as,
corpus of our experiments, the audio signal preprocessing and N −1
feature analysis, the experimental results of vehicles STE (m) = ∑ ~
yl ( n) (4)
n =0
classification based on audio signal and also some discussions.
Finally, section 4 gives the summary of our work and the ZCR is another basic acoustic features that can be computed
conclusions. easily. It is equal to the number of zero-crossing of the
waveform within a given frame,
2. Two-Level F-ratio Method ZCR (l ) = {
1 N −1
∑ | sgn[ ~
2 n =0
yl (n)] − sgn[ ~
yl (n − 1)] |} (5)
2.1. Pre-processing Where sgn(x) is the sign function,
The pre-processing is a most important part of acoustic signal ⎧1 if x > 0
processing, which converts the sound waveform to some type of ⎪
parametric representation (generally at a considerably lower in sgn( x) = ⎨0 if x = 0 (6)
formation rate) for further analysis and processing. It includes ⎪− 1 if x < 0
the following steps [5],
⎩
(1) Preemphasis, in which the digitized acoustics signal is put ZCR has the following characteristics: (1) In general, ZCR of
through a low-order digital system to spectrally flatten the both unvoiced sounds and environment noise are larger than
signal and to make it less susceptible to finite precision voiced sounds. (2) It is hard to distinguish unvoiced sounds from
effects later in the signal processing. The digital system environment noise by using ZCR alone since they have similar
used in the preemphasizer is either fixed or slowly adaptive. ZCR values. (3) ZCR is often used in conjunction with the
The most widely used preemphasis network is the fixed volume for end-point detection. In particular, ZCR is used for
first-order system, detecting the start and end positings of unvoiced sounds.
However, the ZCR parameter is easily affected by lower
H ( z ) = 1 − a~z −1 0.9 ≤ a~ ≤ 1.0 (1) frequency noise, thus the short time band-pass ZCR is used as
(2) Frame Blocking, the acoustic signal is a slowly time following,
varying signal in the sense that, when examined over a 1 N −1
sufficiently short period of time (between 5 and 100 msec), BZCR(l ) = {∑ | sgn[ y%l ( n) − T ] − sgn[ y%l ( n − 1) − T ] | +
its characteristics are fairly stationary, so it is reasonable to 2 n =0 (7)
cut the continuous acoustic signal into short time parts, and + | sgn[ y%l (n) + T ] − sgn[ y%l (n − 1) + T ] |}
each part is called a frame. In this step the preemphasized Where T is a threshold.
acoustic signal is blocked into frames of N samples, with
adjacent frames being separated by R samples. Let pl (n) represent the related peaks in the frame, which is
(3) Windowing, the next step in the processing is to window defined as,
each individual frame so as to minimize the signal ⎧1, ~
y l ( n) > ~
yl (n − 1) & & ~
y l ( n) > ~
yl (n + 1)
discontinuities at the beginning and end of each frame. If p l ( n) = ⎨ (8)
⎩0 , others
we define the window as w(n),0 ≤ n ≤ N − 1 , then the
Let N l denote the number of peaks (PN) in the frame,
result of windowing is the signal,
~ ∞
yl (n) = yl (n) w(n), 0 ≤ n ≤ N − 1 (2) N l = ∑ pl ( n ) (9)
n = −∞
Where yl (n) and ~
yl (n) are the lth frames of signal before The average value of short time peaks (AP) is defined as,
windowing and after windowing respectively, and a typical ∞
window is the Hamming window, which has the form, ∑ | ~yl (n) pl (n) |
2πn Pl = n = −∞
(10)
w( n) = 0.54 − 0.46 cos( ), 0 ≤ n ≤ N − 1 (3) Nl
N −1 The variance of short time peaks (VP) is defined as,
109
∞
μ j is the estimated mean for the jth vehicle,
∑ [ y% (n) p (n) − M ]
n =−∞
l l l
2
1 n
vl = (11) μj = ∑ xij (13)
Nl n i=1
Where M l is the average value of amplitudes of short time And μ is the estimated over-all mean,
signal in a frame. 1 m
The above six parameters have been constructed as a feature μ= ∑μj (14)
m j =1
vector for the frame, and the total forty frames are used as an
analysis window, then the features of a window are used to Using the F-ratio weight we got a basic performance for five
represent the information of a kind of vehicle. kinds of vehicles classification. However, there are also errors of
classification. We analyze the confusion patterns of error
2.3. Two Level F-ratio Based Method classification and find their features with larger values of F-
For the vehicle classification, it is a pattern recognition approach, ratios are near similar, and the features with smaller ones are
and has four steps [5], very different. Consequently, in order to improve the
(1) Feature measurement, in which a sequence of performance, a two-level F-ratio weighted algorithm is proposed.
measurements is made on the input signal to define the In the algorithm, we first use F-ratio of total patterns as the
“test pattern”. weight to select the candidate set of confusion patterns, and then
(2) Pattern training, in which one or more test patterns in the candidate set, we use the F-ratio of confusion patterns as
corresponding to sounds of the same class are used to the weight to give the final classification results. The algorithm
create a pattern representation of the features of that class. is as follows,
The resulting pattern, generally called a reference pattern,
can be a template, derived from some type of averaging Step 1. calculate the F-ratio of over-all patterns (Fall) and
technique, or it can be a model that characterizes the the F-ratio of the confusion patterns (Fconfu) ;
statistics of the features of the reference pattern. Because of Step 2. for input acoustic signal of the vehicle, get the
the non-stationary nature of acoustic signal, the short-time feature vectors of the analysis window;
feature extraction as section 2.2 are performed sequentially Step 3. use Fall as the weight, calculate the weighted
over time, producing a sequence of feature vectors as Euclidean distances between the input feature vectors and
pattern. different patterns;
(3) Pattern classification, in which the unknown test pattern is Step 4. select the two best patterns as the candidates;
compared with each class reference pattern and a measure Step 5. use the Fconfu of candidate patterns as the new
of similarity (distance) between the test pattern and each weight, recalculate the weighted Euclidean distances
reference pattern is computed. between the input feature vectors and candidate patterns;
(4) Decision logic, in which the reference pattern similarity Step 6. get the best pattern as the final result.
scores are used to decide which reference pattern (or
possibly which sequence of reference patterns) best 3. Experiments
matches the unknown test pattern.
The factors that distinguish different pattern recognition The experiment audio data of the corpus were selected from
approaches are the types of feature measurement, the choice of Internet [7]. It includes the acoustic signal of five kinds of
templates or models for reference patterns, and the method used vehicles, i.e., airplane, helicopter, truck, tank, jeep. The
to create reference patterns and classify unknown test patterns. MSP430F149 is used to construct the embedded real-time
In the paper, the linear classification algorithm is designed and processing platform, which is a Texas Instruments 16-bit Ultra-
an approach of feature weighted Euclidean distance metric is low-power microcontroller, it has a powerful 16-bit RISC CPU,
adopted. In which the relevant weight is the F-ratio of different a fast 12 bit ADC, and hardware multiplier. It is useful for the
feature dimensions. typical applications include sensor systems that capture analog
F-ratio is a statistic which is often used in speaker recognition signals, convert them to digital values, and process and transmit
[6], it is proportional to the ratio of the variance of the means of the data to a host system.
each speaker’s feature distribution to the average value of the In the experiments, the audio signal is first parameterized using
variance of each distribution. The farther apart the individual a feature analysis. Each analysis frame is processed with a
distributions are with respect to their average spread, the higher Hamming window, and the frame length is 25 milliseconds and
the F-ratio. Thus the F-ratio is an indication of a feature frame overlapping length is 12.5 milliseconds, the preemphasis
effectiveness. In vehicles classification, the feature parameters factor a~ is selected as 0.97. Each testing file is cut into pieces,
are evaluated in terms of their ability to discriminate vehicles and the length of each piece is about 2 seconds. There is only
and their dependence on other parameters. For the former one kind of vehicle in a piece and the system will give one result
purpose the F-ratio of the analysis of variance is used. For a for a piece. The six features are calculated for each frame, they
given parameter, the values obtained from the repetitions by are STE, ZCR, ZBCR, VP, AP, PN. A typical F-ratio of six
each vehicle may be regarded as samples from a probability features for a vehicle is shown in Figure 2. It is noted that the F-
distribution associated with that vehicle. For vehicles ratios of two new features, VP and AP, have larger values, thus
classification, a good parameter is one for which these it is reasonable to think that they have much contribution for the
individual vehicle distributions are as narrow and as widely classification results.
separated as possible. Using the F-ratio as the weight of Euclidean distance between
The F-ratio is given by [6], the test vehicle and the different patterns, the classification
n m 1 n m
accuracy of 77.3% is gotten. This result is not very good, there
F= ∑ (μ j − μ ) / ∑ ∑ ( xij − μ j ) (12)
2 2
110
An Embedded Audio-Based Vehicle Classification Based on Two-level F-ratio
es
es
s
me
am
am
am
Technology, (1996)
ra
fr
fr
fr
0f
40
80
12
111
dots: an Audio Entertainment Installation using Visual and Spatial-
based Interaction
Abstract. “dots” is an interactive sound installation that takes into account the spatial position of an arbitrary number of participants
in order to algorithmically synthesize an audio stream in real-time. The installation core is a software application developed during
this work, which employs advanced video and audio processing techniques in order to detect the exact participants’ positions and to
weight-mix short audio granules. Audio mixing is performed using a virtual spatial gridding of the installation space in two
dimensions. The synthesized audio stream reproduction is combined with a number of appropriately designed visual effects, which
aim to enhance the participants’ comprehension and render the “dots” installation a high-quality interactive audiovisual platform.
-1-
112
dots: an Audio Entertainment Installation using Visual and Spatial-based Interaction
techniques were employed to identify blobs from video real- the grid areas noted above gets triggered by the presence of a
time captures and extract all the spatial information needed. participant. On the other hand, the dot colour is defined by the
More specifically, the libraries that were used are the following: number (N) of the detected participants within the i-th specific
i) the video capture library (which is included in the basic grid area, using the following equation:
Processing installation package) ii) the minim library [7] and iii)
the “blob detection” library [8]. With the order that they were , 1
listed above, the first code library is responsible for the input of , 2 (1)
the video image into the core application. The second employs , 3
the well-known JavaSound API to provide an easy-to-use audio
library and it is here used for the reproduction/playback of the that is, the blue dot appears when only one person is traced, the
synthesized audio component. Finally, the blob detection library yellow when two people are traced and the red when there are
is employed for detecting blobs in sequential video frames in three or more people in the specific active grid area.
real-time (i.e. for the recognition of the outlines from the
footage of the video signal).
Figure 2: The dots colour shapes designed and the “dots” logo
-2-
113
dots: an Audio Entertainment Installation using Visual and Spatial-based Interaction
projector inside the room is made. This means, that one visitor
2.2. The “dots” installation space outside the installation room can see the exact position of each
person together with the number of them that are in a specific
The technical equipment that is needed for the realizing the part of the grid inside the room.
complete installation is a personal computer (typically with a
core2duo processor, 2GB of basic memory and ideally a high 3. “dots” installation demonstration
performance video processing expansion card), a High
Definition (HD) video camera for creating the input video The complete “dots” platform was realized for the purposes of
signal, a video projector with a minimum analysis of 1024x768 the second annual festival of the department of audiovisual arts,
pixels and two monitor active loudspeakers. As it shown in held in Corfu, Greece on May 2008. The installation room
Figure 3, the camera is located on the ceiling so that it can track dimensions were 4 m (width) x 4m (length) x 3.4m (height). As
and trace the movement of people that exist and move into the mentioned previously, the camera was placed on the ceiling.
installation room. In the same Figure, the 3x3 virtual grid Under these dimensions, the effective detection surface covered
mentioned in the previous Section is also illustrated. a satisfactory part of the installation room.
The computer required for executing the core application is The colour of the room was plain and bright so that the contrast
located inside the room, and preferably towards the rear of it. between the human and the room would be high. This condition
Based on the minimum requirements that the computer system is very significant for the blob detection algorithm, as it clearly
must have, it would be good to take into consideration that the defines the human edges/borders. Accordingly, a white carpet
technical equipment must be silent so that there will not be any was placed on the floor in order to additionally increase the final
audience noise disturbance. Hence the computer was placed video signal contrast. Additionally, the illumination inside the
inside a special sound absorptive construction. The “dots” core room had to be uniform and ambient; otherwise it would cause
application is running on the computer receiving the data from problems to the blob detection process, while the sensitivity of
the camera and projecting the final result on the opposite wall the detection was adjusted according to the specific illumination
with the help of the projector. As it is understandable, the during the initial installation setup.
projector must be installed at the back of the room projecting the The equipment described in Section 2.2 had to be hidden in
final visual effect on the opposite front wall. order to be transparent to the people participating into the
The stereo loudspeakers employed are located next to the front installation and not to attract their attention. The camera that
two corners of the room and give feedback to the user with used was connected via a typical firewire interface, allowing fast
sounds that are produced each time with a different and robust digital video signal transmission to the core
combination. In a future enhanced version of the installation, application.
there will be reproduction surround sound from four or more Finally, the gain of the different sound sources was
loudspeakers in the room. appropriately and independently adjusted, depending on the
During this work, a number of important aspects related to the acoustic properties of the room and the disturbance it would
installation effective realization where considered, such as the cause to neighbour installations.
existence of light (and more preferably the presence of ambient An additional, important point of interest was to design an
light) which is shed equally in the whole room, and the selection appropriate audiovisual sub-system to attract the attention of the
of a light coloured floor or a white carpet so that the blob people passing outside of the installation, as it was difficult to
detection accuracy is significantly increased. Moreover the notice the installation functionality from outside. An ambient
colour used for painting the walls is an important installation sound together with the small monitor that was installed outside
parameter, as it should be the same for all room surfaces and of the installation room (as mentioned previously) showing the
preferably light (for example light blue or white). dots’ patterns created by the participants’ interaction finally (and
efficiently) solved this problem (see Figure 4). The ambient
sound excited the visitors’ curiosity, so that most of them finally
decided to enter into the installation room.
-3-
114
dots: an Audio Entertainment Installation using Visual and Spatial-based Interaction
(d)
-4-
115
dots: an Audio Entertainment Installation using Visual and Spatial-based Interaction
5. Acknowledgments
The authors wish to thank Mr. Loukas Ziaras for providing the
photographic material included in this work.
References
[1] D. Birchfield, K. Phillips, A. Kidané and D. Lorig,
“Interactive Public Sound Art: a case study”. In Proc. 2006
International Conference on New Interfaces for Musical
Expression (NIME06), Paris, France (2006)
[2] D. Birchfield, D. Lorig, and K. Phillips, "Network Dynamics
in Sustainable: a robotic sound installation", Organised Sound,
10 (2005), pp. 267-274
[3] S. Boxer, "Art That Puts You in the Picture, Like It or Not",
New York Times, April 27, 2005.
[4] N. A. Streitz, C. Rocker, Th. Prante, R. Stenzel, & D. Van
Alphen, “Situated Interaction with Ambient Information:
Facilitating Awareness and Communication in Ubiquitous Work
Environments”. In Proc. HCI International, Crete, Greece (2003)
[5] I. Zannos and J.P Hebert, "Multi-Platform Development of
Audiovisual and Kinetic Installations" in Proc. of the 8th
International Conference New Interfaces for Musical Expression
(NIME 2008), Genova, June 5-7 (2008)
[6] http://www.processing.org (last visited May 29th, 2008)
[7] http://code.compartmental.net/tools/minim/ (last visited May
29th, 2008)
[8] http://www.v3ga.net/processing/BlobDetection (last visited
May 29th, 2008)
[9] Ch. Tsakostas and A. Floros, “Real-time Spatial
Representation of Moving Sound Sources”, Audio Eng. Soc.
123rd Convention, New York, October 2007 (preprint 7279)
-5-
116
AllThatSounds: Associative Semantic Indexing of Audio
Data
Authors: Hannes Raffaseder, Matthias Husinsky, Julian Rubisch
University of Applied Sciences St.Pölten, Austria
Abstract
Motivation: Demand and supply of digitally stored sound files increased rapidly during the
last years and reached an unmanageable amount. For media producers the search for suitable
sounds is an essential and time-consuming part of their work.
The research project AllThatSounds tries to improve the search procedure by indexing the
files in an associative, semantic way. A method for the systematic categorization of sounds is
introduced to simplify the annotation of audio files with metadata. Furthermore additional
data is collected by the evaluation of user profiles and by analyzing the sounds with signal
processing methods.
The project's result is a tool for structuring sound databases with an efficient search
component, which means to guide users to suitable sounds for their sound track of media
productions.
Introduction
Supply and demand for digitally stored audio increased rapidly in the recent years. Number
and diversity, as well as quality of available sound files reached an unmanageable amount.
Efficient approaches in the search for audio data play an important role in the process of
media production. Most of the available search tools nowadays require the user to know
important features of the sound before the search can be carried out. A search request using
semantic features, that are closer to human perception rather than technical parameters, is
hardly possible. In a broader perspective the search procedure is also harder because of the
volatility of the medium sound, which makes it difficult to uptake acoustic events. This results
in a disability to verbally describe sounds properly. Usually, not the sounds themselves, much
more the preceding events that caused the sound are described.
The research project AllThatSounds aimed on simplifying the process of finding suitable
sounds for media productions. For this purpose many different possibilities to categorize and
describe sonic events were analyzed, evaluated and linked. Apart from the applicatory use of
the tool, the research questions raised by the work on these topics trigger a discussion process
about perception, meaning, function and effect of the sound track of a media product.
117
fit optimally to the mood and image of the product. Since only a few persons have a sufficient
knowledge in this area, the search for audio is performed more or less arbitrary, at best guided
by intuition and the own taste. Many interesting pieces of music and audio files will not be
taken into account.
Descriptive Analysis
The descriptive analysis enables for the uniform description of acoustical events from the
sound designers perspective. The aim is that already at the time the upload of a sound into the
database takes place, it can be sufficiently and distinctly registered.
Based on the insights of Murray Schafer (3), David Sonnenschein (4), Darry Truax (5), Theo
von Leeuwen (5) and earlier thoughts of the author (2), a general classification of sounds was
developed, which enables for a differentiated description in the following categories: Type of
excitation, sound source, timbre, pitch, dynamics, room, familiarness and possible use.
Very soon the difficulties when trying to describe the sonic events in an adequate and
universal way, become apparent. On one hand a detailed and satisfactory description of the
sound requires high complexity and accuracy, but on the other hand the process of describing
the sound must not take too long in practical use. To keep time short when describing the
sound at upload, full description cannot be reached.
Listener’s Analysis
The context, personal mood of the listener, the field of application or the intention of the
production have a big influence on the description of an acoustical event. There is a semantic
gap between human perception and objectively measurable signal parameters. Sometimes,
two technically very similar signals have a totally different effect.(2)(6)
Therefore only one description of the acoustical event might lead to an unwanted result in
another context. To disarm this problem, AllThatSounds uses collaborative interfaces and
description methods which became popular under the buzzword Web 2.0.
Machine Analysis
In the recent years many research institutions started working on methods to give machines
similar perceptual possibilities as humans. In the research field of “Music Information
Retrieval” fundamental work was done by using the results of signal processing methods on
audio data for getting an idea about its musical content or its semantic content. Classification
is often done using artificial intelligence models. Basic examples are automatic beat and
tempo tracking, pitch estimation and melody or instrument extraction.
For AllThatSounds the application of sound similarity models promised to be very
meaningful. In many audiovisual productions sounds, which have no relation at all to the
objects shown on the screen, but still appear to be authentic, are used. Therefore a similarity
118
model based on Mel Frequency Cepstrum Coefficients (MFCCs) is used to compare each
sound with each other in the database. This model proved to be very useful in comparing and
classifying sounds and music (1). The 20 most related sounds are stored with each sound and
enables that way for a new kind of browsing for viable sounds.
Furthermore, for each sound descriptors, corresponding the MPEG 7 standard and some other
psychoacoustic parameters like roughness and sharpness, are calculated and can be used in
search requests.
Semantic Analysis
In media production the semantic content in acoustic events is utterly important to achieve a
certain perception at the consumer. Until today it is almost impossible to automatically extract
semantic meaning from audio signals automatically, since not only signal parameters, but to a
wide extent also cultural and social experience of the listener play a role. How these
denotations to specific aural events happen, how they manifest and are described, is in many
respects still not investigated. Theo van Leeuwen shows (6) that different intermodal
interplays play an important role, as well as the social, cultural or historic environment.
AllThatSounds aims on combining the annotated metadata from the users and the calculated
features to explore possibilities to automatically extract semantic denotations. Conclusions
can be done, when the database has been used for a longer time by numerous users.
Also, to investigate this topic further, a library of short film and video clips, that stand out
because of their interesting or typical sound design, was created. These clips were tagged and
indicated in categories like “event”, “symbol” and “acoustic material”. That way it is easy to
search for clips with certain events like “murder” (event), “grief” (symbol) or “orchestra
music” (acoustic material). A detailed evaluation of this material in the clip library has yet to
be done.
Acknowledgement
Starting in October 2005 a research team at the University of Applied Sciences St.Pölten
worked together under the project leader Hannes Raffaseder with the Vienna based companies
Audite / Interactive Media Solutions, Team Teichenberg and, since April 2007 also the
University of Applied Sciences Vorarlberg. The works were supported by the Österreichische
Forschungsförderungsgesellschaft (FFG) in their FHplus funding programme until May 2008.
119
References
[1] Feng, D., Siu, W.C., Zhang, H.J. (2003): Multimedia Information Retrieval and Management, Springer-Verlag, Berlin.
[2] Raffaseder, H. (2002): Audiodesign. Hanser-Fachbuchverlag, Leipzig.
[3] Schafer, Murray R. (1994): The Soundscape – Our Sonic Environment and the Tuning of the World. Destiny Books,
Rochester.
[4] Sonnenschein, D. (2001): Sound Design – The Expressive Power of Music, Voice, and Sound Effects in Cinema.
Michael Wiese Productions, Studio City.
[5] Truax, B. (2001): Acoustic Communication (2nd ed.), Ablex Publishing, Westport.
[6] van Leeuwen, T. (1999): Speech, Music, Sound. MacMillan Press Ltd., London.
120
IMPROVe: The Mobile Phone as a Medium for Heightened Sonic
Perception
Richard Widerberg
Dånk! Collective
Karl Johansgatan 47 G
SE-41455 Göteborg, Sweden
rwiderberg@g mail.com
+46708977452
Zeenath Hasan
School of Art s and Communication
Malmö University
SE-20506 Malmö, Sweden
zeenath.hasan@gmail.com
Abstract. In this paper, we describe the design and research phase of a project that aims to create conditions for heightened sonic
perception through a mobile phone based software application. The initial design concept is that of an aural architecture for sonic
socio-cultural exchange where sonic realities of the everyday are improvised live in a non-linear mode. The design approach adopted
is collaborative. The project is a work in progress.
1. Introduction such reveals the direct and incidental connections between the
different aspects that come together in its creation.
Our experience of our everyday lives is mediated through a
‘multitude of mechanically produced sounds’ [1]. The everyday 3. Initial Design Concept
sounds that we experience are produced outside of our own
volition. The capability to capture sounds and play it back has 3.1. Background
made it possible to listen to sounds outs ide of its original context A community of practice with a history and tradition of working
[2]. Sound recording technology can be used as an extension of closely with found sounds through the means of electronic and
listening and enhance our aural awareness[3]. The mobile phone digital tools is that of electro-acoustic musicians. When viewed
is also a medium through which sounds are heard outside of on a scale of involvement from active to passive, the members
their original context. However, the normative definition of the of this community of practice involves not only those who
mobile phone as a medium for communication has restricted its actively engage in the creation and reproduction of their own
potential as a medium for sounds that exist outside of the tools or instruments for mixing sounds, but also those who
immediate tele-communication. This design and research project passively listen to the sounds that are produced. Lastly, one
explores the potential of the mobile phone as a medium of mode in which electro-acoustic musicians might work to
communication beyond its currently dominant role as a compose sounds in a group is spontaneously through live
transmitter of sounds. The design space for exploration is the improvisation.
mobile phone as a digital networked medium that is
appropriated by social networks to communicate across
boundaries of time, space and context [4]. The project thus 3.2. Improvisation
proposes the design of the mobile phone as a medium for the Professional musicians have practiced improvisation to create
exchange of everyday sounds within communities and across compositions spontaneously. Melodies, harmonies and rhythms
socio-cultural contexts by mobilizing the potential of the mobile are combined within the traditional structures of mus ic that the
phone as a tool for the production of everyday sounds. professional musician has been trained in. Musical instruments
have been known to tear away from their established histories to
2. Approach accommodate and challenge each other. When the mobile device
is used as a musical instrument in an improvisation, what
The project adopts a collaborative design approach by gathering musical structures, if any, emerge? Improvisation is a collective
a community of interest [5] consisting of members from activity. Professional musicians practice it to scope the
different communities of practice [6] who conduct their boundaries of the musical form. What pursuits will the untrained
activities at various levels of involvement. The design process improviser indulge in when involved in sonic improvisation?
gathers participants around an initial des ign concept that is used
as a boundary object [7] or as a common point of reference. The
design concept that emerges in the interactions with the gathered
participants is regarded as an artifact [8] because perceiving it as
-1 -
121
4. Working Prototype which also clarifies the concept, can be viewed on the project's
web site [11]. The initial design concept was also introduced
A working prototype was developed for use as a common point separately to eight participants who were selected based on their
of reference for discussions on the design concept with gathered active to passive involvement with found sounds. Participants
participants and other stakeholders. were given a mobile phone to carry for a month’s period after
which separate discussions were held with each of them.
4.1. Scenarios
Scenarios were constructed as a way to unfold the initial design 6. Findings
concept at work and also to have a shared understanding within
the project team. Provided below are two scenarios in brief. When the mobile phone was used as an instrument for recording
everyday sounds by participants who do not pursue musical
4.1.1. Scenario 1
performance as a profession, they reported emotional, nostalgic,
A group of friends, untrained in music, record soundscapes from anecdotal and politically analytical associations with the sounds
their daily life. The group meets at a local pub, where there is a they chose to record. Participants who pursued music production
sound system for playing the gathered sounds. They perform a as a career mainly recorded aesthetically pleas ing sounds for use
live-remix of the sounds on their mobile devices. The aural in their next performance. The possibility of recording sounds
exchange affects individual and group understanding at the and then processing and playing them back into a soundsystem
cultural and social level through a sharing of the everyday makes the mobile device a musical instrument among
soundscape. professional musicians. Although the above two groups have
been presented as a dichotomy, it is not a strict division becaus e
4.1.2. Scenario 2
both groups reported associations with sounds other than that of
Trained mus ic practitioners, like cellists, record sound objects
the aesthetic.
through a mobile phone. The group meets in a concert hall. They
perform a group improvisation with the collected sounds
through their respective mobile devices. The exchange is an 7. Conclusi on
exploration of the formal aspect of aural compos ition that builds
on traditional music structures and creates new forms of music. The project began with the objective to conceive of the mobile
phone as a medium for heightened sonic awareness. It has
4.2. System Functionality achieved proof of concept on people’s reception to the existence
Sounds are collected via a mobile phone and sent to a location and use of such an application on the mobile phone. The next
where they can be played back into a sound system. The same threshold is to use this application for heightened sonic
mobile phone controls the playback of the collected sounds in awareness in different pedagogical contexts.
the soundsystem. Playback control occurs in the physical
location of the soundsystem. The sounds that are played back
are process ed live via interaction through the mobile phone. The
output of the processed sound can be heard directly through the
soundsystem. References
[1] Bull, M., Back, L. (Eds) The Auditory Culture Reader. Berg
4.3. Prototype Application Publishers, Oxford, New York, (2003)
Python programming language was used for rapid prototyping [2] Kahn, D.. Noise, Water, Meat: a history of sound in the arts.
on Nokia Series 60 devices [9]. The phone microphone is used The MIT Press, London, (2001)
for the recording interaction. As current audio processing [3] Truax, B. Acoustic Communication. 2nd edition. Ablex
capabilities on the phone requires some amount of work, mixing publishing, Westport, (2001), 219
and playback functions are processed on an external computer [4] Rheingold, H. Smart Mobs. Perseus Publishing, Cambridge,
using Pure Data [10]. The processing is then controlled live by Masachusetts, (2002)
the mobile phone via bluetooth connection. [5] Fischer, G. Beyond ‘Couch Potatoes’: In Consumers to
Designers and Active Contributors First Monday, volume 7,
4.4. Graphica l User Interface number 12. (2002)
For the recording of sounds, a simple interface allows recording, [6] Wenger, E. Communities of Practice: Learning, Meaning and
listening to the recording and then uploading of the recording to Identity. Cambridge University Press, USA, (1998)
a server. The recording-interface cons is ts of three buttons for [7] Marick, B. Boundary Objects
these three functions. For the improvisation interface, the four- http://www.visibleworkings.com
way directional button was the only key activated for interaction [8] Diaz-Kommonen, L. Art, Fact and Artifact Production:
with the GUI. The ‘Play’ command selects recorded sounds at Design Research and Multidisciplinary Collaboration.
random. Three options are provided for live mixing of the University of Art and Design Helsinki, Finland, (2002)
played back sounds, allowing the participant to control the [9] PyS60. http://sourceforge.net/projects/pys60/
volume, speed and loop length. The ‘Stop’ command stops the [10] Pure Data real-time graphical programming environment
playback. for audio, video, and graphical processing.
http://www.puredata.org/
5. Fi el d Acti vity [11] IMPROVe project web site. http://www.riwid.net/improve
-2 -
122
Audio Interface as a Device for Physical Computing
Kazuhiro Jo, RCAST, University of Tokyo / Culture Lab, Newcastle University, jo@jp.org
Abstract. In this paper, we would like to describe the employment of audio interface as a device for physical computing. We
compare the audio interface with other devices and describe its characteristics. We also present examples of the employment with
three different art works, Monalisa "shadow of the sound", The SINE WAVE ORCHESTRA stay amplified, and AEO. We explain
the implementation of each work with different physical components. Finally, we discuss some of the potential of the employment of
audio interface for future implementation.
-1-
123
[Audio Interface as a Device for Physical Computing]
by binary codes [7]. The work was premiered at Open Space at of a set of control computer, sound synthesis computer, light,
NTT InterCommunication Center from 9th June 2006 to 11th foot switch, fader controller, rotational controller, and 16
March 2007. The work consists of a set of computers with speakers circularly situated in a room (Figure 3). We briefly
custom version of Monalisa applications, projector, camera, describe the procedure of the work below.
microphone, push button, and speaker situated in a room (Figure
1). We briefly describe the procedure of the work below.
-2-
124
[Audio Interface as a Device for Physical Computing]
conduction in the same manner of the push button of Monalisa audio input, three audio outputs, one extra audio input/output, a
"shadow of sound". For the input channel of the fader controller, power socket, and three ring modulation circuits (Figure 6).
we set up a volume calculator. It reports the change of the Every audio inputs/outputs of the converter boxes are directly
resistance of the fader as the change of volume of incoming connected to the outputs/inputs channel of a 56-ch, 192KHz,
audio signal. With the change of the volume, the control 24-bit external audio interface, RME Fireface 800
computer sends a message for the synthesis computer engine to (http://www.rme-audio.com/).
change the frequency of a sine wave. The sampling resolution
for the volume is 16-bit (i.e. 65535 level). Therefore it is
sufficient to accomplish subtle changes of the frequency of the
sine wave.
2.3. AEO
AEO is a sound performance project consisting of three
members: Eye, Taeji Sawai and the author. AEO has performed
at several international festivals (e.g. Dutch Electronic Art Figure 6: Converter boxes (Black for the sphere with the
Festival 2004, Radar 7 at 24 Festival de MEXICO 2008). In the transmitter, White for the sphere with the receiver)
project, each member takes one of three roles: performance
(Eye), sound design (Sawai) and instrument design (the author). The amplified box has two audio output, one volume, two audio
During AEO performances, the performer holds the instrument inputs, a power socket, and two channels amplifier (Figure 7).
in each hand and shakes, sways, or swings them. These
movements by the performer produce patterns of sound and light
through devices and a computer. The instrument has undergone
a transition in function and forms over six iterations [5].
Figure 5: AEO instruments We detected the inclination and acceleration of each axis of the
accelerometer as a separate audio signal. For each signal, we
2.3.1. Accelerometer, Distance sensor, Small light bulb setup a volume calculator. It reports the change in analog
We developed two sets of converter boxes for the accelerometer voltage from each axis as the change of volume of incoming
and the distance sensor, and an amplified box for the small light each audio signal. We employ the reported value to control the
bulb. Each converter box has a connector to the sphere, one parameter of represented sound of the performance.
-3-
125
[Audio Interface as a Device for Physical Computing]
Distance sensor: percussion with its fine time precision. However, with the
For distance sensor, we employ one extra audio input and one MAKE Controller Kit, the resulted change of the sound did not
extra audio output of the converter boxes. The extra audio input reflect the subtle movement of the performer. As Wessel and
is connected to the transmitter and the extra audio output is Wright pointed out [10], the low latency between gesture and
connected to the receiver. We assign an ultrasonic sine wave gesture controlled audio output is essential for live computer
(40KHz) to the extra audio input. The audio interface enables music performance. We believe that our approach suggests a
the utilization of such high frequency with its high sampling rate sensitive way to communicate physical world with its
than the built-in audio interface. The external audio output preciousness of measurements and time.
produces the audio signal from the receiver. We detected the
distance between two spheres as an audio signal. For the signal, Wessel and Wright argued the possibility to apply existing audio
we setup a volume calculator. It reports the change of the DSP (Digital Signal Processing) modules (e.g. Filters, Fast
distance as the change of volume of the audio signal. We Fourier Transforms, Linear Predictors) to process the signals
employ the reported value to control the parameter of from/to the physical world [10]. We have not investigated such
represented sound of the performance. possibilities well. However we consider handling multiple
sensor signals with one channel of the audio interface by
Small light bulb: employing band-pass filters.
For each small light bulb, we employ one audio output, one
audio input, and one channel amplifier of the amplified box. By As a future work, we consider to publish our developments for
feeding audio signals in appropriate amplifier, it is possible to public. We plan to provide various instructions and examples of
bring the coil of the bulb in an excited state. We assign the the employment of audio signals to communicate with the
represented sound of the performance to the audio input. We physical world. We hope to encourage people to stimulate each
amplify the sound and connect it to the small light bulb. In the other to discover the potential of the audio interface as a device
performance, the brightness of the bulb reflects the change of for physical computing.
the represented sound changes. The volume of the amplified box
adjusts the threshold of the brightness. Acknowledgements
We would like to thank for the other member of Monalisa, The
3. Related works SINE WAVE ORCHESTRA, and AEO. Monalisa "shadow of
There are some precursors who also tried to employ audio the sound" was developed under the support of NTT
signals as means to communicate with the physical world. InterCommunication Center and FY2005 IPA
Allison and Place [1] developed the SensorBox. The device (Information-Technology Promotion Agency) Exploratory
accepted six sensor inputs and two audio inputs. The data from Software Project (Project Manager: KITANO Hiroaki).
each sensor was carried as the amplitude of a sine wave, which
was located in the 18KHz to 20KHz, and mixed back on the two
audio inputs. They did not provide its technical detail well,
however their approach was quite same as ours of the converter
References
box of AEO. Canadian artist artificiel also explored the light [1] Allison, J., Place, T., SensorBox: practical audio interface
bulb as sound source in their electro-acoustic installation for gestural performance, In Proceedings of the 2003
"beyond6281 [2]". They feed processed audio signals in Conference on New interfaces For Musical Expression,
powerful amplifiers in the similar way to the small light bulb of pp.208-210 (2003)
AEO. PingPongPlus [4] employed audio signals for position [2] artificiel, beyond6281, Art + Communication 2006: WAVES,
tracking. It detects the position of ping-pong ball with a sets of RIXC, the Center for New Media Culture, Riga, Latvia,
microphones mounted under the ping-pong table. They detect pp.38-39 (2006)
the time of hit with each microphone and calculate the position [3] Furudate, K., Jo, K., Ishida, D., Noguchi, M., The SINE
with the time difference. Ms. Pinky [9] by Scott Wardle consists WAVE ORCHESTRA stay amplified, Art + Communication
of a set of vinyl records and software running on MaxMSP. The 2006: WAVES, RIXC, the Center for New Media Culture, Riga,
vinyl contains special signals and from the signal, the software Latvia, pp.104-105 (2006)
decodes the velocity, direction, and physical position of the [4] Ishii, H., Wisneski, C., Orbanes, J., Chun, B., Paradiso, J.,
needle on the surface of the vinyl in real time. PingPongPlus: design of an athletic-tangible interface for
computer-supported cooperative play, In Proceedings of CHI
'99. ACM, New York, NY, pp.394-401 (1999)
4. Discussions
[5] Jo, K., Transition of an Instrument: The AEO Sound
We explored the characteristics of the audio interface as a
Performance Project, Leonardo Music Journal No.17, pp.46-48,
device for physical computing. The audio interface
(2007)
accomplished higher sampling resolution and sampling rate than
[6] Jo, K., Furudate, K., Ishida, D., Noguchi, M., Transition of
the other devices. In the installation Monalisa "shadow of the
instruments on The SINE WAVE ORCHESTRA, ACM
sound", we simply employ the push button with the built-in
Computers in Entertainment, on October (2008) (to be
audio interface. In The SINE WAVE ORCHESTRA stay
published)
amplified, we accomplish subtle change of the frequency of the
[7] Jo, K., Nagano, N., Monalisa: "see the sound, hear the
sine wave with 16-bit sampling resolution. With the latest
image", In Proceedings of the 8th International Conference New
instrument of AEO, the sampling resolution and sampling rate
Interface for Musical Expression, pp.315-318 (2008)
of the audio interface enables to represent subtle movements of
[8] O'Sullivan, D., Igoe, T., Physical Computing, Boston, USA:
the performer as patterns of sound and light than the other
Thomson Course Technology (2004)
devices. During the latest implementation of AEO instruments,
[9] Wardle, S., Ms Pinky, http://www.mspinky.com/
we have conducted informal comparison with the audio
[10] Wessel, D. and Wright, M., Problems and Prospects for
interface and the MAKE Controller Kit. We assigned the change
Intimate Musical Control of Computers, Computer Music
of the acceleration into the change of the volume of white noise
Journal. 26, 3, pp.11-22 (2002)
sound. With the audio interface, when the performer shakes the
sphere, the resulted sound could be heard as a kind of maracas
-4-
126
Automatic genre and artist classification by analyzing improvised
solo parts from musical recordings
Jakob Abesser, Christian Dittmar, Holger Grossmann ({abesjb,dmr,grn}@idmt.fraunhofer.de)
Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany
Abstract. This paper introduces a set of high-level features to describe instrumental solo-parts. The set consists
of 148 single- and multidimensional features related to the melodic, harmonic, rhythmic and structural properties
of four instrumental domains. A simple yet common instrumentation model has been applied to describe both
the soloing and the accompanying instruments as well as rhythmic and melodic interaction between them. To
evaluate the features’ discriminative power related to different musical styles, an evaluation for content-based
genre and artist classification has been performed each with two different test sets consisting of symbolic and
real audio data. Two different classifier approaches have been utilized, one commonly used support vector
machine (SVM) classifier with preliminary discriminant analysis (LDA) and one novel approach based on the
Rhythmical Structure Profile which allows a tempo-adaptive representation of the rhythmic context provided by
the accompanying instruments. For both classification scenarios, ensemble decisions based on single instrument-
related classifiers led to the highest scores of 84.0% for genre and 58.8% for artist classification.
127
Automatic genre and artist classification by analyzing improvised solo parts from musical recordings
test sets (described later on in 2.3) fit into this model. 2.2 Feature extraction
To describe the soloists’ way of playing within a solo
2.1 System overview part, three main questions have been investigated.
Which notes are played within the given harmonic and
2.1.1 Transcription and pre-processing
rhythmic context? How is the solo part structured?
The implemented system allows the processing of both To what extent does the soloist interact with the ac-
symbolic and real audio data (MIDI and Audio files). companying instruments? Timbral characteristics of
Our experiments are all based upon excerpts from the the instrument, the precise instrumentation of a solo
analyzed solo parts of 20 to 40 seconds length. To ex- (e.g. whether the soloist plays a electric guitar or a
tract the score parameters from symbolic audio files, saxophone) as well as applied playing styles of the
the MIDI Toolbox for MATLAB [6] has been used soloist (like glissando or vibrato) have explicitly not
for data conversion. It allows to derive a list of all been taken into account here. A total of 148 high-level
notes containing the parameters note onset and dura- features both single- and multi-dimensional have been
tion (both in seconds and bars), velocity, MIDI pitch implemented and a certain sub-set of them can be ex-
and MIDI channel. To process real audio data, the tracted for each one of the four instrumental tracks.
Transcription Toolbox [4] (developed at the Fraunhofer In the following four sections, a selection of the imple-
Institute for Digital Media Technology) has been uti- mented features will be explained in detail.
lized. It is a software toolbox that encapsulate four
different algorithms to perform a separate transcrip- 2.2.1 Melodic and harmonic features
tion of the melody, harmony, bass and drum track of
Three different representations of the melodic progres-
a music-piece. It furthermore offers the user manifold
sion have been examined to derive melodic and har-
ways to correct the transcription results e.g. by choos-
monic high-level features. Besides the absolute and
ing a temporal quantization grid or a pitch correction
relative pitch (intervals between adjacent notes in half-
causing all notes to fit to the manually selectable key
tone steps mapped to one octave), the functional pitch
of the analyzed excerpt of the song. The Transcrip-
(see 2.1.2) of each note within the solo is determined
tion Toolbox also extracts the beat grid of the song
based on the aforementioned harmony analysis. A wide
which enables a subsequent projection of all detected
range of different features characterizing the melody
note onsets from their absolute values in seconds to
have been extracted. These are e.g. the pitch range
certain multiples of the bar lengths and thus allows a
in halftones, a measure of chord-tone ratio (derived by
tempo-independent onset representation.
analyzing simultaneously played chord notes of the har-
mony instrument) as well as the temporal ratio of poly-
2.1.2 Quantization and harmonic analysis phonic parts, chromatic note sequences (with consec-
utive intervals of a half-tone step) and note sequences
For some of the extracted rhythmic high-level features,
with constant pitch. Additionally, the progression of
the note onsets and durations have additionally been
the relative pitch was also converted into the corre-
quantized to a 64th-note beat grid. Furthermore, a
sponding functional pitch values to derive a key- and
simplified harmony analysis has been applied to the
scale-independent representation of the applied inter-
harmony track. The goal was to determine the root
vals. All single probabilities (e.g. of a fifth downwards
note of each played chord. The system is able to de-
or a third upwards) as well as some other basic statis-
tect the most common 2-, 3- and 4-note chords in all
tical features like zero- and first order entropy and the
possible inversions by using chord interval templates.
D’Agostino measure [2] have been furthermore com-
In case the chord was unknown, the lowest note was
puted as melodic features. The temporal ratio of frag-
supposed to be the root note. For internal representa-
ments with a constant melodic direction is mapped to
tion, all played chord notes are artificially elongated to
a measure of balanced direction, furthermore the dom-
allow a detection of the harmonic context for each note
inant direction (ascending or descending) is thereby
played by the soloist. By mapping the interval between
determined as additional feature.
each note of the solo melody towards the detected root
note of the simultaneously sounding chord, a represen-
2.2.2 Rhythmic features
tation called functional pitch was defined. Here, only
the type of the interval (third, fifth, etc.) is projected For the computation of rhythmic high-level features,
to the corresponding integer value (3, 5, etc.), the size the note onsets, durations and inter-onset intervals
(e.g. major or minor third) is not taken into account have been analyzed. To characterize the perceived
to increase the independence from the key-type (major rhythmical precision of a track related to different beat
or minor). grids (4th, 8th, 16th and 32th-note grid), the quantiza-
-2-
128
Automatic genre and artist classification by analyzing improvised solo parts from musical recordings
tion cost was calculated as an inverse measure of rhyth- the distribution of both rhythmic and melodic patterns
mic precision within the particular beat grid. Further- within the solo.
more, a swing ratio was also calculated for the beat
grids mentioned above using a similar approach as de- 2.2.4 Interaction-related features
scribed in [8]. To derive a rhythmical representation of To describe the interaction between the soloist and
all notes of an instrumental track that is independent the accompanying musicians, two approaches have
from tempo and bar measure, we introduce the Rhyth- been followed. By calculating the euclidean distance
mical Structure Profile (RSP) which is derived from the between bar-wise RSPs one can determine whether
un-quantized note onsets. The RSP is based on parti- two musicians play rhythmically in unison or use
tioning each bar length into k equidistant grid points, complementary rhythms. The aforementioned chord-
where different corresponding binary and ternary val- tone ratio (see 2.2.1) is furthermore calculated bar-
ues of k (2-3, 4-6 etc.) have been investigated, each grid wise to characterize the progression of the harmony-
as an un-shifted and a shifted version related to down- relatedness of the solo melody. For both vectors, both
and off-beat positions. Each note of the instrumental mean and standard deviation are calculated as fea-
track is mapped onto these grids that contain a grid tures.
point aound the note’s onset time. By summing up
the note’s normalized velocities mapped to all defined
2.3 Evaluation
grid-points, the RSP can be calculated and saved in
form of a three-dimensional matrix. Afterwards, one The partitioning of the data sets into training and
can both analyze the temporal distribution of notes test data generally has been performed class-wise to
over all grids and as well as within each grid. They al- a proportion 50% - 50% randomly for each iteration,
low the calculation of the features dominant rhythmical whereas a total of 50 iterations were passed through
grid (containing the majority of all notes), dominant for each evaluation scenario.
rhythmical feeling (down- or off-beat) and dominant
rhythmical characteristic (binary or ternary). Further- 2.3.1 Genre classification
more an algorithm to detect syncopations within dif- For the genre classification experiments, a 6-fold-
ferent rhythmical grids based on the RSP was imple- taxonomy has been utilized, consisting of the music
mented. genres Swing (SWI), Latin (LAT), Funk (FUN), Blues
(BLU), Pop-Rock (POP), Metal-Hardrock (MHR). Be-
2.2.3 Structure-related features sides instrument-related single classifiers, the efficiency
To describe the structure of a solo, both rhythmical of ensemble classifiers (based on a probabilistic ma-
and melodic repetitions within the instrumental tracks jority decision) was investigated. Two different ap-
have been seeked. For this purpose, an algorithm for proaches have been chosen, a common support vector
detecting repeating patterns within character strings machine (SVM) classifier with preliminary linear dis-
(Correlative Matrix approach [10]) has been utilized. criminant analysis (LDA) and a nearest-neighbor clas-
These character strings are derived from the absolute sifier based on the aforementioned RSP.
pitches as well as from the quantized onset and dura-
tion values. All detected patterns were mapped into a LDA-SVM classifier Before the evaluation, all feature
three-dimensional representation consisting of the pa- vectors are extracted from solo excerpts of a particular
rameters length, incidence rate and mean distance. As data set. After a feature-wise variance normalization of
a fourth parameter, supposing to characterize the re- the training data, LDA has been performed for dimen-
call value of a detected pattern, the so called relevance sionality reduction of the feature space to 5 dimensions
has been calculated from the normalized pattern pa- (since we are dealing with a six-class problem). Sup-
rameter values as rP at = lP at,N orm + fP at,N orm + (1 − port vector machines have been chosen as classifier ap-
dP at,N orm )2 . It is based on the simple assumption that proach, more precisely C-Support Vector Classification
the recall value increases with ascending pattern length (C-SVC) using the radial basis function (RBF) kernel
and frequency and decreases with ascending temporal as described in [1]. Subsequent to variance normal-
distance whereas its impact is furthermore reduced by ization and dimension reduction, the optimal classifier
the squaring operation. Basic statistical features like parameters C and γ are determined using a threefold
mean, median, standard deviation, minimum and max- grid-search and the classifier model is trained after-
imum value are calculated for each of the four pattern wards. To evaluate the trained classifier, all feature
parameters as well as the number of patterns related to vectors from the test data passed the same two prelim-
the overall number of notes of the current track. After inary steps. Finally the classifier output was compared
all, 63 feature values contain manifold information on with the ground truth label vector.
-3-
129
Automatic genre and artist classification by analyzing improvised solo parts from musical recordings
RSP classifier The main idea behind this novel ap- LDA-SVM RSP Human
proach is to model the rhythmic context provided by Input MIDI Audio MIDI MIDI
the accompanying instruments during the solo part
(re-synth.)
which is usually specific for each music genre. There-
fore, the RSPs of the bass track, the harmony track
MEL 63.8 44.4 – 37.6
and the drum track with separate investigation of the HAR 57.3 45.1 63.7 –
bass drum (B) and snare drum (S) track are computed MEL + 71.7 – – 58.8
globally over the total length of the analyzed excerpt HAR
to extract the most frequent rhythms and to minimize BAS 70.1 51.8 66.3 –
the influence of rhythmical breaks and variations. Af- DRU 62.2 35.9 61.0 (B) –
ter their computations, the instruments’ RSP matri- 47.7 (S)
ces of each song in the training data set are stored ALL – – – 63.1
in genre- and instrument-related containers for later
ENS 84.0 63.4 73.2
use. After applying the same computation step for the
songs within the test data set, the euclidean distance
between each extracted RSP matrix and all stored ma- Table 1: Genre classification results in %
trices related to the same instrument is calculated. The
minimum distances to each container can be converted
to assignment probabilities to the according genres due lagher, Jimi Hendrix and Steve Ray Vaughan). Since
to rhythmic similarity. the accompanying musicians are not supposed to have
impact on artist classification, only the features de-
Listening test To compare the results of the two clas- rived from the melody track have been provided to the
sifier approaches with the ability of human listeners to artist LDA-SVM classifier. Training and evaluation is
assign an excerpt from a solo part to a music genre, performed as described for genre classification.
a listening test has been performed. 25 test persons
between 20 and 42 years (µ = 26 years) of age with a 3 Results
relatively high average musical background of µ = 12
years participated. Each test person had to assign 15 The genre classification results are listed in table 1.
excerpts from different solo parts (randomly selected Besides the two classifier approaches described in 2.3.1,
from the symbolic-audio genre testset) to one of the the results of the listening test related to the three
given music genres. The instrumentation of the ex- investigated scenarios test are presented in the fifth
cerpts has been unified (melody, harmony and bass in- column. The single classifiers of both the LDA-SVM
strument assigned to a piano sound) to prevent a genre and the RSP approach achieved classification scores
assignment based on commonly appearing instruments up to 71.7% for MIDI and up to 51.8% for real audio
(e.g. Metal-Hardrock with electric guitar). Three input. Using ensemble based classification, scores up
different instrumentation scenarios have been investi- to 84.0% respectively 63.4% within the aforementioned
gated, the first five pieces only consisted of the melody 6-fold genre taxonomy were achieved. We assume that
instrument, the second five pieces of the melody and partly incomplete or erroneous transcription results are
harmony instrument and the last five pieces of the com- the main reasons for lower scores for real audio data.
plete instrumentation (see 2). A simple metronome has The achieved scores for artist classification are 58.8%
been furthermore added within the first two scenarios (electric guitar) and 56.0% (saxophone).
to provide a rhythmical orientation to the test persons.
4 Summary and future work
2.3.2 Artist classification
In this paper, we presented different high-level fea-
To evaluate the features’ discriminative power to iden- tures related to the melodic, rhythmic, structural and
tify the artist who is playing a certain solo, two exper- interaction-related description of improvised solo parts.
iments have been performed. For each of them, four A simple but common instrumentation model allows an
musiciancs playing the same instrument and being allo- application of these features for a wide range of differ-
cated to related music genres have been chosen and for ent music genres. Using the extracted information of
each of them 30 excerpts of solos have been collected. all four instrumental tracks by applying an ensemble
In detail, the first set consists of four famous saxophone classifier, classification rates up to 84.0% within a 6-
players (John Coltrane, Dexter Gordon, Charlie Parker fold genre taxonomy were achieved. As the listening
and Joshua Redman) and the second one of four well- test’s results show, a genre classification solely based
known electric guitar players (Eric Clapton, Rory Gal- on the solo part of a song is a difficult task. Despite of
-4-
130
Automatic genre and artist classification by analyzing improvised solo parts from musical recordings
the dominant solo instrument, the genre assignment is Conf. on Digital Audio Effects (DAFX), Septem-
primarily based on the characteristics of the accom- ber 2003.
panying instruments. Considering that timbre- and [9] P. Herrera, V. Sandvold, and F. Gouyon.
instrumentation-related features have not been taken Percussion-related semantic descriptors of music
into account here and only the solo part has been ana- audio files. In Proc. of the 25th Int. AES Conf.,
lyzed, the results are encouraging for further research 2004.
within this topic. As the results of the artist classi-
fication reveal, describing the way of playing by using [10] J.-L. Hsu, C.-C. Liu, and A. L. P. Chen. Discov-
high-level features basically allows a discrimination be- ering nontrivial repeating patterns in music data.
tween different performing artists. On the other hand, In IEEE Transactions on Multimedia, volume 3,
it still exists a lack of semantic information. To over- pages 311 – 324, September 2001.
come this, additional features to describe playing styles [11] T. Lidy, A. Rauber, A. Pertusa, and J. M. Iesta.
in detail as well as specific instrumentation and tim- Improving genre classification by combination of
bre aspects need to be implemented to derive better audio and symbolic descriptors using a transcrip-
results for artist classification. Regardless of the clas- tion system. In Proc. of the 8th Int. Conf. on
sification task one has to emphasize the importance of Music Information Retrieval (ISMIR), 2007.
a well-performing transcription system in order to an-
[12] S. T. Madsen and G. Widmer. A complexity-
alyze real audio data by the use of high-level features
based approach to melody track identification in
based on score parameters.
midi files. In Proc. of the Int. Workshop on Artifi-
cial Intelligence and Music (MUSIC-AI), January
References 2007.
[1] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: [13] C. McKay and I. Fujinaga. Automatic genre clas-
a library for support vector machines. In http: sification using large high-level musical feature
// www. csie. ntu. edu. tw/ ~ cjlin/ libsvm sets. In Proc. of the Int. Conf. in Music Infor-
(last called: 10.09.2008), 2001. mation Retrieval (ISMIR), pages 525–530, 2004.
[2] P. J. Ponce de Léon and J. M. Inesta. Pattern [14] C. Saunders, D. R. Hardoon, J. Shawe-Taylor,
recognition approach for music style identifica- and G. Widmer. Using string kernels to iden-
tion using shallow statistical descriptors. In IEEE tify famous performers from ther playing style.
Transactions on System, Man and Cybernetics - In Proc. of the 15th European Conference on Ma-
Part C : Applications and Reviews, volume 37, chine Learning (ECML), pages 384–395, 2004.
pages 248–257, March 2007.
[15] G. Widmer and W. Goebl. Computational models
[3] R. López de Mántaras and J. L. Arcos. AI and mu- of expressive music performance: The state of the
sic: From composition to expressive performances. art. In Journal of New Music Research, volume 33,
AI Magazine, 23:43–57, 2002. pages 203–216, 2004.
[4] C. Dittmar, K. Dressler, and K. Rosenbauer. A
toolbox for automatic transcription of polyphonic
music. In Proc. of the Audio Mostly, 2007.
[5] T. Eerola and A. C. North. Expectancy-based
model of melodic complexity. In Proc. of the
6th Int. Conf. of Music Perception and Cognition
(ICMPC), 2000.
[6] T. Eerola and P. Toiviainen. Midi toolbox: Mat-
lab tools for music research. In www. jyu. fi/
musica/ miditoolbox/ (last call: 10.09.2008),
Jyvskyl, Finland, 2004. University of Jyvskyl.
[7] J. Erkkilä, O. Lartillot, G. Luck, K. Riikkilä, and
P. Toiviainen. Intelligent music systems in music
therapy. In Music Therapy Today, volume 5, 2004.
[8] F. Gouyon, L. Fabig, and J. Bonada. Rhythmic
expressiveness transformations of audio record-
ings - swing modifications. In Proc. of the 60th Int.
-5-
131
The Heart as an Ocean
exploring meaningful interaction with biofeedback
Pieter Coussement, Marc Leman, Nuno Diniz, and Michiel Demey
Abstract. This paper discusses the need to redefine the concept of ‘interaction’ within the context of interactive (audio) installations.
This discussion is based on the realization of ‘The Heart as an Ocean’, a media piece that explores the relationship between auditory
senses and biometric feedback.
132
of the technological mediator, and the public may have the line output to create an aggregated device providing eight line
feeling that it never experienced the artist’s intentions. level outputs. An extra nineteen-inch screen showed the
The question is whether it is possible to cope with this problem software GUI. The speakers were hidden in order to emphasize
of technological mediation and learning curves. Are there ways the atmosphere of the exhibition space, giving more room to the
to overcome the inherent limitations of hyperinstruments? audio.
3. Basic concept
In ‘The Heart as an Ocean’, the goal was to get a natural flow 4.2. Software.
of communication without the restrictions of having a too The software is developed using Cycling 74’s MAX/MSP vi . On
technical interface that could obtrude the intended interaction. the top level there is a GUI running, which enables a real-time
The interaction had to work like an affordance. No sophisticated HD recording of the interaction. This can be rendered to a DVD
explanations should be necessary to interact, and user feedback and is offered as a multiple.
should be based on a very strong homogeneity in ‘experiencing’.
4. Technical realisation
4.1. Hardware
‘The Heart as an Ocean’ consisted of seven satellite speakers,
one subwoofers and a heart rate sensor hooked up to an
Arduino v board connected to a Mac Book. The seven satellite
speakers were spread across a wall spanning eight meters. The
subwoofer was discretely placed in the room. An M-Audio
Firewire audiophile was used in conjunction with the computer
133
The Heart as an Ocean- exploring meaningful interaction with biofeedback
7. Discussion
-3-
134
EarMarkIt: An Audio-Only Game for Mobile Platforms
Abstract. The fields of both mobile gaming and audio-only gaming have rapidly expanded in the last few years. In this work, we
attempt to combine these two fields and present an immersive audio-only game prototype meant for future application in mobile
gaming platforms. The game is based on the scenario of an open-air market, in which the player roams freely by actually walking
around, guided by the auralization played back via headphones. We created a wireless stride- and direction-sensing device with
attached manual input controls to simulate future mobile gaming devices that allow the user to explore an open space while playing a
game.
1. Introduction station before the tram leaves the station. The player is allotted a
purse of money, which he can use to pay for the food items. In
Our objective for this project is to create an audio-only game in the course of the game, there is also a thief who tries to steal
which a player can navigate a virtual audio space by walking food that the player has collected. Points are awarded based on
around in the real world. By using audio cues present in the the speed in which the player completes the entire set of
game, the player, wearing a wireless sensor device, can detect purchases and the number of food items remaining in his or her
his position relative to the objects and interact with them. Our basket.
project aims at developing an audio-only game that combines
the accessible and innovative features of immersive audio with
the freedom and ubiquity of mobile gaming.
Because it does not rely on any visual feedback, audio-only Figure 1: The user walks across the market, buying food and
games are highly accessible for blind or visually impaired trying to catch the tram in the end.
people [3, 4]. However, common audio-only games rely solely
on buttons as the primary means of interacting with the game 2.1. Gameplay
[1]. What differentiates our game is the integration of a wireless The game is based on a number of auditory cues that help the
stride and direction sensor box that allows the player to interact player navigate the audio space of the market and provide
with the game in a more pleasurable way by taking full information about the food items that need to be collected.
advantage of his or her physical and perceptual capabilities. “Environmental” sounds are original self-recorded or freely
available recordings of typical markets that are spaced along the
2. Game Outline one-dimensional playing corridor to simulate a market.
“Interaction” sounds are recordings that involve the gameplay
The basic premise of the presented game is that of an open-air elements including the food vendors, the tram and the thief.
food market in which the player navigates through a corridor Recordings of the vendors include those in which they are heard
filled with different sounds of vendors, musicians, and other trying to sell one of four single food items: apples, sugar,
typical market events, see Fig. 1. In the game, the player needs donuts, and fish. The tram sound is that of a tram passing a
to purchase a certain amount of food items and then find a tram station. The thief sound is made up of footsteps that gradually
-1-
135
EarMarkIt: An Audio-Only Game for Mobile Platforms
become audible if the user does not move for a specified amount flexible off-the-shelf components in both hardware and
of time. software. The use of these well-known tools allows us to set the
focus of development on the game concept and evaluation of the
The primary difficulty in the game is for the player to game, rather than having to deal with platform-specific
distinguish between the environmental and interaction sounds, problems that might arise when developing for a wide range of
and to recognize when a vendor is selling a certain food product mobile devices (also see [5] for another example of rapid audio
for an acceptable price. When the player encounters a vendor prototyping using Max/MSP ). As has been shown in previous
selling food, he or she can make a purchase by pushing a audio-for-gaming implementations [6], this rapid prototyping
momentary switch button on the wireless sensor and input approach also makes it possible to evaluate the use of
device that correlates to that type of food. If the player presses technologies that are not yet implemented in the current mobile
the wrong type of food button, the vendor becomes “upset” that devices for the mass market.
the player requested food that is not being sold at that location,
and that food vendor disappears, making that instance of food no For the current prototype of the game, we chose sensors that are
longer available for the player to purchase at this place. If the already available in some modern mobile phones today (such as
player requests the correct type of food, the food item is added Nokia 6210 Navigator, Apple iPhone and similar). These phones
to his basket, the price of the food is deducted from his purse, could be a near-ubiquitous platform to control, or given enough
and that specific vendor becomes silent. Pressing a query button processing power, even run the entire game. As the gaming
in conjunction with one of the food buttons allows the player to hardware is thus already available to many people, the game
check the remaining food that he must still purchase. The player could be quickly deployed to any mobile phone that provides the
must seek out the best price for each food item so that the required technical capabilities. The vibration motor of the phone
money in the purse is sufficient for all of the items required. could also be used to augment the audible feedback by basic
haptic sensation.
The second difficulty in the game is that of limited time. The
player must find the tram station and arrive in time. Since the In our prototype, the game is controlled by a small, battery-
player is required to search the market for the best prices of powered, wireless device that the user wears on the belt, a
food, he cannot simply buy the first item of each type of food headband or similar. The device contains different sensors that
that he encounters. He or she must remember where certain allow us to determine the current heading of the player, detect
items were sold and possibly return to them. strides, and button presses. To achieve this, we use a three-axis
accelerometer, a magnetometer (“compass sensor”) a Bluetooth
The third difficulty in the game is the thief, who continually modem and pushbuttons tied together by an Arduino
follows the player through the corridor. If the player does not microcontroller board, see Fig. 2. The compass is used to
take a certain number of steps within a specified time, the thief determine the players heading and needs to be calibrated once at
advances towards the player and by random selection steals one the beginning of the game. As the player walks, the
of the food items from the basket. After a food item is stolen, the accelerometer data is used to detect footsteps [7]. Each step the
player must navigate again to try to find a remaining vendor - player makes in the real world also changes his position in the
which do not reappear after selling the player an item - in order game. By integrating the steps with the current heading of the
to meet the game requirements. The thief element is included in player, it is possible to move freely inside the game in two
the game to keep the player active and moving. dimensions. The soundscape is adapted to the player’s current
position by changing the stereo field and using some basic
The game ends when either the player arrives at the tram station psychoacoustic effects to simulate sounds coming from different
or the tram leaves the station before. A score is calculated based directions.
on how much money the user has left, how much time is left
before the tram leaves, and how many items from the required
list the user has collected in his basket.
2.2. Navigation
Similar to a real market layout, the auditory cues are placed
along a corridor with a set beginning and end. The player starts
the gameplay at the beginning of the corridor and is able to
navigate through the corridor using the wearable wireless sensor
device. Except for the thief sound, the auditory cues take up a
certain portion of the corridor and the user can hear them pass
by as he navigates through. The cues are panned either to the
right or left side of the player’s position. As the player passes Figure 2: To test the prototype software, we developed a
them, the localization of the auditory cues in the stereo field wireless handheld controller and a sensor
changes based on the user’s position to them. This, coupled with
a high-shelf filter with variable cutoff frequency, simulates the To handle the incoming raw values from the sensors more
effect of the sounds traveling around the user. When the player flexibly on the PC that computes the audio, we developed a slim
changes his heading (as measured by the direction sensor), the serial-to-OSC proxy. The proxy requests the raw sensor readings
stereo field adjusts accordingly so that sounds rotate relative to from the microcontroller. Inside the proxy this data can be
the user’s orientation. filtered and interpreted to generate OSC messages [8]. These
messages can be streamed to any network address, thus the
3. Prototyping Process and Game Hardware different modules of the game could also run on different
computers if necessary. This approach has proven to be a robust
Our audio-only game is meant in the future to be played on and easy way to interchange data between the different modules
mobile devices such as phones, PDAs, or even MP3 players. of the prototype, while keeping the architecture of the prototype
However, for the development of the current prototype we used as flexible as possible.
-2-
136
EarMarkIt: An Audio-Only Game for Mobile Platforms
The OSC stream from the proxy is sent to the actual game In addition, the absence of a graphical user interface also makes
software: A Max/MSP patch processes the OSC messages and the game accessible to the visually impaired.
handles the game logic and sound output. Based on the user’s
current orientation and estimated position the game logic We will continue to develop our game to include improved
generates the audio stream. To play the game, the player does inventory-checking controls and more cohesive sets of sounds.
not have to be aware of the computer. It remains in the We will develop our wireless sensor algorithms to give a more
background and merely processes the user input from the accurate reading, leading to improved game navigation.
wireless device to generate the audio stream that is then sent to a
set of wireless headphones. The corridor-based game stage lends itself well to being played
on an actual train or tram platform, in which the user might be
As the interaction with the game is not bound to a specific able to synchronize the end of the game with the arrival of the
location, the player can use any open space to interact with the real life tram or train. Also, the technology demonstrated here
game, yielding a user experience that is very similar to using a may fit into interactive installations such as those found in
mobile device. Whereas it allows simulating the experience of museums or exhibits.
using a mobile device, the prototype stays open for rapid
extensions and changes to both hardware and software during The sound output is optimized for stereo headphones, as they are
development. Once the development of the prototype is finished, a ubiquitous audio playback device for mobile devices. When
the game can be ported to any sufficiently equipped mobile being played on a personal computer, surround output would
device. deepen the immersion in the game.
-3-
137