You are on page 1of 139

audio mostly

A CONFERENCE ON INTERACTION WITH SOUND


PROC EEDI NGS

Proceedings of the
Audio Mostly Conference
- A Conference on
Interaction with Sound

October 22 23, 2008


Piteå Sweden
Content
Contributors CV 1

Sound and Immersion in the First-Person Shooter: Mixed Measuring the Player's Sonic Experience. 9
Mark Grimshaw, Craig A.Lindley & Lennart Nacke

Sound and the diegesis in survival-horror games. Daniel Kromand 16

Psychologically Motivated Techniques for Emotional Sound in Computer Games. Inger Ekman 20

Interactive Sonification of Grid-based Games. Louise Valgerður Nickerson & Thomas Hermann 27

Using audio aids to augment games to be playable for blind people. David C. Moffat & David Carr 35

BeoWulf field test paper. Mats Liljedahl & Nigel Papworth 43

Control of Sound Environment using Genetic Algorithms. Scott Beveridge & Don Knox 50

Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities. Daniel Hug 56

Saturday Night or Fever? Context Aware Music Playlists. Stuart Cunningham, Stephen Caulder & Vic Grout 64

A Musical Instrument based on 3D Data and Volume Sonification Techniques. Lars Stockmann, 72
Axel Berndt & Niklas Röber

Same but different - Composing for Interactivity. Anders-Petter Andersson & Birgitta Cappelen 80

The Harmony Pad - A new creative tool for analyzing, generating and teaching tonal music. 86
Gabriel Gatzsche, Markus Mehnert, David Gatzsche & K. Brandenburg

Sonic interactions with hand clap sounds. Antti Jylhä & Cumhur Erkut 93

Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity. Ulrich Reiter 101

An Embedded Audio-Based Vehicle Classification Based on Two-level F-ratio. JiQing Han 108

dots: an Audio Entertainment Installation using Visual and Spatialbased Interaction. Andreas Floros, 112
Nikolaos Grigoriou, Nikolaos Moustakas & Nikolaos Kanellopoulos.

AllThatSounds: Associative Semantic Indexing of Audio Data. Hannes Raffaseder, Matthias Husinsky & 117
Julian Rubisch

IMPROVe: The Mobile Phone as a Medium fo Heightened Sonic Perception. Richard Widerberg & 121
Zeenath Hasan

Audio Interface as a Device for Physical Computing. Kazuhiro Jo 123

Automatic genre and artist classification by analyzing improved solo parts from musical recordings. Jakob 127
Abesser, Christian Dittmar & Holger Grossmann

The Heart as an Ocean: Exploring meaningful interaction with biofeedback. Pieter Coussement, 132
Marc Leman, Nuno Diniz & Michiel Demey

EarMarkit: An Audio-Only Game for Mobile Platforms. David Black, Kristian Gohlke & Jörn Loviscach 135
Contributors CV

Mark Grimshaw
Mark Grimshaw was educated in Kenya, South Africa, England and New Zealand where he
gained his PhD. He is currently the Reader in Creative Technologies at the University of
Bolton, England, and his most recent work 'The Acoustic Ecology of the First-Person Shooter'
was published by VDM in May 2008.

Daniel Kromand
Daniel Kromand holds a bachelors degree in Media Studies from University of Copenhagen
and is currently attending the IT University of Copenhagen for a master's degree in Media
Technology and Games. He has previously presented research in avatar theory at the DiGRA
2007 conference. Daniel Kromand also works as a project manager in Copenhagen.

Inger Ekman
Inger Ekman researches sound design for interactive media and virtual environments. She
received a MSc in Computer Science from University of Tampere in 2003 and is currently
working on her doctoral thesis on game sound design. She is particularly interested in non-
musical sound effects and ambience and the ways in which these influence player emotion.

Louise Nickerson
Louise Nickerson is a PhD candidate at Queen Mary, University of London (QMUL) in the
Department of Computer Science. She holds an MSc in Computer Science from QMUL
(2002) and a BA in French and Italian Literature from the University of Virginia (1998). She
is part of the Interaction, Media and Communication research group which focuses on human-
human interaction with a smattering of people working on audio. Her interest stems from
accessibility and the belief that soon mobile computing will be the norm. Her research focuses
on the development and definition of auditory overviews to make auditory interfaces more
approachable.

When not being a PhD student, Louise can be found nurturing her interest in foreign
languages or on the Thames with her crew-mates from the Sons of the Thames Rowing Club.

Thomas Hermann
Dr. Thomas Hermann studied physics at Bielefeld University. From 1998 to 2001 he was a
member of the interdisciplinary Graduate Program "Task-oriented Communication". He
started the research on sonification and auditory display in the Neuroinformatics Group and
received a Ph.D. in Computer Science in 2002 from Bielefeld University (thesis: Sonification
for Exploratory Data Analysis). After research stays at the Bell Labs (NJ, USA, 2000) and
GIST (Glasgow University, UK, 2004), he is currently assistant professor (German C1
position) in the Neuroinformatics Group where he coordinates and conducts research on
sonification, human-computer interaction and cognitive interaction technology.

In his research, Thomas Hermann is developing techniques for interactive multimodal data
representation and exploratory analysis of high-dimensional data with a particular focus on
sonification, novel interactive data mining techniques and human-computer interaction. His

1
research topics include furthermore Tangible Computing, Ambient Information Systems,
Gestural Interactions and Augmented Reality.

David Moffat
David Moffat is a Lecturer in Computing at Glasgow Caledonian University.
His research interests are mainly in emotion and affective computing, often as applied to
video games, but he also has wider interests in other fields of AI and cognitive science.

David Carr
David Carr is a Masters student in Advanced Computing at Glasgow Caledonian university.
He is one of the first appointed "scholars" in the School of Engineering and Computing, for
his particular interests in AI programming for games.

Mats Liljedahl
Since the mid 80’s, Mats Liljedahl has been involved with questions concerning ICT as a tool
for learning and artistic expression. Between 1988 and 1996, Mats worked as a music teacher
and took part in projects dealing with integrating ICT as an active tool for music teachers.
Between 1996 and 2000 he worked as a teachers educator at Ingesund University College of
Music in Arvika, where he also had the opportunity to work with several development
projects on ICT, music and learning. Since August 2000, Mats works at the Interactive
Institute and has taken part in several projects dealing with music, learning and ICT.

Nigel Papworth
Nigel Papworth trained at the London College of Printing in Graphic Design after some years
working in advertising in London, a career he started at
17 before completing his education. After moving to Sweden in 1985, he co-founded of one
of Sweden’s first major games companies, Daydream, where he worked as lead game
designer for 9 years. Two of his designs ‘Safecracker’ and ‘Traitors Gate’ enjoyed
considerable success, especially in the USA. He has worked and lectured on the integration of
Ai, dialogue, game states and behavioral simulation systems utilizing chaos theory in
computer games. After joining the Interactive Institute, he has concentrated on the role audio
can play in driving gameplay and conveying game content and status to the player.
Nigel is married and has three children aged 21 to 14.

Scott Beveridge
Scott Beveridge B.Sc. (Hons.), PhD, Audio Technology, B.Sc. (1st Class Honours), Audio
Technology
Publications
D. Knox, G. Cassidy, S. Beveridge, R MacDonald. Music Emotion Classification by Audio
Signal Analysis: Analysis of Self-selected Music During Gameplay. 10th ICPMC, Sapporo,
Japan, August 25th – 29th, 2008
D. Knox, S. Beveridge, R. MacDonald. Emotion Classification in Contemporary Music.
DMRN+2: Digital Music Research Network One-day Workshop 2007, Queen Mary,
University of London, December 2007.

2
Research Interests
Algorithmic composition and music generation using non-deterministic source data. In
particular the sonification of socio-spatial behaviour in large sensate environments. Current
research investigates the intersection between Human Computer Interaction and Music
Information Retrieval in a framework which uses models of emotion to generate socially
reflexive audio material.

Daniel Hug
Daniel Hug has a background in music, sound design, interaction design and project
management in applied research. Since 1999 he investigates sound and interaction design
related questions through installations, design works and theoretical publications. Since 2005,
he is teaching sound studies and sound design for interactive media and games at the
Interaction Design department of the Zurich University of the Arts, Switzerland. Hug pursues
a PhD on sound design for interactive commodities at the University of the Arts and Industrial
Design of Linz, Austria, in close exchange with the European COST-initiative "Sonic
Interaction Design".

Stuart Cunningham
Stuart Cunningham was awarded the BSc degree in Computer Networks in 2001, and in 2003
was awarded the MSc Multimedia Communications degree with Distinction, both from the
University of Paisley (UK). He is a Chartered IT Professional (CITP), member of the British
Computer Society (BCS), the Institution of Engineering & Technology (IET) and the Institute
of Electrical and Electronics Engineers (IEEE). Stuart was also a member of the MPEG Music
Notation Standards (MPEG-SMR) working group.
Stuart is currently a Senior Lecturer in Computing and a PhD student at Glyndŵr University
in the UK, studying under the supervision of Professor Vic Grout. His research interests
include: measurement of audio similarity, audio compression, image sonification, and musical
content & context analysis.

Lars Stockmann
Lars Stockmann has just received his Diplom (Master's Thesis) in Computational Visualistics
from the Otto-von-Guericke University of Magdeburg. His current research interests are
interactive acoustic environments. This includes interactive sonification techniques as well as
computer-based instruments for live performances and computer games.
Previous studies include API-design for audio applications on mobile devices, stereo vision
and rendering (raytracing and real-time cg). He is currently working as developer and
programmer at a company in the course of formation.

Anders-Petter Andersson
Anders-Petter Andersson is a musician, composer and doctoral researcher in Musicology at
Göteborg University / Malmö University / Interactive Institute. The working title of his Ph D
project is "Interactive Music Composition". He tries to answer the question of how one can
compose musically satisfying sound and music for games and interactive applications. In
interactive music, the listening role is complex, as the listener participates and alters the
composition. With knowledge from musical traditions such as improvisation he develops
composition methods for audio-tactile and physical environments such as Strainings, Do-Be-
DJ and Mufi.

3
Anders-Petter is co-ordinator of a new education and BA programme in Interactive Sound
Design at Kristianstad University, combining music and the computer within mobile services,
game and music industry. Join the Interactive Sound Design community at the website:
www.interactivesound.org

Birgitta Cappelen
Birgitta Cappelen is an industrial designer (SID), interaction designer and associate professor
at Oslo school of Architecture and Design (AHO). She is also working on her Ph D at Malmö
University (K3). The working title of her Ph D project is "Co-create and Re-create -
rethinking Industrial Design in the digital age." In her work she tries to answer the question of
what meaningful design can be in our time after postmodernism and with the computer as a
material requirement. Instead of designing beautiful and user-friendly objects, she suggests
designing fields of possibilities with a high degree of inscription and potential of circulation.
She calls this design quality "multivalence".
www.musicalfieldsforever.com

Gabriel Gatzsche
Gabriel Gatzsche studied Media Technology at the Technische Universität Ilmenau and
received his Diploma in 2003. After that he joined to the Fraunhofer IDMT Ilmenau where he
works on the MPEG-4 based storage and transmission of object oriented WFS audio scenes,
the software development of spatial audio reproduction systems for large venues and the
development and standardization of data formats for the Digital Cinema. Since 2005 he works
on a doctoral thesis which deals with the analysis of synthesis of musical audio signals.
Within that relationship he developed the HarmonyPad which is presented at the
AudioMostly2008.

Antti Jylhä
Antti Jylhä was born in Helsinki, Finland, in 1981. He received the M.Sc. degree in
telecommunications from Helsinki University of Technology (TKK), Espoo, Finland, in 2007.
He is currently working at the Department of Signal Processing and Acoustics at the same
institute as a researcher and pursuing a doctoral degree in acoustics and audio signal
processing. His current research interests include auditory and multi-modal interfaces in
human-computer interaction, and modeling and analysis of multiple interacting sound
sources. He is also instructing student projects related to automatic sports monitoring, and is
involved in the activities of the Helsinki Mobile Phone Orchestra.

Cumhur Erkut
Cumhur Erkut was born in Istanbul, Turkey, in 1969. He received B.Sc.
and the M.Sc. degrees in electronics and communication engineering from the Yildiz
Technical University, Istanbul, Turkey, in 1994 and 1997, respectively, and the Dr.Sc.(Tech.)
degree in electrical engineering from the Helsinki University of Technology (TKK), Espoo,
Finland, in 2002. Between 1998 and 2002, he worked as a researcher, and between 2002 and
2007 as a postdoctoral researcher at the Laboratory of Acoustics and Audio Signal Processing
of the TKK, and contributed to various national and international research projects. 2007
onwards, he is appointed as an Academy Research Fellow, he conducts his research project
Schema-SID [Academy of Finland, 120583] and contributes to the COST IC0601 Action
"Sonic Interaction Design" (SID). His primary research interests are sonic interaction design,
and physics-based sound synthesis and control.

4
Ulrich Reiter
Ulrich Reiter studied Electrical Engineering at RWTH Aachen, from where he received a
Diplom-Ingenieur degree (equ. M.Sc.) in 1999. He has been working at the Institute of Media
Technology (director Prof. Dr. Karlheinz Brandenburg) at Technische Universitaet Ilmenau as
a researcher from 1999-2007, where he was also giving lectures on Virtual and Applied
Acoustics and on Recording Studio Technology. He did his Ph.D. on perceived quality in
interactive audiovisual application systems of moderate complexity. Since 2008 he is working
with Prof. Peter Svensson at the Centre for Quantifiable Quality of Service in Communication
Systems (Q2S), a Norwegian Centre of Excellence, at the Norwegian University of
Technology in Trondheim. His current research interests include multimodal perception,
virtual acoustics and interactive audiovisual application systems using the MPEG-4 standard.

JiQing Han
JiQing Han, received the Master degree and Doctor degree both from Harbin Institute of
Technology, China in 1990 and 1998, respectively. He is a Professor in School of Computer
Science and Technology, Harbin Institute of Technology, China. His research interests
include Speech recognition and synthesis, Audio signal processing, Pattern recognition.

Andreas Floros
Andreas Floros was born in Drama, Greece in 1973. In 1996 he received
his engineering degree from the department of electrical and computer
engineering, University of Patras, and in 2001 his Ph.D. degree from the
same department. His research was mainly focused on digital audio
signal processing and conversion techniques for all-digital power
amplification methods. He was also involved in research in the area of
acoustics. In 2001, he joined ATMEL Multimedia and Communications,
working in projects related with digital audio delivery over PANs and WLANs, Quality-of-
Service, mesh networking, wireless VoIP technologies and lately with audio encoding and
compression implementations in embedded processors. Since 2005, he is a visiting assistant
professor at the Department of Audio Visual Arts, Ionian University. Dr. Floros is a member
of the Audio Engineering Society, the Hellenic Institute of Acoustics and the Technical
Chamber of Greece.

Nikos Moustakas
Nikos Moustakas was born in Athens, Greece. He graduated from high
school in 2004 and till now is an undergraduate student at the Ionian
university in the department of Audio-Visual Arts in Corfu. He
participated in 3 photo exhibitions that his department organized as well
as Audio-Video festivals in the last two years. Nikos has attended the
“Miden Festival” in Kalamata where he presented his own work (some
videos). In addition, he has some experience on the conservation and
restoration of old photos. Also he is an operator in the computer
laboratory of his Department. His main focus is on audio-video interactive installations.

5
Nikolas Grigoriou
Nikolas Grigoriou was born in Heraklion, Crete, Greece. He graduated
from high school in 2004 and till now is an undergraduate student at
the Ionian university in the department of Audio-Visual Arts in Corfu.
He participated in 3 photo exhibitions that his department organized as
well as Audio-Video festivals in the last two years. Nikolas has attended
the “Miden Festival” in Kalamata where he presented his own work
(some videos). In addition, he has some working experience in terms
of practicum in a TV Studio in his hometown and this year he won the
first prize at a student oriented conference “Student Eureka 2008”. His
main focus is on audio-video interactive installations, surround effects,
and sound in space.

Matthias Husinsky
Matthias Husinsky was born in 1982. He studied Media Engineering and Media Design at the
University of Applied Sciences Hagenberg, Austria from 2000 to 2004 and received his
Masters degree with a work on a guitar tuner for mobile phones. From 2005 till now he
worked as a project assistant in several research projects at the University of Applied Sciences
St.Pölten, Austria, mainly focusing on modern audio technologies in a networked society. He
is now also a PhD candidate at the Johannes Kepler-University in Linz, Austria, working in
the field of MIR (music information retrieval).

Julian Rubisch
Julian Rubisch was born in 1981. He studied Telecommunications and Media at the
University of Applied Sciences St. Pölten, Austria from 2004 to 2007 and received his
Bachelor of Science degree for a work on psychoacoustical models within audio information
retrieval. He is currently employed as research assistant at the University of Applied Sciences,
St. Pölten, where he is also working on his Masters thesis on generative music for media
applications.

Richard Widerberg
Richard Widerberg is a sound artist and new media designer living in Göteborg, Sweden. He
has a background working as a new media designer, organizing events, making radio, doing
sonic works, and playing music.
His main focus during the last years has been to investigate the many dimensions of sound
and listening. Also related to his sonic works are works that deal with location, mobility,
interaction and social exchange. He has a strong interest in new forms of copyright and
distribution of both music and sound as well as open-source development. He is also an active
musician.
http://www.riwid.net
rwiderberg@gmail.com

Zeenath Hasan
Zeenath Hasan involves herself in the people centered practice of design research to exercise
the potential of media technologies for socially appropriate intervention. The studio from
where she conducts her doctoral research on the role of media in a democracy is located in the
School of Arts and Communication, Malmö University. She was born in Kolkata and

6
currently resides in Malmö. Her stint as new media practitioner was sparked after her training
in the MS Communications programme at Manipal. Her footprints crossed over to Finland
during her training in MA New Media Studies from the University of Art and Design
Helsinki. She has worn the labels of Information Architect, Interface Designer, Design
Researcher, Cultural Producer, and Media Artist and Researcher. She also runs a 1-person
design research firm.
http://www.zeeniac.net
zeenath.hasan@gmail.com

Kazuhiro Jo
Kazuhiro Jo <http://jo.swo.jp/> is a Research Fellow at RCAST, University of Tokyo and
Visiting Research Fellow at Digital Media group, Culture Lab, Newcastle University. He is
also a member of The SINE WAVE ORCHESTRA <http://swo.jp/>, a member of the
Monalisa project <http://monalisa-au.org/plog/>, a member of AEO, and a co-organizer of
dorkbot Tokyo <http://dorkbot.org/dorkbottokyo/>.

Jakob Abesser
Dipl.-Ing. Jakob Abesser studied computer engineering with specialization in
telecommunication and measurement engineering at the Ilmenau University of Technology
from 2002 to 2008. He investigated the characterization of instrumental solo parts in music
pieces by means of musical high-level features as well as their application for genre and artist
classification within his diploma thesis. After his graduation he joined the Fraunhofer IDMT
and is now working in the Metadata department as a Ph.D. student. His research interests are
the automatic transcription of stringed instruments like bass and guitar as well as performance
analysis and artist classification.

Christian Dittmar
Dipl.-Ing. (FH) Christian Dittmar studied electrical engineering with specialization in digital
media technology at the University for Applied Sciences in Jena from 1998 to 2002. In his
diploma thesis, he investigated Independent Subspace Analysis as a means of audio signal
analysis. Subsequent to his successful graduation he joined IDMT in 2003 to work at the
Metadata department. He has contributed to a number of scientific papers in the field of music
information retrieval and automatic transcription. In 2005 he participated in the MIREX
evaluation category automatic drum detection. Since late 2006 he is Semantic Metadata
Systems group manager at Fraunhofer IDMT.

Holger Grossman
Dipl.-Ing. Holger Grossmann studied electrical engineering at the Ilmenau University of
Technology from 1986 until 1993. Afterwards he worked as software developer in the fields
of electronic musical instruments as well as client-/server e-business systems. In 2001 he
joined the Fraunhofer-Arbeitsgruppe für Elektronische Medientechnologie AEMT, Ilmenau
where he worked as engineer and researcher in the Metadata department. During the
following 3 years his research focus was the development and realisation of the music
identification system AudioID which was standardized in MPEG-7 as the
AudioSignatureType Description Schema. The Fraunhofer-Arbeitsgruppe became an
independent institute in 2004. Since then Holger Grossmann is head of the Metadata
department at IDMT, focussed on research in the fields of automated semantic media
annotation and multimedia search. He is co-author of scientific papers and was invited as

7
speaker on international conferences. In 2007 he chaired the AudioMostly conference in
Ilmenau.

Pieter Coussement
Pieter Coussement is a new media artist, performer and composer based out of Ghent,
Belgium. His main focus is on the function and position of the human body within interactive
art with sound as the main mediator. After several years of teaching new media, he started his
PhD at the Institute for Psychoacoustics and Electronic Music –IPEM at the university of
Ghent, where he furthers his artistic research on interactive art.
Education:
PhD 2008 Institute for Psychoacoustics and Electronic Music –
IPEM, University Ghent, Belgium
drs. Audiovisual arts, (re)experiencing interactive art
MA 2003 Royal Academie of Fine Arts, Ghent, Belgium
Master of Fine Arts, Mixed Media
BA 1999 Royal Academie of Fine Arts, Ghent, Belgium
Three Dimensional Art, Multi Media
Professional Experience:
2007 PHL department of visual arts, Hasselt, Belgium
Workshop teacher Interaction with audiovisual media
2007 Howest PIH, Kortrijk, Belgium
Workshop teacher Creative Coding: Realtime user responsive 3D Visuals in
openGL
2004-2008 City Academy of Arts, Ostend, Belgium
Teacher Digital Visual Design
2003-2008 City Academy of Arts, Ostend, Belgium
Teacher Experiment Digital Design

David Black
David Black is a master student at the Digital Media program of the Hochschule Bremen. As
a graduate of the University of Southern California School of Music and the Royal
Conservatory of Den Haag's Institute of Sonology, he is involved with interactive music
pieces for electronics, dance, percussion, electroacoustic composition, guided walking audio
tours, and auditory display.

Kristian Gohlke
Kristian Gohlke is currently a master student at the Hochschule Bremen. His interests circle
around physical-computing, electronics and programming as well as human-computer
interaction. He also works as a student tutor at Bremen University and the University for the
Arts in Bremen where he teaches physical computing to children and artists, and does and
consulting on related art and design projects. He lives at: http://krx.at/

Jörn Loviscach
Joern Loviscach is a professor at Hochschule Bremen. His major interests lie in computer
graphics, human-computer interaction, and audio and music computing. He is a regular
contributor to conferences such as SIGGRAPH, Eurographics and the AES Convention. In
addition, he has published numerous chapters in book series such as Game Programming
Gems and ShaderX Programming. He lives at: http://www.l7h.cn/

8
Sound and Immersion in the First-Person Shooter: Mixed Measurement
of the Player's Sonic Experience

Mark Grimshaw (mark@wikindx.com)


Craig A. Lindley (craig.lindley@bth.se)
Lennart Nacke (Lennart.Nacke@bth.se)

Abstract. Player immersion is the holy grail of computer game designers particularly in environments such as those found in first-
person shooters. However, little is understood about the processes of immersion and much is assumed. This is certainly the case
with sound and its immersive potential. Some theoretical work explores this sonic relationship but little experimental data exist to
either confirm or invalidate existing theories and assumptions.

This paper summarizes and reports on the results of a preliminary psychophysiological experiment to measure human arousal and
valence in the context of sound and immersion in first-person shooter computer games. It is conducted in the context of a larger set of
psychophysiological investigations assessing the nature of the player experience and is the first in a series of systematic experiments
investigating the player's relationship to sound in the genre. In addition to answering questionnaires, participants were required to
play a bespoke Half-Life 2 level whilst being measured with electroencephalography, electrocardiography, electromyography,
galvanic skin response and eye tracking equipment. We hypothesize that subjective responses correlated with objective
measurements provide a more accurate assessment of the player's physical arousal and emotional valence and that changes in these
factors may be mapped to subjective states of immersion in first-person shooter computer games.

Introduction suggests that immersion is a participatory activity [9] and


McMahan provides three conditions for immersion: “[T]he
An increasing amount of research in games studies and games user’s expectations of the game or environment must match the
technology deals with presence in virtual environments and, of environment’s conventions fairly closely […] the user’s actions
more interest for the purposes of our research, player immersion must have a non-trivial impact on the environment [and] the
in digital game worlds. Player immersion may be said to be the conventions of the world must be consistent”. [5] For
holy grail of digital game design particularly in the types of McMahan, two factors influencing immersion are the level of
game environment found in the First-Person Shooter (FPS) social realism and the level of perceptual realism. Garcia claims
genre. This type of game is typically exemplified by the run ‘n’ that “[i]n the most immersing environments reminders of the
gun sub-genre (games such as the Doom, [1] Quake [2] and structural level of the game are gone” [10] while Carr provides
Half-Life [3] series – even though the latter has an unusually two categories of immersion: perceptual (when the participant’s
strong focus on narrative) which is visually characterized by a senses are monopolized by the experience) and psychological
hand or pair of hands holding a weapon on screen and (an imaginative or mental absorption through which the
conceptually characterized as ‘the hunter and the hunted’. The participant becomes engrossed in the experience). [11]
intention is that the player identifies with the game character
whose hands, the player’s virtual prostheses, are seen receding In the context of this paper, other work mentions or focuses
into the game environment. [4] This identification with the upon the role of sound in facilitating player immersion in the
character, and the use of hands only, provides a first-person game world. Laurel makes the case that the “[t]ight linkage
perspective with which it is proposed that, visually, player between visual, kinaesthetic, and auditory modalities” is key to
immersion in the game world derives from the player the sense of immersion. [12] Jørgensen believes that players are
‘becoming’ the game character, in the sense of the player having immersed in an auditory world through the use of realistic audio
the experience of acting ‘within’ the game world. This sense of samples [13] while Murphy and Pitt make similar claims for
immersion is strengthened further through the player’s actions spatial sound. [14] Autopoiesis and acoustic ecologies have
having a non-trivial effect on the environment and game play. been used to model player immersion through sound in FPS
[5] For example, operating the game interface (the computer games [4] and Grimshaw and Schott provide a range of
mouse or keyboard, for instance) may cause the image on the conceptual tools with which to analyze the immersive functions
screen to change: the weapon may recoil and flash or an on- of game sound. [15] Some authors make a distinction between
screen animation indicates the weapon reloading. The player modes of immersion, particularly where immersion is enabled
perceives movement through the 3-dimensional world of the through the spatial qualities of the sound: Stockburger implies
game because visual artifacts rotate, magnify and diminish or that the player is physically immersed in the game sound [16]
appear and disappear on the screen. and this is amplified and explicitly stated by Grimshaw. [17]
This physical sonic immersion has also been observed for film
There are a variety of definitions of immersion in computer audiences and the concept transferred to the design of sound for
games. Kearney and Pivec claim that immersion provides the FPS games and simulators. [18]
motivation or flow required for the player to be repeatedly
engaged with the game [6] while Ermi and Mäyrä, paraphrasing Most of the work cited above is theoretical and, where authors
Pine and Gilmore, [7] state that “immersion means becoming describe the immersive potential of sound in computer games,
physically or virtually a part of the experience itself”. They also there is the assumption that sounds, more sounds, realistic
distinguish between different forms of immersion: sensory, sounds, spatial sounds all inexorably and incontrovertibly equate
imaginative and challenge-based immersion. [8] Murray to greater player immersion. This may well be the case but the
assumption lacks thorough evidence to support the various

-1-
9
concepts of immersion outlined above. Attempts to provide Arousal is commonly measured using galvanic skin response
evidence include Jørgensen, who uses player surveys, [13] and (GSR), also known as skin conductance. [23] The conductance
Shilling, Zyda and Wardynski. [18] The latter paper is of of the skin is directly related to the production of sweat in the
particular interest to this work because it not only explores the eccrine sweat glands, which is entirely controlled by the human
use of sound in an FPS game/simulation (America’s Army [19]) sympathetic nervous system. Increased sweat gland activity is
but it also attempts to objectively measure the player’s thus directly related to electrical skin conductance. Hence,
emotional arousal through the use of temperature, electrodermal measuring both GSR and EMG provides sufficient data to
response and heart-rate measurements. However, although the provide an interpretation of the emotional state of a game player
authors state that “emotional arousal has a positive impact on in real time, according to a phasic emotional model.
[the] sense of immersion in virtual environments” and that the
precise conjunction of a sound and an action seen on the screen This paper describes and analyzes the results of a preliminary
is “crucial for immersing the player”, the paper is a description experiment that investigates the role of sound in enabling player
of their attempts to introduce, and amplify, emotion within the immersion in the FPS game Half-Life 2. [24] The investigation
game environment through sound rather than an attempt to is designed to provide both subjective responses (through the
effect and measure immersion. The link between emotional use of questionnaires) and objective measurements (through the
arousal and immersion is assumed and so the relationship use of electromyography (EMG), galvanic skin response (GSR),
between sound and player immersion remains undefined in electroencephalography (EEG), electrocardiography (ECG) and
objective terms. eye tracking equipment). The overall aim of the experiment is
to find external (that is, objective) measures that may be reliably
Emotions are a central part of the game experience, motivating correlated with subjective experiences assessed via
the conscious cognitive judgments and decisions made during questionnaires in order to provide more detailed descriptions of
gameplay. Psychophysiological investigations suggest that at the emotional experience of game players during gameplay, both
least some emotional states may be quantitatively characterized in the degree of emotions experienced and in the timescale of
via physiological measurements. Specific types of measurement emotional changes and modulations. It is further hoped that this
of different physiological responses (such as GSR, EMG, ECG method may lead to real-time measures for states of immersion
and EEG, as described below) are not by themselves reliable of players playing first-person shooter computer games. Finally,
indicators of well-characterized feelings; [20, 21] a de rigueur correlating discriminations within psychophysiological data with
cross-correlation of all measurements is crucial to identify the different categories of immersion can provide at least one
emotional meaning of different patterns in the responses. method for validating those categorizations. The experiment is
Moreover, the often described many-to-one relation between preliminary since the psychophysiological characterization of
psychological processing and physiological response [20] allows states of immersion is not yet well developed.
for psychophysiological measures to be linked to a number of
psychological structures (for example, attention, emotion, The study further aims to provide a psychophysiologically-based
information processing). Using a response profile for a set of answer to the assumption that sound plays a role in enabling the
physiological variables enables researchers to go into more immersion of the player in the FPS game world. If the results of
detail with their analysis and allows a better correlation of the experiment provide a positive answer, that sound does
response profile and psychological event. [21] The crucial issue indeed play this role, it is envisaged that future experiments,
here is the correlation of patterns of measurement characteristics using a similar methodology, will be designed to investigate
for a set of different measures with subjective characterizations more specific questions about the relationship between the
of experience such as emotion and feelings (for example, the player and sound in FPS games.
feeling of immersion in gameplay).
The experiment was conducted in May 2008 in the Game and
Facial electromyography (EMG) is a direct measure of electrical Media Arts Laboratory at Blekinge Institute of Technology
activity involved in facial muscle contractions; EMG provides (BTH) in Sweden. The investigation of sound formed part of a
information on emotional expression via facial muscle activation larger psychophysiological investigation into the nature of the
(even though a facial expression may not be visually observable) player experience in computer games. This paper is also limited
and can be considered as a useful external measure for hedonic to the analysis of GSR, EMG and questionnaire data. Further
valence (that is, degree of pleasure/displeasure). [22] Positive analysis taking into account the other data types is ongoing.
emotions are indexed by high activity at the zygomaticus major
(cheek muscle) and orbicularis oculi (periocular muscle) Method
regions. In contrast to this, negative emotions are associated
with high activity at the corrugator supercilii (brow muscle) Subjects played a Half-Life 2 game mod especially designed for
regions. This makes facial EMG suitable for mapping emotions a short immersive playing time of maximum 10 minutes. The
to the valence dimension in the two-dimensional space game mod was played four times with different sound
described in Lang’s dimensional theory of emotion. [22] This modalities and physiological responses were measured together
dimension reflects the degree of pleasantness of an affective with questionnaires (assessing subjective responses) for each
experience. The other dimension, the arousal dimension, depicts modality.
the activation level linked to an emotionally affective experience
ranging from calmness to extreme excitement. In this kind of 2.1 Design
dimensional theory of emotion, emotional categories found in The game sessions were played under four different conditions,
everyday language (for example, happiness, joy, depression, corresponding to the permutations of the independent variable
anger) are interpreted as correlating with different ratios of (sound modality): playing with diegetic game sounds (normal
valence and arousal, hence being mappable within a space sounds), playing with speakers completely turned off (no
defined by orthogonal axes representing degrees of valence and sounds, no music), playing with diegetic game sounds and an
arousal, respectively. For example, depression may be additional music loop (sounds and music), and playing with
represented by low valence and low arousal, while joy may be diegetic game sounds turned off and hearing only the music loop
represented by high valence and high arousal. (only music). Participants played under each condition in a

-2-
10
shifting order to eliminate repeated-measures effects (using a 2.3 Apparatus
Latin Squares design). Physiological responses (as indicators of Facial EMG. We recorded the activity from left orbicularis
valence and arousal) were recorded for each session as well as oculi, corrugator supercilii, and zygomaticus major muscle
questionnaire answers. Questionnaire item order was regions, as recommended by Fridlund and Cacioppo, [26] using
randomized for each participant using the open-source software BioSemi flat-type active electrodes (11mm width, 17mm length,
LimeSurvey. [25] 4.5mm height) electrodes with sintered Ag-AgCl (silver/silver
chloride) electrode pellets having a contact area 4 mm in
2.2 Participants diameter. The electrodes were filled with low impedance highly
Data were recorded from 36 students and employees, recruited conductive Signa electrode gel (Parker Laboratories, Inc.). The
from the three BTH University campuses and their age ranged raw EMG signal was recorded with the ActiveTwo AD-box at a
between 18 and 41 (M = 24, SD = 4.89). 19.4% of all sample rate of 2 kHz and using ActiView acquisition software.
participants were female. When asked how frequently they play
digital games, 50% answered that they play games every day, Galvanic skin response (GSR). The impedance of the skin was
22.2% play weekly, 22.2% play occasionally and only 5.6% measured using two passive Ag-AgCl (silver/silver chloride)
play rarely or never. However, it should be noted that 62.1% of Nihon Kohden electrodes (1 microamp, 512 Hz). The electrode
all the males play on a daily basis and 20.7% play weekly. In pellets were filled with TD-246 skin conductance electrode
contrast to that, most of the females enjoyed playing only on an paste (Med. Assoc. Inc.) and attached to the thenar and
occasional (57.1% of all females) or weekly (28.6%) basis. hypothenar eminences of the participant’s left hand.

Out of all participants, 47.2 % considered themselves casual Video recording. A Sony DCR-SR72E video camera
gamers, 38.9% said that they belong to the hardcore gamer (handycam) PAL was put on a tripod and positioned
demographic and 13.9% could not identify themselves with any approximately 50 cm behind and slightly over the right shoulder
of those. Nevertheless, no female participant considered herself of the player for observation of player movement and in-game
to be a hardcore gamer and 71.4% of all females said they were activity. In addition, the video recordings served as a validation
casual gamers. Male gamers were more evenly distributed tool when psychophysiological data were visually inspected for
among hardcore (48.3%) and casual (41.4%) gamers: the larger artifacts and recording errors.
percentage of males considering themselves hardcore players.
Game experience survey. Different components of game
91.7% of the participants were right-handed and 50% were experience were measured using the game experience
wearing glasses or contact lenses. 94.4% believed they had full questionnaire (GEQ). [27] As shown in a previous study by
hearing capacity (5.6% stated explicitly that they lack full Nacke and Lindley, [28] the GEQ components can assess
hearing capacity). 69.4% had a preference for playing with a experiential constructs of immersion, tension, competence, flow,
music track on. 44.4% preferred playing with surround sound negative affect, positive affect and challenge with apparently
speakers, while 33.3% opted for playing with stereo good reliability.
headphones. 11.1% liked playing with stereo speakers and the
final 11.1% preferred surround sound headphones. 33.3% Sound immersion. Subjective player experience of sound
played an instrument. 13.8% played the piano or keyboard and immersion was measured using our own additional
8.3% played the guitar. 41.7% saw themselves as hobby questionnaire items rated on a 5-point scale ranging from 1 (for
musicians – some people worked with sound recording and example, not immersive) to 5 (for example, extremely
programming but did not play instruments. immersive) for sessions where sound was audible. Specific
sound questions included the following:
66.7% of participants were enrolled as University students.
16.7% already had a Bachelor’s degree and 13.9% had a • How important are sounds in general for you in FPS
Master’s degree. 61.1% of the participants had already played games?
the digital game Half-Life 2 before, 30.6% played it between 10 • Diegetic Sounds:
and 40 hours and 58.3% played it on a PC, leaving only one How immersive were the following?
participant who played it on an Xbox 360. • Background sounds
• Sounds of opponents
To estimate preconceptions of sound immersion, participants • Sounds that you produced yourself (player-
were also asked how important they considered sounds, in produced sounds)
general, for first-person shooters (FPS). The results were rated How important was the sound for you in the level you
on a 5-point scale ranging from 1 (not important) to 5 (very just played?
important). 55.6% claimed that sound was very important and • No Sound, No Music:
36.1% said it to be important. The term “immersive”, which was How much did it bother you to play without sound?
also part of the questionnaire items assessing sound immersion,
• Nondiegetic Music Only:
was explained to participants beforehand as “the feeling of being
Did you miss the sound effects in this level? (Yes/No)
encapsulated inside the game world and not feeling in front of a
monitor anymore”. This was so phrased for reasons of lay
intelligibility and deemed to be a synthesis of previous
definitions of game immersion noted above, particularly those of
Ermi and Mäyrä, Garcia and Carr. This is suitable for
investigating whether immersion in a very general sense may be
distinguishable in psychophysiological measurement features; if
so, ongoing experiments may address the psychophysiological
detection of finer distinctions within the broad category of
immersion.

-3-
11
all modalities, participants were thanked for their participation
and paid a small participation fee before they were escorted out
of the lab.

2.5 Reduction and Analysis of Data


Recorded psychophysiological data were visually inspected
using BESA (MEGIS Software GmbH, Germany) and EMG
data were also filtered using a Low Cutoff Filter (30 Hz, Type:
forward, Slope: 6 dB/oct) and a High Cutoff Filter (400 Hz,
Type: zero phase, Slope: 48dB/oct). If data remained noisy, they
were excluded from further analysis. EMG data were rectified
and exported together with GSR data at a sampling interval of
0.49 ms to SPSS for further analysis.

Mean values for physiological responses were calculated for


epochs of complete session times (varying between five and 10
Figure 1: The eye-tracking screen, EMG sensors and minutes). Psychophysiological data were corrected for errors
electroencephalography array using log and ln transformations. After histogram inspection,
data was assumed to be close to a normal distribution (without
Other apparatus used, but not included in this analysis (the data elimination of single outliers). Means were calculated for items
will form the subject of a future paper), were a Biosemi 32- of each of the seven GEQ questionnaire components
channel EEG system and a Tobii 1750 eye tracker (cf. Figure 1). (immersion, tension, competence, flow, negative affect, positive
affect, and challenge).
2.4 Procedure
We conducted all experiments on weekdays in the time from To test statistical significance of the results, one-way repeated-
10:00 a.m. to 6:00 p.m. with each experimental session lasting measures ANOVAs were conducted in SPSS using sound
approximately two hours. The experiments were advertized modality as the within-subject factor for each of the three EMG
especially to graduate and undergraduate students. All measures (orbicularis oculi, corrugator supercilii, zygomaticus
participants were invited to the newly established Game and major), the galvanic skin response (GSR) and all GEQ
Media Arts Laboratory at Blekinge Institute of Technology, components.
Sweden. After a brief description of the experimental procedure,
each participant filled in two forms. The first one was a Results
compulsory “informed consent” form (with a request not to take
part in the experiment when suffering from epileptic seizures or For each sound modality, the participants were asked a few
game addiction). The next one was an optional photographic assessment questions. This included 5-point scale ratings for the
release form. Each participant had to complete an initial immersiveness of sounds (5 = most immersive) after they played
demographic and psychographic assessment questionnaire prior the level with diegetic game sounds only. The background
to the experiment, which was immediately checked for sounds were rated just below the median value (M = 2.97, SD =
completeness. Participants were then seated in a comfortable 1.03) and opponent-produced sounds were rated higher (M =
office chair, which was adjusted according to their individual 3.81, SD = 1.06) than player-produced sounds (M = 3.14, SD =
height, electrodes were attached and the participant was asked to 1.44).1 The presence of sounds in this level was rated very
rest and focus on a black cross on a grey background on the important (M = 4.17, SD = 1.03). After playing without any
monitor. During this resting period of 3-5 minutes, physiological sounds or music, most participants also claimed that it bothered
baseline recordings were taken. them a lot (M = 4.06, SD = 1.19). For the “music only”
modality, it was noted as well that 75% of the participants
Next, participants were seated in front of a high-end gaming missed the sound effects.
computer, which used a 5.1 surround sound system for playback
(Half-Life 2 sound quality settings on ‘high’), and were GEQ Sound & Diegetic Nondiegetic No Sound
component Music Sound Music Only or Music
encouraged to get acquainted with the game controls for two Only*
minutes (using a non-stimulus game level) if they did not Immersion 1.53 (1.08) 1.51 (1.01) 1.13 (0.95) 0.85 (0.79)
indicate a priori FPS experience. The participants played the Tension 0.97 (1.12) 1.41 (0.81) 1.94 (1.13) 1.57 (1.14)
Competence 2.18 (1.21) 2.24 (0.93) 1.88 (1.10) 1.57 (1.06)
same Half-Life 2 game level four times for 10 minutes (or until Flow 2.10 (1.28) 2.37 (0.94) 1.72 (1.32) 1.50 (1.12)
completed) in a counter-balanced order to eliminate repeated- Negative 0.86 (0.86) 1.06 (0.62) 1.63 (1.02) 1.71 (1.10)
measures effects. As this was a preliminary experiment designed affect
to produce broad subjective answers and objective Positive affect 2.06 (1.11) 2.16 (0.93) 1.61 (0.93) 1.49 (1.00)
Challenge 2.18 (1.08) 2.19 (0.69) 1.96 (0.85) 1.60 (0.81)
measurements from which future, more refined experiments can
be designed, the following broad sound on/off modalities were
chosen: Table 1: Means (and standard deviations) of the GEQ
components for the four test sound modalities2
1. the level with all diegetic sounds and nondiegetic
music audible Table 1 shows a comparison of GEQ mean scores. A
2. the level with just diegetic sounds audible comparison of these values shows that regardless of GEQ
3. the level with just nondiegetic music audible components, people gave higher ratings (except for tension and
4. the level with no sound or music audible negative affect) when sound was present. The presence of
1
Potentially supporting the theory of challenge-based immersion. [8]
After each modality, participants were asked to report their 2
subjective experiences using questionnaires. After completion of N = 36, N of items = 2(*5), GEQ scale has values between 0 and 4
(median = 2).

-4-
12
diegetic sound (whether combined with music or not) also seems (χ2(5) = 25.16, p < .001), corrugator supercilii EMG means
to be an enabling factor of the subjective experience of (χ2(5) = 57.65, p < .001), zygomaticus major EMG means
challenge and flow — flow especially seems be experienced (χ2(5)= 16.43, p = .006) and GSR means (χ2(5)= 52.41, p <
more easily with diegetic sounds. .001). Hence, degrees of freedom were corrected using
Greenhouse-Geisser estimates of sphericity for EMG means (ε =
The complete absence of sound seems to negatively influence .63, ε = .45, and ε = .76) and GSR means (ε = .47).
the subjective feeling of immersion to a significant degree as it
is the lowest rated item in this modality. With missing auditory Nevertheless, neither EMG responses (orbicularis oculi: F(1.90,
feedback, there is also a decrease in the feeling of competence 53.21) = 0.86, p > .40, corrugator supercilii: F(1.36, 38.02) =
among all participants. The combined presence of sound and 0.66, p > .40, zygomaticus major: F(2.27, 63.58) = 0.61, p >
music seems to also have a soothing effect on play as ratings for .40), nor GSR [F(1.40, 39.05) =0.68, p > .40] could achieve
tension and negative affect are very low under this modality. It statistical significance for the repeated measures design. The
is also the modality that has the highest score for the immersion results of the ANOVA show that tonic measurements of
item. However, it should be noted that music also seems to be physiological response from an accumulated game session were
somewhat distracting from game flow since flow ratings are not significantly affected by different sound modalities.
higher when music is omitted and only diegetic sounds are
presented. Discussion and Future work
For GEQ components Immersion (χ2(5) = 3.49, p > .50), This paper has described and analyzed the results of a
Competence (χ2(5) = 10.28, p > .05), Negative Affect (χ2(5) = preliminary experiment to measure the effect of FPS sound and
5.36, p > .30), and Flow (χ2(5) = 10.12, p > .05), Mauchly’s test music on player valence and arousal and to detect any possible
indicated that the assumption of sphericity had been met, but for correlations between measurable valence and arousal features
the remaining items Tension (χ2(5) = 11.98, p < .05), Positive and self- reported subjective experience.
Affect (χ2(5) = 11.56, p < .05) and Challenge (χ2(5) = 23, p <
.05) it was violated. Therefore, degrees of freedom were There are two important and related results. Firstly, the data
corrected for the latter three using Greenhouse-Geisser estimates gathered from the subjective questionnaires (see Table 1) shows
of sphericity (ε = .80, ε = .84, and ε = .71). a significant statistical difference between the four modalities
over the GEQ components. This is particularly the case with
Statistical significance was achieved for all components Flow and Immersion, the results of which show higher scores
(Immersion: F(3, 105) =8.20, p < .001), Competence: F(3, 105) when diegetic sound is present than when it is not. Prima facie,
= 4.49 , p < .01), Negative Affect: F(3, 105) = 9.75 , p < .001), this would indicate that diegetic sound does indeed have an
Flow: F(3, 105) = 9.42 , p < .001), Tension: F(2.39, 83.73) = immersive effect in the case of FPS games. Music also appears
7.85 , p < .001), Positive Affect: F(2.52, 88.21) = 6.18 , p < .01), to increase immersion, while reducing tension and negative
and Challenge: F(2.14, 74.78) = 5.17 , p < .01)). These results affect, at the expense of a reduction in the experience of flow
show that the subjective game experience measured with the within gameplay.
GEQ was significantly affected by the different sound
modalities. Secondly, the psychophysiological data do not support the
subjective results, but are instead both inconclusive and lacking
Table 2 shows a comparison of the normalized physiological statistical significance (see Table 2). If we maintain the
responses. Negatively valenced arousal would be indexed by assumption that physiological evidence, in these circumstances,
increased GSR and corrugator supercilii activity (with decreased can be used to confirm the subjective evidence, then there are
zygomaticus major and orbicularis oculi activity). [29] This is several potential explanations for the lack of correlation between
not the case for any of the accumulated measurements shown. the two result sets. Further analysis and experimentation will be
The only notable decrease of orbicularis oculi and zygomaticus required to explain this disparity. Some initial possible
major activity is shown under the no sound condition. However, explanations (assuming a valid experiment design and
corrugator supercilii activity is also decreased and galvanic skin implementation) include:
response is somewhat consistent across conditions.
1. The GEQ incorporates distortions derived from the
Physiological Sound & Diegetic Nondiegetic No Sound or retrospective “storytelling” context of the
Response Music Sound Only Music Only Music
questionnaire.
Orbicularis 1.85 (0.37) 1.85 (0.37) 1.86 (0.42) 1.79 (0.31)
oculi (ln[µV]) 2. The physiological data, gathered over 10 minutes of
Corrugator 1.94 (0.25) 1.90 (0.27) 1.95 (0.33) 1.89 (0.26)
supercilii play, contains too much noise to produce a significant
(ln[µV]) result. It must be noted that the data analyzed here was
Zygomaticus 1.98 (0.40) 2.00 (0.38) 2.00 (0.43) 1.94 (0.35) accumulated over one game session and even after
major (ln[µV])
Galvanic skin 0.72 (0.18) 0.73 (0.17) 0.70 (0.18) 0.72 (0.17)
inspection of histograms and logarithmic correction
response not all measurements were perfectly normally
(log[µS]) distributed. Even though a non-parametric statistical
analysis or a range correction of physiological
Table 2: Means (and standard deviations) for the corrected responses could be conducted, it is unlikely that this
physiological measurements (EMG and GSR) under the will show significant effects over the 10 minute
different modalities3 timescale used. Connecting physiological response
data to game events using more precise phasic
Accordingly, Mauchly’s test indicated that the assumption of measurements, as described in Nacke et al., [30] could
sphericity had been violated for orbicularis oculi EMG means yield more insight into the emotional effects of sound.
This level of detail can be achieved but it would need
3 an additional method for recording subjective
N = 29 (after data reduction).

-5-
13
responses at the same event level precision to be [4] Grimshaw, M., The Acoustic ecology of the first-person
correlated with. shooter: The player, sound and immersion in the first-person
shooter computer game, Saarbrücken, VDM Verlag Dr. Mueller
3. The subjectively reported experience is a function of e.K., (2008).
the modulation of emotions within a smaller time [5] McMahan, A., Immersion, engagement, and presence: A
scale than that used in the analysis of new method for analyzing 3-D video games, in M. J. P. Wolf &
psychophysiological data. This means that the B. Perron (Eds.), The Video Game Theory Reader, New York,
emotional net effect may be the same, but the details Routledge, 67–87 (2003).
of emotional dynamics produce different subjective [6] Kearney, P. R., & Pivec, M., Immersed and how? That is the
experiences as reported by the GEQ. As analogy: a question, Game in' Action, Sweden, (2007, June 13—15).
flat sea and a sea with big waves may have the same [7] Pine, B. J., & Gilmore, J. H., The experience economy: Work
mean level, but one makes for much better surfing is theatre & every business a stage, Boston, Harvard Business
than the other. This might be detectable by derived School Press, (1999).
measures form the current data set. [8] Ermi, L., & Mäyrä, F., Fundamental components of the
gameplay experience: Analysing immersion, Changing Views –
4. The subjectively reported results are not measurable Worlds in Play, Toronto, (June 16—20, 2005).
using our equipment and methods. In particular, the [9] Murray, J. H., Hamlet on the holodeck: The future of
source of the GEQ components reported in Table 1 narrative in cyberspace, Cambridge, The MIT Press, (2000).
may have a different psychological explanation than [10] Garcia, J. M., From heartland values to killing prostitutes:
that captured by the arousal/valence model of An overview of sound in the video game Grand Theft Auto
emotion. This consideration raises the need for more Liberty City Stories, Audio Mostly 2006, Piteå, Sweden,
thorough ongoing conceptual investigations of terms (October 11—12, 2006).
such as immersion, presence, flow, challenge and fun [11] Carr, D., Space, navigation and affect, in Computer Games:
(as started in [31]). Based upon a richer range of Text, Narrative and Play, Cambridge, Polity, 59-71 (2006).
linguistic and conceptual distinctions, it may be [12] Laurel, B., Computers as theatre, New York, Addison-
possible to devise experiments having more Wesley, (1993).
discriminating power among the range of descriptive [13] Jørgensen, K., On the functional aspects of computer game
models thus created. In particular, these are complex audio, Audio Mostly 2006, Piteå, Sweden, (October 11—12,
concepts used in different ways by different authors, 2006).
and it may not be the case that they have simple [14] Murphy, D., & Pitt, I., Spatial sound enhancing virtual
mappings to instantaneous emotions measured by story telling, Lecture Notes in Computer Science, 2197, 20–29
psychophysiological techniques. Explanatory theories (2001).
then need to move to higher levels in modeling the [15] Grimshaw, M. & Schott, G., A Conceptual framework for
structuring of a series of measurable emotions, related the design and analysis of first-person shooter audio and its
to perceptions and player actions, to provide a more potential use for game engines, International Journal of
systemic account of the foundations of the quality of Computer Games Technology, 2008, (2008).
play experience, as suggested by Lindley and [16] Stockburger, A., The rendered arena: Modalities of space
Sennersten. [32] in video and computer games, unpublished PhD thesis, London,
University of the Arts, (2006).
These questions must be addressed by ongoing research. To our [17] Grimshaw, M., The Resonating spaces of first-person
surprise, our research contradicts the results presented by shooter games, Proceedings of The 5th International Conference
Shilling et al. [18], who indicated a strong correlation between on Game Design and Technology, Liverpool, (November 14—
sounds and physiologically elicited emotions. Unfortunately, 15, 2007).
Shilling et al. did not report direct values of their measures that [18] Shilling, R., Zyda, M., & Wardynski, E. C., Introducing
would allow a direct comparison. It remains for more thorough emotion into military simulation and videogame design:
future analysis to find greater scientific evidence for a America’s Army: Operations and VIRTE, GameOn, London,
relationship between sound and psychophysiological measures. (2002).
Our future aim is to investigate this within our research. [19] MOVES Institute, America’s army, Monterey, Naval
Postgraduate School, (2002).
Acknowledgements [20] Cacioppo, J.T., Tassinary, L.G. and Berntson, G.G.
Psychophysiological science, Handbook of psychophysiology,
The research reported in this paper has been partially funded by 3-26 (2007).
the FUGA (FUn of GAming) EU FP6 research project (NEST- [21] Cacioppo, J.T., Handbook of Psychophysiology, Cambridge
PATH-028765). We thank our FUGA colleagues, especially University Press, (2007).
Niklas Ravaja, Matias Kivikangas, and Simo Järvelä, for many [22] Lang, P. J., The emotion probe. Studies of motivation and
stimulating inputs to this work. We would also like to thank attention, American Psychologist, 50, 372–385 (1995).
laboratory assistant Dennis Sasse for helping in the execution of [23] Lang, P. J., Greenwald, M. K., Bradley, M. M., & Hamm,
the experiment. A. O., Looking at pictures: Affective, facial, visceral, and
behavioral reactions, Psychophysiology, 30, 261–273 (1993).
[24] Valve Software, Half-Life 2, Electronic Arts, (2004).
[25] Schmitz, C., LimeSurvey (v1.70) from
References http://www.limesurvey.org (2008).
[26] Fridlund, & Cacioppo., Guidelines for human
[1] id Software, Doom series, Activision, (1993-2004). electromyographic research, Psychophysiology, 23(5), 567-589
[2] id Software, Quake series, Activision, (1996-2005). (1986).
[3] Valve Software, Half-Life series, Electronic Arts, (1998- [27] IJsselsteijn, W.A., Poels, K. and de Kort, Y.A.W.,
2004). Measuring player experiences in digital games. Development of

-6-
14
the Game Experience Questionnaire (GEQ), Manuscript in
preparation.
[28] Nacke, L. and Lindley, C., Boredom, Immersion, Flow - A
Pilot Study Investigating Player Experience, IADIS Gaming
2008: Design for Engaging Experience and Social Interaction,
IADIS, Amsterdam, The Netherlands, (2008).
[29] Ravaja, N., Turpeinen, M., Saari, T., Puttonen, S. and
Keltikangas-Jarvinen, L., The Psychophysiology of James Bond:
Phasic Emotional Responses to Violent Video Game Events,
Emotion, 8 (1), 114-120 (2008).
[30] Nacke, L., Lindley, C. and Stellmach, S., Log who’s
playing: psychophysiological game analysis made easy through
event logging, International conference on Fun and Games 2008,
Springer, Eindhoven, The Netherlands, (2008).
[31] Lindley, C.A., Nacke, L. and Sennersten, C., What does it
mean to understand gameplay? First Symposium on Ludic
Engagement Designs for All, Aalborg University, Esbjerg,
Denmark, (2007).
[32] Lindley C. A. and Sennersten C., 2006 A Cognitive
Framework for the Analysis of Game Play: Tasks, Schemas and
Attention Theory, Workshop on the Cognitive Science of Games
and Game Play, The 28th Annual Conference of the Cognitive
Science Society, Vancouver, Canada, (July 26—29, 2006).

-7-
15
Sound and the diegesis in survival-horror games

Daniel Kromand
Brydes Allé 23, 431
2300 Copenhagen S
kromand@itu.dk

Abstract. The paper analyzes the affordances of the soundscape in survival-horror games by examining the barrier between diegetic
and non-diegetic sounds and how the validity of a sound cue is challenged. The three game examples all veil their aural warning cues
with a broken causality and causes the player to distrust the cues’ tie to the game world. This results in an uncertainty whether the
sound comes from an in-game object or from non-diegetic ambience. The paper argues that the horror genre draws upon this
discrepancy between soundscape and perceived reality to create an ambience of fear.

1. Introduction Grimshaw defines the acoustic ecology of multiplayer first-


The screeching tires in a racing game; the high pitch of person shooters as “a set of relationships between players
incoming artillery fire; the birds chirping in a haunted forest; all founded upon sound” (Grimshaw 2007; 6). This statement, apart
of these are a natural and ingrained part of the virtual worlds we from the expectation of several players, forms this paper’s basic
inhabit in video games. But in contrast to the real world, these understanding of sound in an acoustic ecology, but it needs
sounds are not the causal effect of rubber moving across asphalt, further elaboration in order to be operational within a broader
a projectile moving through air at high speed, or a bird pressing theoretical framework. Grimshaw perceives an acoustic ecology
air out of its lungs. Instead they are sound files played when as divided into sounds that relate to objects within the game
certain parameters in the virtual world have been met. This (‘players’ in the aforementioned quote) and those that do not:
difference might be subtle, but justifies a thorough analysis of The former form part of an understanding of the
acoustic environments in games, as these have all been designed game’s acoustic ecology whilst the latter do not.
through conscious effort and are not results of physics. Games A conceptualization of an acoustic ecology helps
are virtual environments with content hard-coded into them. In a to explain objects and actions in the game world
linear single player game certain sounds can be set to play when because it comprises sounds produced by those
the player moves through the module, creating a specific objects and actions. (Grimshaw 2007; 221)
soundscape for each in-game area. Such a modular approach to
audio design can create a very tightly tailored soundscape, The understanding, which Grimshaw refers to, requires a player
where sound can provide a dramatic increase of tension. that is able to discern the sound object from the rest of the
soundscape and to select the various affordances offered by it.
This paper aims to analyze the dramatic use of audio design in The player relies on experience and contextualization to do so,
survival-horror games due to the fact that this particular genre which means that a sound object might produce different
has a significant use of audio, which plays a central part in the meanings for the same player depending on his familiarity with
main feature of the genre: Fear and uneasiness. The genre of the causality inside the game and on the situation in which it
survival-horror rests upon a series of stylistic and design arises (Grimshaw 2007; 136). The player thus continuously
choices, the most notable being: A scary setting, a scarcity of connects experienced sound objects to causal effects within the
gun ammunition, awkward camera angles, and puzzle solving game world to build a personal framework for deciphering
(Whalen 2004). Defining the semantic elements of a genre is relevant information through the acoustic ecology.
often difficult as neither of them are singularly essential, but
rather make up the parts for the holistic definition (Buscombe The acoustic ecologies contain information about the
2003). I have chosen three broadly recognized survival-horror relationships between objects in the game world and the actions
games –BioShock, F.E.A.R., and Silent Hill 2– that together that they perform. The players discern this information from the
form a generalizable reference to the genre as a whole. The goal rest of the sensory filler through a continued connection of
is to prove how the audio design purposefully creates causal effects. Sound is a carrier of both immersion and
uncertainty in the division between the diegetic and the non- information, the latter extracted through affordances as
diegetic. This stresses the player into carefully considering his described earlier. Immersion through sound can come from a
actions and puts him on alert even though no visual threats are variety of aural objects, for example the beating heart that
apparent. The collapse of the barrier between the diegetic and typically accompanies low health status, seeks to give the player
non-diegetic soundscape is a strategy to build a horror a sense of embodying the avatar.
atmosphere.
Grimshaw’s framework for an acoustic ecology and aural
2. Sound in an acoustic ecology immersion in the game world will be the primary tool for
To analyze the use of audio in survival-horror games it is analyzing the soundscape of single-player games, in the shape of
imperative to structure a framework for the soundscape of video the selected survival-horror games. The player’s found
games. Mark Grimshaw argues for such a framework by affordances will be examined closely in order to reveal how
utilizing theories of acoustic ecology, which combines a notion sound is used for dramatic effect.
of spatiality and the aspect of time (Grimshaw 2007; 6).

16
3. Trans-diegetic sound and running water, but might also notice a monotonous note
The trans-diegetic effect of audio examined by Kristine played in crescendo. These two examples are arguably not
Jørgensen (2007) is a transgression of the traditional barrier background music as they do not have a constant presence and
between diegesis and non-diegesis as explained by only constitute a limited tune. It might however be argued that
Bordwell/Thompson (2004). The traditional distinction between they are examples of Jørgensen’s point that ambience should not
diegesis and non-diegesis divides the soundscape into sounds be taken literally: Instead ambience frames the general
that are heard within the fictional world and those that are not. atmosphere in the player’s current area (Jørgensen 2007; 110).
The film viewer’s ability to understand this divide comes from A crescendo is traditionally used in horror movies to instill
repeated exposure to the language of films, where breakdowns anticipatory fear, an effect that BioShock mimics in a slightly
of the barrier are rare and usually act as comic relief (such as the different form. In BioShock the crescendos are activated after a
big band in Mel Brooks’ Blazing Saddles that is thought to be time lapse and not on basis of narrative structures. I argue that
non-diegetic, but then happens to be located out in the desert). this leads to an increase in tension, as most players will be
familiar with the culturally implied meaning of such a crescendo
Jørgensen argues however that due to the interactivity of video (a leitmotif for shocks), and thus they will be prepared for
games, sound can pass the barrier and in effect become trans- encounters, which may or may not take place. The soundscape
diegetic. Units in Warcraft III for example speak directly to the furthermore often has a deep throb present in the ambience, only
player as a way of conveying information, i.e. they speak from clearly audible at high volume settings. This throb is not held at
within the diegesis to the outside, while music in the game can the same note, and thus dispels silence and helps fill the
function as a leitmotif for certain events, allowing the player to soundscape with unfamiliar noises. Not all ambience in
anticipate future events (Jørgensen 2007; 110). Trans-diegetic BioShock is unmelodic though: At select fights throughout
sound does not dissolve the barrier though: It merely causes a BioShock regular background music is played, for example at
short transgression that still keeps the division between diegesis the first encounter with a Houdini splicer in Arcadia or the fight
and non-diegesis intact. The trans-diegetic effect therefore against Dr. Steinman in the Medical Pavilion, which fulfills the
typically transfers information from the game to the player, dramatic effect of a standard boss fight.
according to Jørgensen in two different versions: Either as a
reactive sound affirming player input or as a proactive sound The part of the ambience that is closer to the diegesis (e.g. the
informing the player of an altered game state (Jørgensen 2007; creaks, announcements over the speakers, etc.) serves to create a
116). general atmosphere and fills the sensory system. Grimshaw
refers to ‘sensory fillers’ as music that is irrelevant for gameplay
Jørgensen’s argument relies upon the fact that players can (2007; 4), but in BioShock the sensory filler feeds the player too
interpret the soundscape of a given game as triggered by specific much aural input to cope, which in turn causes confusion
events, i.e. not being completely random. The ability to interpret regarding the diegetic ties to the game world. The creaking noise
can partly be learned through genre conventions and by keeping of Rapture fills the soundscape and potentially masks or
a consistency of sound within the individual game itself muddles the sounds produced by real, in-game enemies. The
(Grimshaw 2007; 102-103). player’s understanding of affordances can help to perform better
in the game of BioShock, as certain sounds pass information
4. Welcome to Rapture regarding nearby opponents, but at the same time these exact
When first checking into the grim world of BioShock (2K affordances are mimicked in the ambience. This creates a dual
Games 2007) and immersing into the city of Rapture the scene is relationship to the soundscape of BioShock: On one hand the
one of confusion and paranoia: Monsters –the so-called spatial location of opponents is revealed through hearing and
splicers– are roaming the hallways, ammunition is limited, and helps to prepare the player for an encounter, but on the other
the ambience is filled with running water and creaking metal. hand the ambiences causes a distrust in the player towards the
The soundscape of BioShock is densely populated with a division of diegetic and non-diegetic sounds.
general ambience, and sounds tied to the inhabitants of Rapture
and the player. 5. Paced horror in F.E.A.R.
F.E.A.R. (Monolith Productions 2005, henceforth FEAR) is set
The in-game objects of BioShock, such as the player, enemies, in modern day, albeit with slightly futuristic technology, and
robots and vending machines, offer a quite varied selection of revolves around a special unit’s attempt to handle a paranormal
sound objects. Movement creates sound, which can help the crisis. The game draws heavily on Japanese horror with frequent
experienced player identify the source (e.g. the heavy thud of a submersions in water, movement through dark crawlspaces, and
Big Daddy is easily recognized, while stealthier opponents, such a The Ring-inspired ghost called Alma. The game resembles a
as the spider splicer, create a minimum of noise). Speech is generic shooter at times and player will often find himself in
another form of identifying opponents, since several of them firefights with enemy soldiers, but other elements of the game
will replay sentences: Some are preset, for example a specific creates an atmosphere of horror and uneasiness, for example
splicer early in the game –a deranged mother bending over a crawlspaces, which –paired with a limited flashlight battery and
baby carriage– plays the same monologue when encountered a giggling ghost– builds a very tense experience.
(although it is interruptible according to the player’s actions),
while other enemies replay their sentences at random, which The ambience in FEAR does not adapt to fights as seen in other
both adds to the atmosphere of Rapture, but also informs the games (Whalen 2004), and is usually deep bass or a slowly
player of possible danger. oscillating pitch. Similar to BioShock the ambience lacks
tonality and a clear melody, and therefore is likely to be
The ambience in BioShock lingers between the diegetic and the perceived as part of the diegetic world. An analytical player
non-diegetic: a typical engine hum is for example audible, but might identify the ambience as non-diegetic, but the atonality
underneath there are sporadic, high-pitched sounds. The latter breaks the usual expectations to background music. Arguably
does not appear to have any connection to the diegetic world of atonal ambience can be experienced as closer to the diegesis as
Rapture, and likewise while in Neptune’s Bounty (an area of it is less cohesive, thereby provoking uncertainty about the
Rapture) the player will hear the omni-present creaks of metal

17
sound’s non-diegetic nature. The ambience in FEAR thus puzzles and monsters to reach the end. The games of the series
assumes the role of sensory filler with uncertainty as an effect. have been examined academically in their own right, and
especially the radio carried has received a lot of academic
FEAR paces the player in a manner, which is not unlike attention (Carr 2003, Whalen 2004, Perron 2004).
BioShock: The action sequences are designed as a wavelike
curve with enemy resistance focused in smaller areas in contrast The radio carried by the avatar in Silent Hill 2 produces static
to being spread out over the level. The player rarely encounters whenever enemies are approaching, but similar to the static in
a lone soldier and does not experience a steady stream of FEAR it fails under certain conditions. At select locations the
enemies. Therefore the player will also traverse a relatively radio will produce static that sounds like a mixture of chirping
large amount of space with no enemy encounters. Ironically it is birds and a squeaky wheel, but is not produced by any entities.
in these unchallenged spaces that the player can be expected to This sound is not ambience, as it still can be heard when the
fully experience the game’s horror atmosphere. In areas where background music is turned off in the options menu.
the player is left alone both visual and aural cues (such as the Furthermore the radio does not register the “crawlers” (the
flickering of static, which is often, but not always, a premonitory zombies in their cockroach-like state) and the player is likely to
warning) allude that paranormal threats are nearby. The second be surprised when one them runs off with a high-pitched squeal
encounter with the ghost Alma –inside the aforementioned from underneath a car (as it happens on Martin St.). Therefore
crawlspace– the avatar’s pulse and breathing becomes audible the radio, which is the primary tool for locating enemies –due to
while Alma can be heard shuffling around and giggling. Earlier the omnipresent fog– has both too much and too little sensitivity
Alma was capable of killing the player, and thus her presence to be a completely reliable warning system.
inside the crawlspace naturally causes distress. Before
encountering Alma in the crawlspace, the player is aware that The ambience of Silent Hill 2 is in some sense similar to both
something is about to happen, primarily due to the set-up –a BioShock and FEAR, as it is highly metallic and often a deep
dark, claustrophobic space– and the aural cues of heavy bass with an oscillating pitch, but with a significant difference in
breathing and pulse. Nothing paces the player to go through the the fact that ambience is situated, which means that topological
crawlspace, but the space has to be traversed if the player wants locations have their own sound (e.g. one apartment has another
to progress the game. At his own pace the player has no option ambience than others). The dynamic and unsettling soundscape
but to move through an area where aural cues have indicated makes it harder for the player to get accustomed with the
that danger might be lurking. This design resembles Hitchcock’s ambience. This distorts an efficient reading of the radio’s
well-known example of a bomb hidden underneath a table where affordances, as some rooms upon entry trigger static from
two people are having a conversation, which will cause nearby enemies, while other rooms trigger a new ambience (for
prolonged suspense for the audience if they are aware of its example room 208 in Wood Side Apts.). The atonal quality of
existence (cited in Perron 2004; 2). In FEAR the player knows the ambience makes it harder to differentiate between static
that a threat is present, but the cues are slow to reveal exactly from the radio and from the ambience. Due to the fact that the
when and where it will happen. This threat warning system radio often displays different sounding warnings, new
combined with a minimum of pacing is created to make the background ambience can create uncertainty whether monsters
player move very slowly and to examine every minor aural or are nearby or not.
visual cue. The horror in FEAR thus relies on an uncertain
cueing of threats and slow pacing that allows the player to fully As in both BioShock and FEAR the soundscape in Silent Hill 2
immersive himself in the sinister atmosphere. purposefully works against the player’s effort to read the
affordances of aural warning cues. The game does supply them,
FEAR utilizes a variety of sounds that afford a threat warning, but with an irregular set of consequences and therefore a broken
such as the radio communication between enemy soldiers, which causality. The soundscape of Silent Hill 2 operates within a
can be heard far away and is the primary source for identifying frame of uncertainty that constantly holds the player between
their presence before visual confirmation can be made. Other knowledge and ignorance. Along with limited ammunition and
sounds afford a similar warning, but lack the consistency of the field of vision, the soundscape efficiently builds a setting of
radio signals, such as the static caught in the player’s earpiece. horror.
The static is usually an indicator for approaching paranormal
events, but is sometimes played without any following 7. Conclusion
consequences. This pseudo-causality is designed to put the Analysis of the three games shows they mainly have unmelodic
player on edge and make him carefully considering his moves ambience without a tonic (although music was included at select
even though no threat is imminent. The misuse of the static moments). The soundscape affords less (or in a sense too much)
reduces the player’s faith in it as a reliable tool, but accentuates to the player, as sounds can be hard to tell apart. This was
that something might happen. especially the case in FEAR and Silent Hill 2 where the player’s
in-game warning system proved to be unreliable. The player’s
Aural affordances are, as in BioShock, limited because they difficulties of dividing sounds into non-diegetic and diegetic can
have flawed causality. The unreliable affordances delivered by be described as an actual collapse of the diegetic barrier. This
the audio design keeps the player in a limbo between trust and effect goes beyond Jørgensen’s theory of trans-diegesis, since
apathy. Furthermore the slow pacing in the game’s horror the players do not prepare for future action based on non-
sequences adds another dimension, as the player will have to diegetic information. Instead they act because they do not know
approach a possible threat at his own pace. FEAR gives release if the sound is diegetic or not. The atonal ambience reduces the
to the built up tension through intense firefights with enemy perceived field of non-diegetic sound, with the exception of the
soldiers where the game resembles a generic shooter. boss music, and all sounds can be suspected to belong to the
diegesis. The diegetic collapse is the effect when player are no
6. Finding way through the fog in Silent Hill 2 longer fully capable to discern between diegetic and non-
Silent Hill 2 (Konami 2001) takes place in a town of the same diegetic elements. It is safe to say that the soundscape of the
name, which is immersed in constant fog. The player controls three survival-horror games produce an ‘un-knowledge’ and
the avatar in third person and has to traverse a number of while it does offer affordances, it also creates an uncertainty

18
Sound and the diegesis in survival-horror games

about the validity of the said affordance. It is simply impossible


to extract certain causality from some affordances because the
ambience mimics the sounds flawlessly, and only by replaying a
level can the player discern between the real warning sounds
and those that are not. Even though the analysis accepts
Grimshaw’s thesis on the players’ usage of soundscape
affordances, I also claim that games like the ones analyzed
purposefully hinders an efficient transfer of affordance and that
exact point might well be one of reasons why some players
enjoy these games.

The paper did not aim to dismantle either Jørgensen’s or


Grimshaw’s theories, but rather to prove that certain genres have
a varied use of soundscape and affordances. The idea of
uncertainty as produced by the relation between ambience and
in-game objects helps for an understanding of both the formal
design of survival-horror games and also the players’
relationship to aural affordance. The constant guessing as to
whether the sounds have a causal connection put the players in
unusual insecure spot that might well build a more intense
experience.

References
[1] Aarseth, Espen, Quest games as Post Narrative Discourse,
Narrative Media, pp. 361-376. University of Nebraska Press
(2004)

[2] Bordwell, David & Kristin Thompson: Film Art: an


introduction, 7th Ed., McGraw-Hill 2004

[3] Buscombe, Edward: “The Idea of Genre in the American


Cinema” in Film Genre Reader III (Barry Keith Grant ed.),
University Of Texas Press 2003

[4] Carr, Diana: Play Dead – Genre and Affect in Silent Hill
and Planescape Torment, Gamestudies vol. 3 issue 1, 2003

[5] Grimshaw, Mark Nicholas: The Acoustic Ecology of the


First-Person Shooter, University of Waikato 2007

[6] Jørgensen, Kristine: “On Transdiegetic Sounds in Computer


Games” in Digital Aesthetics and Communication, Vol 5. (Arild
Fetveit & Gitte Stald ed.): pp. 105-117, Intellect Ltd. 2007

[7] McMahan, Alison: “Immersion, Engagement, and Presence:


A Method for Analyzing 3-D Video Games” in The Video Game
Theory Reader (Mark Wolf and Bernard Perron ed.), Routledge
2003

[8] Perron, Bernard: Sign of a Threat: The Effects of Warning


Systems in Survival Horror Games, University of Montreal 2004

[9] Whalen, Zack: Play Along – An Approach to Videogame


Music, Gamestudies vol. 4 issue 1, 2004

19
Psychologically Motivated Techniques for Emotional Sound in
Computer Games
Inger Ekman, inger.ekman@tml.hut.fi
Department of Media Technology,
Helsinki University of Technology
P.O.Box 5400
FIN-02015 HUT

Abstract. One main function of sound in games is to create and enhance emotional impact. The expressive model for game sound
has its tradition in sound design for linear audiovisual media: animation and cinema. Current theories on emotional responses to
fiction are mainly concerned with linear medial, and only partly applicable to interactive systems like games. The interactivity
inherent to games introduces new requirements for sound design, and suggests a break in perception compared with linear media.
This work reviews work on emotional responses to fiction and applies them to the area of game sound. The synthesis is
interdisciplinary, combining information and insights from a number of fields, including psychology of emotion, film sound theory,
experimental research on music perception and philosophy. The paper identifies two competing frameworks for explaining fictional
emotions, with specific requirements, and signature techniques for sound design. The role of sound is examined in both cases. The
result is a psychologically motivated theory of sound perception capable of explaining the emotional impact of sound in film, as well
as identifying the similarities and difference in emotional sound design for these two media.

question we are concerned with here is how to tackle all these


1 Introduction issues while maintaining the emotional functions of sound.
The importance of sound for establishing mood in
Surprisingly little academic investigations into sound have
audiovisual media is well recognized. It should come as no
addressed this issue. Also, while there are many books on
surprise that viewing a movie without sound can strip it of its
game sound design, the majority of literature deals either
emotional impact, making the events pictured on screen
with the technological tools for creating the sounds or
seemingly distant and of little relevance. The potency of
managing the overall process of sound design e.g. [2][25].
sound is also rather well exploited at least in contemporary
The best foundations for game sound design decisions can be
gaming. The model for sound design in these games is often
found in the fragmental accounts written by designers
that of film sound; game sounds should be “Cinematic” [3] or
[22][3][13], sharing glimpses of their intuition and
“Bigger than life” [22].
experience in the craft through audio post mortems and
project descriptions.
However, it is a well known fact that games differ from film.
Also sound design in games is different from film sound on
Underlying many successful design decisions is an implicit
several points. An obvious disparity resides in the technical
understanding of what kind of sounds and actions create
realization of game sound compared to film sound. For one,
emotional impact. The aim of this study is to bring together
the expected technical setups for game and film consumption
various sources of information, to contribute to an explicated
are of a vastly different nature. Technical aspects are,
version of this knowledge. Such knowledge may ultimately
however, outside the scope of this paper. Instead, the focus is
aid the design process, and it should be of benefit to
on the special representational requirements that games, as
researchers working to understand the phenomenon of game
interactive systems, put on the sound design as compared to
sound. To this purpose the paper will jointly address two
those within static linear media. Interactivity creates a
close questions, namely why and how sound works to create
fundamental differentiation in what constitutes as the sound
emotion. Asking why looks for the reason behind having
design for film, and games, respectively. This is to say that
emotional reactions to sound emanating from fictional worlds
the final mix for a film differs by nature from the end result
in the first place. Answering this question requires looking at
of game sound design, usually a set of programmatically
the relationship between player and game, and the role of
expressed rules for combining/manipulating sound samples
sound in shaping this relation. The second question, how,
as a player interacts with the system. While technology
deals with the mechanisms underlying emotional reactions to
dictates the alternatives on both sides, the distinction between
sound, seeking cognitive and biological explanations to the
the two is present by nature of film vs. game, and exists
emotional experiences of a certain sound used in a certain
independently of technological issues. This aspect is the
context. Finally, a preliminary synthesis of these findings
focus of this investigation.
sketches out some practical sound design implications for
emotional impact in film and games, respectively.
These are the challenges game sound has to conform to: 1)
Whereas film sound is written to a fixed set of actions, seen
This work serves several purposes. First my hope is that an
through predefined points of view, game sounds must support
understanding of the underlying processes of sound will
alternative paths and viewpoints through the story. 2) Film
provide useful concepts for practical work, by bringing to
sound is written to sequences of actions with known
light the alternative possibilities and approaches for
durations. Game sounds must span action sequences of
impacting the player experience. This view is in no way
flexible temporal duration. 3) Games ask players to become
attempting to challenge the value of intuition to design, nor to
active and play. Hence, sounds must support action, respond
explain away or trivialize the talent that goes into quality
to player control and often survive high repetitiveness. The
sound design. On the contrary, explanations of why and how

20
masterful sound designs influence emotion are as much a And if they are real, what is their cause? To shed some light
celebration of the skills that went into designing them in the on these questions, and as an inventory of the tools available
first place as tools for shaping new experiences in the future. for looking at the effects of sound, let us turn for a moment to
I am also positive that a bit of clearly enunciated arguments psychology. What psychological support exists for emotional
for why sound holds so much power can be of use, especially reaction to fiction?
in defending sound when it has to compete for attention (or
budget) with visual effects. Finally, this paper links into the The simplest objection to calling emotional responses to
academic discussion on sound design, and provides a fiction real is related to the reality of the events underlying
synthesis of relevant literature from several vastly different these responses. Most emotion theories require that evaluated
fields. The topic of sound design is still advanced in many events should be of real significance to the evaluating
separate camps, partly because it is interdisciplinary in nature subject. This is precisely the concern underlying Tan’s
and that requires tapping into so many fields all at once. The argument [27]. Investigating the emotional responses in film
goal in this paper is not to provide an exhaustive review. viewing, he bases the validity of his model in Frijda's [9]
Rather it offers a selection of findings from different sources, theory of emotions and discusses in length the compatibility
suggesting both where future connections may be found and of his theory with Frijda's basic tenets. In short, he argues
directions how some of them may be explored further. that emotional responses can arise to fictive events as long as
they are perceived as apparent reality. This is achieved
The structure of the paper is as follows: Chapter 2 discusses mainly by the diegetic effect, a cognitive and perceptual
the nature of emotion in general, and looks at emotion theory illusion in the viewer' head. The illusion is maintained with
applied to film and games. Chapter 3 considers the processes several cinematic techniques, usually by dissolving and
wherein film sounds influence emotional response and eradicating the medium. The witness position, Tan argues, is
consider several sources of emotion in sound. Chapter 4 one of the very means in which the cinema excuses the
offers a summary on emotional effect of sound and compares viewer's passivity and explains, in keeping with the diegetic
emotional sound in film with sound in the interactive context effect, the viewer's ability to see it all.
of games.
Another way to make room for fictional emotions is to allow
2 Emotions and Emotional Responses in emotional responses not only to real events, but to the
Film and Games fictions of our imagination. Damasio does just so: he calls the
latter type as-if emotions, and describes in detail the
In psychology literature, whole chapters have been written processes in which the mind simulates affective responses
about defining emotions. Perhaps it is due to the subjective [7]. According to Damasio, fundamentally affective
nature of emotional experiences, as well as culturally processes are part of all rational decision-making, and
negotiated nature of what emotions are suitable and for function as a means of making decisions – we reason by
whom, that it is hard to pin down precisely what constitutes simulations on how the outcome might feel. From this point
an emotion. Intuitively, however, the struggle to define of view, fiction would be just another way of running
emotion seems puzzling. Most of us would seem to know, by simulations.
intuition, what sort of experience the word emotion refers to.
For the purpose of this text, let us though consider briefly
2.2 Emotional Responses to Film
what it implies. Oatley and Jenkins [20] 96] mention the
following three defining features: There are two dominant approaches to understanding how
- Emotion is usually caused by a person (consciously or film elicits emotional responses. One approach is based on
unconsciously) evaluating an event of some Freudian psychoanalysis, invoking as a key concept the
significance in relation to a goal or situation. Lacanian mirror stage, which “typifies an essential libidinal
- The core of an emotion is a readiness to act, and relationship with the body-image” [14]. This stage is called
emotion influences actions by preparing for certain forth by identifying in film, a process that links the self with
types of actions and causing a sense of urgency. the film experience through subconscious mental responses
- An emotion is usually experienced as a distinctive type (the basis of which is rooted in early childhood and
of mental state, sometimes accompanied or followed by formations of the Ego). An alternative approach explains the
bodily changes, expressions, actions. emotional reaction to film through cognitive appraisal of
portrayed fictional events. Instead of identifying
Emotions are thus evaluations of a specific (visceral and subconsciously, the viewer responds emotionally because of
urgent) type, which signal events of critical importance and their cognitive investment in the fictional events. The
relevance to the perceiver. They provide the perceiver with viewers' emotional responses are related to motivational
an evaluation of the event, producing (typically) either a processes [27][9][16].
positive or negative sensation. This sensation is referred to as
valence. They also produce a sense of urgency. Generally, According to the appraisal theory, the key to understanding
the activation part of emotions is called arousal. emotions is to understand the role fictive events play to the
viewer. Tan identifies two types of emotion at play in the
2.1 Fictional Emotions? viewing situation: fiction emotions (F emotions, emphatic
emotions) and artefact emotions (A emotions, non-empathic
From a biological viewpoint the function of emotions seems emotions). Empathetic (F) emotions are emotional reactions
clear. They provide the foundations for successful behaviour to the story; they require empathy with the main character.
in an environment with promises and perils. Emotional Non-empathetic artefact emotions are linked to sensory
relations to film and games, however, is an example of pleasures – the viewer enjoys the good looks of the
emotional relation to fictional events. This has been a very protagonist or the beautiful scenery. [27]
controversial subject, and arguments go back and forth about
whether all these emotional reactions, which have their cause Non-empathetic emotions require no appreciation of events.
in completely fictive events, are even real emotions at all. The mechanism by which emphatic emotions are created is

21
more complicated. Tan suggest the film viewer connects to 3 Roles of Sound in Emotional Appraisal
the fiction is through a witness position. This position of The appraisal theory would seem to give little explicit
agreed-upon inaction (even in the most anxious of moments, explanation of how sound contributes to emotion. The
we are content to just watch the film) dictates the relations question is to what extent sound influences the process of
between viewer and film events. Our commitment to the appraisal. It seems clear that for both film and games, sound
protagonist's cause gives rise to emphatic relations. However, is part of the process enforcing the inherent emotionality of
the inactivity dictated by the viewing position also creates events as appraised within each framework. However, the
tensions, such as worry or frustration when we know more practical and functional approach to sound design is different.
than the protagonist, but cannot change the situation. On the In film, sound makes actions seem real and consequential for
other hand, inactivity may allow us to more deeply the viewer, both factors mentioned as prerequisite to
empathize, since no action is asked of us. empathetic or F emotions [27]. Within this category of
emotions, two specific cases can be identified for sound.
2.3 Emotions in Games
The interaction in games usually has the player controlling a One goal of film (and game) production is to make things on
main character (or other objects in the game). Whereas screen seem real. A silent two-dimensional representation
cognitive appraisal of game events lends itself to an has limited apparent reality on its own, but adding sound
emotional impact, a key question in games is how the helps perceiving the pictures on screen as physical bodies.
narrative reality is maintained. Especially, games remove the Especially important in this aspect is the use of Foley, or
passive viewing position, previously suggested a main synchronized sound effects.
contributor of empathic effect in film. Also, the typical
structure of games – featuring repetitive action, often only As another case of supporting realism, an integral role for
loosely bound together by a story – is challenging to sound at least in classical (Hollywood) film production has
empathetic emotion. been to hide the medium. An example of such is the common
practise of continuing sounds over cuts, making them less
On the surface, it would seem that the active nature of apparently noticeable. Many other conventions of film sound
players in games break the passivity that Tan proposes so also contribute to the invisible medium effect, either directly
significant in relation to justification of emotions. However, or indirectly. Thus, a role of sound is to create a sense of
it turns out the activity of games lends itself rather well to the immediacy.
purposes of explaining emotion and the cognitive appraisal
model has also been invoked in relation to computer games In the context of games, the significant emotional investment
and emotions [21][15]. Activity, while challenging the goes into advancing goal-related progress. Most current
passive ride of empathetic emotion, provides an alternative games rely on both auditory and visual content to represent
frame in which to evaluate actions. the game world. Crucially, in the context of interaction,
sound comes to take on a new task: facilitator and
Perron [21] explicitly works from Tan’s theory to add confirmatory of action. Historically, technical considerations
gameplay emotions (he calls them G emotions), that arise have long dictated tools for sound expression in interactive
from the cognitive appraisal of game situations. Perron’s G context, which has forced (and allowed) game sound to
emotions arise just because, in games, the player is invested deviate from some film sound conventions. The role for
in acting. Emotional evaluation is fuelled by care for the sound in games is at least partly dictated by a functional
progress of the game. Lankoski [15] offers a detailed approach and a sound's impact defined by its capability of
breakdown of emotions within the gaming context, supporting and facilitating gameplay.
demonstrating how different goal evaluations give rise to
basic emotions and their combinations. Finally, in discussing empathic emotions we mentioned that
some appraisals require no cognitive investment in the story,
In games with a protagonist, gameplay emotions may equal but are linked purely to sensory pleasures. Interestingly, there
care for the protagonist, but this care is essentially different is no explicit mention of negative affect in the context of
from empathetic emotion: From the perspective of gameplay, artefact emotions. The same artefact emotions are present in
the protagonist is a means, a tool, for playing the game (and games as well, where part of the game can be enjoyed (or
achieving the personal goal of completing the task). not) apart from aspects of gameplay-related progress.

Whereas the protagonist can also provide a vessel for Soundtrack CD sales gives strong evidence that artefact
empathetic emotions, to some extent, the two frameworks for emotions functions with regards to both film and game
emotional evaluation - empathetic and gameplay - are music. Beautiful pieces of music obviously give themselves
competing. As Perron [21] notes, even in games, both the to being appreciated as such, disregarding of whether the
main story line as well as individual plot elements and viewer is attending to the story. Further, it is motivated to
narrative turning points are indeed furthered from time to broaden the spectrum of artefact emotions to include also
time through filmic means, using stills, predetermined negative effects. With thought to how visual material is used
animation sequences/ dialogue or cut scenes. During these for shock effects (e.g. displaying blood and entrails), it is
moments, the player is stripped of control and, effectively, easy to imagine a similar process in which unpleasant
reduced into a witness position. Empathetic emotion comes sounds, regardless of story, could produce negative affect.
with loss of activity. A similar point is made by Lankoski Both in film and games, sound provokes sensory pleasure
[15] when he suggests the empathetic capability of the player and displeasure.
is inversely related to the cognitive challenge of action - in
the heat of a battle, there is little time to ponder the 3.1 The Realism Fallacy
protagonist's feelings. The above categories identify the main roles of sound in
creating and steering the emotional experience. However, as

22
we shall soon see, there are unresolved contradictions hiding Inferences” [31]. This view is further fortified by subsequent
within these categories. Namely, the effects within the two results from experimental psychology, leading researchers to
latter categories of sound (gameplay and non-empathetic) suggest that there exists such a thing as unconscious emotion
seem to contradict the traditional sound design goal of [30]. For example, Öhman [32] has demonstrated fear
narrative realism. reactions in people who were presented with spider and snake
pictures subconsciously; that is, people became frightened of
Appraisal theory of emotion holds that empathetic emotional pictures they never even realized they had seen.
processing and value judgement is guided by conscious
attention to a story, and this cognitive investment is In light of unconscious value judgements, it is easy to read
heightened in the perceived realism of portrayal. On the other Tan's non-empathic emotions are but one example of these:
hand, it allows emotional experiences that arise from the sounds of the film are invoking emotion by nature of their
appreciation of the artefact: the pictures and sounds of the perceptual properties, unrelated to story at hand. This also
film/game as such. Similar appreciation is present in games, a opens up a way to understand and predict what would result
medium where technological artistry is elaborately in a positive (or negative, for that matter) value judgement.
showcased, often even used in promotion. One especially interesting factor is the importance of
familiarity and perceptual fluency for eliciting positive value
Tan’s view is that artefact emotions detach the viewer from judgements [23]. By this account, beauty is defined by the
the story, drawing attention away from the narrative towards ease in which a stimulus can be processed.
the film as artefact, thus making the actions within the
narrative less consequential for the viewer. This position 3.3 Misattribution and Making Sense of
raises a complicated question, namely how to interpret such Emotions
sounds that appear to transgress the borders of realism,
Affective responses include paraphernalia of bodily
despite experientially supporting narrative.
responses (pounding heart, sweaty palms) and may also bring
with them a certain action tendency (e.g. fight-or-flight). In
For example, while discussing the effect of sentiment, Tan
fact, it has been suggested that one possible function in which
and Frijda [28] 62] mention sound, especially orchestral
we do consciously attend to our affective processes is by
music, as one possible source for the awe-inspiring. Awe
appreciation of the abrupt changes in the felt background
requires a sense of overwhelming power, and this role is
state, or what Russell [24] has called core affect. This change
partly to be played by sound. The effect, according to Tan
would lead us seeking for a cause of our altered state, leading
and Frijda, is the emotional function of total submission, a
us via cognitive process to attribute the jolt to the most
feeling underlying e.g. crying. Sound is thus a tool in
plausible event in our environment.
portraying power, and heightening sentiment. The question
is, by which channel this non-empathetic heightening of
Whether or not we are willing to accept unconscious affect as
sentiment is capable of influencing the (empathetic)
the source of emotions, we can agree that when consciously
evaluation of narrative events.
attended to, emotions tend to have an object. Emotions are
evaluations of something. To be able to function properly, we
The problem of border transgression is most apparent in two
must be able to determine an object for our emotions,
sound conventions that would seem to deviate from the
something to be afraid of, or pleased by. This distinguishes
purpose of reality. In what seems like a blatant contradiction,
emotion from moods, which are long-term affective states
they invoke a sense of realism in highly unrealistic sounds.
without an object. However, and here comes the catch, the
One is the use of musical scoring. The other is the use of
events we cognitively allocate, as objects do not necessarily
sound effects that mismatch what is seen on screen. Similar
have to be the true cause of our initial affective response. In
breaches are omnipresent in games, as well, where sound
fact, when it comes to reasoning why we feel like we do, we
elements effortlessly transgress borders, allowing objects
are prone to make mis-attributions and erroneously
within the story world (diegesis) to refer to non-diegetic
appreciate our affects quite differently from their real causes
space, and vice versa [8][12].
even in everyday life.
The above concerns are about where to draw the line of
A classical example of misattribution is a study by Schachter
realism and about how emotional effects communicate across
and Singer [26]. They injected subjects with doses of
different categories of judgement. Upon closer scrutiny, both
adrenaline, a hormone associated with an excited body state.
questions generalize to the way non-empathetic affect
Depending on the situation that followed the injection,
influences other sources of emotion (empathetic or
subjects judged their aroused state as either anger or elation.
gameplay- related appraisal). To proceed further, we need an
It can be argued that in most cases, our appreciation of our
explanation on how non-empathetic emotions arise, and a
emotional state is at least partly determined by context.
way of predicting when and how artefact emotions lend
emotional meaning to other evaluative processes.
3.4 Misattribution and Emotion in Film and
Games
3.2 Unconscious Affective Processes
Misattribution is a process wherein the contextual appraisal
The theoretical frameworks dealt with above have considered
of perceived emotional ‘raw material’ lending emotional
affect by means of perceived experience. Nevertheless, it
meaning to an outside cause, irrelevant to that particular
seems many of the associations and effects of sound in both
emotional stir. Now consider a similar process at work during
film and games are working on an unconscious level.
film viewing or when playing a game, with music, sounds,
pictures, actions, all mingling to create emotional impacts. Is
Several findings state to the fact that at least some
it not probable, that at some point, the true causes of our
evaluations of stimuli are made precognitively. Zajonc was
feelings might remain oblivious to us? Is it not possible, that
among the first to point this out in his essay, famously
we, stirred by our passions, unwittingly, in deciphering the
entitled “Feeling and Thinking: Preferences need no

23
cause of our emotions take them to be caused by whatever studies at least part of the functions of musical meaning
the film serves to us on a silver(screen) plate? Could it be appear to be universal [19].
that we just happen to attend to a game event, and assume our
emotions are caused by that event, when in fact they are not? The existence of universal function of musical emotion
suggests there may be other sources of emotion humans draw
This is, essentially, what Annabel Cohen proposes. Cohen from in their interpretations of music. The obvious case is
[5][6] has dealt extensively with the difficult question why, that music invokes memories and connotations awaken by
and how, something as obviously constructed as the film that music. However, those would not be universal, even less
score, does not completely destroy the sense of realism in a so than culturally learned expectations. Van Leuwen [17]
film. On the contrary, as many composers will confirm, a turns to the human body, suggesting that the most primitive,
carefully chosen (or composed) piece of music will actually and also a common link between sound and emotion for all
heighten the sense of reality in a film. Music also seems to humans, is the perception of our own bodies. Especially the
lend a great deal of emotion to events, in ways other than vocal system sets a reference point through simultaneous
proposed by the cognitive appraisal theory. Cohen's answer experience of how it feels and what it requires to produce a
lies in a congruence-associationist model of film viewing certain sound.
[18], whereby music focuses attention on those objects in the
film that are congruent with the sound. At the same time, the Another suggestion is that evaluation of some sounds has
conscious attention is directed away from non-associated biological motivation. This appears to be the fact with the
sounds, and attending suppressed for stimuli irrelevant for startle response (the phenomenon in which we jump to
ongoing cognitive processes. someone shouting ‘boo’ at us), which aside from providing
for pranks also makes us more alert for dangers and
The emotional impact comes from the fact that sounds, even automatically directs our attention toward potentially harmful
when unattended to, will nevertheless affect perception of events. However, there may well be other ways in which our
objects in the film. Cohen [6] highlights the importance of perception of sound is evolutionary determined. For example
temporal unity as a binding factor and predictor of which Huron [11] suggests that the perceived cuteness of sounds
parts of the sound will draw attention. She calls our attention may be an evolutionary adaptation that promotes parenting.
to animation and the technique called mickey-mousing,
whereby sound effects are replaced by short musical motifs. Value judgements (which are the raw stuff of emotion) seem
Their temporal matching allows these music snippets to to be going on even on the lowest level of perception. A
replace the original sounds of the events, at the same imbuing classical example within music (and other perceptual)
both the events, and the objects part of the action, with research is the mere exposure effect, wherein a stimulus is
specific characteristics. judged as likeable merely as the function of familiarity.
Investigations into a phenomenon called perceptual fluency
The account on musical meaning in film provides an equally suggest that emotional processes are influenced by the very
useful tool for approaching the question about other film ease of processing [23]. These findings would imply that
sounds as well. Applied to object sound, the theory suddenly such things as differences in perceptual clarity (think signal-
appears much less mysterious: Consider how we find out to-noise ratio) of audio influence the emotional impact of
properties of objects in real life. What we do is handle the sound, such as perceived beauty or likeability.
object – tap it, stroke it, bang it against something. By
perceiving synchronic sounds, we find out the normal sound The critical requirement for musical expectations to arise is
of a chair, a balloon or a mandolin. Now, turn that process that it is attended to as music. Further, there appears to be
around and we have precisely what sounds do in a film (and, boundaries in our listening schemes that separate different
to a great extent in games as well): now the temporal unity of styles of listening. Notably, musical listening, in which
event and sound defines the object through what sound it sounds are perceived as sounds, is not the only form of
makes. Longer chains of events, if temporally matched, cause attending to events. An illustrative deviation from this frame
similar perceptions. When approached from this angle, it is are listening styles provoked by compositional techniques
not so odd that sounds in fiction may deviate somewhat from invoking other listening styles, like musique concrète, where
their real life counterparts without seeming false or the use of real world sounds provokes listening, not at sounds
unrealistic. What may seem surprising is that a whole sum of and patterns, but for causes – Chion aptly refers to this as
temporally congruent sounds may become involved in the causal listening [4]. This is a special case of music, perhaps
same process, from simple Foley through more elaborate seldom used in film, but appearing more and more in games.
layers of sound effects all the way to music. In these cases, the framework of listening is perhaps more
determined by evolutionary and low-level perceptual
3.5 Where do The Emotions Come From? processes of meaning-making than musical listening modes.
So far we have established that unconscious emotional
processes may ‘contaminate’ temporally congruent events 3.6 Realism Revisited
through misattribution and shown how this may influence We should now attempt a new understanding of realism.
perception of events in film. The big question remains, where Within fiction, realism is not an absolute, but a nominator for
do the emotions come from? a certain level of fit, an apparent realism or credibility. On
the narrative level, realism allows taking the story seriously
The most frequently researched category of emotional sound enough to allow emotions of empathetic quality. Good fit is
is music. Huron [10] describes musical emotions in terms of determined by whether the sound is credible (or illustrative)
fulfilling expectations: the interplay between anticipated and of a certain sound source [1], pp 190]. When the Foley artist
sounded music progression creates patterns of dynamic (the person responsible for creating sounds to on-screen
tension and relaxation. Musical expectations arise from events) smashes pumpkins in his studio, he does so in order
several sources, most of them cultural, but according to some to produce such sounds with good fit with on screen events.
Many times, the sounds produced have little to do with the

24
actual event seen on the film screen – indeed, often non- Finally, in interactive systems interpretations of sound take
realistic sounds are purposefully used to make the action on a new role, conveying functional information [29]. By this
sound better. It is, for example, recognized that walking on task, sounds are evaluated on a third level, in how well they
cornstarch sounds much 'more real' on film than the actual serve a functional value, Jørgensen [12], pp 49] refers to this
sounds of walking on snow. as a sounds functional fidelity. On this stage, emotional
evaluations are no longer determined only by the sound itself,
A possible explanation underlying the perceived realism of but by the utility of a sound for the higher goal of performing
some Foley sounds is the notion of prototypicality. A goal-related actions. The value of functional sound depends
prototype is an object that inhabits central perceptual on how the functional aspect supports game progress, the
characteristics of a given category. Prototypes do not utility of sound. The utility of sound is connected to goal-
necessarily exist in reality, they are mental constructs of our related cognitive evaluations. High utility enforces gameplay
perceptual system. The prototypical chair is the average of all emotion.
chair perceptions of your brain, and by definition, it will be
the 'chairest' chair of them all. Experimental psychology has 4 Comparison of Emotional Sound in Film
established that people perceive prototypes as more easily and Games
recognized [27], also more beautiful, and trustworthy [23]
than other category members. The cognitive appraisal framework provides two alternatives
for emotional relation to fictive events: the passive witness
Similarly, narrative reality determines how sound behaves position allows empathetic emotion, while the active player
within the diegesis, and how the source sounds should sound draws emotional meaning from goal-related evaluation.
when listened to from different Points of Audition1, such as These two frameworks ask the viewer/gamer to take on a
listening behind a wall or under water. At the core, then, also different attitudes towards the fiction and appear to be
immediacy is but one way of creating a sense of realism. In competing. They also rely on different strategies for sound
film, it serves to reinforce the witness position, being present design.
but out of control. Tan [27], pp 25] considers this in his
analysis of the camera point of view, mentioning how even in For film sound, the effort is usually on heightening narrative
first person view the camera is often a bit off, making space reality. This is achieved through detailed attention to
for someone to ‘look over the shoulder’. In the case of sound, narrative fit, often striving for high apparent reality. Focus is
the heightened, focussed sound including over-clear dialogue on advancing the narrative, heightening and clarifying those
can be considered realistic if we view it as a portrayal not of specific actions that are necessary for following the story’s
the scene, but of the experience of listening to the scene. progress (usually the top priority is on dialogue).
Consider this: while our environment usually contains a
multitude of sounds, we only attend to a select few at a time. For games, the focus point is different, as games have to
We are also exemplary at picking out and following these support player action. In games, the task of many sounds is
sounds. A person with normal hearing has no difficulty in primarily to provide feedback about actions. Hence, narrative
following a single conversation in a room filled with people, fit is often sacrificed for utility. To the extent that auditory
the phenomenon so aptly named the ‘cocktail-party’ effect. cues are used to guide actions, they are treated with utmost
Thus, what would seem presented as the films sounds is not respect for legibility. For example, even in the case of
the scene itself from a given point in space, but the scene as instructions with diegetic source (a non-player character,
heard if listened to by attentive ears. voice mail, etc.) it is common that auditory instructions
remain heard even if the character runs away from their
The narrative realism of a sound is thus not in faithful diegetic source.
reproduction of sound sources, nor of their environments.
The apparent realism of a sound in the context of narrative is An interesting avenue for sound design in games is to shift
defined by how representative a sound is of a certain event. focus from music to the emotional impacts of Foley and
Sounds that are highly representative have good narrative sound effects. A possible alternative for emotionality in
fit. High narrative fit supports empathetic emotion. games is in environmental sounds, which is already used in
many games, where ambient sounds are beautifully merged
Importantly, however, the evaluation of sounds spans several with musically suggestive elements and event sounds into a
layers. Below the level of narrative meaning are (partly) sonic landscape in the spirit of musique concrète. However,
unconscious processes whereby sounds are judged for this approach to be systematically explored, there is need
emotionally. For example, a sound can have good fit for better understanding of how everyday sounds influence
narratively, but poor legibility, because the signal-to-noise emotions. In these investigations, theories of unconscious
ratio is so high. Importantly, breaches in this level are emotion may prove especially informative.
disruptive to the perception of sound. We have seen that
perceptual fluency is also capable of influencing affective References
evaluations [23]. Thus, as with the narrative fit, unconscious [1] Bordwell, D. and Thompson, K. 1985. Fundamental
processing of sound influences emotional judgements. These Aesthetics of Sound in the Cinema. In Weis, e. and
affects are unrelated to the narrative content of the sound, but Belton, J. (eds.) Film Sound, Theory and Practice.
tap into the notion of artefact emotions. Depending on their Columbia University Press. 181-199.
nature, they can cause pleasure or displeasure, which can [2] Brandon, A. 2005. Audio For Games. Planning, Process
then be attributed to other temporally congruent events. and Production. New Riders.
[3] Bridgett, B. 2007. Audio Postmortem: Scarface: The
World is Yours. Available at
1 http://www.gamasutra.com/features/20070322/bridgett_
Similar to Point of View in camera techniques, but
using sound. pfv.htm [Accessed 29.5.2008]

25
[4] Chion, M. 1994. Audio-Vision. Sound on Screen. [19] Narmour, E. 1990. The analysis and cognition of basic
(Translation by Claudia Gorbman). Columbia University melodic structures: The implication-realization model.
Press. University of Chicago Press.
[5] Cohen, A. 1990. Understanding musical soundtracks. [20] Oatley, K. and Jenkins, J. 1996. Understanding Emotion.
Empirical Studies of the Arts 8, 111-124. Blackwell Publishing.
[6] Cohen, A. 2001. Music as the source of Emotion in [21] Perron, B. 2005. A Cognitive Psychological Approach to
Film. In Juslin, P. and Sloboda, J (eds.) Music and Gameplay Emotions. Proc. DiGRA 2005 Conference:
Emotion. Oxford University Press. 249-272. Changig Views – Worlds in Play.
[7] Damasio, A. 2005. Descartes' Error: Emotion, Reason, [22] Prince, B. Tricks and Techniques for Sound Effect
and the Human Brain. Penguin, paperback reprint Design. Computer Game Developers Conference 1996.
(1994). Available at
[8] Ekman, I. 2005. Understanding Sound Effects in http://www.gamasutra.com/features/sound_and_music/0
Computer Games In Proc. Digital Arts and Cultures 81997/sound_effect.htm [Accessed August 20, 2008.]
2005, Kopenhagen, Denmark. [23] Reber, R.; Schwarz, N. and Winielman, P. 2004.
[9] Frijda, N. H. 1986. The emotions. Cambridge University. Processing Fluency and Aesthetic Pleasure: Is Beauty in
[10] Huron, D. 2007. Sweet Anticipation: Music and the the Perceiver's Processing Experience? Personality and
Psychology of Expectation. MIT press. Paperback Social Psychology Review, 8 (4). 364-382.
reprint (2006). [24] Russell, J. 2003. Core Affect and the Psychological
[11] Huron, D. 2005. The Plural Pleasures of Music. Proc. Construction of Emotion. Psychological Review 110 (1),
2004 Music and Music Science Conference. Kungliga 145-172.
Musikhögskolan & KTH (Royal Institute of [25] Sanger, G. 2003. The Fat Man on Game Audio: Tasty
Technology), 1-13. Morsels of Sonic Goodness. New Riders.
[12] Jørgensen, K. 2007. ‘What are Those Grunts and Growls [26] Schachter, S., & Singer, J. 1962. Cognitive, Social, and
Over There?’ Computer Game Audio and Player Action. Physiological Determinants of Emotional State.
Ph.D. dissertation, Copenhagen University. Psychological Review, 69, 379-399.
[13] Kutay, S. 2006. Bigger Than Big: The Game Audio [27] Tan, E. 1994. Film-induced affect as a witness emotion.
Explosion. A Guide to Great Game Sound. Available at Poetics 23, 7-32.
http://www.gamedev.net/reference/articles/article2317.a [28] Tan, E. and Frijda, N. 1999. Sentiment in Film Viewing.
sp [Accessed 29.5.2008] In Plantinga, C. and Smith, G. (eds.) Passionate Views.
[14] Lacan,J. 1951. Some reflections on the ego. (fut lu par Film, Cognition, and Emotion. Johns Hopkins. 48-64.
Lacan à la British Psycho-Analytical Society le 2 mai [29] Tuuri, K.; Mustonen, M.-S.; Pirhonen, A. 2007. "Same
1951) Available at: http://aejcpp.free.fr/lacan/1951-05- sound - Different meanings: A novel scheme for modes
02.htm [Accessed 29.5.2008] of listening." Proc. AudioMostly 2007, Ilmenau,
[15] Lankoski, P. 2007. Goals, affects, and empathy in Germany.
games. Paper presented at Philosophy of Computer [30] Winkielman, P. and Berridge, K. 2004. Unconscious
Games, Reggio Emilia, Italy. Available at: Emotion. Current Directions in Psychological Science,
http://www.mlab.uiah.fi/~plankosk/blog/?p=53. 13 (3). 120-123.
[Accessed 29.5.2008] [31] Zajonc, R. B. 1980. Feeling and Thinking: Preferences
[16] Lazarus, R. 1991. Emotion and adaptation. Oxford Need No Inferences. American Psychologist, 35, 151-
University Press. 175.
[17] Leewen, T. van. 1999. Speech, Music, Sound. [32] Öhman, A. 2005. The role of the amygdala in human
Macmillan. fear: Automatic detection of threat.
[18] Marshall, S. and Cohen, A. 1988. Effects of Musical Psychoneuroendocrinology 30, 953-958.
Soundtracks on Attitudes toward Animated Geometric
Figures. Music Perception 6, 95-112.

26
Interactive Sonification of Grid-based Games
Louise Valgerður Nickerson and Thomas Hermann
1 2

1
Interaction, Media and Communication, Department of Computer Science,
Queen Mary, University of London, London, U.K.
lou@dcs.qmul.ac.uk
2
Ambient Intelligence Group, Cognitive Interaction Technology · Excellence Center (CITEC)
Bielefeld University, Bielefeld, Germany
thermann@techfak.uni-bielefeld.de

Abstract. This paper presents novel designs for the sonication (auditory representation) of data from
grid-based games such as Connect Four, Sudoku and others, motivated by the search for eective auditory
representations that are useful for visually-impaired users as well as to support overviews in case that the visual
sense is already otherwise allocated. Grid-based games are ideal to develop sonication strategies since they
oer the advantage of providing an excellent test environment to evaluate the designs by measuring details of
the interaction, learning, performance of the users, etc. We present in detail two new playable sonication-
based audio games, and nally discuss how the approaches might generalise to general grid-based interactive
exploration, e.g. for spreadsheet data.

1 Introduction only a limited number possible elements of a nite set


ll a grid cell. Examples of grid-based games are chess,
Sonication, the auditory representation of data, has Chinese checkers, Connect Four, noughts and crosses
become an important sensory channel for rapid data (or tic-tac-toe), Sudoku, to name a few.
scanning, real-time monitoring and exploratory data We develop our sonication approaches at hand of
analysis [6]. Particularly if the data is structured in grid-based games for the following reasons: (a) there
time (e.g. time series, process data), sonication is a is a very clear task for the players, yet (b) there is
good choice in order to communicate the patterns by a sucient variety of required overviews so that the
using the auditory modality. However, a very frequent task is not trivial, (c) the limited complexity facilitates
data type consists of two-dimensional grids or matrices the designs, and (d) the game itself provides a very
of data. In fact, most data sets which are subject to the useful test environment to evaluate all aspects of the
analysis in data mining can be re-organised to take this design, from performance and learning to the æsthetics
form, using columns for features and rows for measure- (acceptance, qualitative evaluation).
ment vectors. Images are also naturally represented by
a 2D-grid of measured intensity values. Spreadsheets
are another frequent example of grid-based data. Thus 1.1 Sonifying grid data
it makes sense to investigate how to make such data When designing grid-based game sonications, a deci-
more accessible by using sonication, or how sound sion has to be made whether the sonication shall be
can be used eectively to deliver a concise overview generic in the sense that it is applicable to a wide class
or summary of the data. of games, or specic for a particular game. Generic ap-
However, the sort of overview needed depends highly proaches generalise better towards a more wide-spread
on the task, and often dierent task-specic overviews use, maybe even beyond the scope of grid-based games
are needed, ranging from overviews that give a rough into tasks such as video data sonication, however,
idea how a grid is lled to very-specic overviews such they may not deliver exactly the information that the
as `what cells form groups with a particular pattern', players need to play the game, or allow to extract these
row-wise scans, diagonally aligned patterns, symme- patterns only after longer training.
tries within the grid, etc. A mix of sonication techniques that oer both spe-
Grid-based games are a special case of grid data in cic and inspection general seems suitable, and puts
the sense that usually the grid dimension is xed and into the fore that the users will need control over what

27
Interactive Sonication of Grid-based Games

sonication is to be selected and what parts of the grid one might perform with grid-organised data, such as
to be explored. Indeed, interaction plays an important in a spreadsheet. Connect Four can represent looking
role in inspecting grids, as can be seen also in visual for linear patterns in data while Sudoku can represent
exploration of grid games, where eye-movements, xa- cross-correlating subsets of data.
tions and saccades are naturally used to serialise and
access the information. In a similar fashion we believe 2.1 Traditional methods
that manual interaction is an important (if not the key)
ingredient to create successful designs. We present for A grid-based representation that often gets tackled is
instance a graphics-tablet based sonication approach the auditory representation of images. This is tradi-
where proprioceptive information serves the intuitive tionally done via scanlines. Examples of this can be
understanding of the position in the grid whereas sound seen in representations of images where each pixel value
conveys the information about the grid content at hand is mapped to sound and played in order. More ad-
of the 4×4 Sudoku in section 3. Dierent exploration vanced techniques involve nding textures in the image
strategies emerge from such an approach. to represent in sound. The diculty with the scanline
To couple interaction to plausible acoustic responses, approach is the challenge of lining up the rows so that
we use ideas from Model-Based Sonication [2, 4] one can understand patterns that are orthogonal to
(MBS). MBS describes how to use excitatory systems the direction of the scanline. The pattern and audi-
in order to create informative sound as result of pro- tory texture approach is much closer to what we try to
cesses where the user's interaction puts energy into accomplish here with our implementation of Connect
an data-driven sound-capable system. Even without Four.
creating a coherent sonication model, MBS might be Other pertinent work is research into the sonica-
helpful to create designs that are more intuitively un- tion of spreadsheet or tabular data. Stockman, Hind
derstood. and Frauenberger [7] describes a system where visually-
impaired users can navigate spreadsheet data by map-
A key problem in grid-based game sonication is the
ping numerical values to a range of pitches. The data
missing persistence of the grid, as opposed to the per-
is then played serially by row or column. This is meant
sistent visual game board. To create a close analogy to
for generic use; our approach is to look to the specic
the visual task of adding visual elements on a board,
to inform the generic. Kildal and Brewster [5] describe
an auditory version can use a stationary sound pat-
a method of providing overviews of numerical data in
tern which is permanently played, allowing players to
tables by again mapping values to pitches. Here, rows
add sound elements accordingly. Conditions to win a
and columns are presented concurrently giving the user
game translate to corresponding auditory conditions
quick access to where the highest and lowest values are
within the sound. This analogy might open a window
to be found. The idea of concurrency is one we apply
to the design of very interesting new audio games, how-
to our implementation of Connect Four.
ever, we here keep the focus on grid-based games, and
thereby translate the analogy into a rhythmical soni-
cation strategy where, instead of a stationary sound, 2.2 Connect Four versus Sudoku
a repetitive sound pattern is created, which can be re- One can generalise grids as M × N grids with a set of
garded as one bar in a repeating sonic loop. We de- k potential token values. Connect Four is a 7 × 6 grid
velop this idea into a playable version of Connect Four with three token types (one for each player as well as
in section 4. the `empty cell' item). Sudoku is a n2 × n2 grid with
We discuss our ideas via qualitative experiments n2 + 1 tokens (n2 tokens and the `empty cell' item).
with a limited set of users, since we are still within The most common variant of Sudoku is where n = 3
the design phase towards stable sonications, and we or the 9 × 9 grid.
close the paper with outlooks on our ongoing work. There are several dierences between the games and
their grids. Connect Four is a two player game while
2 Background Sudoku is a single player game. Another dierence is
that in order to win Connect Four, a pattern of four
In the visual realm, space is used to make salient in- tokens in a line must be achieved while in Sudoku the
formation of interest. In the case of grid-based games, tokens must be uniformly distributed. In both games
it organises the items on the grid so that the players the grids get lled one move at a time. In Connect
can easily make sense of the state of the game. This Four at the end of each pair of turns there are an equal
is also true of data that is visually displayed in tab- number of each token in the grid while in Sudoku this
ular format: it makes correlations between two axes condition is only properly achieved when the puzzle is
clear. We can use grid games to represent tasks that completely solved.

-2-

28
Interactive Sonication of Grid-based Games

Another key dierence is that in Connect Four, when interaction (see gure 2). We also employ MBS to pro-
tokens are added to the grid, they are placed in the low- vide contextual information to the player. Sonication
est unlled cell in the selected column while in Sudoku, examples are provided at [1].
cells can be lled in any order. The playing of Connect
Four is what drives the dominant features of the soni- (1,4)
cation described in section 4. The columns are primary
as their state is what informs the players where tokens (1,1)
Sudoku play area

may be placed and the rows are secondary as they de-


scribe the end position of the token placed in a par- (4,4)
ticular column. Sudoku is less straight-forward: it is
the structure of the grid and the rules of the game that (4,1)

are important. Neither the rows, columns or cages (see


gure 1) are dominant but rather they must be inter-
compared so that a player can deduce the value that Figure 2: Playing Sudoku on the Wacom graphics
belongs in each cell. tablet. Post-it notes dened a square play area.

3 Model-Based Sonification for Sudoku


3.1.1 Representation of the grid
Sudoku is a single player game where a player must ll The use of the graphics tablet means that we do not
all the cells on the grid so that the values in each row, need to provide strong location information. As a re-
column and cage are unique. The most common form sult, the grid is not explicitly represented in sound ex-
of Sudoku is a 9×9 grid, shown in gure 1. The grid cept in the MBS that we use when a player probes a
is further subdivided into nine 3×3 sub-grids, called cell of the grid. We use a standard energy ow model
cages. Here, we implement an easier version: 4×4 Su- as introduced in [2] to describe the eect of each cell
doku. upon its neighbours:

dEij X
= −λEij + q · (Ekl − Eij ) (1)
dt
(k,l)∈N (i,j)

where q represents the energy ow rate between neigh-


bour cells and λ is the energy loss or the decay of
the energy. ij denotes the co-ordinates of the cell and
N (i, j) is the set of all cells that neighbour ij .

Figure 1: The Sudoku grid. The 9 × 9 Sudoku grid is


made up of nine 3 × 3 sub-grids, called cages. Cage
rows/columns are horizontal/vertical sets of cages.

Figure 3: The Sudoku ow model When a cell is ex-


cited, the energy ows into the neighbouring cells as
3.1 Design and implementation described in equation 1.
In the 4×4 grid, there are ve possible values for each
cell: the four tokens and `empty'. Each grid has a cer-
3.1.2 Representation of the cell values
tain number of cells that have pre-lled-in values. In
order to play the game, players need to cross-reference The four values in the grid are represented by pitch.
rows, columns and cages in order to deduce what val- The pitches range evenly from MIDI note 64 to MIDI
ues go into the empty cells. Key information for solv- note 80. When a cell is empty there is a white noise
ing includes where the grid is dense/sparse and where sound that is modulated to sound like the wind. Our
all the items of the same values are present/missing initial design had empty cells represented by the lowest
in a row/column/cage/cage row/cage column. To en- pitch, however only players with musical training found
able exible self-directed exploration  much as a you this comprehensible. We used SuperCollider3 for all
would get from glancing  we use a graphics tablet for sonications here, using PlayBuf as sample player and

-3-

29
Interactive Sonication of Grid-based Games

standard techniques for panning and ltering of sound. Tapping vs dragging We assumed that the major-
More information will be provided on our website at [1]. ity of interaction would be by dragging. However,
the majority of players (Players 2, 4, 5 and 7) pre-
3.2 Playing the game ferred to tap the cells to excite the grid. Player 2
Players a stylus to explore the grid and enter values explained this by saying that the sounds made by
on the graphics tablet. There is also a graphical repre- the model made this tapping interaction more in-
sentation of the grid (see gure 4) which provides the tuitive. Another explanation is that players were
limits of the cages. To probe the grid, players either tapping in order to compare only two values at a
tap or drag the stylus across the grid. When the sty- time.
lus enters a cell, the cell is injected with energy (as Draggers For the players who dragged more than
described in equation 1) and the energy ows through they tapped, the stylus and tablet interaction al-
the grid. To enter a value, players click a button on the lowed them to quickly scan a row, column or cage
stylus. Each click cycles the current value of the cell by drawing lines or circles in the grid. These play-
to the next value. If a cell is a starting value, nothing ers appeared to be the fastest at completing puz-
happens. zles. We anticipate that this is because a quick
scan allowed players to quickly determine which
tone was missing or if there were two tones of the
same value in the row/column/cage.
Panning One surprise was that both Players 2 and
5 (both tappers instead of draggers) found that
the panning was not helpful and in fact was dis-
tracting and made it harder to compare values.
This possibly indicates that the use of the graph-
ics tablet suciently localises a player and the
additional cues are unnecessary.
Based the two dierent ways of interacting with the
grid (dragging vs tapping), we expect that a better
Figure 4: The graphical interface for Sudoku. Players
interact with the grid using the graphics tablet stylus. tting model will need to be devised to make it more
natural for the dragging technique to be used. The
faster interaction allows for the patterns that occur
3.3 Discussion to be more quickly absorbed. With a more intuitive
model, players can more naturally take advantage of
5 men and 2 women played the auditory version of 4×4 the way we process audio.
Sudoku, two of whom were musicians and one of whom
was visually impaired. Their level of experience with 3.3.2 Informing grid sonifications
playing Sudoku ranged from beginner to advanced.
Much like with how direct manipulation and the in-
3.3.1 Player feedback
troduction of the mouse revolutionised graphical user
interfaces, the use of the tablet enables the user to a
Feedback for Model-Based Sudoku was varied. We pre- greater degree than keyboard interaction. The tablet
sume that is this partially due to it being a single player interaction contributed more to the success of the
puzzle game (a two-player game, on the other hand, en- 4 × 4 Sudoku than the use of Model-Based Sonica-
gages the players competitiveness and allows them to tion. With a suciently small grid so that the number
learn from one another). The general consensus was of values is not overwhelming, stylus interaction gives
that while 4 × 4 Sudoku is quite simple visually, the the user the exibility to explore the grid as they de-
auditory version was quite challenging and the smaller sire and provides speed that is dicult to mimic with
versions was approximately the right level of diculty. traditional keyboard or 5-way navigation (such as on a
Here are some of the more specic ndings: mobile phone or a game controller). It also neutralises
First try First attempts were often frustrating, some- the problem of strongly localising a user in the grid
times resulting in starting over. This indi- through sound.
cates that dierent initial solving techniques are
needed: mapping out grid density rather than lo- 3.4 Scaling up to 9×9 Sudoku
cation of similar items. Second games were much The 4 × 4 implementation of Sudoku does not scale up
smoother. well to 9 × 9 Sudoku. The main problem is that there

-4-

30
Interactive Sonication of Grid-based Games

are simply too many values to remember. We require 4.1 Design and implementation
a new model that lends itself better to the larger grid.
The important features of Connect Four are the
The main concepts here are generating models that can
columns and the locations of tokens, especially where
support the cross-hatching technique  where players
there are several of the same value in a line. Knowing
cross-correlate values in rows, columns and cages to
what is around a token is therefore very important as
deduce values  and also categories and order the values
well as being able to focus on each token individually.
used in the grid. The problems that occur in the 9 × 9
Sudoku grid inform us how to alter the model that was
used for sonifying the 4 × 4 grid. 4.1.1 Representation of the grid

It is clear that we need to develop specialised We represent the grid in a short looping sound so that
overviews and lters to allow users to focus on dier- players can think about the entire grid and understand
ent parts of the grid. Key information is about what where tokens are in relation to one another. The aim is
is present or missing and picking out items of similar to provide all the information quickly enough so that
values. What this implies is that players must be able the players can reason about it as a whole with the
to apply certain lters in combination as they interact distinct parts making up a pattern that they can work
with the grid or prompt overviews to be played. How- with. The end result is that the grid is like a short bar
ever, it is also important not to lose the advantages of music. We then punctuate this bar of music with two
from the direct interaction provided by the graphics drum sounds to help players localise themselves within
tablet. For example, were the player interested in each loop. A stronger (or louder) drum plays at the
an overview of a row, tapping to the left or right of start of the grid and the softer (or quieter) one occurs
that row could play an overview of the row where to- at the fth column of the grid. Our initial design did
kens are played in a predened order using the graphi- not include the second drum however, it was quickly
cal interface and panning to re-enforce their positions. apparent that when the grid is sparse, it did not have
To query where a particular token is present, players the energy or liveliness for which we were aiming nor
could select the token from a list and use gestures in was the localisation strong enough. This aim also drove
each cage to determine if it is present. Another lter the rate of our auditory display. We experimented with
could be used in combination with a the token lter to grid lengths of 0.7 to 3.5 seconds. Less than a second
show where the token is missing. Finally a lter that was found to be quite manic and over two seconds a
only displays where the empty cells are could highlight bit too slow. Our preferred length was 1.4 seconds with
where the grid is dense or sparse. To solve the prob- 0.2 second pause between loops, coming to a total of
lem of the large number of values to be entered, sounds 1.6 seconds.
that can be vocalised can be used. This enables the
player to self-organise the tokens and these can then
also be used as input removing the necessity to make
mappings between tokens and their graphical represen-
tation. Vocal sonications have been successfully used
in the sonication of EEG data [3].
Given the complexity of the sonication and inter-
action needed, we have tabled our work on Sudoku for
the time being and are focusing on the second game we
implemented: Connect Four.

4 Rhythmic Connect Four

Connect Four, a Milton Bradley game, is a two-player


Figure 5: The sonication of the Connect Four grid.
game on a 6×7 grid where each player tries to line up The row determines the pitch of a token and the col-
four game tokens while blocking the other player from umn drives when the token plays. Two drum sounds
doing the same. The traditional Connect Four grid is punctuate the auditory grid at columns 1 (louder
shown in gure 5. The auditory version of this game is drum) and 5 (softer drum). There is a short pause
based on adding sound events to a rhythmic pattern.  the length of a column  at the end of the grid.
Sound examples are provided on-line at [1].

-5-

31
Interactive Sonication of Grid-based Games

4.1.2 Representation of the columns


The columns in the grid are evenly spaced in the audi-
tory grid loop. The values in each column are presented
concurrently. The pitch of each value is determined by
the row, where the bottom row is mapped to a low
pitch and the top row is mapped to a high pitch. The
pitch intervals used can greatly aect the æsthetics of
the auditory display. We experimented with several in- Figure 6: The graphical interface for the Connect Four
tervals, such as `neutral' or `jazzy'. While the neutral game. Each player uses their slider to select columns
pitch interval allows for the greatest separation of the and the button to enter values in the auditory grid.
notes (MIDI pitch 52, 56, 59, 61, 64, 68) 1 the jazzy in-
terval (MIDI pitch 52, 54, 57, 59, 62, 67) was the most
engaging and least irritating after many repetitions. 4.3.1 Player feedback
The general feedback from the informal evaluation of
4.1.3 Representation of the tokens the game, which used the graphical interface shown in
Each token was represented by an instrument. We used gure 6, was quite positive. Here are some of the most
a vibraphone and an electric bass in our implementa- frequently mentioned topics:
tion. The pitch interval for each instrument is modu- Playability Most players felt that with a little prac-
lated to t the instrument better. These two instru- tice, the game would be quite playable.
ments are very dierent sounding and their envelopes
Engaging rhythms Players found the sounds æsthti-
are diverse making them easier to tell apart. Addi-
cally pleasing. Players as well as bystanders would
tionally, each token plays in a dierent speaker. These
nd themselves moving along with the beat.
dierences allow the players to pay attention to each
token alone. Losing the beginning of the grid Players would
Additionally, we use brilliance to indicate when there often get confused about where the grid began
are several tokens in a row. Minimum brilliance corre- in the sonication. The rst token placed in the
sponds to a player's token all on its own and maximum grid seemed to take the players' focus away from
brilliance is applied to four of a player's tokens in a line the drum beat.
(game is over). If the game is won, the winning com- Graphical interface Several players used the posi-
binations has maximum brilliance while the brilliance tion of their opponents slider after their move to
of all other tokens is set to the minimum. This use gure out where the last token was placed. Player
of brilliance highlights tokens that have the potential 2 commented that they felt this was a cheat.
to win the game and gives a clear indication when the
Masking The higher pitched tokens overpower the
game is over.
lower ones, especially in the case of the vibra-
phone. We hypothesised that this may be in part
4.2 Playing the game because the players are not taking advantage of
The interface for playing Connect Four was graphical listening to a single players tokens by listening to
(see gure 6). Each player has a slider allowing play- one speaker at a time.
ers to drop their tokens into the grid. The value of Playing patterns Players either tended to experi-
the slider represents the columns in the grid. Each ment with several moves before committing while
player also has a button; until this button is pressed, others selected a column right away. It is unclear
the player's move is not committed. This allows each what drives this behaviour and whether it has a
player to move their tokens and hear the eect of their correlation to successful game play.
move before making a nal decision for the turn.
Interaction The slider did not always make it clear
4.3 Discussion when the player moved from one column into the
next. This was addressed by adding column de-
We performed an informal evaluation with ve dier- limiters to the interface as shown in gure 8.
ent players: four men and one woman, two of whom
were musicians and one of whom was visually-impaired. 4.4 Informing grid sonifications
Their ages ranged from the late twenties to the mid
Our rhythmic Connect Four contains several pieces of
fties.
design knowledge that can be applied to other soni-
1 using the SuperCollider3 .midicps method cations of grid data.

-6-

32
Interactive Sonication of Grid-based Games

Looping through the grid The technique of making the is shown in gure 8. This allowed players more free-
grid into an auditory loop  be it column-wise as we dom in their interaction and also pushed them to rely
have done or not  shows promise for providing a grid more on the auditory feedback rather than looking at
overview. This is similar to other work in auditory where their opponent placed a token throughout their
overviews [5], but instead of the column being sonied move.
at the request of the user, it is repeated to continuously
remind the user of the state of the grid. We believe
this to be a technique that can help overcome the lack
of persistence in the auditory channel. Here, we have
implemented this technique and players of the Connect
Four game found it useful and engaging.
What remains to be tested is the limitations of
this technique. Our sonication was limited to seven
columns with a maximum of six values to represent
while most data sets encompass many more than that.
It remains to be seen whether the technique is depen-
dant on the number of columns presented or on the du-
ration of the sonication. We envisage this technique
being extended to comparing data sets as well, pro-
Figure 7: Two people playing Connect Four. The play-
vided an overview of each data set could be presented ers trade o the stylus and use areas on the tablet to
as we have presented columns here. play in a column, as show in gure 8.

Identification of a pattern Connect Four has a clear


pattern that determines if a player has won: four to-
kens of the same kind in a line, be it in rows, columns
or diagonally. We use brilliance here to highlight where
this pattern occurs in the data and where partial pat-
terns occur. This technique allows us to essentially
apply a lter to the data. In our case here, we had a
very simple pattern to match. We envisage that this
can be extended to many dierent patterns with the
potential of several patterns being applied in turn to
show dierent aspects of the data.

4.5 Formally evaluating Rhythmic Connect Four


Due to the positive user feedback from Rhythmic Con-
nect Four, we are currently taking this work forward Figure 8: The interface for the Connect Four game on
and have just completed a formal evaluation of the in- the graphics tablet. The interface is inverted, allowing
terface with some minor changes. In this study 7 pairs players to sit opposite each other as shown in gure 7.
of players each played 3 games and were interviewed
about their experience. We focused on how they used The training we performed as part of the formal eval-
the audio to inform their playing strategies. The re- uation addressed several of the problem we noted ear-
sults of this evaluation will be reported in further pub- lier. One such problem was losing the beginning of the
lications however we describe preliminary results here. grid or sound loop. We trained players to listen for the
To address the naturalness of the interaction and to louder drum beat and found no indication that this was
focus more on the auditory aspects of the game, the in- a problem during the evaluation.
terface was moved to the graphics tablet. This allows Another problem was the vibraphone instrument
two players to sit opposite each other (see gure 7) and overpowering the electric bass instrument. We ad-
to divorce the sonication from any visual representa- dressed this by boosting the volume of the electric bass
tion. The two main dierences that resulted from this so it was not so easily overpowered and by training
changes was that (1) players were not aware that the players to listen to a single players token at a time
column selection area was a slider and (2) that after se- (as each player has their own speaker). A problem re-
lecting a column, the other player could not see where ported in this second evaluation was that the higher
their opponent had played. The graphics tablet layout pitched tones overpower the lower ones. We will look

-7-

33
Interactive Sonication of Grid-based Games

at this issue further as we complete our full analysis. Our next steps in this work is to complete our anal-
ysis of the formal evaluation of Rhythmic Connect
Four and to integrate some of ndings from Sudoku
5 Conclusion to strengthen its implementation. This will include
In this paper, we have presented some new approaches a whole spectrum of grid inspection: direct cell-based
for the sonication of grid-based games. These grid interaction, localised region overviews and overall sum-
games represent use cases of data displayed on a maries into a coherent interactive sonication system.
grid allowing us to develop techniques that can be This demands that we structure the sonications so
transferred to related applications, such as real-time that the information obtained via the dierent ap-
video stream sonication, spreadsheet sonication for proaches can easily be fused into an increasingly ac-
visually-impaired users, or the generalisation to 3D curate mental image of the grid.
grids. These are attractive follow-up steps on our re-
search agenda towards a better exploitation of sonic Acknowledgements
interactions for grid-structured data types. We would like to thank the COST IC0601 Action on
We introduce an interactive sonication of 4×4 Su- Sonic Interaction Design (SID) for sponsoring this work
doku grids using direct interaction with a graphics and allowing collaboration between the Ambient In-
tablet. The Sudoku grid can inform on how we might telligence Group at Bielefeld University and the In-
sonify sets of data and how they cross-correlate. The teraction, Media and Communication Group at Queen
sonication design was straightforward, following the Mary, University of London. Thanks is also extended
Model-Based Sonication idea that data parametrises to all the members of those groups as well as those of
acoustic systems, and that movement on the grid ex- the Centre 4 Digital Music (QMUL) who helped eval-
cites these systems. Thereby the sounds indicate quite uate this work.
directly what state a grid cell is in. Interestingly, users
start quickly to develop strategies to explore the 4×4 References
grids which we haven't anticipated beforehand, such as
[1] Thomas Hermann. Online sonication examples.
drawing circles in cages, or doing quick line-scans, or
http://sonication.de/publications.
tapping on cells. Due to the limited complexity of the
grid, this direct interaction is suited to allow users to [2] Thomas Hermann. Sonication for Exploratory
solve the Sudoku. However, scaling the problem to the Data Analysis. PhD thesis, Bielefeld University,
9×9 Sudoku fails for two reasons: the user's memory is Bielefeld, Germany, 2002.
exceeded with the many items, and the proprioception [3] Thomas Hermann, Gerold Baier, Ulrich Stephani,
is not enough accurate to understand exactly what cell and Helge Ritter. Vocal sonication of pathologic
is being inspected. To better solve the 9×9 Sudoku, EEG features. In Proceedings of the International-
possibly more specic sonication designs need to be Conference on Auditory Display (ICAD), 2006.
developed.
[4] Thomas Hermann and Helge Ritter. Listen to your
The Connect Four game represents grid data where
data: Model-based sonication for data analysis. In
linear patterns occur. A rhythmic sonication ap-
Advances in intelligent computing and multimedia
proach was developed for the game, which can now
systems, pages 189194, August 1999.
successfully be played with the visual display playing
a very minor role. It exemplies an auditory display in [5] Johan Kildal and Stephen A. Brewster. Providing
good analogy to visual games where the board is per- a size-independent overview of non-visual tables.
sistent for both players  here the persistence is cre- In Proceedings of the 12th International Conference
ated by a looped sonic pattern which serialises the grid on Auditory Display (ICAD), June 2006.
column-wise. First comments from players are promis- [6] Gregory Kramer. An introduction to auditory dis-
ing, however, we need to conduct user studies in order play. In Auditory Display. Addison-Wesley, 1994.
to investigate the potential of learning to better under-
stand the grid set-up. [7] Tony Stockman. Interactive sonication of spread-
sheets. In Proceedings of the. International Confer-
We are condent that by focusing on grid-based
ence on Auditory Display (ICAD), 2005.
games we will be in a very good position to evaluate
sonication designs and to compare the eectiveness
of dierent designs. These games thus represent an
ideal platform to examine sonic interactions. We hope
to make the games attractive so that players enjoy to
play and generate valuable data for us voluntarily.

-8-

34
Using audio aids to augment games to be playable for blind people

David C. Moffat and David Carr


The eMotion-Lab
School of Engineering and Computing
Glasgow Caledonian University
Scotland, UK

email: David.C.Moffat@gmail.com

Abstract. One potentially important use of audio technology is to make video games for blind people. Visual impairment makes
almost all current video games totally inaccessible. We augmented a simple shooting game with audio aids to provide information
about the direction and distance of moving targets, and tested it to see whether people could play the game without seeing the screen.
Players managed to play the game quite well in the blind condition, and in some cases the audio aids improved their performance
when they could see the screen. The conclusion is drawn that audio aids can make mainstream games accessible for the blind.

1 Introduction 1.1.2 The lack of any audio direction or location


The lack of any audio indication of direction and orientation, in
Blind and partially sighted people are excluded from playing a 3D game for example, results in blind players being unable to
nearly all computer games. It is not often noticed or thought tell where or what they are facing in a game.
remarkable, because up to recently games have been strongly Objects may be announced with speech output or special sounds
associated with a younger generation of enthusiasts. But games to represent them (so-called "earcons" by analogy with symbolic
are becoming a mainstream means of entertainment and icons for a visual interface). For example, an enemy monster
sociability, and it is no longer acceptable that any significant could emit a characteristic noise; but without direction
sector of the population be excluded. indication, or distance information, the player cannot take aim
Especially when one considers that the elderly form a large with any normal weapon and shoot the monster.
proportion of all people with visual impairments, it seems clear
that some efforts should be made to cater to this and other
disabilities. 1.1.3 Game hints or story elements are often only visual
Special games could be made for such people, but it would be Vital game or story information may be presented as objects on
preferrable to augment normal games in such a way that the the screen, or cut scenes and graphical animations. On-screen
blind can play them as well. This is challenging, but if feasible it text might not even be readable by a screen-reader.
would give them access to many more games, and potentially This problem is particularly severe for adventure games, where
allow them to play along with normally sighted people, too. a lot of information is displayed using text only, making it
In the study reported here, we explore the possibility of making impossible for blind players to follow the game's story.
normal games playable for visually impaired people, by using 1.1.4 A lack of a "no 3D graphics" mode
stereo audio output from a game to supplement or replace the Visually impaired people do not buy graphics cards for general
usual graphical output through a screen. computing purposes, since they have no use for them.
It is worth noting that making games playable for the blind in But new video games require the latest graphics cards to work,
this way would have benefits for other people as well. For and typically do not provide an option to run the game without a
example, a game that does not need a visual display for output is graphics card. Therefore the games will usually not even
a game that might be played while walking along, using only a execute on blind people's computers.
PDA or mobile phone with stereo headphones.
More serious mobile applications may also benefit, by using the 1.1.5 Game documentation in text form only
same techniques that can be developed for audio games. With the game manual only in text format both as a booklet and
as a text file on the games DVD visually impaired gamers are
1.1 Problems for blind players unable to learn how to operate and play some video games
Visual impairements can make a game inaccessible in different without the help of someone who can read the documentation
ways. Bierre et al [1] list the following categories of problem. out loud to them. This means that every time when visually
impaired people wish to consult the game's documentation, they
1.1.1 Only visual feedback to commands is given will need someone with them. With the storage space available
When a player tells the game to do an action the only output to developers in the form of DVDs and so on, game developers
thats indicates what the player did is visual. Without such could easily provide the documentation in audio form; or they
feedback the game is unlearnable for the blind. The game menu could provide it in a textual form that would be easily navigable
system is also inaccessible, without audio feedback to tell what with the aid of a screen-reader.
the menu options are and which option is currently selected.
Using synthetic speech output, textual output can be spoken to 1.2 Games for the blind
the players, and this technology (of so-called "screen-readers") Some of the problems identified above can be solved with
is obviously important to the visually impaired who use speech synthesis output from the game, and for some problems a
computers in any way. It can be used to augment the visual screen-reader might be enough.
feedback with audio feedback in games.

-1-
35
In the following listed examples of games for the visually The player in AudioQuake has a RADAR device that beeps to
impaired, synthetic speech output occurs often as an audio aid. signify danger like enemy monsters. The beeps sound different
There are other forms of audio aid that have been tried as well, for friends, for enemies, and for other objects. They are doubled
with some degree of success. By the phrase "audio aid" is meant to enable them to carry more information. The double beeps are
any type of audio output that is intended to help visually faster when the target is closer, and the second beep is a higher
impaired players to play the game. pitch when the target is higher than the player.
Compared with the speech output in the games mentioned
1.2.1 Special games above, these beeping earcons allow information to be conveyed
Some developers make games especially for the visually to the player quicker, which is important for an FPS game. By
impaired. These are not intended to be played by normal giving the player more detailed information about the objects in
(sighted) people. They are audio games, without a graphical the 3D world, AudioQuake gives players the opportunity to use
aspect, designed so that blind people can play them well. their own spatial reasoning to make decisions. It still provides
One developer that makes such games is GMA games [2]. They some automatic aiming, however, in that weapons "lock onto"
have made the game Lone Wolf, for example, which is a targets that are within a certain angle from straight ahead. The
submarine warfare game, in which the player captains a experience it provides with the RADAR device is reduced in
submarine on battle missions. The submarine uses speech output another way, too, in that only the nearest friend or enemy is
to report from its onboard sonar system, to tell the player when detected. To have too many earcons representing several other
enemy submarines are within firing range. The technicalities of objects at the same time was judged to be too confusing.
aiming torpedos can be assumed to be left to the crew and to The game AudioQuake is quite close to providing an authentic
automation. It is a clever idea to choose a game world in which experience for blind players, but again the aiming part of the
visual impairment would not be a disability. The technique of game's play is different from that of original Quake. The earcons
using in-game automatic devices for tasks like aiming is used in used are also symbolic sounds, rather than naturalistic ones.
other accessible games, too. Real enemies and friends do not beep faster when they approach
In another game of theirs, Shades of Doom, which is based on you; nor beep higher when they are above you. A more
the original Doom FPS (first-person-shooter) game, the player naturalistic way to convey that kind of location information
has a special "night scope" device that signals with earcons would be to use the human capacity for stereo hearing.
(special sounds) that an enemy is within range [3]. Firing the
weapon then automatically hits the target. While this game was Another free game that makes good use of audio aids for blind
based on a game for normal players, the game-play has been players is Top Speed 2, by Playing in the Dark [6]. It is a car
changed by this manner of shooting, to accommodate the blind. racing game, in which the sounds made by other cars passing on
To change the game-play for an FPS game in this way is to the right or left play louder in one or other speaker. The stereo
make it a special game that is quite different from the original. perception of audio is also used to guide the player around the
track, by providing audio feedback for position. The centre of
the track emits a sound, so that the player hears it louder on one
In these special audio games, it can be seen how good game- side when the car moves off the central "racing" line. Rather
play can be brought to the visually impaired. It is inclusive to than use sounds for walls to warn the player away, as in
bring them into the world of games; but it is still somewhat AudioQuake above, the road-centre sound is used to keep the
exclusive in that they are then playing specially made games player in the middle of the track, even as it turns.
that normal people would not generally choose to play. Another clever trick is to provide the driver with a co-driver, as
For the visually impaired sector of the population to depend on in rally races. This means that the game can give a lot of
special games means that they would never have the same range information to the driver about the road ahead, as well as timing
of games to choose from, either. The market is smaller than the performance, in quite a natrual way, using speech synthesis. By
market for normal games; and the game developers will often be this means the game provides a fairly authentic experience
normally sighted people, who would need special training to without needing the blind player to play a special game. The
understand the kinds of problems that visually impaired gamers modifications to the driving game are small enough to allow
face. From earlier points made, it is apparent that normally normal players and blind players to play together.
sighted developers suffer from blind spots of there own, in so
often assuming that all players can see as well as they can. Another computer game for normal players which has been
To leave special game development to visually impaired modified to be accessible to people of various disabilities,
developers, on the other hand, would make for much more including visual, is AccessInvaders [7], which is a variant of the
accessible games; but even fewer of them. It would mean a classic arcade game Space Invaders.
minority of the game-playing population being served by a When set to be played by a visually impaired player, the game is
minority of the game-making population, which would make the simplified to have only one column of enemies; and they cannot
games doubly expensive as well as relatively few in number. kill the player. Instead, the player loses only if an enemy reaches
the ground. The aliens emit a "sonar-like" sound, and it is spatial
For those reasons, but especially for social inclusion, to let so that its position can be inferred. In this last point,
visually impaired play with normally sighted people, it is good AccessInvaders uses spatial sound like the game Top Speed 2
to attempt to modify normal games with extra audio output so does, allowing the player to use both ears in a natural way.
that the visually imparied can play them, too. In simplifying the game for the blind, however, it becomes a
special game again, quite different in game-play to the normal
1.2.2 Games augmented with symbolic audio (earcons)
version of the game. Since the designers want to encourage
An example of a normal FPS game that has been modified to be
blind people to play games with sighted people, they have
playable by blind people is AudioQuake, which was developed
another way to make this possible. They introduce a
by the AGRIP project [4], and is free, being based on the code-
handicapping system, in which some players get a simplified
base for the original Quake FPS game by id Software [5]. The
game with reduced experience, to make the game accessible to
intention of the AGRIP project is to modify mainstream games
them; and the scoring and teamplay still allow them to play
for visually impaired players; and allow them to modify them
alongside normally sighted people.
further if they wish.

-2-
36
1.2.3 Games augmented with naturalistic audio
The above games all make different compromises to give blind
players something like an authentic game-play experience,
without simplifying the game too much, and allowing some
degree of joint play with normally sighted players. They all use
audio aids to make this possible, and may use weapons that can
aim automatically or do other things to assist the player in ways
that are not alway realistic.
The games use earcons to signify events in the world, but they
tend to be symbolic rather than naturalistic. They use stereo
hearing in some cases, but typically in a one-dimensional
setting, like Space Invaders or driving, where the player only
has to go left or right. When the game requires more complex
maneuvering in 2D or 3D space, like the shooting games, the
audio aids tend to revert to carrying information symbolically,
or with speech.
It is a challenge to support a blind player in a shooting game,
with naturalistic audio aids that convey information in the way
that sounds do in real life. Not with synthetic, coded beeps, that Figure 1: The Asteroids game
is; but with natural properties of sounds, like volume and stereo
detected direction. To convey location information, however, AudioAsteroids does
use more naturalistic properties of sound. As shown below, the
The study reported below attempted to meet that challenge. A game tries to solve the main problems that blind people have in
traditional shooter game is augmented with audio to give the playing normal games.
player location information for objects (targets) in a naturalistic
way, to see how well blind people could play it. The game is not
simplified for them in any way. 2.2 Speech for the menu system
The first problem that visually impaired players encounter is
starting the game and setting options, because they cannot use
2 Audio aided Asteroids the visual menu system.
Asteroids is a classic arcade game in which a spaceship moves The audio aid to solve this part of the problem simply uses a
over a 2D area rotating left and right (or anti- clockwise), and synthetic speech output tool that speaks the options available at
using its jets to go forwards (thrust). The ship must shoot at and each stage. This aid was tested more thoroughly in another game
destroy the asteroids moving through space around it (see which described scenes by reading out the screen text, and gave
Figure 1), and if the asteroids break into pieces, then they must quite complicated series of changing options. It was an
be destroyed also. When an asteroid goes off the screen, it adventure game, and the menu audio aid did enable people to
isntantly "warps" to appear at the opposite edge. If any asteroid play the game without looking at the screen; but we focus
collides with the ship, the player loses a life and starts with a attention here rather on the other audio aids for location.
new ship in the centre of the screen. The aim is to destroy the Because the ship in AudioAsteroids can rotate and move through
asteroids in the shortest time. space, players need to know the relative direction of asteroids,
AudioAsteroids is the version of the game developed for this and how far away they are. The two audio aids for these are the
study. It is set to be fairly easy to play, even for beginners, so directional aid and the distance aid.
that the effects of audio aids can be seen on a range of players.
However, it does have two levels. Players must complete both 2.3 The directional aid
levels in order to finish the game. In the first level (L1) the Because we have two ears, we can estimate the horizontal angle
asteroids wait to be shot at; and in the second level (L2) they are of a sound source quite accurately, depending on circumstances.
moving across the screen, and so harder to hit. When the source is almost directly ahead, we can tell if it moves
through an angle of as little as one degree of arc ([8] cited in
2.1 Symbolic sounds for different objects [9]). This is mainly because of the difference in time of arrival
The main feature of AudioAsterdoids is that each asteroid emits of the sound at the two ears, and consequent phase difference.
a characteristic sound. The sounds used were a cello string being For higher frequency sounds, there is also a difference in
plucked about twice per second, and the rattle of a small card volume at the two ears, because one ear will be partly shielded
box of paperclips. The manipluations of these sounds were from the sound source by the head.
intended to convey information about the location of the The auditory cortex of the brain uses these clues to inform us of
asteroids to the player. the location of the sound source. There are other clues available,
These sounds are clearly not naturalistic, because real asteroids also. For example, the relative height of frontal sources can be
do not sound like cello strings. They are symbolic sounds – estimated if their sounds are high in pitch, because of the way
"earcons" – whose meaning can only be understood once it has that the shell of the ear reflects certain high frequencies back
been explained. This is a necessary compromise, because into the ear canal.
objects in space in fact make no noise at all. Other space games The shape of the ears also helps us to determine whether the
and science fiction films make the same compromise when they sound source is in front or not, and sources from the rear will
have loud explosions and other sound effects in space. Sound in sound a little "muffled."
space is physically impossible, because there is no air for sound
waves to travel through; but it is shown in films because, like The way the ears register sounds, and the brain decodes the
background music, it gives an "emotional atmosphere." signal, is highly complex and not fully understood. However
using some of the main properties as outlined above it is

-3-
37
possible to manipulate sounds in ways that make it appear to a difference could be allowed to grow. Clearly the proportion
listener with stereo headphones that the source is moving from should be as large as feasible, to make the player's distance
left to right (or "panning") and even going behind the head. estimation more accurate. If there were no difference then all
asteroids would sound the same distance away. But if the
2.3.1 Sound from behind the head proportion is too large, then only the nearest objects are heard.
Each sound that is to be used to represent an asteroid is The minimum volume is set at 60% of the maximum volume,
processed to make a second version that sounds like it comes which was found to be a good level to allow the asteroid to still
from the rear. This was done with a software tool called be heard when it is far away, and not get drowned out by much
Maven3D [10], which includes algorithms to simulate the nearer asteroids.
"muffling" referred to above, that makes sounds appear to No tests were done to confirm that players could accurately
source from behind the head. estimate the distance of asteroids, because we did not expect that
There are therefore two sound files prepared for each asteroid: they would be able to. The purpose of the distance audio aid in
the original one for when the asteroid is towards the front of the the game-play is to tell the player which asteroid is closest, and
ship (ahead of the ships left-right axis), and the muffled version when it is so close that it presents a danger to the spaceship.
for when the asteroid is somewhere behind the ship.

2.3.2 Sound Panning


Whether in front of or behind the ship, further processing is 2.5 Audio cross-hairs
needed to feed the sound differently to each ear through the During pilot tests, it became apparent that players would be
headphones. The ear which is slightly further away from the almost unable to hit moving targets. They would generally aim
source is given a slightly delayed signal, which is thus pase- to were the target was, but it would move on in the time it took
shifted, and the signal's volume is lowered. This processing is the bullet to travel to the target's position. This trouble could
called "panning," and can be done in real-time by a software have been alleviated by making the weapon a laser, that would
library. The library chosen for this study was the Dark GDK shoot the target instantly, but that would be to change the game.
[11], which is available as a plug-in for the Visual C++ Instead, some intelligence was built into the ship's weapon to
development environment that is popular with game developers, assist blind players by predicting the future position of the
and is now also included in VisualStudio Express. target. The direction aid indicates this predicted location to the
To use the library function dbSetSoundPan is simple: the player, rather than where the target is currently.
first argument is the sound to be played, and the second A visual analogue to these "audio cross-hairs" is the auto-
represents the left-to-right angle that it should appear to come targeting function that can be found in military fighter jets,
from. where the pilot sees cross-hairs over the target when in fact the
aim of the gun is ahead of the target to where it will be when the
Combining the two techniques of muffling and panning, bullets arrive there. To include this audio cross-hairs feature for
asteroids are given a direction using only sound. The player also visually impaired players does change the game-play somewhat;
needs to get an idea how far away from the ship the asteroid is. but it is not unrealistic, and normally sighted players would have
access to the same auto-predictor device if they would listen to
the audio aids as well.
2.4 The distance aid
The reason it is so difficult to aim at a moving target without
Where some games use earconic or symbolic methods to signify
auto-prediction is probably because distance estimation is so
distance of an object, including some mentioned above, our
innacurate in human hearing. That estimation is crucial to the
intention was to use naturalistic attributes of sounds to carry
calculations of where to aim, so that the moving bullets will
information, where possible. There are many attributes that can
collide with the moving target.
play a role in humans' estimation of distance from source,
including reverberation of the local environment and Doppler-
shift effects coupled with knowledge of the source; but
With the audio aids in place, the game could be evaluated to see
AudioAsteroids uses only loudness.
how well they help players to play it in a blind condition, when
2.4.1 Naturalistic sound at a distance they cannot see the screen.
The most important way we tell the distance of an object
auditorily is from the volume of the sound. In fact we need to be
familiar with the type of sound to do this, so that we can 3 Blind playability of AudioAsteroids
estimate how loud the sound is at source, and thereby estimate
the distance from it's apparent loudness at the ear. Because of The AudioAsteroids game was tested on four players of different
the variablility of sound loudnesses at source, and for other skill levels. The players were normally sighted, so that their
reasons too, humans do not estimate distance very reliably from performance could be compared in the blind and normal
sound alone. conditions. The weakest player was a novice, and so the
difficulty of the game was set to a low level to allow all the
The asteroids in AudioAsteroids are set to sound proportionately players to nearly complete it. Their relative performances would
louder the nearer they are, by a simple linear function. The be measured by time taken to complete it.
precompiled sound for the asteroid is unchanged to signify its
closest distance (of zero), at maximum volume (100%). As the 3.1 Participants – Players
asteroid moves away, its volume falls off linearly with distance The participlants in the experiment (the players; the Ps) were of
until it reaches its minimum volume when it is the furthest away varying levels of skill. Three were aged between 20 and 23, and
it can be, which is the width of the screen. the oldest was 50 years old: she was the novice gamer, who had
little experience of playing computer games of any type.
2.4.2 Modification of the sound-distance scale
Three players had experience of computer games, two of them
To set the relative proportion of loudest to weakest sounds for
more particularly of Asteroids, and one player had some
the asteroids, a simple pilot test was done to see how big the

-4-
38
experience of the audio aids used in AudioAsteroids, having experiences shown in Table 1. Each player is also consistent in
played a part in its development. The relative skill levels of the performance, as shown by the standard deviations. The least
players are shown in Table 1, in which players are shown to consistency is shown by the weakest player, again as would be
have experience by a plus (+) sign. expected.
To confirm these inspections with statistical analysis, t-tests
P1 P2 P3 P4
were drawn between players to see if the scores were
+ + + - Experience of games
+ + - - Experience of Asteroids
significantly different. They were. The (unpaired, two-tailed) t-
+ - - - Experience of Audio Aids tests between P1 and P2 yielded probabilities of 0.01 for shots,
and 0.66 for time, indicating that their times were not
significantly different, but that their shots surely were.
Table 1: The participants' levels of experience
It appears that P2 is the stronger player, wasting significantly
fewer shots.
3.2 Procedure Visuals S T
The game was played on a laptop computer, with headphones 1.1 26 64
attached for the sounds output. The spaceship is controlled with 1.2 24 54
1.3 25 49
an Xbox 360 wireless joypad.
1.4 26 63
The game was started for the players, who first played it five
1.5 24 54
times with only the visual output on screen. They then played it P1 mean 25 56.8
five times with audio on, but not looking at the screen (no stdev 1 6.46
visuals); and finally they played five times again, with both
t-test 0.01 0.66
visuals and audio.
The difficulty of the game was set to be easy enough to allow 2.1 23 61
most players to complete the game without losing a life. They 2.2 19 58
would differ mainly in time taken. The hope was that the game 2.3 14 67
would spread the players, not defeat them. 2.4 22 64
2.5 18 45
To play the game once, it was necessary to play through two
P2 mean 19.2 59
levels, in which L1 starts with stationary asteroids, but in L2 the
stdev 3.56 8.51
asteroids are moving targets. In either case, a minimum of 7
shots are needed to destroy all the asteroids and their fragments. t-test 0.00 0.01
Performance was noted by recording two figures for each play 3.1 30 80
of the game: the number of bullets fired, and the time taken. 3.2 25 61
3.3 32 78
3.3 Results 3.4 32 91
3.5 30 96
All players completed all levels, destroying all asteroids, and P3 mean 29.8 81.2
any fragments resulting from asteroids the first time they are stdev 2.86 13.55
shot.
t-test 0.08 0.03
Every player found it difficult to reach 100% accuracy, and they
nearly always wasted some shots, firing more than 7 times in a 4.1 47 121
level. Only one player lost a life (this was P3 in the Audio-only 4.2 59 133
condition, which is the hardest); so we ignore lives lost. 4.3 27 65
4.4 37 133
The players differed in their completion times, both because of
4.5 27 104
skill level, and because of the difficulties of playind "blind." P4 mean 39.4 111.2

Those results are the aim of the experiment, but first we validate
the experimental setup with the results of all players in the Table 2: Confirmation of player skill levels
normal condition, with Visuals only, as in the traditional version
of the game, which is normal Asteroids. The other t-tests shown in Table 2 (comparing P2 against P3,
and P3 against P4) were again unpaired, but they were one-
3.3.1 Confirmation of player skill levels (Visuals) tailed. This was because we expect that P2 should get higher
The game was fairly easy for all players to complete without scores than P3, and P3 higher than P4, due to the Ps' differences
losing lives; but there were significant differences in shots in experience.
wasted and in time taken to complete. The t-tests confirm that P2 (an Asteroids player) is more skilful
than P3 (a general gamer, but new to Asteroids), based on shots
We first verify that the players' reported skill levels are and on time. P3 does not waste significantly fewer shots than P4
consistent with their results, and that individual players perform does (the novice), but is significantly quicker.
quite consistently within each condition.
This validation and consistency of the players' relative skill
Table 2 shows the results for all players and all games in the levels for the Visuals-only condition, which is comparable to
first condition, with only the visuals on. P1's third game is classic Asteroids, means that the further results can be relied
signified by the row "1.3" and so on. The mean scores for each upon, and the significance of any variation in results can be
player are shown, and their standard deviations. The scores are assessed by comparison with the variation for a player within
the number of shots (S) fired to complete both levels in the each condition.
game, and the time (T) taken to complete them, in seconds.
By inspection it is clear that P1 and P2 are the best players of 3.3.2 Development of performance measure
the game, followed by P3 and lastly P4. This is just what would To help in further performance comparisons between players,
be expected from the players' relative skills as suggested by their the S and T scores (for shots and time) are combined into a

-5-
39
summary performance measure, P, which measures the P2 does not do as well as P1 in the blind condition, despite
divergence from an optimal or extremely strong performance. being the better player, which can be put down to P1's prior
The P measure is larger for weaker players, who waste more experience of the audio aids. It is notable that P2 does not
shots and waste time. P is smaller, close to zero, for stronger perform much better than P3 in the blind condition (one-tailed,
performances that waste few shots and complete in near record unpaired t-test; p = .38), although the gap between them was the
time. P is calculated by the following formula: largest one in the Visuals condition. One hypothesis is that P2
cannot any longer use strategies learned in previous Asteroids
1 1 experience, and is thus reduced to re-learning the game in blind
P = ⋅3⋅ S−14  ⋅T −34
2 2 mode. With that specialist knowledge of Asteroids neutralised,
P2 is not much better than a general gamer like P3.
The lowest value that S can possibly have is 14, over both levels On the other hand, now P4 does much worse than P3 (one-tailed,
of a game; so S-14 represents the number of shots wasted in a unpaired t-test; p = .002), whereas she was closer before. It
game. The fastest times that any player achieved in any level of appears that there are two factors for this result. The first factor
any game total 34 seconds for both levels, which is therefore a is that P4 cannot cope well with the heavy task imposed on her.
fair indication of the optimum time that can be reached by a She is experiencing a cognitive overload, which the experienced
player who completes both levels. By inspection, the average gamers are not: she still has to think explicitly about the nature
firing rate of all players was about 3 seconds per shot, meaning of the gaming context, including the joypad keys, while the
that the number of wasted shots should be tripled to compare the others are familiar with the context, if not the game itself, and
cost with wasted seconds. Finally these two "waste factors" are can therefore concentrate more on playing it.
combined with equal weighting (of ½ each) to arrive at a The second factor is related: P4 is getting tired. This suggestion
performance score. is consistent with her results in the final condition, as analysed
Using this notion of performance, players with different styles – below.
who fire more accurately but take longer, for example – can be
meaningfully compared. Despite the large drop in performance that all players suffer in
the blind (Sounds only) condition, it is remarkable that all of
The mean scores including performance (P) for each player in them, including the possibly overloaded novice player P4,
all conditions are shown in Table 3. manage to complete the game. Only one life is lost in all these
games, and that is by P3, not P4.
The neutralisation of the more skilful players' experience
Visuals M S T P suggests that they are having to learn new strategies. The game
P1 25 57 28 in the blind condition is almost like a new game to them. It was
P2 19 59 20 probably too much to hope that blind players would be able to
P3 30 81 47
play against sighted ones on an equal footing. But even if the
P4 39 111 77
mean 28 77 strategies or skills required for the blind version of the game are
different, the fact remains that blind players can play with
Sounds M sighted ones, albeit at novice level.
P1 34 125 75
P2 36 168 101 3.3.4 The audio aids do not impair performance (Both)
P3 48 146 107 In the final Visuals+AudioAids condition (BOTH in Table 3),
P4 67 384 255 performance is comparable to Visuals alone.
mean 46 206 Three players even improve: P1 improves from 28 to 20 (with
p = .06 for unpaired, two-tailed t-test), by shooting more
BOTH M
P1 19 58 20
accurately, P2 and P3 appear to improve slightly, but not
P2 20 50 17 significantly.
P3 24 79 38 The novice player P4 gets worse (no less accurate, but slower)
P4 38 140 89 in the final condition, compared with the Visuals only one. As
mean 25 82 noted above, this is consistent with the hypothesis of mental
tiredness due to the greater cognitive load the task has put her
Table 3: Ps' mean scores in each condition under.

According to the summary P scores in the Visuals condition, it 3.3.5 Moving targets are much harder, when blind
appears the P2 is indeed the strongest player of classic The performance of players at each level can be calculated, and
Asteroids. The unpaired, two-tailed t-test gives a barely compared, to give a measure of how much performance suffers
significant probability of this of 0.08 however, so there is not when the targets are moving.
much in it. Unpaired one-tailed t-tests confirm that P2 is much Figure 1 shows the performance for each player and condition in
stronger than P3 (with p=.0006); and P3 is stronger than P4 L2 over performance in L1. When this measure is about 1, there
(p = .04). is no difference. As the graph shows, however, having the
asteroids move through space causes difficulties for the players.
3.3.3 Performance in the blind condition (Sounds) The most severe difficulties are caused for P1 and P2, who are
Referring to Table 3, it is clear that all players perform much the players most experienced in Asteroids. This suggests that
worse in the blind condition ("Sounds,"with only the audio aids they have learned strategies for shooting moving targets that are
to guide them). no longer applicable with only the audio aids for information.
Player P1's performance falls to 75, which is comparable to the P3's performance cost for moving targets, on the other hand,
novice performance in the Visuals condition (P4 in the top of the does not suffer in the blind condition, suggesting that this
table). Therefore, at least in this case, a skilful but blind player experienced gamer, new to Asteroids, has no particular shooting
can match a normally sighted novice at a shooting game, when strategies to lose.
provided with helpful audio aids.

-6-
40
because they have experienced the game in earlier blocks. If the
plays had been interleaved, in order to control for any learning,
P1
then some of the conclusions suggested earlier could have been
stated more strongly.

P2 Another point is whether the results would extrapolate from


unsighted to blind players.
The participants were "unsighted" rather than "blind" because
Player

Visuals they study could not see the screen, yet were not actually blind.
P3 Sounds It is a further question whether really blind people, such as those
BOTH blind from birth, could also play the game. It might be that they
would lack the gaming history, and not well understand the
P4 context of shooting asteroids, for example.

0.0 1.0 2.0 3.0 4.0 4 Conclusions and further work

The purpose of the study was to see if unsighted people could


Perf. cost of m oving targets play a shooting game, where location of world objects is crucial,
with only naturalistic audio aids to guide them.
Figure 1: Performance in L2 compared to L1 With certain compromises, including the addition of audio
– bigger numbers mean performance falls more crosshairs to predict the right direction to aim, and non-
naturalistic sounds to asteroids which would be silent in space, it
was shown to be feasible for unsighted people to play the game.
Again, the suggestion is that the audio aids provide an gaming In other respects than the audio aids, the game was the same as
environment that permits blind people to play; but that they will the classic Asteroids game, and normally sighted people can
have to develop strategies quite different from the ones that play it with its game-play unchanged. This demonstrates that
work well for normally sighted people. Some comments from even quite dynamic games could be instrumented to enable blind
the players were to the effect that playing AudioAsteroids was people to play them, without much compromise.
like learning the game from the beginning again. More tentatively, it was possible to suggest the performance that
blind players could achieve with the audio aids. It appears that
The information presented auditorily is not exactly the same as they can learn to play well enough to play against normally
the visually presented game state. There is one small advantage sighted players at novice level. This is significant because it
for the audio aids, in that they relate asteroid positions to the means that blind people are not socially excluded from the
ship's orientation; whereas, when viewing the screen, players increasingly important world of gaming; nor sidelined into some
must continually look between the target and the ship, in order small gaming world of their own, playing only against other
to decide precisely when to shoot. But the audio aids suffer from blind people.
bigger disadvantages that that. Consequently, game developers may feel encouraged to add
One particular difficulty is when asteroids warp from one side of audio aid technology to their games in order to make them
the screen to the opposite side, when they move over the edge. authentically playable to the visually impaired. Just as websites
Graphically, these warp events can be seen and predicted easily; are increasingly being designed to make them more accessible to
but there is no audio feedback for blind players. the visually impaired, the same developments may occur with at
least some genres of computer game. While it is more difficult
The audio aids could be confusing in other ways, too. When to make games accessible, at least it seems feasible.
asteroids break up, for example, the remaining pieces generate Note that the audio aids used in AudioAsteroids were a first
their own, different, sounds because they are new targets. There attempt, and that further work could find better versions of
could be up to four different sounds simultaneously. them. More attributes of sounds could be used, and we have
By observation during play, it was noticed that unsighted little doubt that player performance could be improved from
players tended to shoot at the nearest asteroids. This could be these initial levels – the only question is by how much.
because they generate the loudest noises; but also because they As well as helping the visually impaired, the use of audio aids
present a greater danger; or because they are easier to hit. such as in this game offers interesting possibilities for sighted
people who just happen to be temporarily without a visual
display. The ever increasing use of mobile phones and similar
technology, while walking about town for example, will create a
3.4 Limitations of the study
need for audio information rather than visual output.
One obvious limitation of the present study is that it only used It would be interesting to see whether sighted players could
four participants. They repeated their plays of the game several learn to use the audio aids while viewing the screen. Would they
times each, but still it means that whole sectors of the gaming return to their normal strategy after having adapted to the blind
population, with different levels of experience, are represented condition; or would they be able to combine strategies and
by only one participant. modalities, and become better players than they were before?
As a result of the small umber of participants, some of the The bandwidth for audio presentation of real-time information
conclusions drawn have had to be tentative. should also be investigated, and compared with the quantities of
information that can be visually perceived in the same time.
A more serious limitation is the way the experiment was Finally, there are possibilities for future work that analyses the
designed, with all plays of the conditions occurring in blocks. It changes in strategy encouraged or demanded by the audio
allows the possibility of a learning effect, in which the players version of the game. When unsighted, the players definitely
are more skilled in the last block (of both visuals and audio aids)

-7-
41
seemed to play different strategies. Analysis of collected video
clips would be one way to pursue this question.

Acknowledgements

We thank Brian McDonald, David Farrell, Scott Gannon, and


members of the School's Games Interest Group for their helpful
comments on this work.

References
[1] Bierre, K., Chetwynd, J., Ellis, B., Hinn, D. M., Ludi, S. &
Westin, T. "Game Not Over: Accessibility Issues in Video
Games." In Proc. of the 3rd International Conference on
Universal Access in Human-Computer Interaction,
Lawrence Erlbaum (2005).
[2] GA Games. http://www.gmagames.com/
Accessed on 4th June, 2008.
[3] Peterson, B. Guide to Shades of Doom by GMA Games.
http://www.audiogames.net/pics/upload/shadesofdoom.doc
Accessed on 4th June, 2008.
[4] The AGRIP Project. http://www.agrip.org.uk/about/
Accessed on 4th June, 2008.
[5] id Software. Quake.
http://www.idsoftware.com/games/quake/quake/
Accessed on 4th June, 2008.
[6] Playing in the Dark. Top Speed 2. A free computer game at
http://www.playinginthedark.net/
Accessed on 4th June, 2008.
[7] Grammenos, D., Savidis, A., Georgalis, Y., & Stephanidis,
C. (2006). "Access Invaders: Developing a Universally
Accessible Action Game." In K. Miesenberger, J. Klaus, W.
Zagler, & A. Karshmer (Eds.), Computers Helping People
with Special Needs, Proceedings of the 10th International
Conference, ICCHP 2006, Linz, Austria, 12 – 14 July (pp.
388–395). Berlin Heidelberg, Germany: Springer.
[8] Middlebrooks, J.C. & Green, D. M., "Sound localization by
human listeners," Annual Review of Psychology 42, pp.
135-159 (1991.)
[9] Lu, Y.-C., Cooke, M. & Christensen, H. "Active binaural
distance estimation for dynamic sources." Proceedings of
InterSpeech-2007, Antwerp, Belgium (2007).
[10] Maven3D. Software package for 3D audio editing.
http://www.venturaes.com/emersys/index.html
Accessed on 4th June, 2008.
[11] DarkGDK. Software dame development kit.
http://gdk.thegamecreators.com/ and
http://www.microsoft.com/express/samples/GameCreators/
Accessed on 4th June, 2008.

-8-
42
Beowulf field test paper

Mats Liljedahl Nigel Papworth


mats.liljedahl@tii.se nigel.papworth@tii.se

The Interactive Institute, Sonic Studio, Acusticum 4, SE-941 28 Piteå, Sweden

Abstract. A practical field test, covering some of the parameters governing audio based games, designed for mobile applications
utilizing new techniques with the intention of allowing greater interpretive freedom for the player. The tests are realised through an
audio-based, simple game application: ‘Beowulf’.

1. Introduction alternative to the current graphics-based applications.


Audio may represent a substantial percentage of our total 2) A computer gaming experience based on audio rather than
perception, but it has been chronically under-represented in the graphics to convey the game’s world will make it less dictatorial
development of modern communication devices and and allow freedom and space for interpretation.
applications, even when these devices logically have audio as
their prime media. To test our hypotheses the project described here was
Graphics and visual interface solutions continue to dominate formulated. It draws on two trends in modern technology:
both the media’s attention as well as the lion’s share of the The first is the growing sophistication of mobile, multimedia-
development resources. This seems especially strange in relation capable devices.
to the explosion of technical development in the mobile phone The second is the use of ambiguity and multiple possible
industry. Despite audio-based information being the principle interpretations as resources for design [1][2].
drive and ‘raison d’être’ of all mobile phones, the vast In addition to this, the project draws from earlier works in the
percentage of technical innovation to date is designed to emulate field of audio-based game for the visually impaired [4][5][6]. It
the personal computer and is focused on improving the display should be noted however that the project described here is, in
and the graphic quality. contrast, geared towards a normally sighted audience.
In the project, the audio-based game application “Beowulf”
Another perceived problem with this focus on graphics is the serves as tool for research into the relatively broad spectrum of
way in which it limits the user. Large resources are continuingly questions arising from the hypotheses. Ultimately the goal of the
being put into the creation of more and more realistic game project is to gather indications on these two questions:
graphics. This is true across the board, whether we are talking • Are audio-based games, delivered on mobile platforms,
about game consoles, home computers or more compact interesting to a broad audience? Historically, these kinds of
devices. applications have been mainly targeted at the hearing
Like so much in modern media, this trend has its origins in film; impaired.
Hollywood, especially, has a tendency to saturate its more • Can a game experience, interpreted and disambiguated
popularistic offerings (Marvel Comic based films, Indiana Jones through a virtual environment and communicated via a rich
and especially the Star Wars franchise, for example) with faster soundscape, present the player with a new and unique game
and more complex action, showing ‘bigger, better, faster’ experience?
content and graphics. This visual over-kill utilizing, for the most
part, photorealistic 3D environments in real time presents a real In 2006 a first, smaller test was made with an early prototype of
danger to the user experience: the ‘Beowulf’ game. This was in order to get an initial proof of
In the early days of cinema, there just weren’t the technical concept for the application and some early feedback on our
resources and budgets to create the vast array of fantasy basic ideas. The ideas behind the Beowulf application and its
creatures, monsters and overwhelming, catastrophic visions we design are described in [3]. The result of this first test was then
see in modern films, so the early film makers were forced to be used in forming the second step of the project and revising the
creative with other means; monsters were shown through design of the software and gameplay described here.
shadows, sound suggested storms, light created infernos, etc.
The advantage of this was that it allowed the audience great The focus of the project is firmly on the design process and the
freedom to invest their own imagination in the experience and, user experience.
as we have argued in an earlier paper [3], this actually improves The questions it is designed to raise and address are intended to
the audience satisfaction. be broad & general rather than narrow and specific.
Beowulf was designed to take a huge step away from these In this paper we describe the test, the design of the computer
‘Total-information’ experiences and, once again, give the user game application itself and the paper-based questionnaire used
the freedom to invest heavily in their own experience of the in the field test. It is from these elements and the experience of
game. monitoring the tests in person, that we have drawn the lessons
learned and the results and conclusions.
So, we wished to challenge and explore both the under-use of
audio on mobile platforms and to strive for and focus on a 2. Lessons Learned From Test One
realistic, but open, experience for the user. The first test conducted in 2006 was used as a pilot study. Based
Two hypothesis where formulated: on the experiences from this study, the design of the game
1) An audio-based game for mobile devices has potential as an application was revised.

-1-
43
[Title of Paper]

to move it closer to Beowulf to make it appear as if it came from


2.1. Lessons Learned – Sound Design the own body.
2.1.1. Health status
Beowulf’s current health status must somehow be 2.1.4. Music
communicated to the player by audible means. It was decided to Initially the idea was to use only sound effects in the Beowulf
use four different breathing sounds for this. A light normal game. The game play situations added in the revised version
breathing means Beowulf has 100% health, a very heavy, gave the game two possible endings: success or failure. Failure
wheezing breath means he is almost dying. We initially tested to was signalled by the death sound of Beowulf. Success was
let 100% health be represented by no breathing sound at all. signalled by the death sound of Grendel. Early on it turned out
This turned out to be confusing for the players. Initial tests gave that it was not obvious for the players which ending was which
clear indication that suddenly presenting the breath sound, of these two. There was an obvious need to more clearly signal
seemingly from nowhere, when Beowulf had lost enough health to the player in what way the game ended. In the end we opted
for the first breathing sound to get activated, was very confusing for the addition of two pieces of music, an archetypal “funeral
for the players. Several beta testers did not know how to march” for failure and a likewise archetypal “hero’s theme” for
interpret the breathing sound when it suddenly appeared. Is it a success. This gave implicit but clear enough information about
creture breathing or me myself? The sollution to this confusion the outcome.
was to have the breathing sound present from the start and let it These pieces of ‘end’ music game a slightly unbalances feel to
dynamically Beowulf’s health status. the total game experience, therefore it was decided to also add a
The current health status is also dynamically reflected in the introductory piece of music to serve as a framing overture.
sound of the sword swing. The more health Beowulf looses, the
slower the sword swing. When Beowulf is nearly dying, you 2.2. Lessons Learned – Graphic Design
even hear the sword cling against the floor and Beowulf groan There are inevitable limitations and problems with audio based
from the effort of swinging the sword. In addition the length of games. One of the biggest problems with the first version of the
the file is used to determine how often Beowulf can swing his game was the difficulty of signalling and clarifying when the
sword – he can only re-swing the sword if the previous swing player walked into a wall. In the game, when ‘Beowulf’ hits a
sound has played to the end. Good health has a short sound file, wall a “bump”-sound is played.
bad health has a long sound file. Several early testers had a clear problem with efficient
navigation in the cave environments when the wall ‘bump’
2.1.2. Room ambiences sound was their only indicator of an obstacle directly in front of
Several of the ambient sounds of the caves where modified in them. The repetitive attempts to circumvent this obstruction led
order to make each cave more unique. The reason for this was to the ‘bump’ sound repeating a large number of times, which
more to facilitate a sub-consious memory of the various places the players found annoying and detrimental to the overall game
in the game world than to increase the effectiveness of experience. It was therefore decided to give a visual clue as to
navigation. where the walls in the caves are located.
In the revised version, when Beowulf enters a cave, total cave is
2.1.3. Consistency and uniqueness now highlighted in grey. In this way the player can easily
From the evaluation of the first test we learned the importance determine where the walls are. The tunnels are only one unit
of two sound attributes, namely what can be called ‘consistency’ wide and the audio signal is, therefore, far more easily
and ‘uniqueness’. interpreted.
For most of the sounds we use several slightly different sound
files in order to create variation and avoid hearing fatigue due to
repetition. We learned how extremely important it is to keep the
variations as consistent as possible when it comes to volume,
reverberation, equalisation etc. The player must be able to
perceive all the variations in the same way, be able to
experience them as coming from the same source. In particular,
we had to adjust volume for several of the variations of creature
sounds.
The Beowulf soundscape is filled with sounds and it becomes Figure 1. Enlightened cave
vital to make it possible for the player to discriminate between
both different categories of sounds and individual sounds. The Note that the tunnel openings in the caves are not visible. To
uniqueness of the various sounds becomes crucial. We had to determine where the openings are, the player must either try to
rework several of the sounds of attacking creatures to make walk beyond the edge of the cave, risking just bumping into the
them intuitive and obvious. In same cases it was enough to wall, or better still listen carefully to find the openings.
increase the volume or decrease the amount of reverberation to
make the sound appear to emanate from a position closer to For the test described here, where the playing time was set to
Beowulf. In other cases we had to totally redesign the sound to max 10 minutes, it was also decided to preview the whole map
achieve the desired quality. of the game world, including caves and tunnels, for five seconds
Another aspect of this is the ability for the player to discriminate during the music overture.
between what can be called “first person sounds” and “third
person sounds”. In the revised version used in the second test we 3. The Test
added several “first person sounds”, i.e. sounds produced by The over-all questions of the project, described in the
Beowulf. The breath-sound is one example. In the initial beta- introduction, was broken down into a set of more concrete sub-
tests it turned out that some players misinterpreted the breathing parts in order to achieve a level of measurability. These partially
sound and thought it emanated from a creature or other non- overlapping questions are:
player character. We had to rework the breathing sound in order • Does an audio-based game make sense to a first time user?
• Is an audio-based game a suitable vehicle to deliver game
content on a mobile platform?

-2-
44
[Title of Paper]

• Does the level of pre-game information colour the player’s 3. Game controls and navigation – six statements.
experience? 4. General computer game habits – two statements.
• Can an audio-based game give a satisfactory game
experience? 3.2.2. Questionnaire – quantitative part
• Does a non-visual game world present the player with For the second part of the questionnaire the subjects returned to
navigational problems or other hinders to a working game the game application and were asked to describe six distinct
experience? places in the game environment. They were asked to, for each
• Can a game emulate visual combat situations using only place, describe the physical environment plus the emotions it
sound? evoked in as great detail as possible. All subjects described the
• Can a player correctly interpret visual sounds without the same six places and jumped from one place to the next by
aid of visual reference? clicking buttons in a short cut window presented by the test
• Does a sound based environment encourage a player to leader.
contribute more to the game experience?

Here follows more detailed descriptions of the Beowulf game


application, the questionnaire and the implementation of the test.

3.1. The Game Application


‘Beowulf’ is an audio based game designed to present an
absolute minimum of graphical content and as much audio
stimulation as present technology and the game definitions will
allow.
The game itself is based on a heroic, epic poem probably written
sometime between 700 and 1000 A.D. This is sometimes
referred to as England’s national epic, being the earliest
recognised example of the English language in existence.
In the Beowulf game, we have lifted out one episode of the Figure 2. Game application window (left) and short cut menu
poem and translated this into a simple gameplay scenario; the used for the questionnaire (right).
game terrain is a series of caves in which the poem’s main
adversary, the monster Grendel, is to be found. These caved are 3.3. Test Implementation
described to the player through a pure audio landscape created In the test, we let four separate groups of subjects first play the
from realistic sound sources. Beowulf game and then fill out the questionnaire. The tests were
The poem narrates how the Scandinavian hero Beowulf defeats conducted on identical laptops (Apple MacBook Pro) with all
Grendel in this lair. This is the player’s task in the game. subjects using Koss Portapro headphones to ensure consistent
The caves in the game contain other creatures including audio parameters.
wolverines, snakes, bats & rats. These will attack Beowulf and
injure or kill him if he fails to swing his sword. The four groups comprised 12 subjects in each. Group 1 were
In the game scenario, a gust of wind blows out our hero’s torch music conservatory students aged 20-25. Group 2 were high
as he enters the first cave, and darkness descends. The player school students from the south west of Sweden aged 13-15.
must now trust to his/her hearing and, with only an extremely Group 3 were Art and Design students aged 20-25. Group 4
primitive top view map to help, navigate a route to the monster. were high school students from the North east of Sweden aged
The player’s task is to successfully survive all the hazards 13-15.
between the start point of the game and the monster’s location.
He/she must interpret the myriad sounds that fill the We wanted to have a mechanism that could give us some
environment both to enable this successful navigation and to controlled indication on how much of the total experience come
define and execute the confrontational combat situations. from the game itself and how much depended on other factors.
Finally when Grendel has been found, it must be dispatched by One such mechanism could be the information given to the
determined use of Beowulf’s sword. subjects prior to playing the game. Each group of 12 was
For a more detailed description of the game and its design, see divided into two groups: Six subjects were given only the
[3]. minimum of information necessary to navigate the game. The
other six subjects were given a longer and more emotive
3.2. The Questionnaire description of the kind of terrain and fauna they were likely to
Given the nature of the questions addressed by the project and meet in the game. (It was thought that this would colour the
after testing several alternatives, a two-part questionnaire with subject’s own, later description of the game experience).
one more quantitative and one more qualitative part was created.
Minimal instructions to play the game - English translation:
3.2.1. Questionnaire – quantitative part ‘You will be playing an adventure game where your task is to
This part consists of 24 statements that all can be answered on a kill a monster hiding somewhere in the game world. You move
seven-step (1 – 7) scale ranging from ‘do not agree’ (1) to by using the arrow keys. With the up key, you take a step
‘entirely agree’ (7). Most of the statements were created with a forward in the same direction as the blue arrow. With the left
double-jeopardy system, where each statement appears a second and right keys you rotate to change the direction you are going.
time in reverse (i.e. The game was boring / The game was fun). With the down arrow you swing your sword, the only weapon
This was to ensure consistency within the answers and to flag you have. By listening to your breathing you can judge your
for any wild-card guessing. health status.’
The statements can be divided into four distinct groups:
1. General appreciation of the game concept – 13 statements. More elaborated, emotive instructions for comparison - English
2. Experience of presence and immersion – three statements. translation:

-3-
45
[Title of Paper]

English translation: 4.1.1. General computer game habits


‘You will be playing an adventure game where you task is to kill One surprise was that the majority of the test subjects did not
a monster hiding somewhere in the game world. You move by consider themselves regular game players, even though they all
using the arrow keys. With the up key, you take a step forward in fall within the normal demographic for regular gamers. Only 18
the same direction as the blue arrow. With the left and right keys of 48 responded 5, 6 or 7 to the statement ‘I often play computer
you rotate to change the direction you are going. With the down games’.
arrow you swing your sword, the only weapon you have. By
listening to your breathing you can judge your health status.
The game environment is a dark, blue grey system of caves and
passages filled with mist and noxious fumes. The cold and damp
caves are of different sizes and shape, some are slippery as ice,
others are filled with sharp stalactites and stalagmites. You will
meet dripping water; you will be walking on half-chewed bones,
remains of fossils and other nefarious things. The cave system is
peopled by animals and other creatures, both real and demonic-
Wolverines, bats, snakes and toothsome rats. The air is
repulsive and there is a sour and unpleasant smell.’

Each subject was given ca. ten minutes of playing time and was
then asked to fill out the first part of the questionnaire.
They were then instructed to return to the game environment
and by using the six short-cut buttons in the application, visit six
specific locations. These they then described in the second part Figure 3: response to statement: I often play computer games.
of the questionnaire.
4.1.2. General appreciation of the game concept
4. Results -Despite this lack of gaming experience the general reaction to
The subjects have filled out the questionnaires in a consistent the application was favourable. 27 of 48 subjects responded 5 or
manner, which makes us confidant that the test has worked and more to the statement ‘It was fun to play’:
that the results are reliable. A number of check-points supports
this:
• There is a strong negative correlation (-0,658) between the
two opposite-pole statements ‘I do not like to play
computer games’ and ‘I often play computer games’.
• A strong correlation (0,784) between the statements ‘This
is a game I want to improve on playing’ and ‘This is a
game I want to play more times’.
• Strong correlations between the statement ‘It was fun to
play’ and all of the following statements: ‘The game idea
(game play) is good’, ‘This is a game I want to improve on
playing’ and ‘This is a game I want to play more times’.
At the same time there is a relatively strong negative
correlation to the statement ‘I was bored early on’.
Generally there are strong correlations between the
statements in the category ‘General appreciation of the
Figure 4: response to statement: It was fun to play.
game concept’.
• The correlations between the statements in the category
-A high number unexpectedly thought the application would
‘Game controls and navigation’ indicates that the
make an excellent game for home computers despite its having
questionnaire have worked and contains valid data.
next to no graphic content.
The only category that do not show strong correlations between
statements are ‘Experience of presence and immersion’. This
category contains the most subjective and illusive of the 24
statements and the subjects may have had problems relating to
the statements given. On the other hand is the aspect of
immersion and presence covered in the qualitative second part
of the questionnaire, which makes this a smaller problem.

4.1. Results from part one of the questionnaire


The shear number of statements posed in the questionnaire (24),
mean that we will be presenting only a selection of the results in
this paper, and show only a few of the graphical representations
of the answer frequency. In these diagrams, blue bars indicate
males and pink bars indicate female. The figures at the bottom
of each bar indicates the instruction sub-group, where sub-group
1 was given only the minimum instruction and sub-group 2 was Figure 5: response to statement: This is an ideal game for a
given the more elaborated instruction. home computer.

-4-
46
[Title of Paper]

-This is especially interesting when compared to the same 5. Descriptive (foggy, sunny, dark etc.)
question, but now focused on mobile phones (the intended
platform for the application). There was no correlation between the quality and complexity of
the description and the information the groups had received at
the start. Both those subjects who had an emotionally
descriptive introduction and those that had a factual introduction
showed mixed levels of creativity in their interpretation of the
environments dependant on the individual. Since the amount and
nature of information prior to the game experience do not seem
the have any greater impact on that experience, it can interpreted
as the answers and responses to the questionnaire emanates from
the gaming experience itself and that very little coloration, if
any, resulted from the pre-information.

Only 2 subjects out of 48 used no emotive descriptive element in


the second part of the formula. These two described only what
they heard directly from the game application itself.
All others described at least some elements of temperature,
colour, smell etc. that emanated from their own interpretive
Figure 6: response to statement: This is an ideal game for a powers.
home computer.
Some subjects showed an almost poetic interpretation of the
It can also be seen that generally the female response to the soundscape:
game was more consistent and slightly more positive. ‘…a deep ravine that you can’t see the bottom of, beyond this
unending pine forests and high mountains…’
4.1.3. Experience of presence and immersion ‘It’s freezing & smells of sand.’
15 out of 48 test subjects did not find the audio environment ‘A dove flies away from a birch tree, chased away by a large
highly convincing but only 6 subjects gave the statement ‘I was crow.’
totally immersed in the game and the game task’ a score of 3 or ‘Smells like old-cut forest.’
less. ‘Smells like blood and I am really scared of what is going to
happen.’
4.1.4. Game controls and navigation ‘…smells like damp stone, maybe even concrete.’
41 out of 48 test subjects responded that they were able to ‘…threateningly red, bloody, hard, someone is hiding from the
understand and play the game immediately. dogs…’
‘Wet ground, cold on my toes…tough!’
26 out of 48 were able to understand and react to the combat ‘…damp, humid, smells like wet forest, green-grey & cool, fresh
situations without problem. air!’
‘…brown, grey with mossy/slimy walls…’
14 out of 48 stated that they tired quickly of the game, with a ‘…weak red light…smells of bodies and death…’
spread of 4.5 to 7. ‘…and you think a wolf is howling, but its only the wind!...’
33 out of 48 expressed an interest in having the game on their The average answer was around 19 words with the shortest
mobile phone. The strongest reaction to this was from group 4, being 3 words and the longest being 65 words.
the High School students from the North-East of Sweden aged
13-15, with all 12 scoring positive on this.
It should be observed that this particular group has a highly
developed mobile phone social-culture and a number of the 5. Conclusions
subjects were simultaneously SMS-ing each other during the
running of these tests. 5.1. Questions and answers
The test carried out has given indicative answers to the
Generally the response to the application was favourable. The questions formulated at the start of this project:
negative responses the game received were focused around 5-6
individuals or ca. 12.5% of those tested.
Q: Does an audio-based game make sense to a first time user?
A: The test showed that although the thresholds did vary from
4.2. Results from part two of the questionnaire.
subject to subject, there was no real problem with understanding
There was a mixed response to the descriptive part of the the game concept and executing said gameplay in the
formula, with some subjects taking a long time to answer and application.
writing complex descriptions, to some subjects hardly reacting
at all. Q: Is an audio-based game a suitable vehicle to deliver game
Generally it can be said that nearly all of the subjects added content on a mobile platform?
elements of their own creativity to the descriptions of the A: Despite the fact that in practise there are outstanding
environments.
technical issues to be solved, the game present no conceptual
These elements fell into five distinct groups:
problems and was seemed suitable for a mobile platform by the
1. Temperature (warm, cold, icy etc.) test group.
2. Smell (unpleasant, rotten, damp etc.)
3. Colour (grey, red etc.) Q: Does the level of pre-game information colour the player’s
4. Emotional content (lost & alone, scary etc.) experience?

-5-
47
[Title of Paper]

A: The test results show no evidence for that the information Despite the game having minimal instructional information and
given prior to playing the game should color the game a very limited graphic interface and feedback, the game was
experience. One interpretation of this is that the game played successfully by all subjects…although there were
experience itself was strong enough to shadow the experience different levels of entry with some subjects grasping the concept
from the, relatively short, pre-information phase of the test. immediately and others needing some minutes to grasp the
principles of navigation and concept, all subjects were able to
Q: Can an audio-based game give a satisfactory game control the character, use the sword and move through the
experience? environment. Most of the test subjects were even able to
A: Although not to everyone’s taste, a significant percentage of complete the set task and ‘kill’ the monster.
the test group were sufficiently positive towards the application
to confirm that it can. 5.4. Problems illuminated by this test
Paradoxically, the principle problems with audio exclusive
Q: Does a non-visual game world present the player with games lie in their principle strengths. The ability of audio
navigational problems or other hinders to a working game applications to trigger self-generated, complimentary visual
experience? content in the user also means that the game designers have little
A: The test did show that there are issues to be concerned about control over what the user is ‘seeing’. This places great demand
regarding navigation, however these presented no major on the game play and sound design. This, in turn, highlights the
problems for the test group and are not sufficient to sabotage an need for a sound design methodology in this area. An absolute
application. majority of the work carried out in this field is still
experimental. With an informed set of “do’s and don’ts” the
Q: Can a game emulate visual combat situations using only work could be lifted to higher levels.
sound?
A: The test showed that this was entirely feasible and only a few 6. Acknowledgments
of the test group expressed any problem in understanding and
successfully utilizing the combat gameplay. Stefan Lindberg, Interactive Institute Sonic for all the sounds.
Martin Nordlinder for dedicated work with software
Q: Can a player correctly interpret visual sounds without the aid development.
of visual reference? Pupils and staff at Central Skolan, Arvika
A: The test showed that this was not an issue. The players might Students at Ingesund University College of Music
vary in their personal interpretation, but at no time did these Pupils and staff at Carlshöjd School Umeå
variations interfere with the gameplay. Students at Umeå School of Art
Stuart Cunningham for advice, help and feedback
Q: Does a sound based environment encourage a player to
contribute more to the game experience?
A: The high gameplay marks suggest that the subjects were
prepared to accept a much lower level of gameplay content than
in a visually based environment. It should be noted that the
gameplay in Beowulf is neither particularly sophisticated nor References
complex. Whether this impression was caused by the novelty of [1] Gaver, W. W., Beaver, J., and Benford, S. 2003. Ambiguity
the game environment or through the addition of self-generated as a resource for design. In Proceedings of the SIGCHI
game content by the user is a discussion for another paper and Conference on Human Factors in Computing Systems (Ft.
needs further user testing. Lauderdale, Florida, USA, April 05 - 10, 2003). CHI '03.
ACM, New York, NY, 233-240. DOI=
5.2. Feasibility of audio based games http://doi.acm.org/10.1145/642611.642653
We can observe that there are only small technical limitations to [2] Sengers, P. and Gaver, B. 2006. Staying open to
creating audio-based games on personal computers. In this interpretation: engaging multiple meanings in design and
project we have used a large number of existing and off the shelf evaluation. In Proceedings of the 6th Conference on
tools and technologies. The narrowest bottleneck for the area is Designing interactive Systems (University Park, PA, USA,
not technical but more related to lack of knowledge about sound June 26 - 28, 2006). DIS '06. ACM, New York, NY, 99-108.
and sound design. DOI= http://doi.acm.org/10.1145/1142405.1142422
There are no limitations when designing audio based games [3] Liljedahl, M., Papworth, N., and Lindberg, S. 2007.
though there are added demands when solving problems where Beowulf: an audio mostly game. In Proceedings of the
one would normally rely on a graphic pointer. international Conference on Advances in Computer
As these games are highly interpretive, great care must be Entertainment Technology (Salzburg, Austria, June 13 - 15,
shown in the design process, as it is only too easy to give 2007). ACE '07, vol. 203. ACM, New York, NY, 200-203.
unfocused and misleading signals to the player through audio DOI= http://doi.acm.org/10.1145/1255047.1255088
content. [4] Lumbreras, M. and Sánchez, J. 1999. Interactive 3D sound
There are no real limitations to the (sighted) player’s potential hyperstories for blind children. In Proceedings of the
and ability to interpret and accurately play audio based games, SIGCHI Conference on Human Factors in Computing
as long as the design process has been intelligently and correctly Systems: the CHI Is the Limit (Pittsburgh, Pennsylvania,
executed. The test subjects had no preconceived negativity to United States, May 15 - 20, 1999). CHI '99. ACM, New
the idea of a non-visual game. And if there was any scepticism York, NY, 318-325. DOI=
prior to the test this was soon dispelled when the game was http://doi.acm.org/10.1145/302979.303101
played. [5] Röber, N., Masuch, M. 2004. Auditory game authoring.
DOI= http://games.cs.uni-
5.3. Ability of subject to navigate play magdeburg.de/audio/data/Roeber_2004_AGA.pdf

-6-
48
[Title of Paper]

[6] Friberg, J. and Gärdenfors, D. 2004. Audio games: new


perspectives on game audio. In Proceedings of the 2004
ACM SIGCHI international Conference on Advances in
Computer Entertainment Technology (Singapore, June 03 -
05, 2005). ACE '04, vol. 74. ACM, New York, NY, 148-
154. DOI= http://doi.acm.org/10.1145/1067343.1067361

-7-
49
Control of Sound Environment using Genetic Algorithms
Scott Beveridge, Don Knox,
Glasgow Caledonian University, Glasgow, Scotland
scott.beveridge@gcal.ac.uk, d.knox@gcal.ac.uk.

Abstract. Sonification - the production of sound to represent some form of data or information, has been
applied in various fields including analysis of financial, meteorological and physiological data. A system overview
is presented which is based on analysis of sociospacial behaviour via video capture of a given environment. The
example given is a busy commuter environment such as a train station. Activity in the environment is mapped
to a set of musical performance parameters, and also forms the basis for controlling an optimisation function
based on genetic algorithms (GA). The aim is to develop a socially reflexive audio environment, where those
present unconsciously interact with the input, and output, of the sonification process. The output of the system
is a series of musical chords, optimised as regards their musical fitness as defined by three consonance criteria.

1 Introduction concentrates upon sonification of gestural information


- specifically socio-spatial behaviour. The aim is to
Sonification can be used as a tool to enhance our un-
develop a system that unconsciously influences passive
derstanding of abstract data. It is particularly useful
users, hence creating a socially reflexive environment.
in applications where the complexity of data is likely to
It is hoped that by carefully mapping input parame-
cause visual overload, or is at a level of abstraction that
ters, audio information can be used to create an envi-
requires some form of translation to make it useful. Ex-
ronment where human interaction with the system is
amples of the data sets that have been subject to soni-
self controlling. Ideal situations where this may be ap-
fication are financial [1], meteorological [2, 3] and phys-
plied include large sensate environments like train sta-
iological [4, 5]. Data sets can be categorised as those
tions, where during rush hour hundreds of commuters
which relate to remotely located data internet traffic,
could be subject to sedate music in order to provide a
financial indices, weather patterns, or ‘localised data’
more relaxed commuting environment. A system such
- skin conductivity, brain patterns, gestural informa-
as this raises interesting questions about the nature of
tion [5]. To produce meaningful interpretation of input
performance. Traditionally, groups of musicians build
data it is necessary to implement a schema or mapping
upon relationships and communication to facilitate the
framework. In its simplest form the framework can
creative process of making music. The system explores
be based on direct linear relationships between move-
the concept of a large scale musical performance where
ments and musical themes [6], however more compre-
the communication and interaction takes place between
hensive and versatile mapping results when an algo-
‘musicians’ who are unaware of their role in the pro-
rithmic design approach is taken [7]. Algorithmic de-
cess. As well as being of interest from a performance
sign encompasses many methodologies. These include
perspective, it is possible that processing data in this
stochastic and serial techniques - which manipulate
way will reveal patterns of information which are not
pre-defined musical parameters using either random
immediately evident using traditional statistical anal-
or structured processes, and generative processes such
ysis methods [8].
as those based upon genetic transformations and cel-
lular automata. Generative processes employ organic
techniques which produce unique, sonically rich mate- 2 The System
rial [5]. Past implementations of this approach have
2.1 Data capture
concentrated mainly on aesthetic uses. These involve
the active interaction of participants with the system Image processing is used as a means of extracting the
in the context of a performance environment. Situa- necessary information from a particular environment.
tions of this type see individuals, influenced by aural Input data is obtained with a visual capture system
feedback, becoming aware of the system and seeking suspended above the target space (see Figure 1). The
to then modify the source information based on their plan view provided by this arrangement gives wide ge-
subsequent actions [5]. ographic coverage and sufficiently detailed movement
This paper presents initial work on a system which information that allows observations to be made in a

50
[Control of Audio Environment using Genetic Algorithms]

wide range of locations. ing studied, correspondences must be formed between


measured behavioural activity and musical variables.
Wherever possible direct correspondences between the
data domain and sound domain should be made. How-
ever the data may not display such immediately obvi-
ous relationships, and in such cases metaphors may
be used [10]. The system draws parallel data streams
that derive direct relationships with dynamic intensity,
tempo and timbre and translates more complex rela-
tionships by statistical analysis and a generative algo-
rithm.

2.2.1 Acoustical parameters

In line with work carried out in gestural interaction [5]


a framework for direct sonification activities has been
implemented – see table 1.

Activity/Trigger Sonification
Figure 1: Plan view from webcam
Number of people Dynamic Intensity
Speed of movement Tempo
Video frames are captured using an inexpensive web
cam, and manipulated using Matlab image process-
Direction of movement Timbre
ing tools to remove unnecessary information (floor de-
tail and static objects). This type of approach has
Table 1: Map of environment activity to sonifica-
previously been used only in small scale performance
contexts, requiring specialised motion capture systems
tion parameter
[9, 5]. Subsequent image processing (see Figure 2) and
statistical analysis allow extraction of various param- This framework provides a comprehensive but po-
eters which describe the nature of movement in the tentially complex set of acoustical parameters which
environment (see section 2.2). represent the activity within the environment. To
simpify these variables and so make the data set more
useful a higher level representation is required. The
objective is to obtain a qualitative measure of these
features in order to simplify audio generation. A re-
cent study by Lu et al [11] has been adopted to provide
this representation. The research involves a framework
for the emotion classification of music which examines
audio on the basis of the features extracted in Table 1.
The system uses signal analysis techniques to extract
intensity, timbre and rhythm (ITR) features which are
used in a classification process based on guassian mix-
ture models (GMM). The result is music which is la-
belled in terms of the two-dimensional stress-energy
model proposed by Thayer [12] (see Figure3). The
model, which was adapted from the original proposed
by Russell [13], forms quadrants equating to content-
ment, depression, exuberance and anxious/frantic.
Classifying features in this manner provides a con-
Figure 2: Figures identified by vision capture system text for further decisions regarding music generated by
the system as the emotion labels are directly analagous
to the activity with the environment. This represen-
2.2 Mapping tation will not be sonified directly, instead the system
In order to acheive sound output that is in some way in- generates audio with opposing features to counteract
dicative of events occuring within the environment be- the activity within the environment.

-2-

51
[Control of Audio Environment using Genetic Algorithms]

e
ner
gy/
arous
al parameters. In an experiment where music with vary-
ing tempo and loudness was played as background mu-
sic within a supermarket environment it was found that
in-store traffic flow could be modified with fast tempo,
e
xube
ranc
e a
nxi
ous
/fr
ant
ic high intensity music proving the most effective in in-
creasing the passage of consumers through the sales
space. The system proposed in this paper can not only
s
tre
ss/
val
enc
e
dynamically alter these musical variables in real time
(and in response to user input) but also incorporate
structural musical features which results in novel au-
dio with the potential of controlling group behaviour.
c
ont
ent
ment de
pre
ssi
on
2.2.2 Structural Features
The mood classification scheme proposed by Lu et al is
based solely on acoustical parameters and does not take
into account more complex structural features which
may attribute to emotional expression. Musical as-
Figure 3: Thayer’s stress/energy model, showing the
four mood clusters; contentment, depression, exuber- pects such as mode, interval and melodic contour are
ance and anxious/frantic necessary in expressing emotion and provide a basis
for higher cognitive constructs associated with music
listening.
It has long been established that expectation plays
In our busy commuter environment we may observe
a pivotal role in a listener’s experience of music [20].
a large number of individuals moving quickly in a uni-
This theory is based on the assumption that while lis-
form direction. This behaviour indicates high intensity,
tening to music an individual will form expectancies
fast tempo parameters with distinct timbre character-
about its continuation. If these presuppositions are
istics dependent on direction. The classification frame-
violated it evokes a corresponding emotional reaction
work proposed by Lu et al (figure 4) places audio with
[21]. Narmour proposed a refinement of this concept
these features within the exuberance category. The
in the implication-realization (I-R) model [22] which
system will use this classification outcome to construct
judges melodic expectancy based on the Gestalt-based
audio with opposite characteristics and so generating
principles of proximity, similarity and good continua-
audio with features which would place it in the con-
tion. The I-R model is based on a three note archetype
tentement category and produce an affective balance
which focusses on the distance and direction relation-
to the original observed behaviour.
ships between the intervals. By measuring melodic
Considering musical features in terms of emotion is contour on the basis of these principles the model has
useful for a number of reasons. Studies conducted on been found to predict melodic expectancy with reason-
group psychology and consumer behaviour suggest that able accuracy.
individuals are very sensitive to changes in environ- This system copes with these structural aspects
ment [14, 15] and that music is a critical factor in their by implenenting an algorithmic approach. Parame-
construction. Spatial aesthetics or ‘atmospherics,’ is ters within the genetic algorithm are modified in real-
the term used to describe the concious designing of time in a mechanism which seeks to optimize structure
space to create certain effects in buyers [16]. Based within specific parameters. The emotion classification
on atmospherics indivduals are likely to display one of process plays an essential part in this process by ensur-
two behaviours. Approach behaviour involves such re- ing a context with which to voilate expectancies and
ponses such as physically moving toward something, hence generate emotion. The initial implementation
affiliating with others in the environment through ver- of this system uses this framework to optimize chord
bal communication and eye contact, and performing triads which form the basis for generated audio.
a large number of tasks within the environment [17].
Avoidance or negative reinforcement behaviour [18] in-
3 Sound Generation
cludes trying to get out of the environment, a tendency
to remain inanimate in the environment, and a ten- The system is built upon an evolutionary design model
dency to ignore communication attempts from others inspired by the Survival of the Fittest concept proposed
[19]. A study by Millman [15] shows how manipulation by Charles Darwin in 1859. In biological terms, this
of these responses can be acheived by altering musical relates to the competition for predominance amongst

-3-

52
[Control of Audio Environment using Genetic Algorithms]

stage where crossover and mutation take place. The re-


sultant children are then subject to same fitness mech-
anism which identifies the fittest subset and the pro-
cess continues. The goal is to incrementally increase
the overall fitness of the individuals with the aim of
achieving the best possible result in a specified time-
frame. The outcome is an optimum chord (as regards
the consonance criteria) which is output as MIDI data.
Figure 4: The hierarchial mood detection framework The optimisation process is influenced by parameters
of Lu et al provided by the statistical analysis module, which have
been mapped from data provided by the visual capture
peers. In the field of mathematics it is more com- system.
monly known as an optimisation problem. Optimisa-
tion problems employ specialised heuristics to exam-
4 Initial Results
ine large, potentially infinite solution spaces with the
aim of finding optimum results. The system uses a The chord structures generated by the system reflect
genetic algorithm methodology for optimisation pro- the current state of the environment being observed
posed by Manzolli et al [23]. Genetic algorithms apply via vision capture. The consonance measures between
the biological processes of reproduction and selection the notes of each chord, tend towards musically stable
to a population (or solution space) to achieve optimi- intervals (3rds and 5ths) in favourable conditions (for
sation. In the system the initial population consists example very crowded conditions during rush hour),
of randomly generated four note chord sequences. In and dissonance when less favourable. The optimisation
keeping with the original implementation [23], these process can be further controlled by altering initialisa-
chord sequences are assigned a fitness value based on tion parameters which affect the initial population size
three specific consonance criteria: harmonic, melodic and maximum number of generations. Figure 5 shows
and voice. These consonance criteria are based on the an initial population size of 30 undergoing 15 genera-
idea of coincidence of harmonics [24] and relate to the tions of evolution. Both the average population fitness
Gestalt principles of proximity and similarity. This and best individual fitness tend towards a maximum
method examines the first 16 harmonic components of value as the optimisation process progresses. The cri-
each individual note including the fundamental to give terion upon which the optimisation process is based
a measure of the quality of note intervals. is completely dynamic and depends upon behavioural
• In the case of harmonic consonance the four notes information as defined by human activity in the envi-
in the chord are compared with each other and ronment measured by the vision capture system.
a score is assigned based on the commonality of
their spectral components.
• Melodic consonance is used to find the chord
which best relates to a given input note. This is
achieved by comparing the harmonic components
of each note in the chord with the input note.
After all four comparisons are made the chord is
given a score which relates to the note with the
maximum overlap value.
• Voice consonance describes the relationship be-
tween the elements of the chord with respect to
a set of note ranges. These sets are defined by
a fuzzy set formalism which assigns scores based
on individual membership functions. Voice conso-
nance ensures adequate spacing between the notes
and hence a tendency towards consonant intervals
of 3rds and 5ths.
These criteria are added together to determine the
overall fitness of each individual. The fittest individ- Figure 5: Population evolution
uals are then chosen to progress to the reproduction

-4-

53
[Control of Audio Environment using Genetic Algorithms]

5 Future Work classification strategies will be developed which incor-


porate both acoustical and structural features. De-
The current implementation of the system creates a
velopment at this level will lead to more emotionally
unique and distinct evolutionary cycle for every dis-
relevant representation and more accurate forcasting
crete set of parameters captured from the behavioural
of crowd ‘feelings’ which will result in a more effective
information observed in the environment. Future ef-
self controlling mechanism. The unification of these
forts will look toward the creation an initial popula-
factors forms the basis for an EPRSC-funded project
tion which will evolve during the entire life of the pro-
with which the authors are involved. In addition, the
cedure. Processing larger populations would allow ad-
role of extra environmental parameters will be exam-
ditional evolutionary parameters to be altered, in ef-
ined. These could be variables that are not specifically
fect the population would grow in parallel with (and
related to human behaviour, but which influence the
in response to) the changing behavioural data at the
environment. For example environment temperature,
input. So in addition to fitness criteria, further evolu-
humidity, light level and ambient noise can be mea-
tionary criteria such as population bounds, mutation
sured and included in the generative process.
criteria, cross-over method and selection function could
be modified. Population manipulation in this manner
would result in more smooth and continuous musical References
gradients, resulting in musical output with smoother [1] F. C. Ciardi, “smax: A multimodal toolkit for
transitions between themes. stock market data sonification,” Proceedings of
ICAD 04, Tenth Meeting of the International
Conference on Auditory Display, Sydney, Aus-
I
nit
ia
lPopul
at
ion tralia, 2004.
[2] N. Reeves, “The cloud harp project.” Available at
http://www.cloudharp.org/index.htm, Accessed
April 10th, 2006.
Ge
net
icAl
gor
it
hm
Fi
tnesscr
it
eri
on [3] A. Polli, “Atmospherics/weather works: A multi-
i
nfl
uencedby channel storm sonification project,” International
Conference of Auditory Display (ICAD), Tenth
st
at
ist
ica
lcrowd
Meeting of the International Conference on Au-
anal
ysi
s ditory Display, Sydney, Australia, 2004.
[4] E. Miranda and A. Brouse, “Toward direct brain-
computer musical interfaces,” Proceedings of the
2005 conference on New interfaces for musical ex-
pression, Vancouver, Canada, pp. 216–219, 2005.
[5] K. Beilharz, “Wireless gesture controllers to af-
fect information sonification,” ICAD, Limerick,
Modi
fi
edPopul
at
ion Ireland, CD-ROM proceedings July, 2005c, 2005.
[6] H. Lrstad and J. Eacott, “The intelligent street:
responsive sound environments for social interac-
Theopt i
mumor tion,” Proceedings of the 2004 ACM SIGCHI In-
Fi
nalPopul
at
ion ‘
fi
tte
st’chordofthe ternational Conference on Advances in computer
gener
a t
iveproce
s s entertainment technology, pp. 155–162, 2004.
[7] K. Beilharz, “Criteria and aesthetics for mapping
Figure 6: Proposed system for evolution of the intial social behaviour to real time generative structures
population for ambient and auditory display (interactive soni-
fication),” INTERACTION - Systems, Practice
A main area for future investigation is the need to and Theory: A Creativity and Cognition Sympo-
ensure the musical output of the sonification process sium, Sydney, Australia, pp. 75 – 102, 2004.
is meaningful. This is in keeping with the stated goal [8] J. R. Minkel, “The computer
of creating a socially reflexive environment - where in minds the commuter.” Available at
this example, the aim is to create a more relaxed envi- http://focus.aps.org/story/v13/st26, Accessed
ronment for the commuter. Toward this, new emotion October 13th, 2006, 2004.

-5-

54
[Control of Audio Environment using Genetic Algorithms]

[9] M. Droumeva and R. Wakkary, “Sound intensity [24] H. Helmholtz, On the Sensations of Tone.
gradients in an ambient intelligence audio dis- Thoemmes Continuum, english ed 1875 edi-
play,” Conference on Human Factors in Comput- tion ed., 1875.
ing Systems, pp. 724–729, 2006.
[10] A. de Campo, C. Frauenberger, and R. Holdrich,
“Designing a generalized sonification environ-
ment,” Proceedings of the ICAD, 2004.
[11] L. Lu, D. Liu, and Z. Hong-Jiang, “Automatic
mood detection and tracking of music audio sig-
nals,” Audio, Speech and Language Processing,
IEEE Transactions on, vol. 14, no. 1, pp. 5–18,
2006.
[12] R. E. Thayer, The Biopsychology of Mood and
Arousal. Oxford University Press, 1989.
[13] J. A. Russell, “A circumplex model of emotions,”
The Journal of Personality and Social Psychology,
vol. 39, no. 6, pp. 1161 – 1178, 1980.
[14] R. E. Milliman, “Using background music to affect
the behavior of supermarket shoppers,” Journal of
Marketing, vol. 46, no. 3, pp. 86–91, 1982.
[15] R. E. Milliman, “The influence of background mu-
sic on the behavior of restaurant patrons,” Jour-
nal of Consumer Research, vol. 13, no. 2, p. 286,
1986.
[16] P. Kotler, “Atmospherics as a marketing tool,”
Journal of Retailing, vol. 49, no. 4, pp. 48–64,
1973.
[17] B. H. Booms and M. J. Bitner, “Marketing ser-
vices by managing the environment,” Cornell
Hotel and Restaurant Administration Quarterly,
vol. 23, no. 1, p. 35, 1982.
[18] M. M. Bradley, “Emotion and motivation,” Hand-
book of psychophysiology, vol. 2, pp. 602–642,
2000.
[19] R. J. Donovan and J. R. Rossiter, “Store atmo-
sphere: an environmental psychology approach,”
Retailling: Critical Concepts, 2002.
[20] L. B. Meyer and L. B. M. Meyer, Emotion and
meaning in music. The University of Chicago
Press, 1956.
[21] D. Huron, Sweet anticipation: music and the psy-
chology of expectation. MIT Press, Cambridge,
Mass., 2006.
[22] E. Narmour, The Analysis and Cognition of Basic
Melodic Structures: The Implication-Realization
Model. University of Chicago Press, 1990.
[23] J. Manzolli, A. Moroni, F. Von Zuben, and
R. Gudwin, “An evolutionary approach ap-
plied to algorithmic composition,” Proceedings
of VI Brazilian Symposium on Computer Music,
p. 201210, 1999.

-6-

55
Genie in a Bottle: Object-Sound Reconfigurations for Interactive
Commodities
Daniel Hug
Interaction Design Department
Zurich University of the Arts
daniel.hug@zhdk.ch

Abstract. Everyday commodities are increasingly enhanced with information or communication technologies and become more
complex and interactive. Electroacoustic sound can be a powerful way of shaping the aesthetic and functional aspects of such artifacts
beyond the modification of the physical aspects. This is a challenge for sound design, which traditionally has prospered mainly in
linear audiovisual media. The nature of objects requires new approaches to sound design, both because physical objects and sound
have a close relationship and also because artifacts are embedded in complex sociocultural contexts. Thus sound design for interactive
commodities is one element that contributes to the hermeneutic affordances of complex, interactive commodities. This paper outlines
the sociocultural significance of objects, their relationship with sounds, and how sound design can re-configure this relationship.

1 Introduction even R. Murray Schafer could not have anticipated3 . Tradition-


ally, the work done in the context of auditory display research fo-
Fueled by recent developments in information and communica-
cuses on functional aspects and how they can be related to sound.
tion technology (miniaturization, increase of processing power,
The research and the resulting sound design criteria are geared to-
reduced costs), cultural changes (information society, nomadic
wards efficient communication of system or data status or changes
lifestyle, merging of work and leisure) as well as the ever-
and are mostly concerned with desktop computers or highly spe-
increasing need for new markets, the vision of the ”ubiquitous
cialized application domains such as medicine or geology [24].
computer”, aptly termed ”everyware” by Adam Greenfield, is be-
coming everyday reality [19]. Literally every thing is a possible Our hypothesis is that the nature of interactive commodities
resource for deploying new technologies and interactivity. Our entails new, largely unstudied aspects, and that new criteria for
everyday life is permeated by networked technologies and com- sound design are required. What these aspects and criteria are,
plex, seemingly autonomous, devices. Some of these devices op- how they can be studied, and finally, how guidelines for design
erate entirely without human intervention, others require some de- can be derived from this study, remains to be investigated. This
gree of operation or manipulation. The latter we call ”interactive paper outlines these aspects and shows possible directions for fu-
commodities”. The term commodity stands for the pragmatic, ev- ture research and design.
eryday nature of these artifacts [20].
2 Meaningful Objects
1.1 The Role of Sound for Interactive Commodities
It suggests itself that any study on how sound can be designed to
Studies dealing with the probable future ”prime fathers” of in- create meaningful interactive objects should include an in-depth
teractive commodities, the mobile phone and the personal digital understanding of the role objects play in the everyday life of in-
assistant (PDA), suggest that sound is going to play an impor- dividuals and society. This is of course a vast domain for study
tant role in the design of small, disappearing, ubiquitous devices. and difficult to frame. The following presents a few spotlights that
For instance, Brewster describes strategies to overcome the lack demonstrate the importance of understanding how objects become
of screen size limitations with sound [10]. In general, sound is meaningful and how this relates to sound design. The elabora-
a powerful means to communicate information about continuous tions are not concerned too much with details like formal aspects
or dynamic processes, to supply ambient information related to of objects and focus on higher level cultural implications.
a place or activities, to improve an ongoing continuous control
activity and to shape the functional and aesthetic experience of
artifacts [16]. Therefore its application in interactive objects of 2.1 Semiotic Approach
everyday use suggests itself and is investigated in initiatives like From a semiotic point of view objects are either the correlative
the European COST IC0601Action on Sonic Interaction Design1 of signs or they are signs themselves. Nöth states that artifacts
or the European research project CLOSED2 , which aims at de- have emblematic (meaning oriented) as well as non-emblematic,
veloping suitable measurement tools and criteria for the design of pre-semiotic (practical, use-oriented) aspects. According to some
sound of interactive artifacts. semioticians also the latter is seen as semiotic aspect. Nöth further
explains that according to Heidegger artifacts carry references to
1.2 A New Challenge for Sound Design the practical use, the determination to be used for something, im-
Thanks to miniaturization of electroacoustic components sounds printed in them. Practical things can also be understood as refer-
can be integrated relatively easily into interactive commodities. ring to their value of use as indices. Such primary and denotative
This creates possibilities for a new kind of ”schizophonia” that meanings can be accompanied by secondary, e.g. aesthetical or

1 See http://www.cost-sid.org 3 Schizophonia is the term coined by R. M. Schafer to denote the sepa-
2 Closing the Loop on Sound Evaluation and Design, see ration of sound from their - natural - sources by means of electroacoustics.
http://closed.ircam.fr and [28]. For him this concept carries only negative connotations [31].

56
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities

connotative, meanings [27]. In his semiotic writings, Barthes of- vention of language. Thus, he resorts to language as paradigm for
fers two broad connotational categories for objects. The first is understanding objects [4]. Objects challenge the semiotic idea of
the ”existential” connotation, which understands objects as obsti- communication by signs and code through their immediate, com-
nate, inhuman or even antihuman. The other is the ”technolog- plex, archetypical multi-sensorial presence. They are not delim-
ical” connotation which is concerned with objects as consumed, ited to relatively restricted semiotic frameworks like media such
reproduced and functional products [4]. as film, photography or even architecture. It is not surprising that
A special aspect in the creation of meaning in objects is ma- many accounts of semiotics of objects actually refer to the presen-
terial. In his dialogue ”Hippias Maior” Plato describes a discus- tation of objects in media such as advertisements (see e.g. [4]).
sion between Socrates and Hippias about what makes the beauti- Joan M. Vastokas calls for a ”semiotics of visual phenomena”
ful beautiful. One proposition is that certain materials like gold (and we shall add auditory phenomena), taking into account the
make ordinary things beautiful, another proposition is that some full spatial, temporal, and gestural dimensionality of the artifact.
materials are more functional and thus make an object appropri- She proposes a narrative concept of artifact, understanding the
ate (cited in [9]). Materials like marble and (fake) gold suggest procedural component of an artifact as being born through inten-
a royal or noble atmosphere in shopping malls, wood can signify tionality of its creator(s), going through a life of use and abuse
closeness to nature, and so forth ([9], [6]). as ”meaningful and expressive object in itself, and as a ritual per-
former in social and cultural life”, finally ”dying” and being dis-
2.2 Sociocultural Objects posed, exhibited in a museum or recycled ([37], p. 341). She
describes several essential points to consider in the study of arti-
Artifacts play an important role on a sociocultural level. Based on facts from a sociocultural perspective:
interviews with over 300 people Csikszentmihalyi and Rochberg-
Halton investigated the meaning of everyday artifacts beyond ”(1) The meaning of artifacts, including works of
mere functionality. They describe how objects become part and visual ”art”, is constituted in the life of the objects
are the result of the process of cultivation, that is ”the process themselves, not in words or texts about them; (2) the
of investing psychic energy so that one becomes conscious of the artifact is not an inert, passive object, but an interac-
goals operating within oneself, among and between other persons, tive agent in sociocultural life and cognition; (3) the
and in the environment” ([14], p. 13). According to Csikszent- signification of the artifact resides in both the object as
mihalyi and Rochberg-Halton things embody goals, make skills a self-enclosed material fact and in its performative,
manifest and shape the identities of their users. Objects thus are ”gestural” patterns of behavior in relation to space,
embodiments of intentionality. Hence they alter patterns of life, time, and society; (4) the processes, materials, and
they reflect and define the personality, status, social integration products of technology, especially those of a society’s
etc. of both, producer and owner. In addition, things evoke emo- dominant technology, function as cultural metaphors
tions through interpretations in the context of past experiences, at many levels and in many sociocultural domains; and
thus becoming signs or symbols of one’s attitude. Moreover they (5) theoretical insights derive, not from theorizing in
can mediate conflicts within the self. the abstract, but from direct observation and experi-
Through the ”objectivity” and permanence of objects such ence of the phenomenal world of nature and culture.”
identities can be shared. This is the precondition for the social- ([37], p. 337)
izing effect of things and of their ability to provide role mod-
els. Csikszentmihalyi and Rochberg-Halton also shed light on 2.4 Active, Animated and Magic Objects
the role of the functional aspect of objects for culture, describing
how even the use of things for utilitarian purposes is inseparable These considerations lead to an aspect of the study of artifacts
from the symbolic context of culture. Artifacts socialize people which seems to be counterintuitive at first glance: The notion of
to certain habits or ways of life and represent these as signs [14]. the object as actor. Latour proposes in his actor-network theory
(ANT) that human beings and non-human or even inanimate ob-
Socio-semiotic studies describe the practice of collecting, the
jects are interlinked in mutual interactions, both being actors in
accumulation of objects as dowry, the expression of self in rela-
the process. In this view, action is not limited to what intentional,
tion to society in bricolage, the rhetoric of the displayed artifact,
meaningful humans do. According to Latour any thing that does
shopping centers as super-objects and stages for objects and social
modify a state of affairs by making a difference is an actor [26].
actions, the mystification and commodification of culture through
souvenirs, and the complex culture around jewels. An excellent A theoretical approach which is somewhat related to ANT, but
collection of essays related to these questions can be found in more focused on human agency, is activity theory, introduced by
[29]. Baudrillard finally points out the relationship between ob- Leontiev in the late 1970’s. Instead of departing from a relational
jects, production systems and society, describing the development approach (between symmetric nodes in networks, ”actors”, that
of artifacts from a static role in a traditional pre-modern society can be people, machines, or other things) activity theory proposes
to their emancipation to flexible, functional entities, and how sys- a primacy of activity over both object and subject, originating in
tems of objects and their consumption reflect socio-ideological purpose, need and intentionality. As for the role of artifacts, they
circumstances [6]. All these aspects contribute to the meaning are described as the product of cultural needs, embodying our in-
making related to artifacts. tentions and desires. Thus they mediate between people and the
world and in this sense things have agency [22].
What has been outlined in these reflections seems to come to
2.3 Beyond Semiotics
a new level in interactive commodities. In her book ”The Sec-
It shines through some of these points that semiotics in a struc- ond Self” Sherry Turkle described the computer as an ”evocative
turalistic sense is limited as an analytical tool when dealing with object for thinking about human identity”. It serves as projection
the world of concrete physical objects. Barthes states that the surface for our desires and fears and is often considered animate in
semiotic study of objects is still at an early stage, one reason being some way - not only by children - exhibiting behaviors indicating
that no pure significant system of objects exists without the inter- some kind of reasoning and agency. The main criterion for alive-

-2-
57
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities

ness, autonomous motion, is being replaced or extended by the listening we hear an approaching car rather than four wheels, an
notion of psychological autonomy. Looking at the more recent de- engine and the various vibrations of the car’s body [17].
velopments in the 20th Anniversary Edition of ”The Second Self” In terms of perceiving physical qualities, the intersensorial link
Turkle states that these characteristics have become commodified, between audio and haptics is quite strong. For example, Kayser et
are part of our everyday experiences. But although computational al. demonstrated somatosensory and auditory interaction and the
technology has lost much of its uncanniness we still tend to per- conditions for its effectiveness, namely temporal coincidence and
sonify and project our self onto it. In the case of the increasing inverse effectiveness [23]. And Avanzini and Crosato have suc-
autonomy and complexity of computational artifacts this tendency cessfully demonstrated how sound can modulate the haptic per-
becomes even stronger. And in some cases like wearable comput- ception of stiffness [1]. Some studies deal with the audio haptic
ers or computational implants the border between computer and relationship in food consumption, for example the factors asso-
human is increasingly blurring on a very concrete level [35]. ciated with judging apples to be mealy [3]. In the less scientific
According to Daniel Chandler the quality of purposiveness and but nevertheless relevant domain of film and game design, sound
autonomy in artifacts arises from the whole being more than the often is used to substitute or denote haptic sensations of protag-
sum of the parts when technology becomes too complex to con- onists, and is essential for suggesting authenticity of objects on
trol. Technological artifacts seem to have a will of their own and screen and suspending disbelief [15].
we tend to anthropomorphize them. The resulting technological
animism credits an inanimate entity with consciousness and will4 3.2 Critique of Sonic Causalism and Naturalism
[11]. But the relation between sounds and physical objects and pro-
From here the step to a notion of a magical quality in complex cesses is more complex. On closer investigation we can discover
computerized artifacts can be easily made. According to Arthur a dialectical relationship between objects and sounds: While ob-
C. Clarke’s third law ”any sufficiently-advanced technology is in- jects are permanent and concretely graspable, sounds are tempo-
distinguishable from magic.”5 rary and evasive, yet still they often have an almost intimate rela-
tionship with the physical world, which is the basis for the natural-
3 The Relationship Between Sound and Objects izing effect of sound in media. But the natural link between sound
and causing physical event has to be questioned. Sounds also have
We have stated above that the study of objects in all their com- an existence which is detached from their original source, am-
plexity matters for the sound design of interactive commodities. bivalent, sometimes carrying a rather vague notion of material in
This is the case both because sounds and objects are often closely them, sometimes being totally abstract. Chion criticizes the sim-
related and because of the power of sound to modulate an objects plistic assumption of the (scientific) discourse about sound ”nat-
identity. Moreover sounds can evoke a certain object and thus its urally” representing a certain cause. This myth leads to a general
sociocultural significance described above. research focus on sounds that are empirically verified as ”well
In the following a closer look at this relationship is provided. identified” and thus supposedly meaningful and useful for design.
However, the list of ”well identified sounds” is relatively short,
often restricted to archetypes or clichés, and very case dependent.
3.1 Sound and Physical Properties
Their successful identification depends on a specific recording of
Sounds are directly connected to an artifact’s physical properties. a specific cause (e.g. a slamming door). And not a small number
In our everyday experience the acoustic properties of materials of the sounds in such a list are well identified because we have
and objects provide us with information on their quality. We learned a specific connotation through the consumption of me-
might see a transparent object, but we will only be able to tell dia like film. It seams evident that resorting to some empirically
whether it is glass or acrylic after tapping it with our finger. We ”well identified sounds” would mean to abandon the richness and
can distinguish metals in the same way or determine whether a diversity of the sonic world in favor of a statistical ”average”.
piece of wooden furniture is made of hardwood or plywood or Chion also points out the linguistic ambiguity when speaking
whether it is in good or bad condition. Many studies have been about sounds and their cause: A sound of a piano can be the tone
concerned with the ability of detecting material properties, shape emitted when pressing a key on a piano or it can be the sound
and size through sound, as well as processes of interacting mate- resulting from hitting the piano with a hammer. Or the sound
rials (see e.g. [18] for a comprehensive overview). Nonetheless, of a piano can actually result from a synthesizer. And although
many aspects of recognition are still not completely understood, we might identify a certain sound as coming from a wooden box
as most of these studies deal with simplified sound events, such we can not say that one particular sound of a wooden box exists.
as mallets hitting metal plates [38]. Depending on where and how we exert physical force on the box
Opposed to typical laboratory setups, the sounds we hear in ev- it will sound differently. In summary: it is usually not possible
eryday life are often composites of several sound sources. Each to claim that a sound is the sound of something specific, or, vice
single vibration in the human hearing range merges into one versa, that every thing has its one sound [12].
sound gestalt and is perceived as a complex entity. With increas- Steven Connor describes the dialectic relationship between
ing complexity, the actual elements that cause a sound are not sounds and objects as being an ”immaterial corporeality”: ”One
discerned at all anymore. In what is called everyday or ecological apparent paradox of hearing is that it strikes us as at once intensely
corporeal - sound literally moves, shakes, and touches us - and
4 In our everyday myths and narratives this topic reappears often under
mysteriously immaterial. (...) Perhaps the tactility of sound de-
a somewhat humorist veil, also referred to as ”resistentialism”. This term, pends in part on this immaterial corporeality, because of the fact
coined by Paul Jennings, stands for a humorous theory in which inanimate that all sound is disembodied, a residue of production rather than
objects display hostile desires towards humans, a ”fact” apparent in expe-
riences such as cars not starting when one’s in a hurry or the bread always
a property of objects.” ([13] p. 157)
falling on the side with butter on it. Last but not least, the relationship between objects and sounds
5 To be found in a 1973 revision of his compendium of essays, ”Profiles can be even viewed from a totally different angle. Chion states,
of the Future”. that we can find certain irregularities, frictions, traces of impacts

-3-
58
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities

in abstract sounds that can give it a material, bodily quality. He Lessing is reported being one of the first to organize an anti noise
calls these notions of materiality ”indices sonores matérialisants” campaign. Bijsterveld points out that ”the sound of technology
([12], p. 102). is a key aspect of technological culture, because sound has been
highly controversial and deeply invested with symbolic signifi-
3.3 The Sonic Metaphysics of Objects cance.” ([8], p. 165)
New technologies like the automobile or industrial machines
In the sound related discourse the notion exists that sound can
bring with them new sounds that become symbols of progress for
be the ”voice” of objects in an actual, immediate manner and not
some and a primitive nuisance for others. A group that embraced
merely metaphorically. This notion again is dialectic. Connor
noise and loud powerful sounds of technology with almost reli-
states: ”When we hear something we do not have the (...) sen-
gious devotion were the Italian Futurists. In his manifesto ”The
sation of hearing the thing itself. This is because objects do not
Art of Noises” Luigi Russolo states: ”We find far more enjoyment
have a single, invariant sound, or voice.” ([13], p. 157) At the
in the combination of the noises of trams, backfiring motors, car-
same time however, sound rarely comes completely apart from its
riages and bawling crowds than in rehearsing, for example, the
source. Connor continues that ”to think of a sound as the ’voice’
’Eroica’ or the ’Pastoral’.” (cited in [30], author’s translation)
of what sounds (...) is also to think of the sound as owned by and
These examples show how the discourse about sound strongly
emanating essentially from its source, rather than being an acci-
reflected the societal structures and significant changes brought
dental discharge from it.” ([13], p. 157) The following dictum is
about by industrialization and technological changes in general.
ascribed to Oskar Fischinger: ”Sound is the soul of an inanimate
Many comparable accounts could be found today, referring to
object.” (cited in [15], p. 330, author’s translation) John Cage
ghetto blasters and mobile phones. Since the industrial age and
is reported to have hit the objects wherever he went in order to
the introduction of electroacoustic technology sounds have be-
investigate their inner nature.
come more pervasive than ever. And this trend will increase due
An old example of experiencing sound as voice of things can
to the technique of electroacoustic enhancement of commodities
be found in Homer’s Odyssey. Homer describes Odysseus, hav-
described above. The study of the sociocultural history of sound
ing returned home and competing in an archery contest: ”Then
reveals that the importance of sound goes far beyond purely func-
his right hand took the string and made it vibrate, the cord sung
tionalist purposes e.g. of providing feedback in an interface.
beautifully and clearly, like the call of a swallow.” (cited in [12],
p. 102, author’s translation) This is more than just a metaphor:
To some extend it is the string which is acting, singing, even if 4 Implications, Directions and Strategies for
excited by the plucking. Sound Design
”Through noise nature vibrates of sense”: With this poetic for- We have described how sounds relate to artifacts, how they be-
mulation Barthes describes the ability of sound to give things come meaningful and even can give inanimate things expressive
a voice. According to him, listening is always connected to a qualities. And we also have described how both artifacts and their
hermeneutics that aims at understanding the dark, the blurred or sounds play an important sociocultural role. Thus, to say that
mute and to make the sense ”behind” appear. Listening to these sound has great potential for conveying information about the na-
sounds is an essentially religious experience, connecting the lis- ture of artifacts and their hidden processes or properties is correct
tening subject with the hidden world of the gods [5]. This at- but falls short of grasping the full complexity of the role of sound
tribution of sound to an expressive quality of objects and to the in artifacts. Sound does not simply convey an information but
soul and voice of things is an obvious connection to the topics of it offers the listener resources, affordances, clues for an interpre-
antropomorphization and animism described above. tative act. The dialectic relation of sound with physical objects
which is combined with the dual nature of artifacts - being at the
3.4 The Cultural Signification of the Sounds of Ob- same time abstract signs and concrete, physical realities - pro-
jects vides endless possibilities for complex combinations. The result
Barry Truax states that sound mediates the relationship between is a complex interleaving of levels of interpretative clues through
listener and environment [34]. This includes also the sounds pro- sound. In the following we will elaborate on how sounds can be
duced by the cultural-technological artifacts we create. For in- designed with these aspects in mind, focusing on the relationship
stance, Mark M. Smith describes how in travel accounts of an- and possible reconfigurations of objects and their sounds.
tebellum America sounds became the signs of positively con-
notated, pre-industrial work, linking sound to an increase of 4.1 A Narrative Approach to Sound Design
wealth and population. He also reveals significant differences be- In the introduction we have described the transformation of inan-
tween the industrialized North and the slavery dominated South. imate objects to procedural, interactive objects and the resulting
Both areas and cultures had a distinct soundscape with different narrative potential. A technologically complex object with the
keynote sounds and soundmarks6 [32]. The sounds of certain ar- ability to sense, process, store and communicate can be seen as
tifacts and machines thus contribute to the identification with a an actor in a narrative of interaction. However, there is hardly any
class, political orientation, etc. theoretically grounded know-how and only a very small number
Through a comparative study of noise abatement campaigns of practical examples which employ a narrative notion expressed
from the early 20th century, Karin Bijsterveld points to class being through sound in the interaction with computerized objects. In
an important element involved in definitions of noise and noise order to establish criteria and a methodological framework to in-
pollution. The cultural struggle about sounds is also a struggle be- vestigate the possibilities of this new direction of sound design for
tween intellectual and working classes. The philosopher Theodor interactive commodities, we propose to start with film and game
6 The sound design because they provide a rich source of material de-
terms ”keynote sound” (sounds that are heard by a particular so-
ciety frequently enough to form a characteristic sonic background) and scribing how the narration of object interaction can be designed
”soundmark” (a sound with a special meaning for a community) have been sonically. Maribeth Back suggests that it could be worthwhile to
coined by R. Murray Schafer [31]. investigate design practices of sound designers of narrative me-

-4-
59
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities

dia. She describes how sound can help to create micro-narratives, etc.) in order to fit them into the ”carrier” sound8 . Particularly
using both cultural experience (codes, sound as sign) as well as interesting to note here is that such sounds, detached from their
physical experience [2]. original source through electroacoustic recording, can become fa-
In the following we will outline a few considerations and possi- miliar and strange at the same time. As mentioned above, some-
ble directions for design drawn from film and game sound design. what contrary to many accounts of everyday listening, the identi-
fication of a sound source can be subjected to quite a big deal of
4.2 Strategies for Object-Sound Reconfigurations interpretation and uncertainty, especially when the sound is rep-
resented in a recording9 . However, this separation of sound and
We have mentioned the possibility of referring sonically to an ob- referent is a powerful device for design because it allows to cre-
ject or a specific action with an object. The sound produced when ate products which have a distinguished and novel sonic identity
an object is used or operated (or just touched in some way) can without being entirely alien.
become not only the signifier of that object but also of its function,
use context and sociocultural significance and turn into a signifier 4.3 The Use of Sound to Animate and Characterize
of metaphorical or associated qualities of the object. Let us con- Objects
sider the example of the sound of a hammer hitting a nail: It is
Animation, Fantasy and Science-Fiction film is certainly also a
an index of a hammer as well as hammering and can symbolize
rich source for the study of objects as expressive ”characters”. In
strength, aggression, work, DIY-culture, communism and so forth
these genres we can find countless examples of objects that are
or it can even be an enunciation of headache.
provided a personality and emotional expressivity through sound.
These levels of meaning creation can become literally mixed
Beauchamp states that ”virtually any character or object can be
in sound through the strategic layering and intertwining of various
personified by adding speech, movement, and the expression of
sounds into a new sound gestalt. This is a common practice in film
emotions.” ([7], p.21) Even in ”realistic” movies the sound de-
sound: Through layering of sounds, combining concrete, identi-
sign often goes beyond just naturalizing or physicalizing an ob-
fiable sounds with each other or even with more abstract sounds,
ject, adding a narrative component. According to Flückiger ob-
meaning potentials7 are transferred between them. Film sound
jects in movies often are not only made credible but also animated
designers for example use this method to create richer meaning
through sound. The purpose of beeping computer displays, the
potentials into seemingly simple sounds. For example, the water-
extensive sonic endowment of supernatural occurrences or alien
drops falling onto an ant colony in ”A Bug’s Life” (John Lasseter,
creatures is to distract from their mocked nature, to suggest an
1998) are sounds of splashing water, combined with sounds of
actual existence, life and function [15].
rockets and explosions [7].
An example for sonically antropomorphized and animated ob-
Another example of the ability of sounds to redefine the nature
jects is the sound design of Darth Vader’s Tie Fighter in ”Star
and meaning of objects displayed on screen are also the typical
Wars”: The distorted scream of a human being was mixed with
cartoon sound effects that have been created for producers like
various jet engines, producing a nail-biting sound while contain-
William Hanna and Joseph Barbera or Tex Avery. Here signifier
ing an eerie humaneness: ”His ship and its sounds are extensions
and signified are subjected to extreme reconfigurations: Instead
of this merger (between flesh and machine, note from the author),
of sonifying literal physical processes, metaphorical or symbolic
and thus they scream in pain.” ([39], p. 109) But also fantas-
sounds are used. Emotional expression and interpretative clues
tically animated objects become more credible when having the
are achieved by the associative nature of the sound and the spe-
appropriate sounds. Think of the living dishes in ”Beauty and the
cific cartoon aesthetic based on analogy, contrast and exaggera-
Beast” (Gary Trousdale, Kirk Wise, 1991). And sound designer
tion. An example is the sound of an anvil being hit with a heavy
Ben Burtt used morphological analogies to human language sen-
hammer used in a scene where a character gets hit by something
tences and baby talk to shape the beeps of R2D2 from ”Star Wars”
like a baseball bat [7]. Many new sound icons, which stand for a
(George Lucas, 1977) [33].
very specific meaning which is entirely artificial and established
Sound plays also a role in marking ordinary objects as extraor-
only through repeated use, have emerged from animated film.
dinary or magic. This can be the case with simple, inanimate ob-
Examples are many film sound clichés: falling objects whistle,
jects that not only become ”alive” through use, but also reveal ex-
strong fist blows in fights have a metallic impression, quick run-
traordinary qualities through sound: An example of a rather sim-
ning produces a sound of a ricochet (which, again, is often the
ple object being enriched and personified in this way is the jade
same, stereotyped sound, and not an arbitrary, realistic recording
sword in ”Crouching Tiger, Hidden Dragon” (Ang Lee, 2000).
of a ricochet), and so forth.
The sword emits glass-like singing sounds, conveying a fine, pre-
These sounds are not questioned by the viewer-listener. Instead
cious identity, fragile and sharp at the same time. When used
the naturalizing power of sound suppresses the notion that the au-
inappropriately it starts to oscillate which is accompanied by a
diovisual event could be impossible or unrealistic, at least during
wobbling sound, as if it was responding in annoyance.
the actual experience of the film. This ability of sound to create
Sonnenschein suggest the use of ”archetypal templates” to cre-
credibility is often used for fantastic or incomprehensible things
ate nonhuman sounds that we can relate to. We might ”find
and processes. Ben Burtt, the sound designer of Star Wars, first
squeals, squeaks or hisses from compressed air hoses; groans
established credibility for fantastic sound effects by finding ”an-
from old wooden doors; and a laughter-like craziness from bend-
chors” in familiar sounds like animals (in which case there is also
ing a saw.” ([33], p. 61) Often sounds of animals are not only
the aspect of animism to be considered, see below) or familiar ma-
used to animate objects but also serve as sign carriers and conno-
chinery. Using these sounds he could inject the fantastic objects
tational devices. Lion growls, yapping chimps or cat purrs can be
and processes depicted on screen with the necessary amount of
familiarity and credibility [39]. These sounds then often are ma- 8 The carrier-modulator principle known from FM synthesis serves

nipulated (pitch shift, time stretch, filtering, amplitude envelopes well here as a metaphor for how layered sounds often work.
9 Quite often students of our classes in sound design are not able to
7 Van Leeuwen proposes this term instead of the static code to express identify the sound sources of recordings their fellows made, unless the
the contextual dependency of meaning making. [36] sounds are very typical or strongly contextualized.

-5-
60
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities

layered over or mixed into ”carrier sounds”, suggesting strength, 5 Conclusion


wackiness or seductiveness [33].
Designing interactive commodities poses new challenges on sev-
eral levels of design. Such objects often are integrated in everyday
4.4 The Limitation of the Semiotic Paradigm life practices which can not be framed by traditional task and goal
oriented methods. Despite certain routines there are always new
Several of the examples given are described through semiotic configurations of contexts and events as we go about our daily
analysis, referring to concepts of sign, index, icon, symbol and lives. Instead of simply decoding pre-structured signs in artifacts
so forth. In product sound quality research and design the semi- which requires a relatively high level of standardization as it can
otic viewpoint is becoming increasingly established. It suggests be found e.g. in mainstream film, we often interpret the world
that sounds of objects convey meaning by a significative rela- in an iterative, ad-hoc manner. This is especially true if the arti-
tionship to experiences or schemas [21]. But similar to objects, facts we encounter are complex, networked and to some extend
sounds can never be pure signs. Even in its most abstracted, elec- autonomous and intelligent.
troacoustic form sound always carries an immediate, naturalizing Sound design in this context can be seen as designing an ex-
component in it. This effect is at work in film, where sounds pressive channel for such animated artifacts and creating micro-
are taken for granted and are accepted as the ”natural” accompa- narratives of interactions. This means that we see artifacts as char-
niment of events on screen, although the audio-visual product is acters, as actors in a micro-dramaturgy and sound both tells us
entirely constructed. But this effect is also potentially at play - something about the temporal and dynamic development of the
and very important to be aware of - in sounding physical artifacts. process as well as about the inherent semantic complexity in the
The semiotic analysis actually works well in audiovisual me- nature of the artifact which can range from hidden or metaphoric
dia like film because movies are spatially and temporally finite material qualities to emotional expressions of a kind of ”genie in
artifacts, consumed in a strictly standardized way. Linguistic con- a bottle”.
cepts (and semiotics are deeply permeated by a linguistic notion
of meaning making) can be applied, the codes are known and new 5.1 The Aesthetic Dialectics of Sonically Augmented
codes can be easily developed and decoded, based on previous Objects
knowledge. And, most importantly, there is always an obvious In the design of the sounds for such artifacts the sociocultural
presence of an authorship, of an actual communication through significance of artifacts in everyday life plays an important role.
the medium of film. This happens on two levels:
In the interaction with actual, physical, computerized com- Firstly, ”objective” qualities can be rendered into artifacts, for
modities, however, the experience changes fundamentally. Al- example with indexical sounds pointing to certain artifacts and
though there is a designer involved she is not designing the entire thus to a range of potential meanings arising from them. Phe-
experience of using the interactive artifact. The designer can not nomenologically, sound as a resource for meaning creation in the
predict and control the interpretative and appropriative strategies context of physical artifacts has the characteristics that it is al-
people will chose as they go about their situated actions in ev- ways rooted in some sort of physical motion in time. The pos-
eryday life. Also the sound designer disappears as the initiator sibility of electroacoustic processing and reproduction of sound
of a communicative act. In such a situation sounds will be per- is not eliminating this archetypal connection, rather it extends
ceived as ”naturally happening”, even if they are not, and the in- and modulates it, significantly altering the perception of an ob-
terpretative act will be influenced by this. Already with a rather jects properties and its meaning potential. For instance we have
simple mechanical process such as a squeaking door there is an described the practice of mixing concrete and abstract sounds in
intuitive judgement of an inherent quality of the hinge, which film sound. Thus sound design for interactive commodities builds
will be interpreted by the listener on a immediate level, before upon the fundamental perceptive link between material objects
any higher level sign systems take effect. The communication and processes and sound. At the same time it extends this fun-
is ad-hoc and interpretative, the sign is constructed through per- damental level to socio-semiotical and psychological aspects of
sonal experience rather than cultural conventions10 . According to the design domain. By enhancing artifacts with electroacoustic
Krippendorff semiotics has fundamental limits when dealing with sounds that are derived from interactions with other artifacts an
such dynamic design issues: It is rooted in a two-world ontology, effect can be achieved that is related to the dialectical relationship
dividing between a world of signs or signifiers and a world of ref- between objects and their sounds described in this paper.
erents or objects and it excludes human agency in the sense that Second, these new artifacts become sociocultural devices
meaning always evolves in an iterative and interactive process of themselves, subjected to individual and collective interpretative
ongoing interpretation [25]. This suggest that the analysis and de- practices. This requires a sensitivity from the designer towards
sign of sounds for interactive commodities should take a critical techno-cultural history and how artifacts grow into the fabric of
distance from semiotics (without abandoning its valuable contri- society, shape discourse and reflect, as well as support, cultiva-
butions) and put an emphasis on hermeneutics, which acknowl- tion.
edges that interpretation and understanding, not only of texts, but
of the world, is essentially a dynamic circular movement between 5.2 Goodbye Authenticity - Welcome Schizophonia
a single element and the whole, and can not be crystallized to a
specific structure. The relationship between sounds and objects has always been di-
alectic. Often sounds reveal the true ”authentic” quality of an
object but through engineering of sounds for industrial products
10 This may change of course if we are confronted with mass-produced
this quality of sound is questioned. Does the convincing sound of
artifacts, where a certain class of similar devices create structural stan-
a car door really display its high quality or is it just engineered by
dards comparable to film or game genres. But this will only shift the
emphasis towards structurally coded meaning, the ad-hoc interpretation changing the resonance body or maybe even by using materials of
of the sounds and their naturalizing, immediate expressiveness is still lower quality but with better acoustic properties? In any case the
present. sounds of objects are perceived as ”natural” and authentic to some

-6-
61
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities

extend, which implies a certain ethical dimension in the discourse gest a certain material quality.
about sound design for artifacts. The possibility to integrate a But the relevant questions are: What happens if such ”transnat-
even wider range of sounds by means of miniaturized electroa- ural” materials are present in an actual, physical object, which
coustic devices and to control them through computer technology has already a material identity? How can design methods, which
makes the relationship between physical object and its sound ar- mostly are used in immersive media, be adapted to the context of
bitrary11 . Never before there was so much control over the sonic actual physical objects with their non-immersive qualities? And
appearance of an artifact. of course this sound design is not ”free”, in many ways it is much
R. Murray Schafer was deeply concerned about the increasing more restricted by external factors than the sound design for tra-
schizophonia resulting from the separation of sounds from their ditional audiovisual media. These factors are technical limita-
”natural” sources. But schizophonia is already the normal cultural tions as to what sounds can be chosen (e.g. resonance body of
condition we live in and have learned to deal with (most of the the device, loudspeaker specifications), but also how the object is
time at least). While we might still have some resistance towards handled or used (hitting, shaking, heating, etc.). In short: Fur-
the schizophonic reality, the generation of teenagers, walking on ther investigation is also needed into synchresis and diegetics of
the streets listening to distorted and filtered pop songs from their sound for interactive commodities. We propose to draw on mate-
mobile phones, obviously enjoying it, will most likely have no rial from game sound research as a good starting point as it is able
problems at all with this notion. to link the narrative audio-visual world of film with interactive
As designers we have to embrace this new aesthetics of the media and physical interfaces.
sonically extended artifact as a new field where practices and vo- Finally, we will have to investigate strategies for defining and
cabularies are still to be developed. To just pick one example con- modulating the ”threshold of attention” by shaping how sounds
sequence indicated above: It might be a good idea not to stick too integrate into a soundscape and how they may emerge from it
tightly to sounds that are ”well identified” in laboratory settings, in a meaningful way. Traditionally, much research has been con-
thinking of them as the only way of conveying a clear meaning. ducted in order to evaluate the perceived urgency of alarm sounds.
For designers it will be necessary to deal with the fact that most In reverse we suggest that it should also be possible to shape
of decontextualized (recorded!) sounds are ”badly identified” or sounds in such a way that they easily blend into the background
at least reinterpreted. They are the sound designer’s raw material of our attention, becoming a keynote sound and being available
and the question is: What creative and interpretative potential lies to listening-in-search or listening-in-readiness [31]. In addition,
in them? designing meaningful, rich, subtle sounds with a certain degree
of individuality, freshness and unpredictability in terms of what
6 Future Work they are supposed to mean will be a good foundation to actually
prevent them from being annoying, helps to create ”hi-fi” sound-
In terms of design many methodological questions remain to be scapes (in the Schaferian sense), and supports the human ability to
answered. One central issue is the conceptual and practical inte- interpret and appropriate whatever we encounter and make sense
gration of the macro and micro levels of the design. The macro out of the alien.
level deals with the overall experience of the interaction with the To make progress in these directions a collaboration between
object and how it is embedded into sociocultural context. This the still quite separate communities of sound and music comput-
requires a suitable method of design and evaluation that takes into ing, auditory display, and traditional sound design is required.
account cultural factors and interpretative processes. The design of rich sounds for interactive commodities will have
On the micro level the challenge for the sound designer is the to integrate the two design cultures and their strengths: Dynamic
integration of sounds into each other, into the artifacts and into and interactive control on the one hand and rich, detailed sonic
the interaction dynamics. For example, an open question is how semantics on the other.
”hybrid sound gestalts” can be designed, that merge with a device.
How far can we stretch an objects (sonic) identity? What sounds
are acceptable on a metal, plastic, wooden object? The meth-
References
ods and tools for the fine-tuning of the fit of the electroacoustic [1] Federico Avanzini and Paolo Crosato. Haptic-Auditory
enhancements and the objects own sounds to processes and dra- Rendering and Perception of Contact Stiffness, volume
maturgies exist and are widely used in film and game sound de- 4129/2006 of Lecture Notes in Computer Science. Springer,
sign: Manipulation of pitch, amplitude, spectrum over time with 2006.
envelopes, adjusting the mix, crossfading, masking, layering or
[2] Maribeth Back. Micro-Narratives in Sound Design: Con-
using semantically complex effects like reverb, delay, filtering,
text, Character, and Caricature in Waveform Manipulation.
and so on. And designing sounds for narrative purposes means
In Proceedings of the 3rd International Conference on Au-
that we have to design transition points, find linking strategies be-
ditory Display, Palo Alto, California, 1996.
tween sounds to create continuity, and model the dynamic time-
space relationship of interactive applications. This requirement [3] P. Barreiro, C. Ortiz, M. Ruiz-Altisent, V. De Smedt,
has already lead to advanced game sound middleware such as S. Schotte, Z. Andani, L. Wakeling, and P.K. Beyts. Com-
FMOD12 or Wwise13 . Finally, there are several techniques for parison Between Sensory and Instrumental Measurements
”material transfer”: Convolution, manual filtering and manipu- for Mealiness Assessment in Apples. A Collaborative Test.
lation of resonance ”by ear”, or physical modeling. Technically Journal of Texture Studies, 29:509–525, 1998.
very simple and often very successful is the subtle layering of fil-
[4] Roland Barthes. Semantik des Objektes. In Das semiologis-
tered recordings of physical processes onto other sounds to sug-
che Abenteuer. Suhrkamp, 1988.
11 Every recording of a sound in fact already is a distortion of the au-
[5] Roland Barthes. Der entgegenkommende und der stumpfe
thentic link between a physical process and the resulting sound. Sinn. Suhrkamp, 1990.
12 http://www.fmod.org
13 http://www.audiokinetic.com [6] Jean Baudrillard. The System of Objects. Verso, 1996.

-7-
62
Genie in a Bottle: Object-Sound Reconfigurations for Interactive Commodities

[7] Robin Beauchamp. Designing Sound for Animation. Else- [28] K. Obermayer K. Franinovic Y. Visell et. al. P. Susini,
vier, Burlington, MA, 2005. D. Rocchesso. Closing the Loop of Sound Evaluation and
[8] Karin Bijsterveld. The diabolical symphony of the mechani- Design. In ISCA Workshop on Perceptual Quality of Sys-
cal age. In Michael Bull and Les Back, editors, The Auditory tems, 2006.
Culture Reader, pages 165–189. Berg, 2003. [29] Stephen Harold Riggins, editor. The Socialness of Things
[9] Gernot Böhme. Der Glanz des Materials - Zur Kritik der - Essays on the Socio-Semiotics of Objects. Mouton de
ästhetischen Ökonomie. In Atmosphäre. Suhrkamp, 1995. Gruyter, 1994.

[10] Stephen Brewster. Overcoming the lack of screen space on [30] Luigi Russolo. L’art des Bruits, Luigi Russolo, textes établis
mobile computers. Personal Ubiquitous Comput., 6(3):188– par Giovanni Lista. L’Age d’homme, 2001.
205, 2002. [31] R. Murray Schafer. The Soundscape: Our Sonic Environ-
[11] Daniel Chandler. Technological or media deter- ment and the Tuning of the World. Destiny Books, New
minism, 1995. Available from World Wide Web: York, 2nd edition 1994 edition, 1977.
http://www.aber.ac.uk/media/Documents/tecdet/tecdet.html [32] Mark M. Smith. Listening to the Heard Worlds of Antebel-
[cited 24.1.2008]. lum America. In Michael Bull and Les Back, editors, The
Auditory Culture Reader, pages 137–163. Berg, 2003.
[12] Michel Chion. Le Son. Editions Nathan, Paris, 1998.
[33] David Sonnenschein. Sound Design - The Expressive Power
[13] Steven Connor. Edison’s teeth: Touching hearing. In Veit
of Music, Voice, and Sound Effects in Cinema. Michael
Erlmann, editor, Hearing Cultures - Essays on Sound, Lis-
Wiese Productions, 2001.
tening and Mordernity. Berg, 2004.
[34] Barry Truax. Acoustic Communication. Ablex, 2nd edition,
[14] Mihaly Csikszentmihalyi and Eugene Rochberg-Halton.
2000.
The meaning of things - Domestic symbols and the self.
Cambridge University Press, Cambridge, 1981. [35] Sherry Turkle. The Second Self: Computers and the Human
Spirit. MIT Press, 20th anniversary edition, 2004.
[15] Barbara Flückiger. Sounddesign: Die virtuelle Klangwelt
des Films. Schüren Verlag, Marburg, 2001. [36] Theo van Leeuwen. Speech, Music, Sound. Palgrave
Macmillan, 1999.
[16] Karmen Franinovic, Daniel Hug, and Yon Visell. Sound
embodied: Explorations of sonic interaction design for ev- [37] Joan M. Vastokas. Are artifacts texts? Lithuanian woven
eryday objects in a workshop setting. In Proceedings of the sashes as social and cosmic transactions. In Stephen Harold
13th international conference on Auditory Display, 2007. Riggins, editor, The Socialness of Things - Essays on the
Socio-Semiotics of Objects, pages 337–362. Mouton de
[17] W. W. Gaver. What in the world do we hear? An ecological
Gruyter, 1994.
approach to auditory event perception. Ecological Psychol-
ogy, (5):1–29, 1993. [38] G. B. Vicario. Prolegomena to the perceptual study of
sounds. In D. Rocchesso and F. Fontana, editors, The Sound-
[18] B. L. Giordano. Everyday listening, an annotated bibliogra-
ing Object, pages 17–31. Edizioni di Mondo Estremo, 2003.
phy. In D. Rocchesso and F. Fontana, editors, The Sounding
Object, pages 1–16. Edizioni di Mondo Estremo, 2003. [39] William Whittington. Sound Design & Science Fiction. Uni-
versity of Texas Press, Austin, 2007.
[19] Adam Greenfield. Everyware: the dawning age of ubiqui-
tous computing. New Riders, 2006.
[20] Daniel Hug. Towards a hermeneutics and typology of sound
for interactive commodities. In Proc. of the CHI 2008 Work-
shop on Sonic Interaction Design, Firenze, 2008.
[21] Ute Jekosch. Assigning Meaning to Sounds - Semiotics in
the Context of Product-Sound Design. In Jens Blauert, edi-
tor, Communication Acoustics. Springer, 2005.
[22] Victor Kaptelinin and Bonnie A. Nardi. Acting with Tech-
nology. MIT Press, Cambridge, Massachusetts, 2006.
[23] C. Kayser, C. I. Petkov, M. Augath, and N. K. Logothetis.
Integration of touch and sound in auditory cortex. Neuron,
48:373–384, 2005.
[24] Gregory Kramer, Bruce Walker, Terri Bonebright, Perry
Cook, John Flowers, Nadine Miner, and John Neuhoff.
Sonification report: Status of the field and research agenda.
1999.
[25] Klaus Krippendorff. The semantic turn - A new foundation
for design. Taylor and Francis, 2006.
[26] Bruno Latour. Reassembling the Social - An Introduction to
Action-Network- Theory. Oxford University Press, 2005.
[27] Winfried Nöth. Handbuch der Semiotik. J. B. Metzler, 2.,
vollständig neu bearb. und erw. aufl. edition, 2000.

-8-
63
Saturday Night or Fever?
Context Aware Music Playlists

Stuart Cunningham, Stephen Caulder & Vic Grout


Centre for Applied Internet Research (CAIR), Glyndŵr University,
Plas Coch Campus, Mold Road, Wrexham, LL11 2AW, North Wales, UK
{s.cunningham | s.caulder | v.grout}@glyndwr.ac.uk

Abstract. Context awareness provides opportunities for enhanced user experience, interaction and customisation of electronic
devices, particularly those which hold large data sets of information which may often only be relevant to a user in certain scenarios.
In this work, we examine how context awareness can be applied to the automatic generation of music playlists on mobile music
devices, such as MP3 players and mobile phones. We hypothesise that the type of music which a person might wish to listen to will
often be influenced by external factors such as the time of day, the ambient temperature, amount of ambient or background noise,
their current amount of physical activity, and their emotive state, to name a few.

We detail the results and data sets of preliminary investigation into several human movement scenarios, emotional status and external
factors. These results are obtained by employing the cost-effective Wiimote controller to record acceleration profiles. The Wiimote is
assessed against a professional level, high-cost, motion capture device to identify if such portable devices are useful in everyday
scenarios. Base values for subject locomotion were investigated for the Wiimote device and verified and analysed using the Qualisys
3D-motion capture system. This was done for setting a base line for the subject forward velocity, but also to allow the research into
further complex locomotion studies for this project. A model of the playlist generation system is provided, which can be used to
simulate responses to various type of context-informing input.

It is noted that the system has been implemented using a fuzzy rule base system (FRBS). This will allow the initial construction
based on a knowledge base relating a suggested emotional state (E-state) based on various inputs. The longer term concept is to
investigate the adaptable nature of the initial knowledgebase and allow it to adapt to an individuals actual emotional state
preferences. It is suggested that further research into the implementation of a Self-Learning Fuzzy Rule Based System (SL-FRBS).

1. Introduction choices. A diagrammatic overview of the complete system we


propose is given in Figure 1. Not only this, but in order to
The generation of an automatic playlist is a useful resource for provide such a customised system, we propose that techniques
portable music players, especially given the storage capacities of of fuzzy logic and self-learning systems are suitable to
currently available which are continually increasing. This means efficiently carry out this task.
that the average listener with an MP3 or other digital music
player will often have a database which has a membership of
tens of thousands of songs, possibly even more! The factors
which influence the choice of music a listener wants to listen to
will be influences by many factors, but especially the user’s
current mood or emotional state, their current activity (if any),
and the range of other external and environmental factors around
the listener, such as the temperature, amount of background
noise, time of day, etc. Rather than have a user cycle through
such large databases to find music which they want to play,
automatic playlist generation, in an ideal scenario, attempts to
select and play music which it has determined the user would
currently like to listen to.

Existing recommendation and playlist generation systems rely


upon the ability to make correlations and measurements between
songs in a music database in order to automatically generate or
order the music in such a manner that the listener is
automatically provided with music which (it is hoped) that they
will like. To date, this has mainly been achieved by analysing
the user’s listening trends and habits as well as analysing the
songs in the database to extract information about the musical Figure 1: Context-Aware Playlist Generation
content itself.
2. Related Work
In this work we propose, and provide initial results of, the
development of an automatic playlist generation system which Implementing automatic playlist generation is not a new field
not only implements these previous approaches identifying user and has a long history relating to the organisation and
trends and content, but also consider the current context of the recommendation of music tracks present on a music player. A
listener and how this might influence their desired musical detailed overview of alternative, historical approaches for

64
automatic playlist generation is beyond the scope of this work, initially of interest comes from two main sources: the user or
however, we briefly present an overview of the field and refer listener and the environment in which the listener exists. This is
the reader to the references made in this section should he or she further ratified by Reynolds et al. who also consider contextual
wish to gain a deeper knowledge of playlist generation input parameters from these two domains [7].
techniques [1, 2, 3, 4, 5, 6, 7].
3.1. The Listener
Initially, recommendation and playlist generation systems relied Information which can be extracted from the user is arguably the
on abstract or meta-data level information and user preference in most useful data which can be acquired if one wishes to
order to order the music tracks. These systems are not hugely determine contextual information regarding the listener’s current
different from Automated Collaborative Filters (ACFs) [8] in emotional state and level of activity. This is illustrated in more
that they track and correlate user preference and build up table detail in Figure 2 which is presented as a subset of the previous
of similarity based on musical information such as artist and diagram in Figure 1.
genre [1, 2, 3]. Although the can take simplistic forms by
counting number of plays, favourite artists, etc. the processes of The emotional state of the user is highly likely to influence the
learning user preferences purely based on these factors can also type of music which they wish to listen to. Listeners who are
take more complex forms [3]. happy or contented are likely to desire their favourite music
tracks and music which is from genres which is known to have
More recent work has been focussed on extracting and analysing positive effects on happiness and reflect and stimulate their
content present within the user’s music collection and making current emotional state. Equally, a listener who is unhappy
decisions based upon similarity metrics or correlations, available might wish to listen to slower, calmer music that fits with their
as a result of content analysis. A more notable example of this current mood. However, this is not to negate the fact that a sad
type of analysis is in the field of audio thumbnails [4, 5, 6]. Such or unhappy listener might listen to upbeat, happy music in order
content information can then be coupled with the meta-data and to change their mood. Therefore, the listening requirements
user preferences mentioned previously in order to provide, what cannot be based purely upon determination of emotion, or at
is generally agreed to be, a more suitable and effective system least, multiple inputs are needed to determine if a sad listener
for playlist generation and music recommendation [4, 5]. wishes to remain sad or wants to be cheered up. We propose that
mechanisms such as skin conductivity and heart rate might be
In similar work to this paper, Reynolds et al. propose systems acquired directly from the user, before being sent to a decision
more advanced than the more conventional approaches to making tool which also took into account other parameters and
playlist generation mentioned earlier. Their work supports the semantic knowledge of the music database.
theories that contextual information is also highly valuable and
appropriate when suggesting or ordering musical tracks for the
listener. In fact, they mull over many of the factors which we
propose to be crucial in our own work, and explain in more
detail later in this paper. Reynolds et al. consider variables such
as temperature, activity and location to be incorporated as meta-
data, and also indicate that the mood of the listener is another
key variable which must be considered in automatic playlist
generation. Their work also presents an excellent overview of
the history of automatic playlist generation and the links
between music and mood or emotion [7]. A detailed exploration
of emotional states, measurement and music goes beyond the
intended scope and context of this paper, however the reader is
further referred to the work of Meyers who an in-depth
exploration of the links between emotion and music [9].

We approach playlist generation from a similar set if initial


hypotheses and notions of employing contextual knowledge. We
see our work as a natural of extension from Reynolds et al. A
limitation of their work was that there was only minimal
identification of the practicalities and implementation present. In
our work we begin to examine how to measure and put into
practice context-aware playlist generation based on a number of Figure 2: Sample Human Input Parameters
bio-physical measurements.
The other major factor interest at this stage is the amount of
3. Context movement or physical activity the listener is engaged in. If the
user is moving a lot then it is reasonable to suppose that they
might be exercising or engaging in an focussed physical
Mobile devices have become increasing computationally
exertion, in which case they would be likely to desire music
powerful and as well as functioning as digital music players,
which reflects this physical motion and might feature strong,
contain more and more peripheral devices such as cameras,
driving beats and tempos which are relatively high, greater than
touch screens and accelerometers. The Apple iPhone and iPod,
120 beats-per-minute (BPM), for instance. An easily available,
in particular, is a high profile example of such a device,
low cost device which is already on the market and can be used
although there are other competing products available and in
to detect three-dimensional (3D) motion is the Wiimote,
development. The power, connectivity options and range of data
illustrated in Figure 3, the remote control device designed for
sources becoming available in mobile devices means that a
use with the Nintendo Wii games console. The Wiimote can be
range of mechanisms can be devised to allow the extraction of
used independent of the Wii console to communicate with
contextual information [10]. Principle contextual information

65
Bluetooth-enable devices, such as computers and provides are present, we suggest that two scenarios are possible. The first
valuable motion information via its accelerometers [11]. may be that the user will be tired and at rest (determined via
motion detection) in which case music of lower tempo and
which provides a more relaxing experience may be required.
However, an alternative may be that the user wants music to
stimulate them perhaps because they are exercising or deriving a
positive, happy feeling from the strong temperatures (such as
when on the beach during a holiday). This can be further refined
by measuring the amount of light. If light levels are low then it
is most likely night time and, again, in combination with
movement and ambient temperature the user might wish to
either dance or relax and chill-out. The amount of ambient noise
is useful firstly to help ensure that the listener is provided with a
constant, desired volume level proportional to the amount of
noise in the environment (within reason). It might also be used
to determine if the user is inside or outside.

4. Initial Implementation & Results

To begin to assess the ability to gain contextual information


from sensors and the usefulness of this information in deriving
Figure 3: Nintendo Wiimote Controller an emotional state (E-state) we implemented a small-scale
system for playlist generation which would analyse a number of
The ability of mobile devices to provide motion data is initially input factors and apply these in ordering a small music database
a very interesting concept when it comes to considering consisting of eight songs, shown in Table 1.
applications for these features, beyond those originally
conceived for the device. This combined with the ability to Table 1: Music Database used in Testing
attain other real-world and user information from sensors and
direct input means that information regarding the user’s current ID Artist Song
context can be described and learnt from observations of 0 Daft Punk One More Time (Radio Edit)
subjects using the devices. 1 Fun Lovin’ Criminals Love Unlimited
2 Hot Chip Over and Over
3.2. The Environment 3 Metallica Harvester of Sorrow
Factors in the environment around the user too, are likely to 4 Pink Floyd Comfortably Numb
have an effect on many factors which influence the listener’s 5 Sugababes Push The Button
habits and preferences when playing music. A number of 6 The Prodigy Breathe
parameters related to the physical environment as well as 7 ZZ Top Gimme All Your Lovin’
environmental metrics can provide suitable input to a decision-
making system. Figure 4 provides illustration of how 4.1. Defining Input Parameters
environmental inputs fit into the playlist generation system. Through some initial studies of the input parameters we are able
to define and categorise a number of states against which any
incoming data to the recommendation system can be matched in
order to tag the current input state and make decisions based on
this state attainment. This applies across all of the sensor input
available to the recommendation system. It should be noted at
this stage that the states currently defined are not necessarily
absolute at this stage and are often indicative. These are easily
refined and extra levels of granularity or simplicity can be easily
introduced by adjusting the size of the membership sets and/or
adjusting the state value thresholds.

Initially we define four locomotive states, which can be


extracted from the Wii controller or another similar device such
as the motion data from an iPhone/iPod or from a higher level
source such as the Qualisys motion capture system. These four
possible states and their associated values are defined in Table 2.
These values are attained from input directly attained from the
user of the system using such a device.
Table 2: Locomotive States
Standing Walking Jogging Running
0 m/s 1.1 m/s 2,2 m/s 3,5m/s
Figure 4: Sample Environmental Inputs
Similarly, we must define some standard states and ranges for
Consider firstly, physical phenomena around the listener, such
any other parameters which we intend to implement at this
as the ambient temperature, humidity, acoustic noise, and levels
stage. Due to limitations of equipment availability and time, we
of light, for example. When high temperature and high humidity
decided to focus on additional inputs from the environment at

66
this stage, rather than other factors directly read from the listener provide hands-free and low cost motion feedback to the FRBS.
or user. Therefore, we define states and parameters for a number The Wii controller provides acceleration feedback for the three
of environmental factors, which are presented in Tables 3 to 5. the x, y and z axis. Initial usage of the device is to obtain an
approximate value for the forward linear velocity V from
the acceleration ∆a across a period of time ∆t as
Table 3: Temperature States
Cold Warm Hot
V = ΔaΔt . (1)
15-18 oC 20-23 oC 27-30 oC We experimented with extracting locomotion data from the
Wiimote controller and compared this to the data which can be
extracted from a full-blown motion capture system; the
Table 4: Lighting States
Qualisys. Figure 6 shows how placement of the Wiimote on
Dark Grey Day Light Light Sunny subjects was achieved and Figure 7 provides an image of a
0-3 2-4 3-5 4-7 6-10 subject being tracked by the Qualisys system.

Table 5: Weather Condition States Subject


Heavy Rain Light Rain Drizzle Dry
0-2 2-5 4-6 5-10 Initial
test
At this stage, the parameters are loosely defined from empirical
and historical knowledge of the individual and are not yet finely
tuned for the playlist generation system. In order to make the
input functions more realistic, usable and suitable to the listener, Wii
they must first be fuzzified. controller

4.2. Fuzzy Logic Model


To make a decision on the emotional state of the user, based
upon the input data sets, we employ a fuzzy logic system,
specifically a Fuzzy Rule Based System (FRBS). Fuzzy set
theory is proposed by Zadeh [12]. The concept of fuzzy logic is
the ability to formalise approximate reasoning. An application
of this theory is FRBS which utilises the concept of fuzzy rules.
Such systems comprise of two main features: An inference
engine and knowledge base. Figure 5 shows an illustration of a Figure 6: Locomotion Extraction with Wiimote
generic FRBS system. Mamdani and Takagi-Sugeno-Kang
(TSK) FRBS were considered in this study, with the later TSK
being implemented as the main FRBS in this paper. Our initial
test FRBS is implemented using the Matlab FIS environment.
The Matlab Fuzzy Logic toolbox provides a useful tool for the
initial realisation of the fuzzy model.

Knowledgebase

Input
Decision
Fuzzification Making Defuzzification
logic

Output
Figure 5: Schematic of FRBS

A fuzzy set L defined on a Universe of discourse U may be


characterised by a membership function μL(x) and is defined
over the interval [0, 1]. For this case the range is based on
measurements taken for a single subject. The following figures
define how the input parameters are configured in the fuzzy Figure 7: Motion Capture with Qualisys
logic model.
The Qualisys is a high-end motion capture system which uses an
The definition of membership of locomotive states is shown in array of infra-red detection cameras to track reflective markers
Figure 8 and was obtained by empirical observations and that can be placed on the subject of interest. Motion vectors are
experimentation with motion detection using both the Wiimote plotted for each marker on the subject and these can be recorded
and a Qualisys system to capture motion. The Wiimote and visualised in real-time. Detailed information of the Qualisys
controller was selected for this study due to its suitability to system can be found at www.qualisys.com.

67
As can be seen, the Qualisys system is more cumbersome than
the Wiimote and actually requires a number of cameras to track
motion accross a very fine range. However, the Wiimote is
much less intrusive and can be attached to a belt or put in a
pocket. Furthermore, provided the Wiimote sunject remains
within the transmission radius of the Bluetooth transceiver
(either 10 meters or 100 meters in laboratory conditions), the
user will have complete freedom of movement. This makes the
Wiimote not only better for recording natural movement, but
much more practical for deployment into real-world scenarios,
such as playlist generation.

Figure 10: Lighting Set

Figure 8: Fuzzy Locomotion Set

Ambient temperature is illustrated in Figure 9 and membership


criteria were gained by measuring a range of location conditions
and consulting average temperature conditions in the UK. It is
expected that for implementation in other countries, this kind of
contextual information is subjective and would be defined at
initialisation.
Figure 11: Weather Conditions Set

These factors are combined as inputs to the FRBS system to


allow an output of emotional state estimation as can be seen in
Figure 12.

Figure 9: Temperature Set

The set of lighting condition classification is shown in Figure 10


and possible weather conditions set is in Figure 11. Again,
membership criteria are based upon empirical observations.

Figure 12: Suggested TSK-type FRBS

68
The current rule base has been derived on the expert or
suggested emotional state of a subject for various states of
inputs. Input ranges for locomotion and temperature have been
determined experimentally, while Lighting and weather are base
on a incremental scheme and a initial triangular or trapezium
membership function was chosen initially and further research
on the rule base and fuzzy set distribution, membership
functions will provide refinement of the system performance.

4.3. Defining Output of Emotional State


Currently a single output state is generated from the fuzzy logic
model which is used to indicate the predicted emotional state of
the user based upon the input parameters. We define a set of 10
emotional states or outcomes which are broadly defined as
having membership of 5 categories. This can be seen in Table 6
and a schematic of the integration of output states is shown in
Figure 13.
Figure 14: Wii Acceleration Curves for Motion
Table 6: Emotional States
Depressed Unhappy Neutral Happy Zoned Comparative data from the Wiimote is presented in Figure 14
0-3 3-4 4-6 5-8 7-9 and also demonstrates locomotive state can be simply
established by correct interpretation of the accelerometer data
over a fixed-rate sampling period of 10Hz. The illustrations
from the Wiimote present the velocity of the user over samples
(time). As can be seen from the illustration, the states of
walking, jogging and running are clearly identifiable.

Figure 13: PlayList Organising Tool (PLOT)

5. Initial Results
5.1. The Wiimote as a Motion Device
The Qualisys motion capture system allowed comparative data
capture of the three Primary directions of subject motion. For
this paper we are primarily interested in forward velocity. It is
noted from the Qualisys results a potential coupling is possible
due to placement of the Qualisys Sensor. The data recordings
from the Wiimote and the Qualisys system were conducted
under the same conditions in order to fully investigate the
comparative effectiveness of each device.

Results measuring motion using the Qualisys are provided in


Figure 15 for walking, jogging and running states. The graphs
from the Qualisys plot forward motion, over a fixed distance,
across time for each of the three tests. The fixed distance is
covered over increasingly short times as the subject increases
speed from walking, jogging and running. Interpretation of time
comes from multiplying the number of samples by the fixed
sampling rate on the Qualisys used for each test.
Figure 15: Motion Data from Qualisys

5.2. Fuzzy Playlist Generation


We configured the system with a number of expert-informed
parameters which related emotional state to a sample collection
of input parameters such as locomotion and environmental

69
factors. We carried these tests out over ten subjects to begin Additionally, we constructed attached an emotional state range
with, who were experienced at using portable digital music to each of the songs in our small database to which the output
players in a variety of scenarios, and established the knowledge emotional state can be correlated. Table 7 shows the E-state
in the fuzzy system using the average of these tests. The range which, through pilot testing, we attached to each song.
scenario configurations used are given below in code along with
a textual description of the scenario and the average set of Table 7: Song Database with E-States
results relating to emotional state is presented in Figure 16. ID Song E-state E-State Median
Although an average response is employed, there were strong 0 One More Time (Radio Edit) 5-8 6.5
correlations between the majority of the subjects and their 1 Love Unlimited 4-6 5
perceived emotional state indicators, which indicate that the data
2 Over and Over 7-9 8
recorded is reliable for those circumstances.
3 Harvester of Sorrow 0-3 2
%1 Walking, temperature is hot, lighting is 4 Comfortably Numb 3-4 3.5
dark/grey and weather is light rain. 5 Push The Button 7-9 8
ip1=[1.5;28;2;2.5] 6 Breathe 0-3 1.5
E(1)=evalfis(ip1, a) 7 Gimme All Your Lovin’ 5-8 6.5

%2 Stationary, temperature is cold, Table 8 shows the results of each of the seven experimental
lighting is dark and weather is raining. scenarios presented along with the resulting playlist to be
ip2=[0.1;14;1;0.5] generated. The grade in the playlist G is determined by taking a
E(2)=evalfis(ip2, a) simple Euclidean distance measurement of the form

( p − q)2
%3 Stationary, temperature is Warmish,
lighting is brightening and weather is dry. G( p, q) = (2)
ip3=[0.1;21;9;9]
E(3)=evalfis(ip3, a) from the song E-state median and the current E-state of the
listener, based on the scenario. The playlist is shown as a ranked
%4 Running, temperature is hot, lighting is set of song ID numbers from the database for each E-state.
Daylight/Getting brighter and dry.
ip4=[3.5;30;5;9.5] Table 8: Playlist Order for Test Scenarios
E(4)=evalfis(ip4, a)
Scenario E-state Playlist order
%5 Walking, temperature is getting hot, 1 4.3 1; 4; 0; 7; 3; 6; 2; 5
lighting is dark and weather is drizzling. 2 0 6; 3; 4; 1; 0; 7; 2; 5
ip5=[1.5;16;1;6] 3 6.8 0; 7; 2; 5; 1; 4; 3; 6
E(5)=evalfis(ip5, a) 4 7.7 2; 5; 0; 7; 1; 4; 3; 6
5 3 4; 3; 6; 1; 0; 7; 2; 5
%6 Stationary, temperature is hot, lighting 6 3.8 4; 1; 3; 6; 0; 7; 2; 5
is grey and weather is dry. 7 6.5 0; 7; 1; 2; 5; 4; 3; 6
ip6=[1.4;32;2.5;9.5]
E(6)=evalfis(ip6, a) 6. Conclusions & Future Work
%7 Walking/Jogging, temperature is mild, We have demonstrated that the Wiimote can function as a highly
Daylight and it’s dry. useful instrument for measuring forms of human motion and in
ip7=[2;17;4.5;9] future plan to assess the functionality of other low cost motion
E(7)=evalfis(ip7, a) devices such as the accelerometers which have been integrated
into mobile music players such as the iPhone and iPod Touch.
Emotions=floor(E*100) These devices, and the Wiimote functionality, can be further
ratified by more comparisons with data extracted using high-end
motion capture hardware, such as the Qualisys, mentioned
earlier.

In terms of being able to attain a more reliable, and possibly


multiple-faceted, emotional state indicator would be to develop
a self-learning algorithm to provide an adaptive context list
generation. This is something which we intend to pursue in the
near future as it will provide a much more customised
interpretation of the user’s emotional states and musical
preferences. Given that emotion is such a personal and almost
unique experience for each individual, this is high in the list of
priorities for future developments. The current system operation
is to obtain an estimate for the emotional state of the subject and
to use this as a factor for the subjects play list organisation. The
list update is based on a predetermined time up-date interval.
The current E-state is used to modify the play list. We have used
a single output value based on a Multiple Input Single Output
Figure 16: Results of Emotion Scenario Indicator Testing
(MISO) model. Further research based on Multiple Input

70
Saturday Night or Fever? : Context Aware Music Playlists

Multiple Output (MIMO) would allow multiple meta-data the listener’s emotional state. However, to derive measurements
selection process based on beat or tempo of the music file, for and indicators directly from the listener suggest a much more
example. robust and reliable evaluation of the emotional state.

The initial FRBS has been implemented and provides a baseline Acknowledgments
model for the development of a more defined model. Fuzzy sets
and memberships functions need to be refined to allow further The authors wish to thank Sue Taylor and Janet Hayes of Sports
optimisation of the system, extending the range of inputs to and Exercise Sciences at Glyndŵr University.
allow a far more flexible system. Of interest is the modification
of the suggested rule base. An aspect is the ability to modify the References
initial rule base with rules more selective of the subjects’
individual emotional states. Approaches being investigated [1] Aucouturier, J-J. & Pachet, F., Scaling up music playlist
based on a self-learning strategy: generation, Proceedings of the IEEE International Conference
on Multimedia Expo, Lausanne, Switzerland (2002).
1. Modifying the fuzzy set definitions. Modification of the
fuzzy sets implies potential issues with the fundamental [2] French, J.C. & Hauver, D.B., Flycasting: On the Fly
nature of the linguistic meaning. Mamdani suggested that a Broadcasting, Joint DELOS-NSF International Workshop on
change in the fuzzy set definitions should be avoided Personalization and Recommender Systems in Digital Libraries,
although potential minor or small modifications may be Dublin, Ireland, 18th – 20th June (2001).
possible [13].
[3] Platt, J.C., Burges, C.J.C., Swenson, S., Weare, C., & Zheng,
2. Modifying the set of rules in the rule base. Given a fuzzy A., Learning a Gaussian process prior for automatically
relation is used to describe each rule in the rule-base of the generating music playlists. Advances in Neural Information
FRBS. Then given the general rule expression: Processing Systems 14, pp. 1425–1432 (2002).

Rn : If m is Mn AND t is Tn AND l is Ln AND w is Wn [4] Gasser, M, Pampalk, E., & Tomitsch, M., A Content-Based
THEN u is Un User-Feedback Driven Playlist Generator and its Evaluation in
a Real-World Scenario, Audio Mostly 2007, Ilmenau, Germany,
Where 27th – 28th September (2007).
Rn = the fuzzy relation of rule n.
M, T, L, W, and U are linguistic labels assigned to each [5] Logan, B., Content-Based Playlist Generation: Exploratory
variable of rule n. Experiments, ISMIR2002, 3rd International Conference on
Musical Information (ISMIR), (2002).
The general relation ‘R’ is constructed as the union of individual
relations [6] Kukharchik, P., Martynov, D., & Kheidorov. I., Indexing
and retrieval scheme for content-based search in audio
n databases, Audio Mostly 2007, Ilmenau, Germany, 27th – 28th
R = UR i . (3) September (2007).
i =1 [7] Reynolds, G., Barry, D., Burke, T., & Coyle, E., Towards a
So that using Zadeh’s compositional rule of inference Zadeh the Personal Automatic Music Playlist Generation Algorithm: The
output fuzzy set assuming for this case a Mamdani model then Need for Contextual Information, Audio Mostly 2007, Ilmenau,
Germany, 27th – 28th September (2007).
u o = (m o × t o × l o × wo )oR . (4) [8] Cunningham, S., Bergen, H., & Grout, V., A Note on
Content-Based Collaborative Filtering of Music, Proceedings of
It is therefore feasible to modify the rule-base based on factors IADIS - International Conference on WWW/Internet, Murcia,
such as the individual’s emotional preference, or additional Spain, 5th-8th October (2006).
inputs such as heart rate. The system would then potentially
provide an emotional state ranking that is more reflective of that [9] Meyers, O.C., A Mood-Based Music Classification and
individual. Exploration System, MS Thesis, Massachusetts Institute of
Technology (MIT), USA (2007).
Another point for development is to attach a number of E-states
to each song in the database, since certain types of music may be [10] Tamminen, S., Oulasvirta, A., Toiskallio, K. & Kankainen,
suitable for two or more E-states. Through further investigation, A., Understanding mobile contexts, Personal and Ubiquitous
we propose to develop a primary emotional rating and then a Computing, 8(2), 135–43 (2004).
secondary and possibly third E-state. Furthermore, the E-state
suitable for each song can be investigated until a much deeper, [11] Maurizio, V., & Samuele, S., Low-cost accelerometers for
accurate estimation of the range of suitable E-states can be physics experiments, European Journal of Physics, (28) pp. 781–
determined. 787, Institute of Physics (2007).

[12] Zadeh, L.A., Fuzzy logic and soft computing: Issues,


The system presented at this stage is more limited than the
contentions and perspectives, Proceedings of the 3rd
outlined system presented in section 1 of this paper. As
International Conference on Fuzzy Logic, Neural Networks and
mentioned, this was due to equipment and time constraints,
Soft computing, Iizuka, Japan (1994).
therefore future work would include employing further sensor
systems as inputs to the playlist generation system particularly [13] Mamdani, E.H., Application of fuzzy algorithms for simple
those which are concerned with input from the listener, rather dynamic process, Proceedings of the IEE, 121, pp.1585-1588
than the environment. It can be argued that inputs from the (1974).
environment are factors which, we predict are likely to influence

71
A Musical Instrument based on
3D Data and Volume Sonification Techniques
Lars Stockmann, Axel Berndt, Niklas Röber
Department of Simulation and Graphics
Otto-von-Guericke University of Magdeburg
Lars.Stockmann@Email.de
{aberndt|niklas}@isg.cs.uni-magdeburg.de

Abstract. Musical expressions are often associated with physical gestures and movements, which represents
the traditional approach of playing musical instruments. Varying the strength of a keystroke on the piano results
in a corresponding change in loudness. Computer-based music instruments often miss this important aspect,
which often results in a certain distance between the player, his instrument and the performance.
In our approach for a computer-based musical instrument, we use a system that provides methods for an inter-
active auditory exploration of 3D volumetric data sets, and discuss how such an instrument can take advantage
of this music-based data exploration. This includes the development of two interaction metaphors for musical
events and structures, which allows the mapping of human gestures onto live performances of music.

1 Introduction sic production, although recently a new research area


solely focussing on new musical interaction methods
Over the past years, computers have contributed to
has been established. One example3 that is planned to
musical performances in several ways. Already in the
be commercially available in the near future is the re-
late 1960s, computers have been employed to control
acTable system, which is described in [10, 12]. Like
analogue instruments. The GROOVE synthesizer de-
Crevois et al., who developed an instrument called
veloped by Max Mathews was one of the first com-
Sound Rose (see [6]), Jordà et al. use a tangible inter-
puter controlled analogue synthesizers [14]. Since the
face as a new intuitive way for live music performances.
introduction of the MIDI standard as a communica-
tion protocol, computers have been used as a means Computer-based instruments are designed in a way
for conduction and arrangement in many music pro- that a musical controller generates data that is passed
ductions, but also as a bridge between input devices to a computer and therein mapped to a single acoustic
and synthesizers. In this context, computer have also stimuli of a certain pitch and volume, or to parameters
been used to augment a performance by adding algo- that somehow control an algorithmic composing. The
rithmically generated notes that fit musical structures, advantage of this approach is that virtually any type
as for example in Music Mouse [19] or MIDI composing of data can be used as input for these instruments.
software like Bars & Pipes 1 . Intelligent instruments The mapping of arbitrary data to sound (including mu-
like Music Mouse facilitate an easier, more intuitive, sic) is part of another very important area of research,
approach to the creation of music for the musically in- specifically sonification. It is often used in the devel-
experienced. At the same time they offer new ways of opment of Auditory Display systems, and employed to
creating music – even for professional musicians. acoustically convey scientific data. While for a long
In today’s productions, external synthesizers are of- time, sonification has merely been a part of visualiza-
ten omitted. Their place is taken by virtual instru- tion research, the techniques which were outlined by
ments, such as Native Instruments’2 simulation of the Gregory Kramer (see [13]) have been developed and
B3 organ or the virtual acoustic and electric piano. successively improved to provide an enhancement, and
Even standard consumer hardware is powerful enough at places even a superior alternative, to visual represen-
for their deployment, and they are used to imitate any tations in science (e.g. [7]). Especially when it comes
kind of instrument in realtime. In contrast to the to the visualization of the inner and outer structures
achievements in sound synthesis, input devices other of 3D volumetric data sets. The auditory channel can
than MIDI-keyboards are still not common in mu- be used to reduce the load of information that other-

1 Bars & Pipes www.alfred-j-faust.de/bp/MAIN.html 3 An overview of some musical controllers can be found
2 Native Instruments www.nativeinstruments.de at www-ccrma.stanford.edu/~serafin/NBF/Newport.htm

72
A Musical Instrument based on 3D Data and Volume Sonification Techniques

wise has to be absorbed by the visual channel alone. tionally utilizing channel redundancy [11, 8]. A simple
The main challenge for sonification research is to find example is a flash light that is augmented with a sound
an expressive, intuitive, and comprehensible mapping while flashing. The proper use of acoustic stimuli in
from the data domain towards sound. combination with the visual representation also gener-
In our sonification system, we employ spatial inter- ates a deeper sense of immersion, especially in inter-
actions to facilitate an intuitive method for an audi- active 3D environments [17]. Gregory Kramer stated
tory exploration of 3D volumetric data sets. It uses a that ’spatialized sound can, with limitations, be used to
strictly functional mapping of data to complex sounds, [...] represent three-dimensional volumetric data’ [13].
based on differences in pitch and volume. This sys- One reason is that spatialized sound provides a direct
tem is the basis for a novel computer-based instrument mapping to the physical 3D space.
that can be used without musical experiences. The in- 3D volume data occurs in countless fields of research
strument is designed out of two metaphors: The Tone and is used to represent the inner and outer structure of
Wall metaphor allows a performer to directly gener- objects or materials in a voxel representation. To find
ate a melody, while the Harmonic Field is used for an an expressive mapping of the these voxels to sound is
computer-aided accompaniment. Both techniques can one of the main challenges when designing a sonifica-
be used at the same time. It produces diverse sounds, tion system.
and allows for a highly interactive performance. It can Since the development of powerful graphics acceler-
be shown that spatial interactions inherent a great po- ators, there has been much research on finding a good
tential for the use in computer-based instruments. mapping in the visualization domain, but only a few
The paper is organized as follows: After an intro- attempts exist to exploit the possibilities of sonifica-
duction to the sonification of volumetric data sets in tion to convey 3D volume data. Minghim and Forrest
Section 2, we advance by presenting our sonification have suggested methods like the “Volume Scan Pro-
system in Section 2.1. This includes some technical de- cess”, in which the density inside a volume probe is
tails regarding our implementation. We then elaborate mapped to the pitch of a generated tone [16]. David
in Section 2.2 how sonification and computer-based in- Rossiter and Wai-Yin Ng traverse the voxels of a 3D
struments connect, and how live music performances volume and map their values to different instrument
can benefit from an instrument that uses our sonifi- timbres, amplitudes and pitches [18]. Both systems
cation system. In Section 3 we describe musical data are controlled through a quite simple mouse/keyboard
can be derived from spatial gestures in volumetric data interface. However, for the sonification of 3D volume
sets. The Tone Wall metaphor (Section 3.1) specifies data, interaction must not be seen as requirement, but
the pitch, loudness, and timbre space for melodic pur- as key aspect. In fact, it is the second most impor-
poses. The Harmonic Field (Section 3.2) describes how tant aspect after the mapping. A direct exploration of
volume data can be used to represent harmonies, bro- the data by, e.g., moving the hand through an interac-
ken chord play, and musical textures. Section 3.3 is tive 3D environment can provide the user with a better
concerned with a combination of both concepts for the understanding of extent or local anomalies. Both ex-
presentation of a one man polyphonic performance. Fi- amples of related work lack this ability of a responsive
nally the results are discussed in section 3.4, which also user interface for 3D input like a realtime tracking sys-
includes possible improvements for further research. tem, or need to compile the audio data before one can
listen to it. The next passage outlines our sonification
system, which focuses on direct interactions and an ex-
2 Volume Data Sonification pressive mapping of the inner structure of 3D volume
data.
Data sonification is an underdeveloped, but growing
field of research. In this section we describe how soni-
fication can be applied to acoustically describe 3D vol- 2.1 Spatial Exploration of Volume Data
ume data sets. Before we describe our method, we As mentioned before, a sonification system can greatly
discuss several advantages that make sonification tech- benefit from tracking devices that allow a direct ex-
niques at times superior to a more classic visual exam- ploration of the volume data. In the visualization do-
ination and presentation of scientific data sets. Exam- main, this is generally done using a certain Viewpoint-
ples are monitoring applications, or any type of unfo- Metaphor, such as the ones presented by Colin Ware
cused operations and processes. The generated acous- and Steven Osborne [23]. With respect to data sonifi-
tic stimuli can be heard without paying direct atten- cation, the eye in hand metaphor can be easily trans-
tion. This yields an improved mobility. Furthermore, formed into the above described volume probe. Instead
Kristine Jørgensen states that the presence of sound of a spherical or cubical shape, our approach uses the
increases attention, and eases the perception by inten- metaphor of a chime rod, as illustrated in Figure 1.

-2-

73
A Musical Instrument based on 3D Data and Volume Sonification Techniques

Using sonification and visualization at the same time


does not only induce the afore mentioned redundancy
that eases perception of the data by dispensing the in-
formation on two channels, but also allows for multi
variate data to be presented directly without the need
of switching between different representations. How-
ever, it is a crucial aspect of the system that the visu-
alization, which requires a powerful hardware, does not
interfere with the audio streaming, even if the system
is not equipped with the latest graphics accelerator.
Thus, we make great use of multi-threading running
Figure 1: 3D Volume scan through chime rod
the visualization on a low priority to ensure that the
audio stream is never interrupted. A scheme of the
whole sonification system is illustrated in Figure 3
The rod can be moved freely through the 3D volume,
and is controlled by an interactor that is connected to Hardware Software
a 3D tracking device. The advantage of using a rod in- Input device
Input data Main loop
stead of a spherical or cubical shape is, that the pitch Position, an- Sonification
Tracking gle, button
of a tone can be directly associated with a position device events Thread manage-
along the rod. Together with an amplitude-modeling ment
3D-volume
depending on the density value at a certain position, a data
complex tone is generated. This allows for an intuitive Sound synthesis
exploration of the inner structures of the volume data. Spectral shaping
Output device Audio data
Unfortunately, the system could not be implemented Soundcard Audio Stream Spatialization
using a MIDI-controlled synthesizer. Instead, we de-
vised our own sound synthesis. A sound is rendered
depending on the density distribution of the volume
that is in close vicinity of the chime rod. The listeners Figure 3: Schematics of the sonification system
head is orientation-tracked, and the generated sound
is spatialized to provide an additional localization cue For the sound processing and output in a multi-
for a more immersive experience. threading environment we use an audio API that is spe-
The realtime tracking is achieved using a Polhemus cially designed for realtime audio applications [20] and
FASTRAK that allows four sensors to be connected. revised it for our purposes. The results were promising
The input data is processed in the client PC that, be- and beared the idea to introduce music elements into
sides the sonification and sound rendering also per- the system. In the next section we elaborate on how
forms the visualization. Figure 2 shows the setup of sonification methods and computer-based instruments
our sonification system. are connected and show how our system can contribute
to the research field of the latter.

2.2 Volume Sonification in a Musical Environment


Hunt and Hermann who advance the research of the
model based sonification impose interaction with the
physical world to be a natural cause for acoustic feed-
back [9]. This feedback is used to gather information
about an object. E.g., a bottle of water that is shaken
reveals information about its contents. This cause-
and-effect chain can not only be used to convey ab-
stract information like the number of messages in an
e-mail inbox of a mobile phone [24] but is also a power-
ful paradigm for computer-based instruments. In the
broadest sense, one could consider these instruments as
merely a special case of sonification: The sonification
of interaction itself. In a musical improvisation interac-
Figure 2: Setup of the sonification system tion can be seen as an expression of emotion and mood.

-3-

74
A Musical Instrument based on 3D Data and Volume Sonification Techniques

A computer that is asked to improvise could, of course, Indeed the tangible interface is very intuitive though
not use mood or emotion as basis for its performance, these attempts are momentarily limited to two dimen-
but arbitrary, or specially arranged data. Using mu- sional space. Besides the afore mentioned reacTable
sic to convey data can have some advantages. Often and Sound Rose The “Morph Table” system that uses
sonification suffers from annoyance. Paul Vickers and morphing techniques presented in [25] is a good exam-
Bennett Hogg state that ‘Sonification designers con- ple how this interface can be used for music generation
centrated more on building systems and less on those [2]. However, the music is also controlled on a rather
systems’ æsthetic qualities’ [22]. Accoustic stimuli that high level. The system generates transitions between
abide by the rules of music are generally more appeal- a source- and a target pattern which is applied on pre-
ing for the listener than sounds that use arbitrary pitch composed melodies and rhythms. It is not possible to
and timbre. It may even stimulate the interactive ex- create a melody directly. Furthermore, it is limited to
ploration of data, as the listener self-evidently becomes two dimensions.
a music performer by interacting with the dataset. She Chadabe describes a system called Solo that uses mod-
or he will try to achieve the most pleasant musical re- ified theremin’s (see [21]) as 3D input devices to guide
sult. A distinct variation in the data means a distinct the system [3]. Again, the melody is generated al-
variation in music. Its location can be memorized more gorithmically. The performer controls variables like
easily when the performer ‘explores it intentionally’ be- tempo and timbre. The computer is used for sound
cause she or he feels that this particular variation fits synthesis. Thus, this approach is similar to that de-
best in the current music progression. scribed in [5] and [2] as the performer has only a global
However, finding a meaningful mapping of arbitrary influence on the generated music. However, we think
multi-dimensional data to music must be considered that 3D input devices can be used to intuitively control
highly challenging. Some approaches can be found in both, melody and accompaniment. Where the former
projects like the Cluster Data Sonification or the Solar is generated through a direct mapping of the position
Songs by Marty Quinn. In his Image Music4 sonifi- to pitch while the latter could benefit from semi auto-
cation, the user can interactively explore a 2D image matic composition or precomposed elements. This not
through music. However, nothing has been done yet only opens the path for diverse improvisations but also
in the domain of 3D volume data. Furthermore, the can be considered more immersive than just influencing
said examples are not intended for live music perfor- certain aspects of music that is otherwise algorhythmi-
mances. The interaction is limited to mouse input that caly generated.
does not meet the high responsiveness demanded by a Our system for interactive exploration of 3D volume
music performer. data is applicable in that it provides the necessary de-
Besides the mapping, the method for interacting grees of freedom to have both aspects in one instrument
with the system is crucial for its efficiency. Like the as well as the responsiveness demanded for a live per-
afore mentioned sonification system computer-based formance. This makes it possible to develop metaphors
instruments mostly use either mouse/keyboard inter- for music and sound generation. Two are described in
action, or are designed to be played with MIDI- the next section.
keyboards. These demand a certain skill in order to
be adequately handled. Systems using the elements
of direct interaction as a means for acoustic excita-
tion are scarce. Instruments like the Fractal Composer
introduced by Chapel, for example, provide a mouse 3 Volumetric Music
driven graphical user interface [5]. The system com-
poses music using the MIDI protocol in realtime that
Along the lines of traditional musical instruments,
depends on parameters which are set by the user. She
computer-based musical instruments have to find in-
or he has no direct control over the melody or har-
tuitive performative metaphors for musical events. A
mony that is generated. This induces a big distance
typical example: To strike one key on the piano means
between the performer and the instrument. She or
playing its corresponding pitch. The keystroke veloc-
he can only influence the composition on a fairly high
ity regulates its loudness. The following sections will
level. These systems are referred to as interactive in-
describe and discuss this mapping of spatial gestures
struments [4] or active instruments[5]. In contrast, the
to musical events and structures, in analogy to the pre-
reacTable and the Sound Rose mentioned earlier are
viously discussed image and volume data sonification
collaborative instruments that use direct interaction.
techniques. The volumetric data represents thereby
4 Design Rhythmics Sonification Research Lab www. the medium of interaction and defines the basis for a
drsrl.com/ music processing.

-4-

75
A Musical Instrument based on 3D Data and Volume Sonification Techniques

3.1 Tone Wall


A question that arises is: How can different tones be
represented in the 3D space? A very intuitive way is a
mapping along the vertical axis: low pitches go down,
high pitches go up.
But an expressive performance necessitates more
than the on/off switching of simple tones. It must
be possible to form them. One of the most impor-
tant means therefore is dynamics (i.e., loudness). In
correspondence to the keystroke velocity on the pi-
ano, we consider the tone space as a wall. The deeper
the performer/interactor punches through that virtual
wall (in z-direction) the louder the tone will be played.
Short punches produce staccato notes, whereas to hold
a tone, the interactor remains in the wall for as long as vel bre
desired.
oci
ty tim
An additional parameter is the punch velocity that dyn
am

pitch
affects the attack and onset behavior of the tone. A fast ics
punch causes a short attack (a very direct beginning of
the tone), and a more percussive onset performed in a
slow velocity results in a softer tone at the beginning
Figure 4: Tone Wall
independent of its dynamic level.
Thus, the y- and z-axis open up the complete band-
width of expressive tone forming known from keyboard
It defines a number of regions (as illustrated in fig-
instruments, like the piano, and the punch velocity is
ure 5) with their own harmonic content, e.g. a C ma-
a new means to specify details of the tone beginning.
jor harmony in the grey area (harmony 1), a minor in
However, it would be unwise to not additionally exploit
the yellow (harmony 2), a cluster chord in the area of
the potentials lying in the x-axis. Many instruments
harmony 5, and so on. The performer can move his
allow the player to vary its timbre to a certain extent,
focus via a head-tracking interaction over the regions
for which the x-axis is predestined. Different timbres
to change the harmony that is currently played; he lit-
can be blended from left to right, e.g. from a very dark
erally looks to the harmonies to play them.
sinusoidal waveform over relaxed, clear sound charac-
teristics up to brilliant and very shrill sounds. There Each harmonic area defines a density gain towards
are no limitations in sound design in comparison to the peak in its center. The density allocation can, of
traditional musical instruments. The complete Tone course, also feature more complex shapes, define mul-
Wall concept is illustrated in figure 4. tiple peaks, holes and hard surfaces. The values can
For more timbral variances and freedom, it is possi- be used for fading techniques, such as those described
ble to fill the Tone Wall with volumetric data of vary- in [1]; high density can be implemented with a louder
ing density. It can be employed as static behavior or re- volume than low density. But the harmonic field is not
act on interactions, e.g. like particles that are charged restricted to static tones only. Chords can be orna-
with kinetic energy when they are hit by the interactor mented by arpeggiated figures and compositional tex-
device. Due to the freedom to apply any sound synthe- tures can be defined. Instead of using a simple in/out
sis methods, the Tone Wall interface is not restricted fading, the texture density can be adapted: very sim-
to pitch based melodic structures, but also for more ple, transparent textures at lower density areas and
complex sound structures and noises for contemporary rich in detail figures at higher densities.
music styles. Since harmonic areas can overlap, we applied a num-
ber of transition techniques—other than fading that
does not satisfy in any situation. Held chords are tran-
3.2 Harmonic Field
sitioned part by part. Each part is moving stepwise
In contrast to the Tone Wall concept, which speci- towards its targeted pitch, where the steps are cho-
fies an interface to create basic musical events, the sen according to the underlying scale of the harmony
Harmonic Field is already a pre-composed musical en- (e.g., major, minor, or chromatic scale). Instead of a
vironment, which can be freely explored by the per- stepwise movement, the transition can also be done by
former. linear glissando. The transitional pitch is an interpo-

-5-

76
A Musical Instrument based on 3D Data and Volume Sonification Techniques

harmony 3 which can be controlled independently. The user plays


harmony 2 melodic gestures on the Tone Wall using hand and
harmony 1 arm gestures and thereby controls the harmonic pro-
gression on the Harmonic Field through head gestures
and a simple look. Furthermore, tilting the head can
be used to steer timbral aspects of the Harmonic Field
play.
Since it turned out to be of some difference to play
melodic figures that harmonize with the Harmonic
line
of si Field play, a further quantization is implemented to
ght
the Tone Wall. The scale that is playable on the Tone
Wall is matched to the current harmonic base and the
punch height is quantized to this scale.

3.4 Discussion
As with all musical instruments, it is necessary to in-
vest a certain amount of practice to learn the intu-
harmony 5 ition and motoric sensitiveness for a confident expres-
harmony 4 sive play. The intuitive correspondence between ges-
tural and musical events, especially in the case of the
Tone Wall interface, turned out to be very supportive
for a steep training curve. Nonetheless, a few practical
issues have to be discussed.
Figure 5: Harmonic Field
The interaction with the Tone Wall is subject to a
motoric limitation; it is quite exhausting to create fast
pace melodies with a proper play over a long period
lation of the pitches of each harmonic area according
of time. Tracking latencies (ranging between 8–10 ms)
to their density weightings. The goal pitch is reached
and sampling artifacts (interaction sample rate is 60
when the old harmonic area is left, or a hole within
Hz with two interactors) also slightly interfere with the
with zero density is found. With complex cluster-like
play and the possible speed of interaction.
harmonies, the resulting metrumless clouds do wake
Because of the absence of any visual reference points,
associations with György Ligeti’s Clock and Clouds for
it is at times difficult to meet the intended pitches.
women’s choir and orchestra.
A calibration, according to the size of the performer,
Compositional textures, in any respect, are not
can lower this problem; his body can provide several
metrumless. They are well-defined sequences of pit-
reference points.
ches/events in a certain tempo and rhythm. In the
For playing melodic intervals, the interactor has to
case of different tempi, the transitional tempo is an in-
leave the wall, jump over the unwanted pitches, and
terpolation depending on the density weighting. Since
punch back into it. Moving the interactor within the
the textures are repetitive, the morphing techniques of
wall would trigger pitches in-between. Thus, melodic
Wooller and Brown [25], and the interpolation tech-
intervals are always adherent with short pauses. A
nique of Mathews and Rosler [15] can be applied to
legato articulation is not possible within this approach.
combine the figural material.
Therefore, an interactor speed dependency has to be
However, generative textures were not included at
incorporated: a pitch is only played if the interac-
the current state. Therefor, transition techniques for
tor’s velocity is below a certain threshold. Pitches
generative algorithms have to be developed and are
can be skipped by faster movements even within the
classified as future work.
wall. Since this raises the problem of creating fast
pace melodies, this mode has to be detachable, e.g.
3.3 Poly Field
by a button on the hand interactor.
When performing music, it is always desirable of be- The same approach could be useful to reduce the
ing able to handle both, melodic and harmonic data, wah-effect when playing a pitch. The punch always hits
simultaneously. Thus, both interfaces, the Tone Wall the low dynamics area at the wall surface first, and the
and the Harmonic Field, have to be accessible and con- loud dynamics afterward. Hence, each tone fades in,
trollable by one person at the same time. even with fast punches that do only effect a more direct
This is achieved by employing two input devices tone attack. Although the interaction sampling rates

-6-

77
A Musical Instrument based on 3D Data and Volume Sonification Techniques

used lower this effect, a velocity dependent sampling and the integration of serial and generative concepts.
of the interactor would make the dynamic level more The volumetric interaction interface also opens up a
accessible. promising possibility for the conduction of music.
However, all performative means of expression are The musical volume representation concept is also
available and easy to perform—dynamics and empha- a novel view on musical structure and elements, en-
sis, articulation, (de-)tuning, timbral and articula- abling new compositional forms and means of expres-
tional (glissando, triller etc.) effects. sion. Here lies the biggest potential of new computer-
For the Harmonic Field the composer is free to de- based instruments. It is unnecessary to imitate tradi-
fine any chords, assign them to any timbral instru- tional instruments to create music that is performed
mentation and figurative ornamentation, and combine better with the real ones. If one wants to play a pi-
them by overlapping. He can actually define any com- ano, violin, trombone etc. the real ones perform always
positional and timbral texture and it can be explored better. New instruments should not imitate them, but
freely by the player. The player, however, is fixed to stand for a confident self-reliance to open up new possi-
this predefined set, unable to create new chords and bilities for new music to constitute their right to exist.
textures interactively during the performance. Fur-
thermore, the three-dimensional space cannot be ex- References
plored adequately using head-orientation alone, i.e. [1] A. Berndt, K. Hartmann, N. Röber, and M. Ma-
looking at a harmonic area from a relatively fixed po- such. Composition and Arrangement Techniques
sition, which allows only an exploration in 2D. The for Music in Interactive Immersive Environments.
player should be able to move freely in 3D space. This In Audio Mostly 2006: A Conf. on Sound in
raises conflicts with the Tone Wall metaphor. A pos- Games, pages 53–59, Piteå, Sweden, oct. 2006.
sible solution is to position the Tone Wall always in Interactive Institute, Sonic Studio Piteå.
front of the player and reposition it when the player
moves through the Harmonic Field. [2] A. R. Brown, R. W. Wooller, and T. Kate. The
However, the combination of the Harmonic Field Morphing Table: A collaborative interface for mu-
with the Tone Wall interface open up a very large sical interaction. In A. Riddel and A. Thorogood,
musical bandwidth with more timbral freedom than editors, Proceedings of the Australasian Computer
any traditional musical instrument can offer. The Music Conference, pages 34–39, Canberra, Aus-
three-dimensional setup of harmonic structures and tralia, july 2007. Australian National University
their density-dependent ornamentation textures are Canberra.
also unique and provides an inspiring platform espe- [3] J. Chadabe. Interactive Music Composition and
cially for performing contemporary music. Performance System. United States Patent Nr.
4,526,078, july 1985. filed sep. 1982.
4 Conclusion, Future Work [4] J. Chadabe. The Limitations of Mapping as a
Structural Descriptive in Electronic Instruments.
In this paper we presented a gesture based approach to- In Proceedings of the Conference on New Instru-
wards virtual musical instruments. We introduced the ments for Musical Expression (NIME-02), Dublin,
conceptual basis, which is a novel interaction mecha- Ireland, may 2002.
nism developed for the interactive auditory exploration
[5] R. H. Chapel. Realtime Algorithmic Music Sys-
of volumetric data sets. For their sonification we de-
tems From Fractals and Chaotic Functions: To-
vised the musical metaphors of the Tone Wall and the
wards an Active Musical Instrument. PhD thesis,
Harmonic Field, and conceived their sonic behavior in
University Pompeu Fabra, Department of Tech-
a way that the interaction with them produces musical
nology, Barcelona, Spain, sept. 2003.
events and aesthetic structures, like tones, melodies,
timbre effects, chords, and textures. We discussed as- [6] A. Crevoisier, C. Bornand, A. Guichard, S. Mat-
sets and drawbacks of these metaphors and outlined sumura, and C. Arakawa. Sound Rose: Creat-
advancements. ing Music and Images with a Touch Table. In
3D interaction devices open up a multitude of new NIME ’06: Sixth meeting of the International
possibilities for the design of computer-based instru- Conference on New Interfaces for Musical Ex-
ments. Their big potential lies in their intuitive as- pression, pages 212–215, Paris, France, 2006.
sociation with physical human gestures and musical IRCAM—Centre Pompidou.
events, for which the interaction with virtual volume [7] W. T. Fitch and G. Kramer. Sonifying the body
data turned out to be the medium of choice. Future electric: Superiority of an auditory over a vi-
work includes the development of further metaphors sual display in a complex multivariate system.

-7-

78
A Musical Instrument based on 3D Data and Volume Sonification Techniques

In G. Kramer, editor, Auditory Display: Soni- [19] Laurie Spiegel. Music Mouse. http://retiary.
fication, Audification, and Auditory Interfaces, org/ls/programs.html, 2004.
Boston, MA, USA, 1994. Addison-Wesley. [20] Lars Stockmann. Designing an Audio API for Mo-
[8] C. Heeter and P. Gomes. It’s Time for Hyper- bile Platforms. Internship report, 2007.
media to Move to Talking Pictures. Journal of
[21] L. S. Theremin. Method of and Apparatus for the
Educational Multimedia and Hypermedia, winter
Generation of Sounds. United States Patent Nr.
1992.
73,529, dec. 1924.
[9] A. Hunt and T. Hermann. The Importance of
Interaction in Sonification. In ICAD 04—Tenth [22] Paul Vickers and Bennett Hogg. Sonification ab-
Meeting of the International Conference on Audi- straite/sonification concrète: An ‘Æsthetic per-
tory Display, Sydney, Australia, july 2004. spective space’ for classifying auditory displays in
the ars musica domain. ICAD 06 – 12th Interna-
[10] S. Jordà, M. Kaltenbrunner, G. Geiger, and
tional Conference on Auditory Display, Juni 2006.
R. Bencina. The reacTable. In Proceedings of
the International Computer Music Conference, [23] Colin Ware and Steven Osborne. Exploration
Barcelona, Spain, 2005. International Computer and virtual camera control in virtual three di-
Music Association. mensional environments. SIGGRAPH Comput.
Graph., 24(2):175–183, 1990.
[11] K. Jørgensen. On the Functional Aspects of Com-
puter Game Audio. In Audio Mostly 2006: A [24] J. Williamson, R. Murray-Smith, and S. Hughes.
Conf. on Sound in Games, pages 48–52, Piteå, Shoogle: Excitatory Multimodal Interaction on
Sweden, oct. 2006. Interactive Institute, Sonic Mobile Devices. In Proceedings of the SIGCHI
Studio Piteå. conference on Human factors in computing sys-
[12] M. Kaltenbrunner, S. Jordà, G. Geiger, and tems, pages 121–124, New York, USA, 2007.
M. Alonso. The reacTable: A Collaborative Musi- ACM.
cal Instrument. In Proceedings of the Workshop on [25] R. W. Wooller and A. R. Brown. Investigat-
”Tangible Interaction in Collaborative Environ- ing morphing algorithms for generative music. In
ments” (TICE), at the 15th International IEEE Third Iteration: Third International Conference
Workshops on Enabling Technologies, Manch- on Generative Systems in the Electronic Arts,
ester, U.K., 2006. Melbourne, Australia, dec. 2005.
[13] G. Kramer, editor. Auditory Display: Soni-
fication, Audification, and Auditory Interfaces.
Addison-Wesley, Boston, MA, USA, 1994.
[14] M. V. Mathews. The Digital Computer as a Musi-
cal Instrument. Science, 142:553–557, nov. 1963.
[15] M. V. Mathews and L. Rosler. Graphical Lan-
guage for the Scores of Computer-Generated
Sounds. Perspectives of New Music, 6(2):92–118,
Spring–Summer 1968.
[16] R. Minghim and A. R. Forrest. An Illustrated
Analysis of Sonification for Scientific Visualisa-
tion. In IEEE Conference on Visualization, At-
lanta, USA, oct. 1995.
[17] Niklas Röber and Maic Masuch. Playing Audio-
only Games: A compendium of interacting with
virtual, auditory Worlds. In Proceedings of 2nd
DIGRA Gamesconference, Vancouver, Canada,
2005.
[18] David Rossiter and Wai-Yin Ng. A system for the
complementary visualization of 3D volume images
using 2D and 3D binaurally processed sonification
representations. In Proceedings of the 7th confer-
ence on Visualization, pages 351–354, San Fran-
cisco, USA, 1996. IEEE Computer Society Press.

-8-

79
Same but Different – Composing for Interactivity

Anders-Petter Andersson, Interactive Sound Design, Kristianstad University, Anders-Petter.Andersson@hkr.se


Birgitta Cappelen, AHO- The Oslo School of Architecture and Design, Birgitta.Cappelen@aho.no

Abstract. Based on experiences from practical design work, we try to show, what we believe, are the similarities and differences,
between composing music for interactive media compared to linear music. In our view, much is the same, built on traditions that
have been around for centuries within music and composition. The fact that the composer writes programming code is an essential
difference. Instead of writing one linear work, he creates infinite numbers of potential musics that reveal themselves as answers to
user interactions in many situations. Therefore, we have to broaden our perspectives. We have to put forward factors that earlier was
implicit in the musical and music making situations, no matter if it was the concert hall, the church, or the club. When composing
interactive music we have to consider the genre, the potential roles the listener might take, and the user experience in different
situations.

music for the installation. This, in order to be able to discuss in


What and Why what manner we mean the compositional work is similar and
Interactive media are increasingly becoming a significant part of different when comparing interactive music to linear music.
our daily lives, and the extreme ongoing developments in ORFI is a new audio tactile interactive installation (see Figure
mobile communication services are making auditive interactive 1). It consists of around 20 tetrahedron shaped soft modules, as
media in particular important. This is a big challenge for special shaped cushions. The modules are made in black textile
everyone who wants to take part in the creation and and come in three different sizes from 30 to 90 centimetres.
understanding of the new auditive interactive media. But to what Most of the tetrahedron has orange origami shaped “wings”
degree is it new? And to what degree does the composition of mounted with an orange transparent light stick along one side
interactive music build on traditional music composition? (see Figure 2).
In this paper we want to show how the composition of music
for interactive media is similar to, but also different from, linear
music composition. We would like to show to what degree, we
can reason and follow the same line of thoughts, when
composing interactive music as in linear music. We want to
show what perspectives, factors and conditions one has to
acknowledge when composing music for interactive media.

The ORFI Example

Figure 2: A user playing and interacting with ORFI’s wings.

The “wings” contain bendable sensors. By interacting with the


wings the user creates changes in light, video and music. ORFI
is shaped as a hybrid, a hybrid between furniture, an instrument
and a toy, in order to motivate different forms of interactions.
One can sit down in it as in a chair, play on it as on an
instrument, or play with it as with a friend. The largest modules,
in suitable size to sit on, have no wings with sensors, but
speakers instead. Every module contains a microcomputer and a
radio transmitter and receiver, so they can communicate wireless
Figure 1: The ORFI landscape, the modules and the dynamic
with each other. The modules can be connected together in a
video projection.
Lego-like manner into large interactive landscapes. Or, the
modules can be spread out in a radius of 100 meters. So one can
We would like to use the interactive installation ORFI 1 [1] as an
interact with each other sitting close, or far away from each
example, showing how we were thinking when creating the
other. There is no central point in the installation, the field [2].
The users can look at each other or at the dynamic video, like a
1 ORFI is an interactive audio tactile experience environment created by
the group MusicalFieldsForever (Anders-Petter Andersson (concept, concept, design, interaction design) and Fredrik Olofsson (concept,
music, composition rules, sound design), Birgitta Cappelen (field theory, music, software and hardware)) www.musicalfieldsforever.com

80
living tapestry, which they create together. Or they can just chill The Interactive Challenge
out and feel the vibrations from the music sitting in the largest The composer of interactive music does not write notes on
modules. paper or mix sound samples together to a linear track. He
The installation has a 4-channel sound system that makes creates music and software which totally or partly are the same,
listening a distributed experience. ORFI consist for the time depending on if the music elements are programs or sound
being of 8 genres, or collections of rules, which the user can samples. This means that the composer composes potential
change between. Our use of the term “genre” has references to music, and software that controls the potential relation between
popular culture, such as music, and everyday activities made music elements. Music elements that will follow each other or
when consuming the music, such as dancing [3, 4]. In ORFI we lie as layers on top of each others and be distributed in the 4-
explore 8 different musical genres: channel system. All depending on what the user does, which
ƒ JAZZ (bebop jazz band, dancing, ambient), never can be predicted exactly. Writing software represent a
ƒ FUNK (groove, dancing) totally different potential than writing for an instrument, because
ƒ RYTM (techno, club) the computer can wait, remember and learn in a more or less
ƒ TATI (speech, onomatopoeic, movie) intelligent manner. Therefore one can write software so that the
ƒ GLCH (noise, club) installation or the interactive medium can behave more or less
ƒ ARVO (ambient, relaxation) like an active actor [7] instead of as an instrument. Playing a
ƒ MINI (minimalist instruments, playing with toys) traditional instrument a musical gesture on an instrument will
ƒ VOXX (voice recordings generated dynamically by produce an immediate mechanical sound response [8]. Writing
the user). software for a computer one can decide that a gesture or
interaction from the user, after a while will create a more
In this paper we have chosen to describe the compositions in the complex musical answer. This is more like the improvisational
JAZZ and MINI genres, because they represent oppositions musician, which after some time comes with his answer to your
regarding genres and therefore serve as explanatory examples. solo play.
The many possibilities, such as many distributed wireless In ORFI we use both strategies in order to offer multiple
modules, and many genres to choose between, reflect our goal to possibilities in all situations [2]. This means that the user
facilitate collaboration and communication on equal terms, interacting with ORFI gets both a direct, immediate answer in
between different users in different use situations light and sound, as when playing on an acoustic instrument. But
after a little while he gets a complex musical answer to motivate
New Situations and Roles the user to further co-creation with ORFI. Examples of how it is
One of the aspects we have put a lot of effort into when creating composed will be presented under “Interactive music
ORFI is related to the use or consumption situation. We don’t composition”.
know and cannot control in what situation and for how long ORFI is created so it continuously invites to collaboration in
ORFI will be played on and listened to. This differs in a different ways and through different media and forms [9]. Since
fundamental way from composing music for a stage we have a very ambitious goal that ORFI shall work satisfactory
performance. In a stage performance one knows implicit that the for most users in many situations over long time it is necessary
audience will sit in the dark with the face towards the stage with an open concept of collaboration. Nothing is the right way
quietly listening for one or two hours. Radio listening is more to do something. Nothing is wrong. It is as right to listen and
like our situation, but here one usually knows by knowing the sleep in the interactive landscape as it is to throw the modules
time, what everyday ritual the radio program is part of [5]. between each other while playing. It is equally right to build and
Music as an ambient sound tapestry in a home is more like our shape one’s own interactive landscape, as it is to concentrate and
situation, but here the user is limited to turning the music on or move the wings rhythmically for some minutes. To offer this
off, or change to next tune. These actions represent a break in amount of openness we have put a lot of effort into the design of
continuity of the listening experience. ORFI in order to offer qualities like robustity, sensitivity,
In ORFI the interaction shall be a seamless [6] part of the music ambiguity, musicality and consistency.
experience. The music therefore must invite and motivate ORFI has to be very robust physically to handle being thrown,
dynamically to interaction. To the co-creation of the music stepped on and bended intensely. But it also has to be robust,
experience. In this meaning, ORFI is more like the tolerant and sensitive in the software and hardware to register a
improvisational musician, but in ORFI we can not count on the weak movement from a child’s hand and attempt to follow the
user to have a professional musical know-how. It must be rhythm.
satisfactory to both musical professionals and people with little ORFI offers many visually, physically and musically
music competence, if we are to reach our ambitious goals. possibilities in many situation. It tries to answer and encourage
In ORFI the audience changes continuously between roles in the users in different ways to musically interaction, in spite of
different situations, from being a passive listener, to a musician their competence, or lack of competence, in music. This means
and a composer. Through long use the user gets deeper that for some, in some genres, the lights rhythmical blinking
knowledge about ORFI`s complexity and the user becomes motivates the user to rhythmic interaction. But in other
more like the improvisational musician, who with his situations, for other users, it is the complex dynamic graphic that
competence creates music on an instrument in real time. But the gives the user a visual image, and motivates the user to create
user also comes nearer and becomes more like a composer over the musical narrative.
time. The user becomes like a co-composer, which based on the All these possibilities open up for various experiences for
potential, the composer and writer of the software has different users depending on the individual’s competences and
formulated, “composes” music by choosing and mixing music experiences with ORFI. It has been very important for us to
together. In this way the real composer who has written the design ORFI to offer many possibilities in every situation,
software is present in the installation by continuously giving ambiguity [10]. But it has also been very important to give ORFI
musical answers, offering new musical possibilities to the user, a clear and unique identity. This so ORFI might act as a
the co-composer. This is also a major difference between linear convincing actor in a collaboration or improvisation. The
an interactive music composition. continuous change of roles the user can make. The many
possibilities the users are offered. The potential infinite uses and

81
the many consumption situations make the interactive A sound node can be a linear sound file or programming code.
composition challenge much more complex than in linear music. We have chosen to present the JAZZ and MINI genres, to show
The fact that the nature of interactive music composition is our solutions in the two cases, sound samples vs. code.
software, and not notes or samples, also makes it necessary to The accompaniments (ground), horn riffs and saxophone sound
structure the interactive composition in another way than linear nodes in the JAZZ genre are sound files. The melodic patterns in
music. the MINI genre are programming code. The difference between
sound file and programming code is that the auditive result
Interactive Music Composition formulated in programming code varies dynamically for each
So how have we met the interactive challenge of composing interaction, while the sound file is essentially the same each time
potential music? How have we created algorithms and rules in it is played. This makes the programming code potentially more
programming code that regulate the relations between musical flexible, since it can vary with user interaction.
elements, the conditions for potential music? And how did we creation of a node
compose music that motivated both professional musicians and Similar to traditional jazz, the blues in our JAZZ genre is
laymen to interact? composed and recorded by a jazz ensemble [11]. Each musician
With concrete examples we will try to show how we have has recorded his instrument until the result is mixed down to a
composed the music for ORFI, exemplified by the two most jazz song.
diverse genres. After recording the music, we have cut and grouped the
We have chosen to structure ORFI’s software and music recorded instruments into separate sound files. Then we have
composition into the following layers: sound node, composition arranged the files interactively by writing rules for ORFI [2].
rule, and narrative structure (see Figure. 3). Our arrangement builds on the style’s traditional "improvisation
Sound nodes are the smallest musically defined elements, such on a theme".
as, tones, chords, or rhythmic patterns.
nodes and node structure
The sound nodes can be joined into sequences or parallel events
In traditional jazz the musicians play on instruments with direct
by composition rules, forming phrases, cadences, or rhythmic
response only. In ORFI the user might instead play on 20
patterns. The sound nodes can be joined into phrases or parallel
physical soft modules. When interacting each module plays
events by composition rules (algorithms). The user experiences
three different saxophone sound samples depending on the
these phrases as narrative structures based on a genre.
situation. The reason for this solution, is that we wanted to be
able to vary the expression from soft Ben Webster-like sound
nodes within Dorian scale, to hard, growling and dissonant
sound nodes outside the Dorian scale, and percussive saxophone
pad sound nodes. Which sound nodes the program combines and
how depend on if the users are active or passive, interact on their
own or collaborate, synchronise to the musical beat or not.
roles and experience
Similar to jazz, our interpretation has separate roles tied to the
different instruments. We use the tenor saxophone in the soloist
role, blues drums and walking base as accompaniment (ground),
and horn riffs played on saxophone, trombone and trumpet. An
important difference is that the interacting users continuously
can choose to change between the role of improvisator, soloist,
accompanist by choosing the module to play on etc. Therefore
the roles are potential and open for interpretation more than
definitely.
An example; When improvising, a saxophonist creates music
from pre-composed short motifs. He also creates phrases from
Figure 3: Structure of the interactive music composition two contradictory curves of tension – amplitude and vibrato.
software and interface in ORFI. The amplitude curve goes from strong to weak (>). And the
curve for vibrato goes in the opposite direction from little to
Figure 3 shows two users interact (bottom) with input sensors A much vibrato (<). We use the same strategy when composing
and B. The composition rules written in programming code, interactive music; This result in a sound node that potentially
selects the saxophone sound nodes (sax 1, 5, 7, 2) based on the has two gestures at the same time, a decreasing and an
users interaction. Another composition rule creates the switch increasing gesture. These contradictory curves of tension could
from “ground 1” in high tempo, to “ground 2” in slow tempo, so potentially function as the start tone, building up a tension in a
that it synchronise smoothly with the pulse without creating a phrase, and as an end tone finishing the phrase creating a
break. Over time the user creates a narrative structure of an 8 bar release. The user, both laymen and musicians, can choose to
jazz blues that motivates further interaction. hear it as a tension or release depending on the situation.
In our JAZZ genre laymen can use the saxophone’s tension
Sound Node
creating curve for other purposes than the professional musician.
definition, mediation and qualities For instance speeding up a movement, rolling his body over the
We call the smallest musically defined elements sound nodes. soft modules spread out on the floor, while communicating with
They are categorized into sound qualities like length, a friend.
instrument, pitch, harmony, tempo, meter, etc. Based on the
response and experiences
sound qualities and the composition rules, the program choose
The melodic pattern generated by the programming code is
and creates the narrative, e.g. a melodic motif, where the
changing dynamically with the user interactions and other
expressive qualities depend on user interaction.
melodic patterns playing at the same time. So when the user

82
interacts, the software realise one out of many possible moment to play back motifs, and selects sound nodes that add
melodies. variation to the harmony and musical phrasing.
Similar to playing on traditional instruments, the user gets direct Unlike traditional music, laymen can communicate with each
response when interacting. And at the same time the user other directly through the music. In ORFI laymen interact
contributes to a musical whole. actively and the program responds to individual, as well as
The MINI genre gives immediate response in simple 2-6 notes collective interactions.
melodic motifs. The sound nodes also contribute to a complex An obstacle in traditional music is that it is hard to make music.
and musically satisfying response that motivate users to interact It is hard for a layman to keep the rhythm, pick the right notes
with others over longer time. and create musically satisfying phrases.
Similar to traditional music our JAZZ genre uses different In ORFI its different; The program and its composition rules are
instruments to create complex variations and contrasts between tolerant, making it easier for laymen to synchronise to the pulse.
the instruments. This is also the case for groups of sound nodes, The composition rules tolerate deviations from what is
within an instrument. rhythmically correct and synchronise motifs to the harmony and
code vs. tune the pulse in other sound nodes. The result is avoidance of
The sound nodes in the MINI genre are inspired by minimalist technically difficult situations. And the laymen can instead focus
music in the style of Steve Reich [12]. on communication and collaboration with others.
Similar to minimalist music our genre is characterised by composition techniques in JAZZ
repetitions and small variations of short rhythmical and melodic We have been inspired by traditional cool jazz and its modal
motifs, rather than large-scale development such as phrasing, or harmony, rhythmically improvised and laid back performing
sonata form. With less happening on a macro level the focus is style of such artists as Miles Davies and Ben Webster [15, 16].
directed towards the surface and the micro level of small In cool jazz a saxophonist can make it sound great letting the
changes in melody, rhythm and timbre. instrument wander casually along a modal scale. He search his
What makes our MINI genre different is that every sound node way along the background of drums and falling fifths of a
is a program (see Figure 4). walking base.
In cool jazz it is custom to use themes in modal scales with less
SynthDef(\pattSynth, {|out= 0, freq= 440, amp= 0.1, atk= 0, rel= 0.5, chords in order to make it easier to improvise freely, with focus
max= 40| on rhythm and musical expression.
var e= EnvGen.kr(Env.perc(atk, rel), doneAction:2); Similarly we use Dorian modality in ORFI, to motivate
var f= EnvGen.kr(Env.perc(0, 0.01), 1, 1, 0, Rand(0.95, 1.1));
improvisation and interaction.
var z= SinOsc.ar(freq*[1, IRand(0, 3).round(1).max(0.5)+Rand(1,
1.02)], f*Rand(10, 40), amp*0.1); As in cool jazz, interactive music composition, also use effects
Out.ar(out, e*z); like growling and dissonant saxophone with harsh timbre as a
}).store; musical rhetoric technique.
Unlike traditional jazz we use growling and dissonant
Figure 4: The sound node programming code for a synthesised, saxophone to express, stage and dramatize the conflict when
marimba, written by Fredrik Olofsson in SuperCollider [13, 14]. many users interact simultaneously. For instance when many
people play and tease each other, by interacting with many
Composition Rule ORFI modules at the same time. The result is that the program
definition and mediation creates many growling noises outside the Dorian scale, in
Similar to traditional music the composition rule is composition addition to the user created, soft and consonant tones.
knowledge, the composer use to create the traditional musical As in traditional jazz we use soft consonant leading notes for
work. making musical ornaments. These motivate to improvisation,
It can for instance be knowledge about how to create relations such as call-and-response communication, and duets between
between tones, rhythm, melody, timbre and harmony in music. musicians.
Different from traditional music is that the composition rules are Different is that the soft and consonant leading notes are used to
programming code realised through use. For instance the sound express pauses, motivating turn-taking between laymen. When
files in the JAZZ genre are controlled by the composition rules one or many users make pauses while in a sequence interact-
in a program that consider both the musical and the interactive stop-interact-stop-interact, etc., the composition rules create soft
development over time. Another example is the MINI genre leading notes in Dorian scale. This in addition to the ones that
melodic pattern, where both sound nodes and composition rules the user has chosen. The result is that the user becomes aware of
are formulated as programming code (figure 4) and where the the silent pause between the interactions, and the relations
difference between sounds and rules is lacking. The result is that between his own actions and actions of others. This motivates
the music can change dynamically, and that it sounds differently dialogue, imitations and play in call-and-response manner.
over time, with different users and situations. composition techniques in MINI
competence and experience Similar to traditional minimalist music, the ORFI motifs borrow
Similar to traditional music, a musician can use his polyrhythmic techniques from Gamelan, African and Middle
improvisational competences for making music by joining pre- Ages music. Here, polyrhythmic and harmonic gaps in the
composed elements together. rhythmic patterns, make them fit into each other, creating
Differently from traditional music is that a layman with less “hocket” patterns. This motivates improvisation and interaction.
musical competence can interact with the program and its A difference is that minimalist motifs are used to express
composition rules. The program interprets the interaction, delays contrasting and varying responses that motivate laymen. The
and changes the response in order to make it musically composition rules then vary the pattern so that the hocket effects
satisfying, according to the composition rules. disappear, in order to reappear when the rule for variation is
The composition rules regulate the synchronisation of motifs to active again. A difference is that the music varies with the
the pulse after every user input. It also regulates tonal and number of users interacting, giving dub-delay effects to one user
harmonic development so that they don’t contradict the genre and reverb to another. The result is a blurred and distorted
rules. Instead, the program waits for a rhythmically suitable effect. The effect is used to separate between two individual
laymen motivating them to collaborate.

83
Narrative Structure tolerant minimalist sound carpet to sink down into, or a melodic
toy to throw and play with, or an improvising partner and active
definition actor, continuously inviting and motivating to communication.
We use the term narrative structure to describe structures for
connecting series of events in ORFI, creating experience and same and different
expectations about future musical output. We have tried to show how we have composed music for the
MINI and JAZZ genres in ORFI. We have found the comparison
role, action and expectations
between traditional popular music and its application to
The difference between linear music and ORFI`s music is that interactive music to be very successful. Much of music’s
the composer has to negotiate the narrative structure with expressive qualities, variation and repetition techniques are the
interacting users as well as passive listeners. Similar to linear same in interactive and linear music. A great deal of traditional
music, in ORFI there are often opposing or contradictory knowledge about analysis and composition of music can be
expectations about what the narrative structures mean. For transferred to the interactive music.
example, a melodic structure driving the music forward creates Differences we found in the design of ORFI are primarily tied to
expectations of tension and a crescendo. In the same piece, a user expectations and structures in composition rules and
rhythmic pulse in the ground, might create expectations of narrative structures to support those expectations. Our
bodily movements and dance to the pulse. Similar to linear music experiences are that the interacting users need immediate
it is possible for the users in ORFI to negotiate meaning, response to be able to orient and find their way, as well as more
following or denying expectations about the narrative structure. complex response in order to get motivated to continue
Similar to traditional jazz, the narrative structure of our JAZZ interacting over time. We often found that musically complex
genre follows the development and tension in an 8 bar jazz blues structures or processes found in traditional music could
structure. Traditionally the blues structure is the ground for the strengthen other situations with laymen interacting and playing
soloist’s improvisation over a repeated series of chords and a alone and in collaboration with other people.
pulse. Often the soloist creates expectations that follow the
convention, of playing as many rounds as he think suits him.
Add Perspectives
When he feels ready to hand over to the next soloists he gives a
In our paper we have tried to show what we believe, are the
sign, making cadences or finishing riffs. And the next person,
similarities and differences, between composing music for
eager to make his interpretation and show off to the audience,
interactive media compared to linear music. In our view, much
takes over. Building up to the moment just before the start of a
is the same, built on traditions that have been around for
new round in bar 7-8, there exists a short period of 6-8 beats
centuries within music and composition. However, our main
where the tension is at top and the negotiation is strongest.
conclusion about the new auditive medium is that we have to
interpretation and negotiation broaden our perspectives. We have to put forward factors that
A difference in our JAZZ genre is that the system can analyse if earlier was implicit in the musical and music making situations,
the user is synchronising his actions to the pulse in the no matter if it was the concert hall, the church, or the club.
accompaniment. If he succeeds, ORFI answers with rewarding When composing interactive music we have to consider the
horns riffs, stressing the harmonic, periodic and rhythmic genre, the potential roles the listener might take, and the user
development in the blues. experience in different situations.
Another difference is that the blues accompaniment with drums The consumer situation in interactive media is dynamically
and base in our JAZZ genre can be used to negotiate the changeable. Interactive music consumption can take place at
narrative structure. This is often made by users playing and home, in the street, at school. It doesn’t need to be static, pre-
craving for more musical variations of a certain riff. We have destined and hierarchical, with the professional and recognised
found that the accompaniment in addition can motivate two musician on stage and anonymous audience in darkness. In the
users playing a game, dancing, or a person laying down resting concert hall or the club the sound comes from a centrally placed
without focus on the music. sound system. In interactive media, however, the sound can be
Another difference is that the accompaniment is divided into distributed and mobile, so that it moves and follows the persons
three ground beats in different tempi creating possibilities for the interacting.
user to start, stop, change tempo, play together with horns, etc. It The persons consuming the sound are not passive listeners
increases the possibilities for the user to negotiate what and how anymore, but active users, able to dynamically shift between
strong the narrative structure should be. When interacting, the roles, by choosing position in space, relations and roles to other
active users actions and the references to activities like dancing, people and the music. The user can take part in changing the
playing, creating music etc. produce, uphold and nurture a sound experience in real time, based on the rules the composer
narrative structure that potentially invites other users. has created as a potentiality in the software. This differs in a
genres and experience significant way from the jazz improvisator or the professional
The traditional minimalist narrative structure follows the musician. The fact that the composer writes programming code
development on a micro level, with fewer expectations on large is an essential difference. Instead of writing one linear work, he
form structure development. It is almost a contradiction that creates infinite numbers of potential music that reveal
anything relevant should happen on the macro level in themselves as answers to user interactions in many situations.
minimalist music. Instead the expectations should be directed This might be like an instrument responding to a musical
towards the micro level and the tiny variations, we can hear if gesture, or a competent and intelligent actor answering
we sharpen our senses. musically in an improvisation session. But everything has to be
A difference in our MINI genre is that the system organise the formulated in advance as rules in the software. The challenge is
synchronisation of the melodic patterns to the pulse. The to create music, through user interaction, that motivates to
synchronisation is freeing the user from the responsibility to further co-creation of the music and moving image narrative.
keep track of the beat. Instead, we have found, it creates Everything has to be formulated in advance, based on genre and
possibilities for the user to focus on the communication and music knowledge and competence in social behaviour. It’s all
improvisation with others. about broaden the perspective to look wider, further and deeper.
The biggest difference, however is that it the user can choose to
negotiate what role to play, and if he wants ORFI to be a

84
Same but Different – Composing for Interactivity

Acknowledgements
Without Fredrik Olofsson’s unique artistic and technological
competence and knowledge in development of music, hardware
and software, ORFI would not have been possible to create. We
also thank Jens Lindgård, Petter Lindgård and Sven Andersson
for their work with music. We thank the Swedish
Inheritance Fund and Borgstena Textile AB for their
contributions. We thank Interactive Institute and K3 Malmö
University for being a source of inspiration to our work in the
group MusicalFieldsForever.

References
[1] Andersson, Anders-Petter, Cappelen, Birgitta, Olofsson,
Fredrik, ORFI, interactive installation, MusicalFieldsForever,
Art’s Birthday Party, Museum of Modern Art, Stockholm,
(2008)
[2] Cappelen, Birgitta & Andersson, Anders-Petter, From
Designing Objects to Designing Fields - From Control to
Freedom, Digital Creativity 14(2): 74-90, (2003)
[3] Fabbri, Franco, A theory of Popular Music Genres: Two
Applications, Popular Music Perspectives, Horn, D. & Tagg. P.
(ed.), Göteborg and Exeter: A. Wheaton, 52-81, (1982)
[4] Holt, Fabian, Genre in Popular Music, University of
Chicago, (2007)
[5] Tacchi, Jo, Radio Texture: between self and others, Material
Cultures, Why some things matter, (ed.) Miller D., London,
(1998)
[6] Weiser, Marc, The Computer for the Twenty-First Century,
Scientific American, 256(3): 94-104, (1991)
[7] Latour, Bruno, Pandora's Hope, Essays on the Reality of
Science Studies, Cambridge, MA; London, UK: Harvard
University Press, (1999)
[8] Godøy, Rolf Inge, Haga, Egil, Refsum Jensenius, Alexander,
Exploring Music-Related Gestures by Sound-Tracing. A
Preliminary Study, Congas, Leeds, (2006)
[9] Crawford, Chris, On game Design, US, (2003)
[10] Andersson, Anders-Petter & Cappelen, Birgitta, Ambiguity
- a User Quality, Collaborative Narrative in a Multimodal User
Interface, Proceedings AAAI, Smart Graphics, Stanford, (2000)
[11] Lindgård, Petter, Lindgård, Jens, Andersson, Sven (music),
Andersson, Anders-Petter (arr. & composition rules for
interactive installation), JAZZ genre, Do-Be-DJ/Mufi,
MusicalFieldsForever, (2000)
[12] Reich, Steve, Music for 18 Musicians, Recording, ECM,
(1978)
[13] SuperCollider, http://www.audiosynth.com, (2008
[14] Olofsson, Fredrik (music and composition rules for
interactive installation), MINI genre, ORFI/Mufi,
MusicalFieldsForever, (2007)
[15] ) Davies, Miles, Birth of the Cool, Recording, Capitol,
(1950)
[16] Webster, Ben (arr.), Arlen H., Koehler, T, I’ll Wind,
Recording, Soulville, Verve, (1957)

85
The HarmonyPad
- A new creative tool for analyzing, generating and teaching tonal
music
G. Gatzsche, Fraunhofer IDMT, Ilmenau, Germany, gabriel.gatzsche@tu-ilmenau.de
M. Mehnert, Technische Universität, Ilmenau, Germany, markus.mehnert@tu-ilmenau.de
D. Gatzsche, Staatl. Berufsbildende Schule, "‘Janusz Korczak"’, Weimar, Germany, david.g@tzsche.de
K. Brandenburg, Technische Universität , Ilmenau, Germany, Karlheinz.brandenburg@tu-ilmenau.de

Abstract. Learning a classical musical instrument is a challenging task that requires long term practise, high
motor skills and intensive training. Within that relationship the following challenges exist: 1.) Students often
perceive pure score reading and reinterpretation as being boring. To teach musical improvisation additionally
would solve this problem. 2.) But to be able to improvise the student has to reach a certain technical level first.
3.) Pure score reading and reinterpretation does not automatically train the ability to improvise. Often very
good score players have difficulties to accompany a given melody. To overcome these problems a new musical
instrument is proposed that meets the following properties: The instrument is very easy to play. Its interface
is designed to reveal important structural properties of tonal music become geometrically: The relationships
between tones, intervals, chords and keys, functional aspects, aspects of consonance and dissonance and aspects
of tension and resolution. The instrument can be played without having extensive motor skills or prior music
theoretical knowledge. Through the usage of the proposed device the student implicitly acquires knowledge
about musical structure which again helps to compose, improvise, analyze musical pieces or to accompany a
given melody. Teachers can use the instrument to teach music theory.

1 Introduction To bypass or diminish the difficulties denoted be-


fore the following precondition have to be provided:
Many musicians are very good in score reading and
1.) Improvisation must not depend on the musician’s
playing. But if they are required to improvise or to
personal motor skills. This means a musical beginner
compose difficulties arise. But particularly expressing
should have the chance to improvise in very early sta-
own feelings – mainly done through improvisation and
ges of his musical career. 2.) A much better connection
composition – is usually the heaven of every musician.
of theoretical knowledge and its practical realization
But why are there so many people that can play an
has to be obtained. 3.) Musical elements like tones,
instrument but are not able to improvise and compose?
chords, keys, cadences, modulations etc. have to be
In our point of view there are three main reasons:
linked much stronger to visual associations. This would
1. In the first years a musician is manly concentrated help to maintain a conscious logical representation of
to learn the required motor techniques and to be the learned things of musical elements beside the felt
able to read and play scores. But to have a good and intuitive one. Furthermore the visual representa-
grasp of this, much practice and a long play period tion helps the student to build up and maintain a broad
is required. musical vocabulary which can be used in improvisation.
2. Additionally a musician has to learn music theory; In the following sections a creative tool is proposed
Beside playing the instrument much effort has to with the goal to realize the points denoted before. The
be put into learning music theoretical terms and tool is called HarmonyPad, because it is mainly de-
basics of composition. This is often not or only signed for the creation of homophone harmonic song
partially done because of lack of time, complexity and melody accompaniment. The HarmonyPad is very
of the music theory and personal demotivation. easy to play: With only one finger it is possible to play
3. Without practicing technical motor skills and train- tones, intervals and complete chords, with a second fin-
ing music theory a musician is not able to learn the ger the key can easily be changed. Therefore a musi-
highest level of music creation, that is the impro- cal beginner can easily learn to play chords. Further-
visation. The expression of own feelings during more the harmonic functions subdominant, dominant
play and musical articulation needs practice and of and tonic are arranged in a similar way the human per-
course a set of different harmonic building blocks. ception handles music.

86
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music

space), the selected area (e.g. the shape, translation or


rotation of the selected area. Furthermore the user inter-
face specifies the way the pitch space and the selected
area are presented to the user. The user interface can e.g.
be a touch screen, a game controller or a also a motion
C e capturing system.
Figure 1 shows the HarmonyPad as an example of
a G the things introduced to this point: In the inside of the
HarmonyPad shows the pitch space. The pitches are
represented by the small circles together with the note
names. The selected area is the grey circular segment
that marks the three tones C, e and G. Figure 2 shows
F h different controllers that can be used to move the se-
lected area to play chords or to change the pitch space
d currently displayed.

IDMT001278
2 The user interface
The user interface of the HarmonyPad has to be some
kind of touch sensitive surface. This can be a normal
Figure 1: The HarmonyPad touch screen, a multi touch surface, a button matrix, a
pen display or also an innovative controller like the Re-
actable1 . The examples of the following explanations
1.1 Pitch space based musical instruments have been implemented on the JazzMutant Lemur as
an multi touch controller and a Elo Touch Screen. The
The proposed creative tool is a so called pitch space
advantage of the Elo touch screen is the possibility to
based musical instrument. Pitch space base musical
use the standard graphics driver to visualize the pitch
instruments consists of three main components: The
space, the selection function and also musical data. This
pitch space, the pitch selection and the user interface:
again allows a direct interaction with the visualized
pitch space. The player can touch and play the shown
The pitch space defines a geometric arrangement of
tones or can transform the selected area. The drawback
pitches. Similar to color spaces the pitch space arranges
of the Elo touch screen is that the screen it is not able to
tones in a way such that semantic relationships between
process multiple touches simultaneously. This problem
the tones become geometrically apparent. Such aspects
is solved with JazzMutant Lemur which provides the
are for example consonance or dissonance, chordal or
information of up to ten touch points. Another big ad-
melodic grouping, cognitive similarities or simply the
vantage of the Lemur is the possibility to define different
pitch height. The better the tones are organized within
user interface configurations very easily. The drawback
a pitch space the easier it is to generate a wanted sound
of the Lemur is that it is not possible to program and
i.e. tone combination.
show complex data visualizations and geometric mod-
The selected area is a subregion of a given pitch space els.
which contains the tones to be played. If tones that
belong musically together are located in neighborhood 3 The pitch space
then it is possible to create meaningful tone combina- The HarmonyPad consists of two pitch spaces, a first
tions through the definition of simple shaped selected space to setup the key and a second pitch space to gen-
areas. By transforming the selected area (e.g. trans- erate the actual chords. The pitch space that is used to
lation, scaling, rotating, inversion, ...) it is possible to setup the key is the circle of fifth which is shortly denoted
transform a selected chord into another one. Further-
more it is possible to assign alpha values to every point
1 Within this paper we focus on touch based techniques but it is
of the selected area which additionally assigns an indi-
also possible to use other controllers to control the pitch space
vidual weight to every selected pitch.
and the selected area (see Section 2) to get the music out of
the pitch space. So it would be possible to use a joystick like
The user interfaceis the mean by which the musician controller like the 3DConnextion SpaceNavigator, a Theremin
controls the parameters of the pitch space (e.g. the based Controller or a innovative game controller like the Nin-
geometric arrangement of the pitches within the pitch tendo WiiMote.

-2-

87
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music

of the symmetry axis form dominant chords (e-G-b,


G-b-D) and pitch classes centered around the sym-
metry axis form tonic chords (e.g. a-C-e, C-e-G).
This helps the student to become conscious about
functional relationships of chords.
IDMT001275 As it can be seen the key related circle of thirds TR ex-
presses many music theoretical and cognitive relation-
Figure 2: Possible controllers applied to con- ships by arranging only seven pitch classes. But pitch
trol the HarmonyPad: From left to right: The classes alone are not enough to generate music because
Nintendo WiiMote, The multi touch controller every pitch class represents not an unique pitch but an
JazzMutant Lemur, the 3dConnexion SpaceNav- infinite number of pitches. An example is the pitch class
c, which represents the pitches c0 , c1 , c2 ,... cn . But what
igator
of these pitches shall be played if the pitch class c is
selected? As an answer to this question we propose to
extend the key related circle of thirds TR and all other
with F.2 In Figure 1 F is represented by the twelve dots pitch class based pitch spaces in the following way: The
shown in the outer ring. Every dot represents one of pitch class of a pitch is represented at the first dimension
the twelve major keys and its relative minor key in an of a two dimensional space. We call this dimension the
order according to the circle of fifths. Represented at a pitch class axis. In the case of a polar coordinate sys-
touch screen the player can change the key by touching tem this could be the angular axis. This is has already
the corresponding dot. The second pitch space is the been done in the case of the TR. The pitch height4 , i.e.
so called key related circle of third, shortly denoted with the different octaves of a pitch class are assigned to a
TR.3 Key related means that the pitch space does not second dimension. We call this dimension pitch height
contain all twelve chromatic pitch classes but only the axis. In the case of a polar coordinate system this is
seven pitch classes of a selected key. In Figure 1 TR is the radial axis. This is illustrated Figure 3: The pitch
represented inside of F. The seven dots d, F, a, C, e, G, b classes d, F, a, C, e, G, b are assigned to the y-axis and
represent the seven pitch classes of the key C-Major. If the pitch height (d0 , e0 , .. f 00 ) is assigned to the x-axis
the player would select the key E-major by touching the of a two dimensional coordinate system. Therefore the
appropriate point at the outer circle, the TR of E-major two dimensional coordinate of a pitch follows from the
would be shown, i.e. the pitch classes f] , A, c] , E, g] , B, pitch’s chroma and the pitch’s height. In Figure 3 the
d] . The TR was selected to be the basis of the proposed resulting pitch positions are represented as black dots.
musical instrument for the following reasons: This solution has two big advantages which will become
obvious if the selected area (Figure 3, grey area) is in-
• The pitch space consists of only seven pitch classes. troduced: The shown selected area covers the a-minor
This helps the musical beginner to be not confused. chord i.e. the tones e0 , a0 and c00 . Moving the selected
If chords from other keys shall be played can use area towards the pitch class axis transforms the selected
the outer circle to change the key very easily. a-minor chord into a C-major chord, but not by a sim-
• The pitch classes of the TR are in a order such ple shift (e0 , a0 , c00 → g0 , c00 , e00 ) which causes "forbidden"
that three neighbored pitch classes generate a ma- fifths parallels. Through the proposed pitch arrangement
jor triad or a minor triad. This helps the player to the second chord (dashed frame) is inversed automati-
find chords very easily and to become conscious cally such that a well sounding chord transition is gen-
about the pitch classes that form a certain chord. erated (e0 , a0 , c00 → e0 , g0 , c00 ). The second advantage of
the proposed way to represent pitches in a two dimen-
• The pitch classes of the TR are symmetrically ar-
sional pitch class/ pitch height space is the possibility to
ranged around a so called symmetry axis. Tones
generated very simply inversions of a given chord. In
to the left of the symmetry axis form subdominant
Figure 3 the selected a-minor chord can simply be in-
chords (e.g. d-F-a, F-a-C), pitch classes to the right
verted by moving the selected area (grey area) towards
the pitch height axis (solid grey frame). In the example
2 The symbol F represents the term ”fifths”. More details and
semantic properties expressed by F can be found in [1] and [5]. 4 The separation of pitch in the components pitch height and
3 The character T represents the term "thirds" and R stands for pitch chroma is a fundamental of human auditory pitch per-
"related", which means that the pitch space only contains pitch ception. A model that illustrates this is Roger Shepard’s spiral
classes of a certain key. The TR is part of the symmetry model, helix [6]. Warren et al. [7] showed that pitch chroma resp.
a larger framework of circular pitch spaces which had been pitch class and pitch height are processed in different regions
introduced in [1] and [4] of the auditory cortex.

-3-

88
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music

shown in Figure 3 the chord transition e0 , a0 , c00 → a0 , c00 , pressed button sound completely different then a whole
e00 is generated. tone shift? Through the geometric arrangement of tones
within the pitch space meaningful or often used tone
Pitch class
combinations are in geometric neighborhood and stand
in a simple geometric relationship. This again makes
D
it possible to define a simple formed shape, which cov-
ers neighbored pitch classes. Through a translation of
b the shape or a controlled change of the shape’s dimen-
G sions the desired sound can be formed. To make this
possible we have to think about a simple parameter set
e
to control the shape of the selected area. Such a set of
C parameters is presented now. The proposed parameter
set is explained using the TR but it can be applied to
a
every circular pitch space5 . In section 3 we proposed to
F represent pitches in a two dimensional coordinate sys-
tem. In the example of the TR this coordinate system is
d
a polar coordinate system whereas the first dimension
Pitch
is the angular dimension and the second dimension is the
height
d' e' f' g' a' b' c'' d'' e'' f'' ...gze@IDMT001270
radial dimension. Therefore the parameters start angle,
apex angle, start radius and apex radius are proposed to
control the shape of the selected area. This is illustrated
Figure 3: The pitch class/ pitch height space: The in Figure 4: The grey shape represents the selected area
first dimension of a two dimensional space repre- which is described by the four parameters named be-
sents the pitch height and the second dimension fore. We will discuss these parameters in more detail
the pitch class. The pitch space allows to generate now.
important chords and chord transitions by defin-
ing and moving rectangular areas.
C e

Start angle a
4 The selected area a a G
b Apex angle b
Principally it is possible to define arbitrary shaped se- r1 r2
lected areas to select and play tones. But the goal of Apex radius r2
this section is to propose an effective set of parameters Start radius r1
to describe the selected area. Through this parame-
terization it is possible to reduce the number of tasks
that have to be performed by the player to generate a F b

wished tone combination. In the example illustrated


d gze@IDMT000404
in Figure 5 the selected area is predefined such that it
covers major or minor triads. This results in the fact
that the musician needs only to touch a single point at Figure 4: A possibility to parameterize the se-
the instruments surface to play a whole chord. By mov- lected area is to use the four parameters start angle,
ing only one finger into a certain direction the original
apex angle, start radius and apex radius [2]
chord can be easily inversed or transformed into another
one. At the piano for example this task would be much
more complicated: The player has to define three points
i.e. he has to press three buttons. To transform the
played chord into another one he has to know which The start angle: The start angle defines the angular po-
of the three played tones have to be changed. There sition where the selected area starts. Figure 5 shows
are many possibilities to move the fingers in the wrong the selected area with three different start angles. The
way and the required finger movement does not stand
in a simple understandable relationship with the result- 5 Circular pitch spaces arrange pitch classes in a circular way [1].
ing perceptual effect: Why does a semi tone shift of a An alternative representation is the array representation [5].

-4-

89
Änderungen bitte in IDMT001278 vornehmen und hier rein kopieren!

The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music

apex angle of the selected area has been set in the way
C e C e C e
that three neighbored tones are covered. Therefore it is
possible to play complete chord cadences by touching a G a G a G

the appropriate start angle at the touch surface. In Fig-


ure 5 the chord cadence F-major, C-major and G-major
h h h
is shown. The active tones have been marked red. Fur- F F F
d d d
thermore it possible to crossfade a first chord into a
IDMT001271

second by drawing the selected area from one point to


another.
The apex angle: The apex angle decides how many Figure 5: Through changing the start angle of the
tones are covered by the selected area. A small apex selected areaÄnderungen
it is possible to play tone and chord
bitte in IDMT001278 vornehmen und hier rein kopieren!

angle like shown in the first example of Figure 6 covers sequences.


only a single pitch class. A continuous increase of the
apex angle leads to the effect that a single tone is cross-
faded into an interval and then into a triad and then into
C e C e C e
a chord consisting of three tones (Figure 6). Through this
it is possible to also play single or two tone arpeggios a G a G a G
or also melodic lines. The example shown in Figure 6
crossfades the tone C into the interval C-e and then into
the major chord C-e-G. F h F h F h

The start radius: In section 3 we proposed to arrange d d d

the different pitches of a certain pitch class along the IDMT001272

radius. This means that a movement of the selected


area along the radial dimension generates different in- Figure 6: Increasing the apex angle makes it pos-
versions of the selected chord. In the current imple- sible to transform a single tone into an interval, a
mentation we decided to assign lower tones to lower triad or a four tone combination.
radius and higher tones to higher radius. This means
that the height of a chord can be defined by locating
the selected area at the desired radius. This is shown in
Figure 7a: The figure shows a selected area that covers section 3 we know that the key related circle of thirds
the C-major chord in the lower octaves. The selected TR only represents the pitch classes of a one key. While
area is moved along the radial dimension i.e. the pitch for beginners this is a enormous simplification it will
height axis. Through this the selected C-major chord become a limitation the more someone’s gets familiar
is shifted towards higher octaves which means that it with the instrument. Therefore we have to look for ways
sounds brighter and brighter. Analog to the crossfade how to play chords from other keys or to play chords
of different chords along the angular dimension it is that fit not into an unique key (e.g. the augmented C-
also possible to crossfade inversions of chords along the major chord c-e-g] ). All in all there are two possibilities,
radial dimension. Through the possibility to assign a to generate such chords: 1.) A change of the current
pitch height to complete chords the player can inter- key. 2.) Rising or lowering single pitch classes of the
wove melodic elements into his or her chord play. underlying pitch space.
The apex radius: Analog to the increase of the apex
angle it is also possible to increase the apex radius of 5.1 Changing the key
the selected area. This means that tones from addi-
tional octaves are added to the chord currently selected. In standard portable keyboards beginners can change
Through the octave doubling of chord tones the chord the key by simply transposing all tones by a number
sounds brighter and fuller. Figure 7b illustrates how a of semitones. This has the following disadvantages:
relative thin and dark C-major chord consisting of only The number of semitones that the current key has to be
tones of one octave is transformed into a bright and full transposed do not stand in a relationship with the result-
chord consisting of tones from several octaves. This is ing perceptual effect: A key change by a half semitone
simply done by increasing the apex radius. would transpose the chord C-major into a the chord C] -
major. But such a chord combination is very seldom
needed. Other key changes that are more often used,
5 Playing non diatonic chords
for example to change the chord G-major into the chord
In the following section it is discussed how chords can D − major would require seven five changes. Another
be played that do not belong to the current set key. From drawback of the transposition of the whole pitch system

-5-

90
Änderungen bitte in IDMT001278 vornehmen und hier rein kopieren!

The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music

C e C e
key for example, he/she has to touch the point located
at -90◦ . All key changes become immediately visible
a G a G and audible. Therefore it is possible to play the cadence
C-major, F-major, G-major and C-major by selecting the
chord C-major (like done in Figure 1) and touching the
F h F h
appropriate key points at the circle of fifths.
d d

IDMT001274
With the help of the circle of fifths we could reach a
more intuitive way to change the key and to play key
from other chords. But the problems of fifths parallels
Figure 7: a) Increasing the start radius crossfades
still exists: If we select the key C-major and change
a chord through different inversions. b) Increasing
to the key G-major then the played chord C-major is
the apex radius adds tones from other octaves to transposed by seven semitones and a forbidden fifths
the chord parallel occurs. To solve this problem we have to go
back to the pitch class/ pitch height space shown in
Figure 3: The space is not simply transposed but shifted
is the generation of fifths parallels which are forbidden along the pitch class axis. As described in section 3 this
in classical composing. But the problems shown before leads to the automatic generation of a well sounding
can be solved with the following two steps: 1.) Instead chord transition.
of using a semitone based transposition we propose to
perform the key changes based on the circle of fifths. 5.2 Rise or lower single pitch classes
The circle of fifths represents all diatonic keys in an or- The second way to play chords from other keys is to rise
der, that keys with more common tones are located close or lower single pitch classes. This means for example to
together, keys with less common tones are located more transform the chord C-major into a chord C-minor the
far away. For example the keys C-major and G-major pitch class e can be lowered by a half semitone to become
have six of seven tones in common, that are the tones c, the pitch class e[ . Like denoted above this alternative
d, e, g, a and b. In the circle of fifths C-major and G-major allows it also to generate non diatonic or other dissonant
are in neighborhood. Another advantage of using the chords. A possibility to implement such a feature is to
circle of fifths for key changes is that the relative minor assign an appropriate user interface element to every of
key and the parallel major key of a given key are local- the pitch classes. With such an interface element it is
ized in a simple geometric ratio with the current key. If possible to rise the appropriate pitch class by a certain
the current key is the key C-major then the parallel key number of semitone.
c-minor can be found by selecting the key −90◦ of the
current one6 . The last advantage of using the circle of
fifths is that it reveals symmetries between different key 6 Pedagogical application
changes: To find the parallel major key of a minor key Combined with a substantiated pedagogical concept the
is done in nearly the same way like finding the parallel HarmonyPad becomes a helpful tool in early music edu-
minor key of a major key. The difference is that not the cation e.g. in kindergartens, in primary schools but also
key -90◦ but the key +90◦ of the current one has to be se- in schools, music schools. Music students should use
lected. Revealing this symmetries helps music students the tool to improve their music theoretical knowledge
to recognize structural redundancies and to internalize and their ability to compose. Through the simplicity of
music theoretical knowledge much effective. play older people can use the HarmonyPad to learn a
The realization of the proposed way to change keys musical instrument still in advanced years. The sub-
can be seen in Figure 1. The outer ring consists of twelve sequent paragraphs summarize music theoretical resp.
grey points which all represent a key. The assignment tonal relationships that can be taught using the Har-
of keys to the grey points is not absolute: The point monyPad:
at the circle’s top represents the key currently selected.
To the left and to the right of the current key the other 1. The student learns the tones that build the most
keys in an order according to the circle of fifths follow. often used major and minor chords. By selecting
If the musician wants to change into the parallel minor a narrow region at the surface of the HarmonyPad
the student can listen to single tones. Enlarging the
6 We do not use the standard mathematical system where the selected region the single tones can be crossfaded
angle 0◦ is located at the x-axis and angles increase counter into major and minor chords.
clockwise. We use the musical coordinate system. Here the
angle 0◦ is located at the positive y-axis i.e. the circle’s top. 2. The student learns functional relationships: By se-
Furthermore the angle increases clockwise lecting an area on the left side of the HarmonyPad

-6-

91
The HarmonyPad – A new creative tool for analyzing, generating and teaching tonal music

the subdominant chords S; Sp , by selecting an area required to optimize the HarmonyPad for a given tar-
on the right side the dominant chords D; Dp and get group, this can be infants, children, music pupils,
by selecting an area in the upper center the tonic music students but also older people. All in all it can
chords T; Tp can be played7 . be said that the HarmonyPad complements the piano
3. The HarmonyPad links visual geometric posi- keyboard in a very good way: While the piano key-
tions, musical gestures, musimathematical struc- board organizes the tones along melodic relationships
tures and of course the resulting sound effect. This the HarmonyPad does the same for harmonic relation-
again trains to think in harmonies and to remember ships. For this beside the piano the HarmonyPad should
musical elements. become a central part of every scholar and private music
education.
4. The circular arrangement of pitch classes in the
HarmonyPad allows to visualize chord progres-
References
sions as geometric tracks. Chord progressions can
be linked with a geometric shape. For example the [1] G, Gabriel ; M, Markus ; B-
chord progression T,S, D, T can be visualized as a , Karlheinz ; A, Daniel: Circular Pitch
triangle at the HarmonyPad’s surface. Space based Musical Tonality Analysis. In: 124th
5. By playing the HarmonyPad students learn to as- AES Convention (2008)
sign minor and minor chords to its relative minor [2] G, Gabriel ; M, Markus ;
and major chords. This is possible because major S̈, Christian: Interaction with tonal
and its relative minor chords occupy neighbored pitch spaces. In: Proceedings of the 8th International
regions8 the HarmonyPad. Conference on New Interfaces for Musical Expression
6. The student learns automatically common tones of NIME08, 2008
third related chords [3, S. 142]: For example the [3] K̈, Thomas: Harmonielehre im Selbststudium.
chords a-minor and C-major have an overlapping Neuausg. Wiesbaden u.a. : Breitkopf und Härtel,
region which contains the common tones C and e. 2006. – ISBN 9783765102615
7. With the HarmonyPad it is very easy to learn, [4] M, Markus ; G, Gabriel ; B-
which tones can be used to accompany a given , Karlheinz ; A, Daniel: Circular Pitch
tone in a given key. To find the right tone the se- Space based Harmonic Change Detection. In: 124th
lected area has to be setup such that it contains AES Convention (2008)
three (or four) tones. After that all chords that can [5] M, Markus ; G, Gabriel ; G,
be covered by the previously defined selected area David ; B, Karlheinz: The analysis of
and that contain the tone to be accompanied can be tonal symmetries in musical audio signals. In: Inter-
used to accompany the denoted tone9 . national Symposium on Musical Acoustics ISMA 2007,
8. Last but not least the student learns that there are 2007
different inversions of every chord. This inversions [6] S, Roger: Geometrical approximations to the
can be easily created by moving the selected area structure of musical pitch. In: Psychological Reveview
towards the pitch class axis. 89(4) (1982), Jul, S. 305–333
[7] W, J. D.: Seperating pitch chroma and pitch
7 Summary and conclusion
height in the human brain. In: Proceedings of the
The state of the HarmonyPad described to this point is National Academy of Sciences of the United States of
a base around which many improvements like a pitch America, 2003, S. 10038–10042
height dependent change of the apex angle to prevent
dissonant tone combinations in the lower frequency re-
gions, the integration of multi touch controllers like the
JazzMutant Lemur or other geometric representations
like a cartesian one have been implemented. The out-
come of these further developments is that it is now

7 To find out more about the symbols S, Sp , T, ... look at [3]


8 For example a − C − e and C − e − G
9 At this point it becomes clear that the material for the harmonic
accompaniment are won from the own scale’s tones. [3, S.11]

-7-

92
Sonic interactions with hand clap sounds
Antti Jylhä and Cumhur Erkut
Dept. Signal Processing and Acoustics
Helsinki University of Technology
antti.jylha@tkk.fi

Abstract. In this paper, we present a control interface, which applies the hand claps of the user as control input.
With this system, we aim at providing more realistic and engaging control over the output. We present three
exemplary use cases: controlling a synthetic audience, controlling the tempo of a musical piece, and controlling
a simple sampler. Qualitative evaluation shows that the system performs well in the use cases. The control
interface has potential in other types of human-computer interaction as well.

1 Introduction Hanahara, Tada, and Muroi [4] have recently pro-


posed a novel method for human-robot and robot-robot
Analysis and synthesis of everyday sounds is gaining
interaction based on a hand clap language. Instead of
more and more attention in the research field and
two separate communication systems, their proposal
among practitioners. Sound designers of computer
implies a uniform communication system to enable
games and movies, for example, could make use of
more genuine social interaction among humans and
novel ways of adding synthetic, yet realistic and en-
robots. They have defined a set of hand clap patterns
gaging sounds to their products. Furthermore, sonic
as syllables, which can be concatenated to form words
interaction design1 (SID), a new branch of interaction
and meaningful constructs for both humans and robots
design [12], has increased the demand and discussion
to understand.
of using everyday sounds as conveyors of information.
In any context, to extract meaningful control infor-
As a young research area, SID still lacks tools for sys-
mation from hand clap sounds, the claps must be iden-
tematic development and evaluation of sonic interac-
tified. Considering the onset detection in audio signals
tion. The same goes for sound design. In this study,
and the tempo tracking of a musical piece, numerous
we explore the possibility of applying hand clapping, a
techniques exist. For onset detection, a detailed review
simple sonic gesture, as the control signal of an inter-
of techniques has been presented by Bello et al. [1]. A
active system. By sonic gesture we refer to a human
real-time onset detection algorithm and its implemen-
action, which is intended to result in a specific sound.
tation for the GPL-licensed Pure Data (PD)3 software
The analysis and synthesis of hand clap sounds has
[9] has been depicted by Puckette et al. in [10]. For a
been previously studied for example by Peltola et al.
review of tempo tracking and beat tracking techniques,
[8], who describe two different software implementa-
see e.g. [3] and [11]. A tempo tracker based on the lat-
tions for synthesis and parametric control of hand
ter work has also been implemented for PD under the
clapping sounds. Although the implementations dif-
name rhythm estimator4 .
fer, both systems aim at the synthesis of realistic clap
sounds by physically-based modeling. An alternative Audio Gestures5 is a real-time software implemen-
applause synthesis implementation has been presented tation for using general non-speech audio gestures in
by Farnell2 . human-computer interaction. Audio Gestures provides
means to train the computer discrete commands based
Lesser and Ellis [7] have studied hand claps in the
on the user’s non-speech audio gestures. The imple-
context of rhythm therapy. They propose a technique
to detect and evaluate the claps of a student, who has 3 Available at http://puredata.info/downloads.
been instructed to repeat a given rhythmic pattern by 4 Available at ftp://iem.kug.ac.at/pd/Externals/
hand claps. Their aim is a multi-student system, which RHYTHM/
5 Audio Gestures was implemented as a course project in
could be simultaneously applied by all the students in
a classroom. Princeton University on the course Computer Science 436
Human-Computer Interface Technology, see http://www.
1 http://www.cost-sid.org cs.princeton.edu/courses/archive/fall07/cos436/ and
2 http://obiwannabe.co.uk/tutorials/html/ http://www.cs.princeton.edu/courses/archive/fall07/
tutorial_applause.html cos436/FinalProj07.html.

93
[Sonic interactions with hand clap sounds]

mentation has been built on sndpeek6 , a package for possible for the user to perform some other music on
real-time audio visualization. the beat.
In this study, we aim at providing a prototype of It is noteworthy that the three-agent system of the
a control interface, which applies hand claps as con- user, a piece of music, and the synthetic audience, is
trol input. In the current implementation, the control not restricted to this form of interaction. In another
interface can be applied to control a synthetic crowd possible scheme, it could be the audience controlling
of clappers, the tempo of a musical piece, or a simple the user to clap, and the music would follow. Alter-
sampler, by hand clap gestures in a continuous fashion. natively, the user could become the conductor of both
This implementation should be considered an early ex- the audience and the music simultaneously.
ploration of the possibilities of such a system, and we
aim at extending the framework to other sound syn-
thesis and human-computer interaction applications as
well. Examples of intended future implementations in-
clude interactive games, novel HCI schemes, and a tool
for sound and interaction designers for prototyping and
evaluation.

2 Applications
The current implementation of the system can perform
in three different functional modes. In the audience
mode, the user aims at synchronizing a synthetic au-
dience with her clapping tempo and possibly with a
musical piece by clapping to the beat of the music. In
the music tempo control mode, the user claps her hands
to control the tempo of music. In the sampler mode,
the user claps her hands to control a simple table-read
sampler. Technical details of the implementation will
be discussed in Sec. 3.

2.1 Synchronizing the audience to the beat


Combining sound synthesis of a virtual audience with
tempo tracking of hand claps, we can provide a means
for controlling a crowd of clappers to the beat of music.
This use case is illustrated in Fig. 1. Given a piece Figure 1: The audience synchronization system.
of music, the user (The Patron) claps her hands to
the beat of the music. This way, the user becomes
the music information retrieval system, and the tempo
of the user’s clapping hands is applied to control the 2.2 Controlling the tempo of music
audience. Once the user claps to the beat, the audience The clapping interface can be applied to control the
follows, and so the user becomes the conductor of the tempo of a musical piece. For this, there are two op-
audience in this case, being also part of the clapping tions. First, the music can be a pre-recorded piece,
crowd. As a result, the audience becomes synchronized which is then time-stretched in real-time according to
with the music. BPM (beats per minute) of the music the tempo of the clapping. In a second scheme, the clap
is presented to the user based on the clap tempo. This tempo can be applied to control a sequencer in PD.
way, the user can as a by-product of the system also Given for example a pre-programmed musical tune or
determine the BPM of a musical piece, assuming that a random-walk sequencer, it is straightforward to map
she claps to the correct beat. the clapping tempo to the tempo of the music. The
The user interface can directly be applied to control strength of the claps can be mapped to the volume, for
the virtual audience of clappers also without any ref- example.
erence music. This way, it is possible for example to
In this study, we implemented the first option.
train the crowd to clap to a desired beat, making it
When the user claps her hands to control the tempo,
6 http://soundlab.cs.princeton.edu/software/ auditory feedback on the claps is also presented to the
sndpeek/ user to indicate the detected claps.

-2-

94
[Sonic interactions with hand clap sounds]

2.3 Hand-clap-driven sampler runs a modified version of the hand clap synthesis en-
gine ClaPD [8], with a new control interface and addi-
Another simple application for the interface in the mu-
tional functionality. In its previous versions, the con-
sical domain is a sampler controlled by the clapping
trol parameters of ClaPD have only been adjustable by
hands. The user can select an audio sample (.wav-file)
conventional HCI techniques, i.e., the mouse and the
and then control the playback of the sample with the
keyboard. Here we propose a technique for using hand
claps. This application can also function in two ways.
claps as input to extract control parameters for the sys-
The sample can be automatically looped so that the
tem. The hand clap audio data is processed with PD to
tempo of the claps can control the looping rate and
yield parametric control data for the synthesis engine.
the rate of reading the sample from a table. Alterna-
ClaPD and the control interface are described in detail
tively, the sample can be played every time a clap is
in the remainder of this section. Also the techniques
detected.
for implementing the music tempo control and sampler
A clap-driven sampler could be applied for exam-
applications are discussed.
ple to control the tempo of a drum loop in a musical
performance, by mapping the clap tempo to the loop-
3.1 Extracting control parameters
playing rate. Another example could be the sound de-
sign of movies, when discrete auditory events are to be From the user’s hand claps, we extract three types of
placed in the movie soundtrack in a rhythmic way or in control parameters: onsets, tempo, and strength. For
instantaneous locations. The sound designer could for onset detection, we have experimented with different
example mimic the rhythm of a monster’s footsteps envelope-based and band-based methods. In this pro-
with her hand claps to glue the samples into desired totype, we chose to apply a readily implemented PD
place. object known as bonk∼, which is an object designed for
detecting and classifying percussive sounds [10]. The
3 System architecture and technique algorithm is based on analysis of the incoming signal in
11 frequency bands in overlapping time windows. The
The architecture of our system consists of a computer overall change in the power of the bands is applied
running PD, a microphone, and the user. The user is for detecting an attack. The bonk∼ object can also
an integral part of the system, as the user’s gestures be trained to classify percussive sounds by template
are needed to control the system. matching [10].
Capturing the hand clap sounds is easy to do with The output of bonk∼ is the power of each frequency
any conventional microphone. To extract control infor- band, the tag of the class if classification task is rele-
mation from hand claps in our system, the microphone vant, and the summed-up power of the subbands. We
does not have to be of high quality. This makes the use the output to determine the onset to onset interval
system widely applicable, because consumer-oriented (OOI) between subsequent hand claps with a simple
computer microphones are not expensive, and many deterministic tempo tracker. Every time bonk∼ gives
computers even have a sufficient built-in microphone. an output, an onset has occurred, and we can use this
PD is a graphical programming language that was onset information for further rhythm estimation. From
originally developed for audio signal processing [9]. PD bonk∼, we also directly obtain the power of the attack
programs consist of graphical patches, which may con- as an estimate for the strength of the current clap. Al-
tain different objects (functional elements), messages though the clap strength remains unused in this pro-
(parameters for the objects), arrays, and other pro- totype, its potential is acknowledged for future work.
gramming elements. These elements are connected to The onset information obtained from bonk∼ is
each other from their inlets and outlets by drawing a further processed by the rhythm estimator object5 ,
line between them, i.e., routing the data through pro- which is an online tempo tracking object for onset in-
gramming commands. For example, a summation ele- formation, i.e., bang messages in PD. The rhythm es-
ment contains two inlets and one outlet to accept two timator algorithm was originally developed for tatum
numbers to sum and to give their sum as output. A grid analysis of musical signals [11], but it serves well to
PD patch may contain many nested parts, which may provide an estimate of the intended clapping tempo in
be either subpatches (saved along with the main file), our research, too. Tatum is the smallest metrical level
abstractions (graphically programmed individual .pd in musical rhythm, which in the case of quasi-periodical
files), or externals (PD objects compiled from C code). hand clapping translates to the average duration be-
Control data, e.g., numbers and strings, is processed tween successive claps. The algorithm is described in
without real-time requirements, while signals are pro- more detail in [11]. It is based on analyzing the OOI’s
cessed in blocks of data at a different rate. by storing the OOI values into a time-varying OOI his-
For the virtual audience application of Sec. 2.1, PD togram (inter-onset interval (IOI) histogram in [11]).

-3-

95
[Sonic interactions with hand clap sounds]

This analysis is performed continuously not only for


each consecutive onset pair, but for each detected on-
set within a specified time frame. The histogram is
updated for every new onset as a leaky integrator, and
a remainder error function is calculated for the his-
togram information as a function of the tatum period.
The minimum of the error function is searched for by
parametric thresholding, and if the threshold is met,
the point is chosen as the estimate of the tatum pe-
riod.
The rhythm estimator object’s right outlet gives
the estimate of the average OOI as intended by the
performer of the rhythmic sequence. The value is in
milliseconds, so it can be converted to BPM as
60
BP M = 1000 ∗ . (1)
OOI
While experimenting with rhythm estimator, we
found out that the OOI histogram array in PD was
of fixed size. In practice, this means that the his-
togram bin width (i.e., resolution) is proportional to
the maximum OOI allowed by the histogram. This
leads to a tradeoff between BPM resolution and lower
bound. With default settings (OOI resolution = 5 ms),
the maximum OOI detected by the system is 395 ms,
yielding the lowest possible BPM of 151.9 according to
Eq. 1, which is very high for most musical styles. Al-
though it is shown in the rhythm estimator example
Figure 2: The clap tracker PD abstraction. The
patches how the histogram could be made longer by two inlets accept audio data (left) and bang messages
PD messaging, this feature seemed not to work with (right). The object outputs estimated clap strength as
the test platforms. However, changing the histogram the velocity output of the bonk∼ object and the OOI
resolution did work, so we changed the resolution to and BPM values calculated by the rhythm estimator
object. The ooi2bpm object converts OOI values to
13 ms in order to obtain maximum OOI of more than BPM. For bonk∼, a threshold value is provided to
one second. This would be required to detect BPM’s make the abstraction more robust against environmen-
of 60, which we assumed as a reasonable lower bound tal noise. The spigot tests if clapping was selected as
for this prototype. input type before announcing the detection of a real
By combining bonk∼ and rhythm estimator, we cre- clap.
ated a clap tracker PD abstraction to detect the
claps of the user and to estimate OOI, BPM, and the
strength of the clap. The abstraction is depicted in
which are fine-tuned by hand-clapping statistics. It
Fig. clap tracker. To make it possible to compare the
can produce expressive, human-like synthetic applause
gestural control interface to a more conventional input,
of a single clapper with adjustable hand configuration,
the clap tracker also accepts bang messages into its
or asynchronous or synchronous applause of a clapper
other inlet. In our implementation, these messages can
population (audience).
be generated by mouse clicking in the graphical control
The three layers in ClaPD form a sound synthesis
interface depicted in Sect. 3.6.
and control hierarchy: the event layer determines the
clapping mode (synchronous/asynchronous), the con-
3.2 Synthesis of hand claps with ClaPD
trol layer determines the tempi and the hand configura-
ClaPD7 is a PD library for hand clapping synthesis and tion of the individual clappers, and the synthesis layer
control [8]. It contains low-level synthesis and higher- carries out the actual audio calculation. The possibility
level control blocks, and primitive event generators, of interacting with the audience via a master clapper,
7 ClaPD has been released as a free software under the that is, the master clock each clapper listens to in the
GNU Public License (GPL) and can be downloaded from control layer, was indicated in [8], but left unexplored.
http://www.acoustics.hut.fi/software/clapd After the initial release of the ClaPD (rel-0.1), its au-

-4-

96
[Sonic interactions with hand clap sounds]

dience mode has been significantly extended [2]. Cur- a second. In order for the mapping to work realisti-
rently the audience can be dynamically generated and cally, that is for the clapper’s claps to match the beats
multiple sound generators can be hosted in ClaPD. The of the music, the reference BPM of the original song
last property allows us to implement a special Patron needs to be known. If the reference is not known be-
object, which by listening to tapping or clapping of the forehand, it can be determined by the user by clapping
user can drive the audience to synchronize with itself to the beat of the song in the virtual audience mode.
within the limits of the virtual audience. These lim- The mapping of the clap BPM to the phase vocoder
its are imposed by the cosc (coupled oscillator) object precession speed is linear, defined as
that performs the entrainment of each virtual clapper.
BP Mclap
In the synchronized mode, each clapper aims to clap v = 100 ∗ , (2)
around the same rate (frequency-locking) and absolute BP Mref
time (phase-locking) with the Patron, and calculates
its phase difference (measured in milliseconds) with the where v is the precession speed, BP Mclap is the esti-
Patron. Since the Patron has a quasi-stationary OOIP , mated clap BPM, and BP Mref is the reference BPM.
the phase difference can be considered a uniform distri- If a wrong reference BPM value is applied, the claps of
bution U (0, OOIP ) with the mean of OOIP /2 ms. If a the user will still control the tempo, but they will not
clapper is trailing behind of the Patron, then the phase match the beats of the music.
difference is smaller than OOIP /2 ms and its clapping When the user claps to control the music, sonic feed-
rate is accelerated. Similarly, if the clapper is ahead back is provided to the user to indicate the detected
of the Patron (phase difference greater than OOIP /2) claps. This feedback is a hand clap sound sample8 ,
ms its clapping rate is slowed down. The exact expres- which has been written to a table and is played back
sions for the acceleration and deceleration are given in every time a clap has been detected. While it would
[8] for the constant Patron OOI of 440 ms. also be possible to route the clapper’s own clapping
This scenario can be extended to generate syn- sounds back to the audio output, this would result in
chronous virtual applause along an external rhythmic feedback problems if loudspeakers are used to repro-
piece of music, as explicated in Sec. 2.1. In our exper- duce the sounds.
iments, we found synchronizing the ClaPD clappers to The major downside of this time stretching tech-
be easy, when clapping around the built-in preferred nique is that the applied phase vocoder implementa-
clapping rate of the virtual clappers (440 ms), but clap- tion does not readily work with streaming audio. There
ping much faster or slower did not lead to synchro- are also some audible artifacts in the processed sound,
nization. Therefore, we have made the Patron rate such as ”phasiness” and ”loss of presence”, which are
variable. Admittedly, this approach makes the origi- characteristic to phase vocoders [6]. However, for pro-
nal model behave in a less natural way, but gives good totyping purposes, the simple phase vocoder proved to
results in synchronizing the audience. be quite sufficient.

3.3 Music tempo control 3.4 Sampler


Our implementation for the sampler follows a simple
To provide control over the tempo of recorded mu-
interpolating table-read approach. The user can select
sic without pitch artifacts in the resulting sounds, a
the sample she wishes, which is then written to a table
time-stretching technique is required. As a prototype
in PD. Depending on the user’s preferences, the sample
time-stretching method, we use a phase vocoder pro-
can be read from the table either every time a clap is
vided with the PD distribution3 in the help patch
detected, or sequentially according to the tempo of the
I07.phase.vocoder.pd, with added functionality re-
user’s claps.
sponding to external user control. The phase vocoder
In the non-looping mode, the user’s claps act as a
applies short-time Fourier transform (STFT) to de-
trigger for the samples. For each detected clap the sam-
compose the incoming signal into frequency bands, and
ple is simply read from the table it has been written to,
resynthesizes the signal based on the analysis of the de-
and played. In the looping mode, the estimated OOI of
composed signals [6]. In the resynthesis, it is possible
the user’s claps is mapped directly to the looping rate,
to scale the time of the output, which we apply for ad-
i.e., the sample is repeated sequentially according to
justing the tempo while maintaining the original pitch.
the tempo of the claps. Also the pitch of the sample is
Basically, the estimated BPM of the clapper is
allowed to change by reading the sample from the ta-
mapped to the BPM of the music. This is done by
ble with speed inversely proportional to the estimated
adjusting the precession speed of the phase vocoder
synthesis stage according to the user input. The phase 8 The sample can be freely obtained from http://

vocoder accepts the precession speed in hundredths of bigsamples.free.fr/d_kit/clap/handclap.wav.

-5-

97
[Sonic interactions with hand clap sounds]

OOI. The table reading speed σ (samples per second)


is
L
σ= , (3)
OOI
where L is the length of the sample in samples. Thus,
slower clapping results in a sequence of slowly repeated
low-frequency sounds, while rapid clapping leads to a
rapid looping of high-pitched instances of the sample.

3.5 Latency
In the applications, there is an amount of latency in-
troduced by the system between the excitation and
response. The latency as a whole is a sum of many
software and hardware-related latencies, including the
soundcard latency, PD audio buffering, and the com-
putational latency of the PD program. The overall
system latency is strongly dependent on the operat-
ing system and its audio drivers, and the way these Figure 3: The PD control interface of the example ap-
are configured. While it is not straightforward to mea- plications. The user can select the functional mode and
adjust the other settings as required. The BPM esti-
sure all the different latencies, the latency within the mated from the user’s claps is presented in all modes.
PD program can be estimated by PD’s own realtime
object, which calculates the elapsed time between two
program events. the music at reference BPM, and STOP! will end the
As a result of the latency, the user’s claps and the playback. In sampler mode, GO! and STOP! will start
claps of the virtual audience are not trivially simulta- and stop a loop, if looping has been selected. There are
neous. Measuring the latency and taking into account also buttons for loading the music or the sample, i.e.,
the cyclic nature of clapping, a simple remedy for the opening up an ”Open file” window, so that the user
problem is delaying the system response to match the can select the music she wants to control or the sample
time of the user’s next predicted clap. This prediction to use in the sampler. In this prototype, these must be
can be calculated as the difference of the estimated .wav files.
time between claps and the measured latency. In the virtual audience mode, a visualization of the
Naturally, latency appears also in the other func- clapping crowd is presented as a blinkenlights9 can-
tionalities, where it cannot be compensated for with vas, i.e., a grid of flashing pixels indicating individual
the simple trick used in the audience mode. On the clappers. The middlemost pixel in the grid visualizes
other hand, informal evaluation of the system indicated the claps of the user.
that while the latency is noticeable, it is not necessarily
disturbing. 4 Informal evaluation of the system
3.6 Graphical control interface To evaluate the system, we requested two persons to
try out the user interface. The emphasis was on the
The graphical control interface is presented in Fig. 3. virtual audience mode and the tempo control mode.
Although the actual interaction with the system is per- The persons were first instructed to clap their hands
formed with hand claps, a conventional interface for and try to get the virtual audience synchronized with
selecting the functionality and other relevant controls them. Both test subjects reported that the synchro-
is required. nization feels both realistic and engaging. However,
The control interface consists of a selector for the an amount of latency is noticeable, especially during
functional mode, a selector for the input type, a BPM accelerandos and ritardandos.
visualizer, output gain and audio on/off settings, and The idea of controlling the tempo of music with hand
GO! and STOP! buttons for performing mode-specific claps was appealing to the test subjects. It turned
start and stop commands. In the virtual audience out, however, that controlling the tempo of an unfa-
mode, the GO! button starts the synthetic applause, miliar musical piece required some practice. The origi-
while the STOP! button ends it. In music tempo con- nal tempo of the music seemed to control the clapping
trol mode, the user should first adjust the song BPM tempo of the user at first, and it required conscious
to the normal BPM of the piece. Then loading a song,
hitting GO! and starting to clap will start playback of 9 http://ydegoyon.free.fr/software.html

-6-

98
[Sonic interactions with hand clap sounds]

effort to break the cycle. After a few tryouts, the test totype of the interface.
subject was able to control the tempo. Although the The current system makes no effort to distinguish
test subjects did not report it, according to the au- between hand claps and other impulsive excitation sig-
thors’ own experiments a similar phenomenon seems nals, such as tapping the microphone. The system
to occur in the audience mode. The tempo of the clap- would be more robust against impulsive noises if such
ping crowd affects the tempo of the user’s clapping, and a distinction could be made. Another important ad-
it requires concentration to start clapping in a different dition to the system is to exploit the identification of
tempo than the crowd. different hand clap types. Using this discriminative in-
Latency was noticed by both test subjects in the vir- formation, it is possible to construct more diverse con-
tual audience application, but it was not reported too trol mappings and sophisticated human-computer in-
distracting. With real clapping the latency is more se- teraction with simple and natural gestures. Although
vere than with mouse clicks due to buffering for the bonk∼ would serve as a testbed for such an extension,
incoming audio and also in part due to the bonk~ ob- too, we aim at providing another classification scheme
ject, which is naturally computationally intensive when for the purpose. An algorithm and first results of of-
compared to simply sending a bang message. fline identification of eight different hand clap types
have already been presented [5], and a real-time im-
plementation of the algorithm as well as its further
5 Conclusions development has been indicated as future work.
We introduced a prototype of a gestural control inter-
face for sound synthesis driven by hand clap sounds, 6 Acknowledgements
and simple applications to demonstrate the potential
of such an interface. Hand claps, being a universally This work is supported by the Academy of Finland
understood sonic gesture, make a natural control sig- (project 120583 Schema-SID). Our thanks to Leevi Pel-
nal which is easy to learn. The most important steps tola for the initial release of the ClaPD software, and to
in the future are to elaborate the system with more Jussi Pekonen and Hannu Pulakka for their comments
efficient and novel algorithms, and to come up with and for volunteering as test subjects in the evaluation
innovative applications for the control interface. Al- of the system.
though the possibilities of using hand claps as control
signals may seem limited, these possibilities are well References
worth exploring due to the potential of hand claps as
[1] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury,
an easily understood means of control.
M. Davies, and M. B. Sandler. A tutorial on onset
A drawback of the current system is that it incorpo- detection in music signals. IEEE Trans. Speech
rates an amount of latency between the excitation and and Audio Processing, 13(5):1035–1047, 2005.
response. An important objective in the future is to
optimize the computational efficiency of the program [2] C. Erkut and K. Tahiroglu. ClaPD: A testbed
in order to reduce this latency to the minimum. Also, for control of multiple sound sources in inter-
the rhythm estimator histogram resolution issue dis- active and participatory contexts. In PureData
cussed in Section 3.1 needs to be solved in a way that Convention 2007, Montreal, Canada, August
doesn’t affect the resolution of the estimated BPM as 2007. Online proceedings at http://artengine.
much. Alternatively, another algorithm could be de- ca/~catalogue-pd.
veloped for rhythm estimation. [3] F. Gouyon and S. Dixon. A Review of Automatic
During this study, we did not formally compare the Rhythm Description Systems. Comp. Music J.,
two control mechanisms, i.e., hand claps and mouse 29(1):34–54, 2005.
clicks. To make a final justification for applying a hand
clap interface in human-computer interaction, such an [4] K. Hanahara, Y. Tada, and T. Muroi. Human-
experiment should be undertaken. Also the distrac- robot communication by means of hand-clapping
tiveness of the latency could be subject to subjective (preliminary experiment with hand-clapping lan-
testing. guage). In Proc. IEEE Intl. Conf. Systems,
Man and Cybernetics, pages 2995–3000, Montreal,
To provide another modality to the interface and
Canada, October 2007.
more variety in the control signals, the trajectories of
the hands can be extracted from the actions of the [5] A. Jylhä and C. Erkut. Inferring the hand con-
clapper. In the running of this research, we already figuration from hand clapping sounds. In Proc.
begun experimenting with optical hand tracking. The 11th Intl. Conf. Digital Audio Effects (DAFx-08),
concept will be studied in more detail for the next pro- pages 301–304, Espoo, Finland, September 2008.

-7-

99
[Sonic interactions with hand clap sounds]

[6] J. Laroche and M. Dolson. Improved phase


vocoder time-scale modification of audio. IEEE
Trans. Speech and Audio Processing, 7(3):323–
332, 1999.
[7] N. Lesser and DPW Ellis. Clap detection and dis-
crimination for rhythm therapy. In Proc. IEEE
Intl. Conf. Acoustics, Speech, and Signal Process-
ing, volume 3, pages 37–40, Philadelphia, PA,
USA, March 2005.
[8] L. Peltola, C. Erkut, P.R. Cook, and V. Välimäki.
Synthesis of hand clapping sounds. IEEE
Trans. Audio, Speech and Language Processing,
15(3):1021–1029, 2007.
[9] M. Puckette. Pure data: another integrated com-
puter music environment. In Proc. Second Inter-
college Computer Music Concerts, pages 37–41,
Tachikawa, Japan, 1996.
[10] M. Puckette, T. Apel, and D. Zicarelli. Real-time
audio analysis tools for Pd and MSP. In Proc.
Intl. Computer Music Conference, pages 109–112,
Ann Arbor, MI, USA, October 1998.
[11] J. Seppänen. Computational models of musical
meter recognition. Master’s thesis, Tampere Uni-
versity of Technology, 2001.
[12] H. Sharp, Y. Rogers, and J. Preece. Interaction
Design: Beyond Human Computer Interaction.
Wiley, Indianapolis, IN, USA, March 2007.

-8-

100
Toward a Salience Model for Interactive Audiovisual Applications of
Moderate Complexity
Ulrich Reiter, Q2S - NTNU, Trondheim, Norway, reiter@q2s.ntnu.no

Abstract. To provide users of interactive audiovisual application systems with subjectively high presentation
quality of the content (Quality of Experience, QoE), it is usually not effective to increase the simulation depth
of the rendering process alone. Instead, by focusing on salient parts of the content, perceived overall quality can
be increased without causing additional computational costs. This paper provides the basis for a novel salience
model for interactive audiovisual applications of moderate complexity that is based on influence factors which
have been identified in a coordinated series of experimental studies.

1 Introduction cusses briefly the main results. References to the full


papers are given in each subsection. Finally, section 6
The question of saliency of objects in audiovisual appli-
gives a short summary and outlook.
cations is only recently becoming an issue of examina-
tion [1]. Until now, many application systems mainly
rely on visual display and feedback to the user, with 2 Saliency of Stimuli
some kind of “support” in the auditory domain. As
In the absence of information about the history of an
the computing power available in home application and
interactive process, an object can be considered salient
consumer electronics systems is constantly increasing,
when it attracts the user’s visual attention more than
we see a tendency toward integrating more modalities
the other objects [2]. This definition of salience origi-
which until now have only been available in specialized
nally valid for the visual domain can easily be extended
Virtual Reality (VR) systems. With this development,
to what might be called “multimodal salience”, mean-
users can expect an increased degree of immersion.
ing that
This is interesting in many aspects, one of them being
that applications with higher immersion are generally • certain properties of an object attract the user’s
considered more user-friendly because they provide a general attention more than the other properties
feeling of personalization. Additionally, these systems of that object
better represent real life and the complexity of real-life
• certain objects attract the user’s attention more
experiences by offering information multimodally.
than other objects in that scene.
The general problem with this approach is that re-
sources in consumer-oriented application systems are Of course, a salience model requires a user model of
always limited. It is not feasible to perform a fully perception, as well as it needs a task model. The user
grown, detailed simulation of multimodal impressions model describes familiarity of the user with the ob-
in real-time. Furthermore, the time and investment jects’ properties, as attention on the properties of an
necessary to develop completely accurate auditory and object varies with the user’s background. This corre-
visual models is as much of a limiting factor for how lates to the concept of schemata described in Neisser’s
much detail will be rendered, as is the computational “Perceptual Cycle”, see fig. 1. In Neisser’s model,
power alone. It is therefore reasonable to focus only on schemata represent the knowledge about our environ-
the most important stimuli and leave out those that ment [3]. They are based on previous experiences
would go unnoticed in a real world situation. In or- and are located in the long term memory. Neisser
der to do so, it is necessary to estimate what the most attributes them to generate certain expectations and
important stimuli or objects in the overall audiovisual emotions that steer our attention in the further ex-
percept are. ploration of our environment. The exploratory pro-
After giving a definition of saliency in the audiovi- cess consists, according to Neisser, in the transfer of
sual context in section 2, section 3 describes the role of sensory information (the stimulus) into the short-term
interactivity and presence in the perception of audiovi- memory. In the exploratory process, the entirety of
sual quality. Section 4 introduces the salience model it- stimuli (the stimulus environment) is compared to the
self. Section 5 summarizes the experiments performed schemata already known. Recognized stimuli are given
to verify the factors contained in the model. It also dis- a meaning, whereas unrecognized stimuli will modify

101
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity

Stimulus vidual’s participation in a communication setting pos-


Environment sible and efficient” [7]. It is this individual’s quality
experience that we are interested in.
Steuer holds that interactivity is a stimulus-driven
modifies samples
variable which is determined by the technological struc-
ture of the medium [8]. According to Steuer, interac-
tivity is “the extent to which users can participate in
Perceptual modifying the form and content of a mediated envi-
Schemata
Exploration ronment in real time” - in other words, the degree to
which users can influence the target environment. He
directs
identifies three factors that contribute to interactivity:
• speed (the rate at which input can be assimilated
into the mediated environment)
Figure 1: Neisser’s Perceptual Cycle, after [3], • range (the number of possibilities for action at
modified. any given time)
• mapping (the ability of a system to map its con-
trols to changes in the mediated environment in a
the schemata, which will then in turn direct the ex- natural and predictable manner)
ploratory process. These factors are related to technological constraints
Whereas a picture of a human being or a human that come into place when an application is supposed
speech utterance can be considered more or less equally to provide interactivity to the user. Adding to these
salient to all users, because its significance to humans factors, the perceived quality of the system’s feedback
is embedded genetically, an acoustically trained per- to the user - the quality of the audiovisual stimuli
son might focus more on the reverberation in a virtual generated as a reaction to the user’s input - plays an
room than a visually oriented person. The task model equally important role.
describes the fact that salience depends on intention- Closely related to interactivity is presence. Pres-
ality, so that depending on the task the user is given, ence in interactive audiovisual application systems or
his focus will shift accordingly. Virtual Environments is often described as the feeling
Salience also depends on the physical characteristics of “being there” [9] that generates involvement of the
of the objects themselves. Following the Gestalt theory user. Lombard and Ditton define presence in a broader
introduced by Wertheimer [4], the most salient visual sense as the “perceptual illusion of nonmediation” [10].
form is the one requiring the minimum of sensory in- According to Steuer, the level of interactivity (de-
formation to be treated. In the auditory domain it is gree to which users can influence the target environ-
known that certain noises which can be characterized ment) has been found to be one of the key factors for
with properties like ’sharpness’ or ’roughness’ call the the degree of involvement of a user [8]. Steuer has
attention more than others [5], often by skirting mask- found vividness (ability to technologically display sen-
ing effects in the time or frequency domain due to their sory rich environments) to be the second fundamental
spectral or temporal characteristics. Adding to this, component of presence. Along the same lines, Sheridan
salience can be due to spatial or temporal disposition assumed the quality and extent of sensory information
of the objects. Thus a classification of the properties that is fed back to the user as well as exploration and
that can make an object salient in a particular context, manipulation capabilities to be crucial for the subjec-
the so-called influence factors, have to be established tive feeling of presence [11].
and verified in order to draw any useful conclusions Interactivity requires attention of the user. Only
from the “multimodal salience” approach. when both - the system and the user - react to each
other, true interactivity is in place. Therefore, inter-
3 Influence of Interactivity and Presence activity can be regarded as one way of controlling the
focus of attention of a user. This is important, as ob-
The concept of interactivity has been defined by Lee
jects in the user’s focus will naturally be more salient
et al. based on three major viewpoints: technology ori-
than others.
ented, communication-setting oriented, and individual
oriented views [6, 7]. Here, the technology-oriented
view of interactivity is adopted. The “technology-
4 Salience Model
oriented view of interactivity defines interactivity as a A salience model would thus mainly contain (and ide-
characteristic of new technologies that makes an indi- ally quantify) the influence factors that control the

-2-

102
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity

INFLUENCE INFLUENCE INFLUENCE QUALITY


FACTORS 1: FACTORS 2: FACTORS 3: ATTRIBUTES:
reproduction setup physiology subject (experience, synchrony (temporal,
setup-dependent (acuity,masking,…) gender, social, spatial)
content expectations,…) balance of modalities
task (interactivity, qualitative attributes
application,…) (motion, mood,
space, action)

experimental
SENSORY COGNITIVE QUALITY evaluation
STIMULUS
PERCEPTION PROCESSING IMPRESSION

REACTION

Figure 2: The suggested salience model for interactive audiovisual applications of moderate complexity.

level of saliency of each perceived object. It is easily cognitive processing on the other hand. Sensory per-
seen that a generalized salience model is too complex ception can be affected by a number of influence factors
and the influence factors too manifold to cope with at of level 2. These involve the physiology of the user
the current state of knowledge. Therefore it is nec- (acuity of hearing and vision, masking effects caused
essary to get away from a generalized salience model. by limited resolution of the human sensors, etc.) as
Instead, it is reasonable to focus on a salience model well as all other factors directly related to the physical
valid for interactive audiovisual applications of moder- perception of stimuli.
ate complexity. Cognitive processing produces a response by the
Fig. 2 shows how such a salience model may be struc- user. This response can be obvious, like an immedi-
tured. The basis of human perception are the stimuli. ate reaction to a stimulus, or it can be an internal re-
For interactive applications these are generated by the sponse like re-distributing attention, shifting focus or
application system itself, so they will depend on a num- just entering another turn of the Perceptual Cycle (see
ber of factors: the influence factors of level 1. These [3]). Obviously, the response is governed by another set
factors comprise the audiovisual reproduction setup, of influence factors of level 3. These span the widest
e.g. (multichannel) loudspeaker setup, headphones, fre- range of factors, and also the most difficult to quantify:
quency range, panning laws and -algorithms applied, experience, expectations, and background of the user;
size and resolution of screen, brightness and color dis- difficulty and aim of task (if any); degree of interactiv-
tribution, etc. Note that the actual weight of these fac- ity; type of application; etc. Influence factors of level 3
tors may also depend on the audiovisual content itself: are related to the processing and interpretation of the
a static, acoustically “dry” sound source in a frontal perceived stimuli.
position will not be critical to different panning laws Cognitive processing will eventually lead to a certain
or number of loudspeaker channels in the back. In- quality impression (Quality of Experience, QoE) that
fluence factors of level 1 also comprise technical input is a function of all influence factors of levels 1-3. This
devices for user feedback to the system. As an ex- quality impression cannot be directly quantified by hu-
ample, navigation in a 3D scene can be controlled via mans. It needs additional processing to be uttered in
computer mouse, keyboard, joystick, accelerometer or the form of (quantitative) ratings on a quality scale, as
other advanced technical devices that offer differing de- (qualitative) semantic identifiers, and so on.
grees of freedom (DOF) and thus differing amount and The common way of assessing the overall quality im-
precision of control. This in turn influences how pre- pression is to evaluate single or combined quality at-
cisely the system can react by appropriately producing tributes. The scientific community has developed a
/ modifying the stimuli. To summarize, influence fac- number of attributes that are believed to be relevant
tors of level 1 are those related to the generation of for an overall audiovisual quality impression. Among
stimuli. these are audiovisual synchrony (both temporal and
The core elements of human perception have been spatial), the localization of events, sound as well as
identified to be sensory perception on the one hand and video quality by themselves (which, nevertheless, influ-

-3-

103
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity

-15° 15°
ence each other), responsiveness to interaction (when
applicable), and many more.
2.72 m
Woszczyk et al. have tried to arrange these into a -45° 45°

4×4 matrix of “perceptual dimensions” (Action, Mood,


Motion, Space) vs. attributes (Quality, Magnitude, In-

2.8 m
volvement, Balance) within these dimensions [12]. But
again, a quantification of their impact is hardly possi-
ble as of now. This is because their weight not only
3.4 m
depends on the audiovisual content (the stimulus) un-
der assessment, but also on the experimental evalua-
tion (the test methodology) itself. An attribute that is -105° 105°

explicitly asked for will probably be assumed to be of


higher importance by the test subject (we know from
our experience that only important things are asked
for in any kind of test). The subject’s attention will be
directed toward the attribute currently under assess-
ment, an act that distorts unbiased perception of the -165° 165°

overall multimodal stimulus. Therefore, the subject’s


reaction might be influenced as well. Figure 3: Overview of the test setup: large pro-
jecting screen in the front and eight channel loud-
5 Experimental Evaluation speaker setup unevenly distributed between front
A number of subjective assessments have been per- and back. The higher number of loudspeakers in
formed to verify the influence factors that were pre- the screen area (front) increases audiovisual coher-
viously identified for a typical prototype reproduction ence.
setup of interactive audiovisual content. This exem-
plary setup makes use of a large projecting screen, a
multichannel loudspeaker setup, and real-time room- loudspeaker channels as low as possible without intro-
acoustic simulation rendered on a standard PC. Fig. ducing deteriorations in localization quality.
3 adumbrates the top view of the reproduction setup It has been shown that the necessary number of
with the test subject located in the center. A descrip- loudspeaker channels mainly depends on the content
tion of the audio rendering engine can be found in [13]. itself. Among all tested factors, the different motion
The results of these experiments reveal a number of paths through the scene that were presented in the
interesting points: assessment (and therefore the directions of incidence
of the sound source) had the greatest impact on the
5.1 Multimodality vs. Unimodality perceived subjective quality of the loudspeaker setup
The first two assessments focused on a possible reduc- used. As a rule of thumb and generalized result, the
tion of algorithmic complexity for the bimodal (audio- well-known five-channel setup defined in ITU-R BS.775
visual) case compared to the unimodal (audio only) [15] is suitable for such systems.
case. The simplifications assessed were directly related The second experiment [16] evaluated the number
to the computational load that the real-time rendering of internal workchannels for the MPEG-4 Audio Per-
of audio imposes on the processor. ceptual Approach reverberation algorithm [17]. The
The first experiment [14] evaluated the number of MPEG-4 Audio Perceptual Approach is also known
loudspeakers necessary in interactive audiovisual appli- as the SPAT reverberation algorithm for Cycling’74’s
cation systems of moderate complexity using a Vector Max/MSP. As opposed to the image source method, it
Base Amplitude Panning (VBAP) approach to posi- generates reverberation generically, i.e. without a phys-
tion the sound sources. The room acoustic simulation ical/geometrical reference of the room to be simulated.
method applied was an image source model simulat- The amount of so-called internal workchannels involved
ing the early reflections, combined with uncorrelated determines the complexity/density of the reverberation
diffuse reverberation created separately for each loud- pattern. Increasing the number of workchannels there-
speaker channel. For these situations, the complexity fore generates a higher density reverberation pattern.
of the algorithm (and thus the computational load) is The density of the reverberation pattern is usually at-
significantly growing with the number of loudspeakers tributed to be the foremost quality criteria of artificial
involved. Thus, it is desirable to keep the number of reverberation.

-4-

104
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity

The assessment compared three versions of the Per- collect as many objects as possible within a given
ceptual Approach algorithm to each other. The test time limit of 30s.
results make clear that - for the audiovisual case - sub- Interestingly, and contrasting with results obtained
jects were not able to identify the three versions of the by others (e.g. Zielinski, Kassier, Rumsey, Bech et al.
algorithm under assessment. Increasing the density of in [19] and [20], both based on smaller sample sizes),
the diffuse reverberation part remains without perceiv- the different tasks that test subjects had to perform did
able improvement of the quality in bimodal (audiovi- not have an effect on the quality evaluation (Friedman:
sual) perceptual situations. Therefore the Perceptual χ2 = 3.3, df = 2, p > 0.05, p = 0.190, ns).
Approach algorithm as specified in MPEG-4 Scene De- Two possible explanations exist:
scription can be simplified to use only four internal
workchannels without degrading the overall perceived • The subjects’ task of navigating through the scene
quality in the audiovisual context. was not demanding enough, making the differ-
ences in quality too obvious.

5.2 User Interaction • Whereas user interaction was related to visual and
haptic modalities, the quality rating was based on
The next three assessments [18, 21, 22] focused on the audiovisual percepts. The distraction generated
effect that user interaction with the audiovisual ap- by interaction was not high enough to be signifi-
plication might have on the perceived overall quality. cant across modalities.
Here the general assumption was that by offering an at-
These possible explanations were examined in the
tractive interactive content or by assigning the user a
next two experiments [21, 22]. On the one hand, user
challenging task, the user would become more involved
interaction, rating process and tasks aimed at sharing
and thus experience a subjectively higher overall qual-
the same modality. On the other hand, test subjects
ity.
were confronted with a mentally complex, yet easily
The first experiment in this series compared the per-
scalable task: the n-back working memory paradigm.
ceived overall quality of audiovisual scenes under dif-
In this, the subject typically is required to monitor a
ferent degrees of interaction [18]. The actual amount
series of stimuli and to indicate whether or not the
of interaction was determined by three different tasks
stimulus currently presented is the same as the one
that the test subjects had to fulfill during the assess-
presented n steps before.
ment. These were:
Here, a sequence of spoken numbers was presented
1. Listen and watch task: Test subjects were pre- and subjects had to compare the numbers. At the same
sented with an automated movement through the time, the reverberation time was varied and subjects
virtual scene. No activity on their side was re- were subsequently asked to correctly rate the length
quired. The automated movement lasted around of reverberation in comparison to a reference rever-
30s, selected from two different predefined motion beration time1 . Unlike in previously published exper-
paths. iments, both the attribute to be rated and the dis-
tracting task were located in the same modality. An
2. Listen and press a button task: Again, test sub- analysis of the collected data indicates that the pre-
jects were asked to experience an automated cision with which auditory parameters can be rated /
movement through the virtual scene. This time, discriminated by humans is dependent on the degree of
an object automatically appeared within the field distraction present in the same modality. A highly sig-
of view. It was subsequently approached and nificant difference in rating accuracy was shown for the
(again automatically) collected. Then, a new ob- “navigation only” condition vs. the “navigation with 2-
ject would appear, and so on. Test subjects were back task” condition using Wilcoxon’s T test (matched
asked to immediately press a button whenever the pairs signed ranks, T = 20, p ≤ 0.01).
object appeared. This result further confirms and specifies the find-
3. Listen and collect an object task: Test subjects ings of [18, 19, 20]: Whereas cross-modal division of
were using the computer mouse to navigate freely attention only renders a small significant effect and
inside the virtual scene. Their task was to col- - apart from being listener-specific - depends on the
lect the object that was positioned somewhere on experimental conditions, with inner-modal distraction
the floor. When they had approached the object test subjects would predictably commit errors in their
closely enough, it was collected and re-appeared 1 In [18], an additional semi-structured interview had re-
in another location. The new location was either vealed that reverberation time was regarded as one of the
within the field of view, or the subjects had to most important attributes for the given type of interactive
turn around to see it again. They were asked to audiovisual content by all test subjects.

-5-

105
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity

ratings. Apparently, inner-modal influence is signifi- (T = 452.50, p ≤ 0.01), and the cut-off frequency
cantly greater than cross-modal influence. This is also fc = 12kHz (T = 812, p ≤ 0.01). The low-pass filter-
supported by some of the theories of capacity limits in ing in the active session (Game condition) was rated
human attention [23]. as being generally less perceptible.
This assessment showed that a cross-modal influence
5.3 Cross-Modal Interaction of interaction is possible when stimuli and interaction
Finally, the last assessment [24] in this series investi- are carefully balanced. Interaction performed in one
gated the possibility of cross-modal influence of inter- modality (e.g. visual-haptic) can dominate the percep-
action upon perceived quality. Whereas in the previous tion of stimuli in another modality (here: auditive).
two assessments the influence of interaction within the Yet, at this time it is not possible to determine or quan-
same modality was investigated, here the influence of tify that balance a priori.
a visual (-motion) task upon the perceived audio qual-
5.4 Conclusions
ity was evaluated. This experiment is borrowing from
what Zielinski et al. [19] and Kassier et al. [20] have de- The experiments have clearly identified a number of
scribed, but the test panel was significantly larger (31 factors that influence the perceived quality of audiovi-
test subjects opposed to 6 and 7, respectively), thus sual content. These are of technical nature, i.e. depend-
allowing a profound statistical analysis. ing on the reproduction setup and simulation algorithm
For this experiment, a computer game was designed used, but also of contextual and subjective nature, i.e.
to assess the effect of divided attention in the evalu- depending on user task, on degree and modality of in-
ation of audio quality during involvement in a visual teraction offered, and on individual attention capacity
task. Subjects had to collect selected flying objects limits.
(donuts) by running into them and avoid the collision
with other objects (snowballs). For the navigation, test 6 Summary and Outlook
subjects used the left and right arrow keys of a com- The model introduced here identifies and classifies the
puter keyboard. Movement was only possible to the most important influence factors that determine the
sides, at a fixed distance from the source of the flying saliency of objects in a multimodal perceptual situa-
objects. tion. It has been specifically developed to describe the
A game score was recorded for each subject to verify perception of audiovisual content in interactive appli-
subjects’ involvement in the game and to prod the sub- cation systems of moderate complexity, yet it can be
jects to actively play the game. By collecting the right extended to include true multimodality. It is based on
object (donut) the score was increased by one point, the experimental evaluation of perceived overall quality
whereas a collision with a snowball decreased the score (Quality of Experience) tested in a coordinated series
by one point. of subjective assessments.
For the experiment, each subject carried out a pas- The model needs further refinement to be put to
sive and an active session. The active session consisted use in real-world applications. One of the tasks that
in playing the computer game and evaluating the audio remain is the context-dependent quantification of the
quality. This session was designed to cause a division influence factors: in its current state of development,
of attention between the process of rating the audio the model is a purely qualitative one that does not yet
quality and the involvement in the computer game. In allow a priori statements (quantified estimations) on
the passive session, subjects were asked to evaluate the the weight of individual factors.
audio quality while only watching a game demo. The
audio quality degradations were realized by modifying References
the tonal quality. The original music signal (16kHz)
was low-pass filtered using three different cut-off fre- [1] Reiter, Ulrich. On the Need for a Salience Model
quencies fc = 11kHz, 12kHz and 13kHz. Addition- for Bimodal Perception in Interactive Applica-
ally, an anchor with a low-pass filtering at the cut-off tions. IEEE/ISCE’03, International Symposium
frequency fc = 4kHz was created. on Consumer Electronics. Sydney, Australia, De-
The Wilcoxon T test showed that the quality ratings cember 3-5, 2003.
of the active session varied significantly from the rat- [2] Landragin, Frederic; Bellalem, Nadia; Romary,
ings of the passive session for cut-off frequencies up Laurent. Visual Salience and Perceptual Grouping
to 12kHz. A significant decrease in rating correct- in Multimodal Interactivity. Proc. International
ness was shown for the Game condition in compar- Workshop on Information Presentation and Nat-
ison to the No Game condition for the anchor item ural Multimodal Dialogue IPNMD. Verona, Italy,
(T = 37, p ≤ 0.01), the cut-off frequency fc = 11kHz December 14-15, 2001.

-6-

106
Reiter - Toward a Salience Model for Interactive Audiovisual Applications of Moderate Complexity

[3] Farris, J. Shawn. The Human Interaction Cycle: [15] Recommendation ITU-R BS.775-1. Multichannel
A Proposed and Tested Framework of Perception, stereophonic sound system with and without ac-
Cognition, and Action on the Web. PhD Thesis. companying picture. International Telecommuni-
Kansas State University, USA, 2003. cation Union, Geneva, Switzerland, 1994.
[4] Wertheimer, Max. Untersuchungen zur Lehre von [16] Reiter, Ulrich; Partzsch, Andreas; Weitzel,
der Gestalt II. Psychologische Forschung. 4, 1923, Mandy. Modifications of the MPEG-4 AAB-
pp 301-350. IFS Perceptual Approach: Assessed for the Use
[5] Zwicker, Eberhard; Fastl, Hugo. Psychoacoustics - with Interactive Audio-Visual Application Sys-
Facts and Models. 2nd updt. ed., Springer Verlag. tems. Proc. AES 28th Internat. Conf., Pitea, Swe-
Berlin, 1999, ISBN 3-540-65063-6. den, June 30 - July 2, 2006, pp 110-117.
[17] Int. Std. (IS) ISO/IEC 14496-11:2004. Informa-
[6] Lee, Kwan Min; Jin, S. A.; Park, N.; Kang, S. Ef-
tion technology - Coding of audio-visual objects -
fects of narrative on feelings of presence in com-
Part 11: Scene description and Application en-
puter/video games, Annual Conference of the In-
gine. Geneva, Switzerland, 2004.
ternat. Communication Association (ICA), New
York, NY, USA, May 2005. [18] Reiter, Ulrich; Jumisko-Pyykkö, Satu. Watch,
Press and Catch - Impact of Divided Attention
[7] Lee, Kwan Min; Jeong, Eui Jun; Park, Namkee;
on Requirements of Audiovisual Quality. 12th In-
Ryu, Seoungho. Effects of Networked Interactiv-
ternat. Conf. on Human-Computer Interaction,
ity in Educational Games: Mediating Effects of
HCI2007, Beijing, PR China, July 22-27, 2007.
Social Presence, PRESENCE2007, 10th Annual
International Workshop on Presence, Barcelona, [19] Zielinski, Slawomir; Rumsey, Francis; Bech,
Spain, Oct. 25-27, 2007, pp 179-186. Soren; de Bruyn, Bart; Kassier, Rafael. Computer
Games and Multichannel Audio Quality - the Ef-
[8] Steuer, Jonathan. Defining Virtual Reality: Di- fect of Division of Attention Between Auditory
mensions Determining Telepresence. Journal of and Visual Modalities. Proc. AES 24th Interna-
Communication. 42/4, 1992, pp 73-93. tional Conference on Multichannel Audio, Banff,
[9] Larsson, Pontus; Västfjäll, Daniel; Kleiner, Alberta, Canada, June 2003.
Mendel. On the Quality of Experience: A Multi- [20] Kassier, Rafael; Zielinski, Slawomir; Rumsey,
Modal Approach to Perceptual Ego-Motion and Francis. Computer Games and Multichannel Au-
Sensed Presence in Virtual Environments. Pro- dio Quality Part 2 - Evaluation of Time-Variant
ceedings First ISCA ITRW on Auditory Quality of Audio Degradation under Divided and Undivided
Systems AQS-2003. Akademie Mont-Cenis, Ger- Attention. AES 115th Convention, New York,
many, April 23-25, 2003, pp 97-100. USA, October 2003, Preprint 5856.
[10] Lombard, Matthew; Ditton, Theresa. At the Heart [21] Reiter, Ulrich; Weitzel, Mandy; Cao, Shi. Influ-
of it All: The Concept of Presence. Journal of ence of Interaction on Perceived Quality in Au-
Computer-Mediated Communication, 3, 1997. dio Visual Applications: Subjective Assessment
[11] Sheridan, Thomas B. Further Musings on the Psy- with n-Back Working Memory Task, Proc. AES
chophysics of Presence. Presence, 5/1994, pp 241- 30th International Conference, Saariselkä, Fin-
246. land, March 15-17, 2007.
[12] Woszczyk, Wieslaw; Bech, Soren; Hansen, Villy. [22] Reiter, Ulrich; Weitzel, Mandy. Influence of In-
Interactions Between Audio-Visual Factors in a teraction on Perceived Quality in Audio Visual
Home Theater System: Definition of Subjective Applications: Subjective Assessment with n-Back
Attributes. AES 99th Convention, New York, Working Memory Task, II. AES 122nd Conven-
USA, 1995, Preprint 4133. tion, Vienna, Austria, May 5-8, 2007.
[13] Reiter, Ulrich. TANGA - an Interactive Object- [23] Pashler, Harold. The Psychology of Attention. 1st
Based Real Time Audio Engine. Audio Mostly paperback edition, The MIT Press, Cambridge,
2007, 2nd Conference on Interaction with Sound, MA, USA, 1999, ISBN 0-262-66156-X.
Ilmenau, Germany, September 27-28, 2007. [24] Reiter, Ulrich; Weitzel, Mandy. Influence of In-
[14] Reiter, Ulrich. Subjective Assessment of the Opti- teraction on Perceived Quality in Audiovisual Ap-
mum Number of Loudspeaker Channels in Audio- plications: Evaluation of Cross-Modal Influence.
Visual Applications Using Large Screens. Proc. Proc. 13th International Conference on Auditory
AES 28th Internat. Conf., Pitea, Sweden, June Displays (ICAD), Montreal, Canada, June 26-29,
30 - July 2, 2006, pp 102-109. 2007.

-7-

107
An Embedded Audio-Based Vehicle Classification
Based on Two-level F-ratio

Jiqing Han, Tao Jiang


School of Computer Science and Technology, Harbin Institute of Technology, Harbin, P.R.China, 150001
jqhan@hit.edu.cn

Abstract. The human auditory system can be regarded as a complex signal and information processing machine. With the rapid
progress of information technology, it is reasonable to use computer to simulate the ability of human auditory system. One of this
kind of researches is audio sense technology, which uses computer to analyze the environment sounds and identify the sounder by
analysis of audio signals. It can extend the audio sense ability of people. One application of audio sense technology is vehicle
classification, which means using the audio features of different vehicles to design a classifier for identifying the vehicles. For
embedded applications, generally a linear weighted classifier is used for the vehicles classification, and the F-ratio, which denotes the
contributions of different audio feature dimensions, is adopted as the weights. In this paper, we find just using F-ratio as the weight
there are confusion errors of classification, thus an approach of using two-level F-ratio as the weights is proposed to overcome the
problem. We first use F-ratio of total patterns as the weight to select the candidate set of confusion patterns, and then in the candidate
set, we use the F-ratio of the confusion patterns as the weight to give the final classification results, and a real time accuracy of
82.1% is gotten in an embedded platform based on a MSP430F149 microcontroller.

1. Introduction slicing are passed to a finite state machine, which makes the
vehicle detection decision. In [3], a traffic monitoring detector
The human auditory system can be regarded as a complex signal for the vehicle classification was developed to aid traffic
and information processing machine. With the rapid progress of management systems, where a Time Delay Neural Network
information technology, it is reasonable to use computer to (TDNN) was chosen to classify individual traveling vehicle
simulate the ability of human auditory system. One of this kind based on their speed-independent acoustic signature, and the
of researches is audio sense technology, which is a main branch Linear predictive coding (LPC) preprocessing and feature
of human computer interface, and it is important for the extraction technique were applied to the work. In [4], a fusion
automatism and intelligence of computer. This technology can framework using both video and acoustic sensors for vehicle
use computer to analyze the environment sounds automatically detection and tracking was proposed. In the detection phase, a
and identify the sounder by analysis of audio signals, which can rough estimate of target direction-of-arrival was first obtained
make computer has the auditory intelligence. Audio sense using acoustic data through beam-forming techniques. This
technology will be widely used, because it can extend the audio initial estimate designates approximate target location in video.
sense ability of people by using computer, which will be a Given the initial target position, the method was refined by
primary assistant to human being in information acquisition moving target detection using the video data. And Markov
domain. Chain Monte Carlo techniques were then used for joint audio-
The motivation of this paper is to discuss the approaches of visual tracking.
using audio sense technology to identify different kinds of Considering there are many embedded applications which are
vehicles. It is an important signal processing task that has been based on a processor of limited ability of computation, the
found widespread applications such as intelligent transportation approaches of simple computation and small storage are needed
systems, sensor network application [eg. 1-3]. Many researches to develop. Generally, the simple feature extraction and liner
had been explored for this kind of works. In [1], the classification algorithm are adopted for reducing the
classification of moving vehicles in a distributed, wireless computation complexity. In order to represent the contributions
sensor network was investigated, where a local pattern classifier of different audio feature dimensions, a method of audio feature
at each sensor node first made a local decision on what type of weighted Euclidean distance is used for pattern matching in the
the vehicle was based on its own feature vector. The probability linear classifier, and the F-ratios of different audio feature
of correct classification can also be estimated. The local dimensions are selected as the weights. In this paper, we find
decision, together with the estimated probability of being a that using F-ratios of total patterns as the weights, there are also
correct decision then can be encoded and transmitted efficiently errors of classification. We analyze the confusion patterns of
via the wireless channel to a local fusion center ready for error classification and find their features with larger values of
decision fusion. It was found that data fusion and decision F-ratios are near similar, and the features with smaller ones are
fusion enhanced the performance of vehicles classification. In very different. Consequently, in order to improve the
[2], an adaptive threshold algorithm were proposed for real-time performance, a two-level F-ratio weighted algorithm is proposed.
vehicle detection applications. It is a time-domain energy- In which we first use F-ratio of total patterns as the weight to
distribution-based alogrithm. Which first computes the time- select the candidate set of confusion patterns, and then in the
domain energy distribution curve and then slices the energy candidate set, we use the F-ratio of confusion patterns as the
distribution curve using a threshold updated adaptively by some weight to give the final classification results. In the paper, audio
decision states. Finally, the decision results from threshold signal gathering technology, audio signal feature extraction
technology and embedded system technology are also carefully
analyzed. The schematic diagram of our vehicle classification
*
This work is supported by National Natural Science Foundation of China system is shown in Figure 1. Furthermore, an embedded real-
under Grant No. 60672163.

108
time audio sense processing system for identifying different Generally, the feature analysis of acoustic signal is based in
kinds of vehicles based on a Texas Instruments MSP430F149 three domains: time, frequency, and time-frequency domains.
microcontroller is implemented. Acoustic signal processing in time domain is a natural approach,
but not an optimal one due to the complexity of the
environment, i.e., the time domain signatures of acoustic signals
Preprocessing Feature Two-level F-ratio can be hampered by noise from other moving vehicles. The
extraction method
features of the frequency and time-frequency domains are useful
to get a good performance of classification, but they are not
suitable for real-time vehicle classification since they tend to
Sound of vehicle Vehicle classification require intensive computation and samples from a long period of
time. Considering the requirement of real-time processing and
Figure 1. Schematic diagram for acoustic signature the limited computation ability of MSP430F149, the time
processing for vehicle classification domain features are adopted in this paper, and six features in
time domain are used, which include four general ones, such as
The paper is organized as follows: in section 2, the pre- short time energy(STE), short time zero-cross-rate(ZCR), short
processing and feature extraction of vehicles audio signal are time band-pass zero-cross-rate(BZCR) and the number of short
first introduced, which includes preemphasis, windowing and time peaks. And also two new time features are used, i.e., the
feature computation, and four traditional features and also two average value of short time peaks and the variance of short time
new features are gotten, next the proposed two level F-ratio peaks. The above features are defined as follows,
based method is discussed. And then in section 3, the Let ~yl (n) (n=0,1, …, N-1) denote the acoustic signal samples in
experiments and discussions are given, which includes the
a frame, the short time energy is defined as,
corpus of our experiments, the audio signal preprocessing and N −1
feature analysis, the experimental results of vehicles STE (m) = ∑ ~
yl ( n) (4)
n =0
classification based on audio signal and also some discussions.
Finally, section 4 gives the summary of our work and the ZCR is another basic acoustic features that can be computed
conclusions. easily. It is equal to the number of zero-crossing of the
waveform within a given frame,
2. Two-Level F-ratio Method ZCR (l ) = {
1 N −1
∑ | sgn[ ~
2 n =0
yl (n)] − sgn[ ~
yl (n − 1)] |} (5)
2.1. Pre-processing Where sgn(x) is the sign function,
The pre-processing is a most important part of acoustic signal ⎧1 if x > 0
processing, which converts the sound waveform to some type of ⎪
parametric representation (generally at a considerably lower in sgn( x) = ⎨0 if x = 0 (6)
formation rate) for further analysis and processing. It includes ⎪− 1 if x < 0
the following steps [5],

(1) Preemphasis, in which the digitized acoustics signal is put ZCR has the following characteristics: (1) In general, ZCR of
through a low-order digital system to spectrally flatten the both unvoiced sounds and environment noise are larger than
signal and to make it less susceptible to finite precision voiced sounds. (2) It is hard to distinguish unvoiced sounds from
effects later in the signal processing. The digital system environment noise by using ZCR alone since they have similar
used in the preemphasizer is either fixed or slowly adaptive. ZCR values. (3) ZCR is often used in conjunction with the
The most widely used preemphasis network is the fixed volume for end-point detection. In particular, ZCR is used for
first-order system, detecting the start and end positings of unvoiced sounds.
However, the ZCR parameter is easily affected by lower
H ( z ) = 1 − a~z −1 0.9 ≤ a~ ≤ 1.0 (1) frequency noise, thus the short time band-pass ZCR is used as
(2) Frame Blocking, the acoustic signal is a slowly time following,
varying signal in the sense that, when examined over a 1 N −1
sufficiently short period of time (between 5 and 100 msec), BZCR(l ) = {∑ | sgn[ y%l ( n) − T ] − sgn[ y%l ( n − 1) − T ] | +
its characteristics are fairly stationary, so it is reasonable to 2 n =0 (7)
cut the continuous acoustic signal into short time parts, and + | sgn[ y%l (n) + T ] − sgn[ y%l (n − 1) + T ] |}
each part is called a frame. In this step the preemphasized Where T is a threshold.
acoustic signal is blocked into frames of N samples, with
adjacent frames being separated by R samples. Let pl (n) represent the related peaks in the frame, which is
(3) Windowing, the next step in the processing is to window defined as,
each individual frame so as to minimize the signal ⎧1, ~
y l ( n) > ~
yl (n − 1) & & ~
y l ( n) > ~
yl (n + 1)
discontinuities at the beginning and end of each frame. If p l ( n) = ⎨ (8)
⎩0 , others
we define the window as w(n),0 ≤ n ≤ N − 1 , then the
Let N l denote the number of peaks (PN) in the frame,
result of windowing is the signal,
~ ∞
yl (n) = yl (n) w(n), 0 ≤ n ≤ N − 1 (2) N l = ∑ pl ( n ) (9)
n = −∞
Where yl (n) and ~
yl (n) are the lth frames of signal before The average value of short time peaks (AP) is defined as,
windowing and after windowing respectively, and a typical ∞

window is the Hamming window, which has the form, ∑ | ~yl (n) pl (n) |
2πn Pl = n = −∞
(10)
w( n) = 0.54 − 0.46 cos( ), 0 ≤ n ≤ N − 1 (3) Nl
N −1 The variance of short time peaks (VP) is defined as,

2.2. Feature Extraction

109

μ j is the estimated mean for the jth vehicle,
∑ [ y% (n) p (n) − M ]
n =−∞
l l l
2

1 n
vl = (11) μj = ∑ xij (13)
Nl n i=1
Where M l is the average value of amplitudes of short time And μ is the estimated over-all mean,
signal in a frame. 1 m
The above six parameters have been constructed as a feature μ= ∑μj (14)
m j =1
vector for the frame, and the total forty frames are used as an
analysis window, then the features of a window are used to Using the F-ratio weight we got a basic performance for five
represent the information of a kind of vehicle. kinds of vehicles classification. However, there are also errors of
classification. We analyze the confusion patterns of error
2.3. Two Level F-ratio Based Method classification and find their features with larger values of F-
For the vehicle classification, it is a pattern recognition approach, ratios are near similar, and the features with smaller ones are
and has four steps [5], very different. Consequently, in order to improve the
(1) Feature measurement, in which a sequence of performance, a two-level F-ratio weighted algorithm is proposed.
measurements is made on the input signal to define the In the algorithm, we first use F-ratio of total patterns as the
“test pattern”. weight to select the candidate set of confusion patterns, and then
(2) Pattern training, in which one or more test patterns in the candidate set, we use the F-ratio of confusion patterns as
corresponding to sounds of the same class are used to the weight to give the final classification results. The algorithm
create a pattern representation of the features of that class. is as follows,
The resulting pattern, generally called a reference pattern,
can be a template, derived from some type of averaging Step 1. calculate the F-ratio of over-all patterns (Fall) and
technique, or it can be a model that characterizes the the F-ratio of the confusion patterns (Fconfu) ;
statistics of the features of the reference pattern. Because of Step 2. for input acoustic signal of the vehicle, get the
the non-stationary nature of acoustic signal, the short-time feature vectors of the analysis window;
feature extraction as section 2.2 are performed sequentially Step 3. use Fall as the weight, calculate the weighted
over time, producing a sequence of feature vectors as Euclidean distances between the input feature vectors and
pattern. different patterns;
(3) Pattern classification, in which the unknown test pattern is Step 4. select the two best patterns as the candidates;
compared with each class reference pattern and a measure Step 5. use the Fconfu of candidate patterns as the new
of similarity (distance) between the test pattern and each weight, recalculate the weighted Euclidean distances
reference pattern is computed. between the input feature vectors and candidate patterns;
(4) Decision logic, in which the reference pattern similarity Step 6. get the best pattern as the final result.
scores are used to decide which reference pattern (or
possibly which sequence of reference patterns) best 3. Experiments
matches the unknown test pattern.
The factors that distinguish different pattern recognition The experiment audio data of the corpus were selected from
approaches are the types of feature measurement, the choice of Internet [7]. It includes the acoustic signal of five kinds of
templates or models for reference patterns, and the method used vehicles, i.e., airplane, helicopter, truck, tank, jeep. The
to create reference patterns and classify unknown test patterns. MSP430F149 is used to construct the embedded real-time
In the paper, the linear classification algorithm is designed and processing platform, which is a Texas Instruments 16-bit Ultra-
an approach of feature weighted Euclidean distance metric is low-power microcontroller, it has a powerful 16-bit RISC CPU,
adopted. In which the relevant weight is the F-ratio of different a fast 12 bit ADC, and hardware multiplier. It is useful for the
feature dimensions. typical applications include sensor systems that capture analog
F-ratio is a statistic which is often used in speaker recognition signals, convert them to digital values, and process and transmit
[6], it is proportional to the ratio of the variance of the means of the data to a host system.
each speaker’s feature distribution to the average value of the In the experiments, the audio signal is first parameterized using
variance of each distribution. The farther apart the individual a feature analysis. Each analysis frame is processed with a
distributions are with respect to their average spread, the higher Hamming window, and the frame length is 25 milliseconds and
the F-ratio. Thus the F-ratio is an indication of a feature frame overlapping length is 12.5 milliseconds, the preemphasis
effectiveness. In vehicles classification, the feature parameters factor a~ is selected as 0.97. Each testing file is cut into pieces,
are evaluated in terms of their ability to discriminate vehicles and the length of each piece is about 2 seconds. There is only
and their dependence on other parameters. For the former one kind of vehicle in a piece and the system will give one result
purpose the F-ratio of the analysis of variance is used. For a for a piece. The six features are calculated for each frame, they
given parameter, the values obtained from the repetitions by are STE, ZCR, ZBCR, VP, AP, PN. A typical F-ratio of six
each vehicle may be regarded as samples from a probability features for a vehicle is shown in Figure 2. It is noted that the F-
distribution associated with that vehicle. For vehicles ratios of two new features, VP and AP, have larger values, thus
classification, a good parameter is one for which these it is reasonable to think that they have much contribution for the
individual vehicle distributions are as narrow and as widely classification results.
separated as possible. Using the F-ratio as the weight of Euclidean distance between
The F-ratio is given by [6], the test vehicle and the different patterns, the classification
n m 1 n m
accuracy of 77.3% is gotten. This result is not very good, there
F= ∑ (μ j − μ ) / ∑ ∑ ( xij − μ j ) (12)
2 2

m − 1 j =1 m(n − 1) i =1 j =1 are confusion classification errors.


Where xij is the parameter value on the ith repetition by the jth
vehicle, i = 1, ... , n , j = 1,..., m ;

110
An Embedded Audio-Based Vehicle Classification Based on Two-level F-ratio

compare performances of different lengths, and the results of


30 using different lengths of the analysis window are shown in
25 Figure 4.
20
From Figure 4, it is noted when 20 frames is used as the length
of an analysis window, the accuracy is lower, it is just more than
15
F-ratio 65%. When the analysis window is increased to 40 frames, the
10 accuracy increases to 82.1%. After that the accuracy increase
5 slowly, but the processing time increase, which is not suitable
0 for the real time processing. Considering the performance and
ZBCR VP AP STE ZCR PN
processing time, thus 40 frames is selected as the length of an
analysis window in our system.
Figure 2. A typical F-ratios of six time domain features for a vehicle
4. Conclusion
We analyzed the confusion patterns, and Figure 3 shows the
normalized feature parameters of two confusion patterns. It is It is very useful to use audio sense technology to classify the
noted when the F-ratios of the two patterns are larger, there are different kind of vehicles. In this paper, we have presented a
little differences between the two patterns for the BZCR, AP and two-level F-ratio based method for vehicle classification, which
VP; and when the F-ratios of the two patterns are little, there are first uses F-ratio of total patterns as the weight to select the
larger differences between the two patterns for the STE, ZCR candidate set of confusion patterns, then in the candidate set, the
and PN. Using the two-level F-ratio can just overcome the F-ratio of confusion patterns are used as the weight to give the
problem, since the F-ratio represents the over-all character of final classification results. The performance is evaluated on an
patterns, and two-level F-ratio represents individual character embedded platform based on MSP430F149. It can be seen that
between confusion patterns. Then we did the experiments of using two-level F-ratio based method can get a better result than
using two-level F-ratio weighted method. that of using F-ratio based method, and near a 5% performance
improvement can be gotten.
95
90 Although the proposed method has a better performance, there
85 are lots of works should be researched in depth. We should
80 pattern 1
75 enlarge the corpus to evaluate the performance of the proposed
pattern 2
70 method for a large test data, and also explore the approaches of
65
60 classification for the continuous acoustic signal of different kind
ZBCR VP AP STE ZCR PN
of vehicles. All the above researches are our future works.
Figure 3. Normalized features of two confusion patterns

The average performances of using F-ratio based method and


two-level F-ratio based method are given in Table 1. References
[1] Marco F. Duarte and Yu Hen Hu, Vehicle Classification in
Table 1. The performances of using F-ratio and two level F-ratio Distributed Sensor Networks, Journal of parallel and distributed
based method computing, 64: 826-838,(2004)
F-ratio based Two level F-ratio Based [2] Jiagen (Jason) Ding, Sing-Yiu Cheung, Chin-Woo Tan and
accuracy 77.3% 82.1% Pravin Varaiya, Signal Processing of Sensor Node Data for
Vehicle Detection, ITSC2004,(2004)
It is note that the two-level F-ratio based method improve the [3] A. Y. Nooralahiyan, H. R. Kirby, Vehicle Classification by
accuracy about 5% comparing with the F-ratio based method, Acoustic Signature, Mathl. Comput. Modelling Vol. 27, 205-
and a precision rate of 82.1% is gotten. It is shown that the two- 214,(1998)
level F-ratio based method is useful to improve the classification [4] Rama Chellappa, Gang Qian and Qinfen Zheng, Vehicle
performance of vehicles. Detection and Tracking Using Acoustic and Video Sensors,
ICASSP2004, III793-796,(2004)
90 [5] Lawreence Rabiner, Biing-Hwang Juang, Fundamentals of
Speech Recognition, Prentice Hall, (1993)
85 [6] Jared J. Wolf, Efficient Acoustic Parameters for Speaker
Recognition, Journal of Acoustics Society of American,
80
51:2044-2056,(1972)
75 accuracy [7] http://spib.rice.edu/spib/data/signals/noise
(%) [8] Jean-Luc Rouas, Jérôme Louradour, Sébastien Ambellouis,
70 Audio Events Detection in Public Transport Vehicle,
ITSC06,(2006)
65 [9] Daniel P. W. Ellis, Prediction-Driven Computational
Auditory Scene Analysis, Ph.d thesis, Massachusetts Institute of
es

es

es

s
me
am

am

am

Technology, (1996)
ra
fr

fr

fr

0f

[10] Darryl Godsmark, Guy J. Brown, A Blackboard


20

40

80

12

Architecture for Computational Auditory Scene Analysis,


Figure 4. Classification performances Speech Communication, 27:351-366, (1999)
with different lengthes of an analysis window

In the proposed method, there is a changeable factor, the length


of analysis window, in order to evaluate whether it affects the
classification performance or not, we did experiments to

111
dots: an Audio Entertainment Installation using Visual and Spatial-
based Interaction

Nikolaos Grigoriou, Nikolaos Moustakas, Andreas Floros and Nikolaos Kanellopoulos

Dept. of Audiovisual Arts, Ionian University, 49 100 Corfu, Greece


{av200432, av200406, floros, kane}@ionio.gr

Abstract. “dots” is an interactive sound installation that takes into account the spatial position of an arbitrary number of participants
in order to algorithmically synthesize an audio stream in real-time. The installation core is a software application developed during
this work, which employs advanced video and audio processing techniques in order to detect the exact participants’ positions and to
weight-mix short audio granules. Audio mixing is performed using a virtual spatial gridding of the installation space in two
dimensions. The synthesized audio stream reproduction is combined with a number of appropriately designed visual effects, which
aim to enhance the participants’ comprehension and render the “dots” installation a high-quality interactive audiovisual platform.

1. Introduction The “dots” interactive platform presentation is here mainly


performed in technological terms, typically focusing on the
The continuous growth and evolution of digital audiovisual platform architecture and some elementary design / algorithmic
technologies is providing the scientific framework to design and issues and concepts. Additionally, within the framework of this
develop/demonstrate new ways of artistic expression [1]. The work, we investigated and evaluated the interaction functionality
recently established high-definition video and multichannel and the overall installation performance under real-world
audio formats and standards are becoming widely accepted by conditions during an audiovisual festival / exhibition. This
both the media-producers’ and the end-users’ market. evaluation process also allowed the collection and interpretation
Additionally, new terms and ideas originating from the general of some observations related to the behaviour and the means of
concept of interaction are nowadays frequently used to provide interaction of the participating audience.
novel means of audio and visual production, where the audience The rest of the paper is organized as following: In Section 2, the
is actively participating in the production process [2], [3]. As a “dots’ installation architecture is described in detail, focusing on
consequence, a continuously growing number of new media both the core application and the installation room requirements
artists have now the option to express their thoughts and feelings and design. Next, a demonstration of the installation interactive
by developing and demonstrating their installations, using high features is presented in Section 3, followed by a brief analysis of
quality audio/visual synthesis and playback techniques and the functional and behavioural observations made during an
equipment [4]. installation exhibition. Finally, Section 4 concludes this work
The above interactive installations can be also considered as and accents further interaction and audio/visual enhancements
advanced tools for creating and synthesizing in real-time high- that may be integrated in the “dots” platform in the future.
quality audio/visual content. For example, in a typical case, the
visual information obtained or processed from an interactive 2. Analytic installation description
installation can provide feedback to a sound synthesis system for
producing novel sound streams in real-time. The opposite As mentioned previously, the “dots” platform developed during
approach can be alternatively employed for synthesizing the present work is an interactive audio/video installation, an
complex visual content and environments using sound recorded interactive game, combining space, time and audio/visual
signals as input. In any case, the user interaction is usually content. From the design point of view, it consists of (a) the core
realized using an interaction algorithm that defines the rules application, which is responsible for handling the basic input /
under which selected synthesis parameters are varied. The output video and audio signals and for realizing all the signal
variation amount is usually obtained from the installation processing necessary for recognizing the spatial placement of
environment, typically using a wide range of wired or wireless the participants as well as for appropriately mixing and finally
sensors, image capturing devices and video and/or audio producing the audio (and video) signal and (b) the installation
recording equipment [5]. space, where the participants are moving and interacting with
In this article we present the “dots” interactive platform, which the core application, producing the final, complex audiovisual
aims to produce complex audio streams taking into account the output. Both core application and installation space subsystems
movement and the instantaneous spatial positions of the are described in detail in the following two Sections.
participants. In order to provide a motion motivation to the
participants, an additional visual component is produced and 2.1. The “dots” core application
concurrently projected in real-time, which also depends on the
participants positions and motion. Hence, a complex spatial, The application was developed and programmed with the open
audio and visual interaction effect is achieved which apart from source tool “Processing” [6], which represents a powerful
dynamically creating audio content can be also considered as an software sketchbook and professional production tool used in
interactive game with colours and sounds. The visual content many fields of signal processing, science/technology and arts. A
consists of a controlled number of dots with varying colours, number of particular Processing libraries were additionally used
providing the name “dots” to the overall installation platform. for implementing the “dots” installation application (see Figure
1). For example, advanced image and signal processing

-1-
112
dots: an Audio Entertainment Installation using Visual and Spatial-based Interaction

techniques were employed to identify blobs from video real- the grid areas noted above gets triggered by the presence of a
time captures and extract all the spatial information needed. participant. On the other hand, the dot colour is defined by the
More specifically, the libraries that were used are the following: number (N) of the detected participants within the i-th specific
i) the video capture library (which is included in the basic grid area, using the following equation:
Processing installation package) ii) the minim library [7] and iii)
the “blob detection” library [8]. With the order that they were , 1
listed above, the first code library is responsible for the input of   , 2 (1)
the video image into the core application. The second employs , 3
the well-known JavaSound API to provide an easy-to-use audio
library and it is here used for the reproduction/playback of the that is, the blue dot appears when only one person is traced, the
synthesized audio component. Finally, the blob detection library yellow when two people are traced and the red when there are
is employed for detecting blobs in sequential video frames in three or more people in the specific active grid area.
real-time (i.e. for the recognition of the outlines from the
footage of the video signal).

Figure 2: The dots colour shapes designed and the “dots” logo

Focusing on the audio output synthesis module, a number


(typically five) of short sound granules were statically attached
to each grid area. These granules are original, exclusively
produced for the purposes of this work by the authors, with the
use of an external synthesizer and a typical, open source sound
processing software tool. Alternatively, some of the audio
granules were produced using a real-time sound synthesis
engine (typically based on additive or FM synthesis) with a
predefined set of synthesis parameters.
The audio synthesis engine is activated similarly to the visual
component, that is, when a participant is detected to be within a
specific rectangle grid area. In this case, an audio granule
attached to the specific grid area is randomly selected and mixed
Figure 1: Core application architecture to form the final audio output. When the playback of this
granule stops, another one is selected using the same selection
For the accurate position recognition of the people moving into algorithm, provided that the participant is still within the
the installation space and in order to further process the specific grid area borders. The random granule selection process
incoming spatial information and to convert the extracted spatial is always performed using a normal probability distribution.
information to an audio stream, a virtual grid was created. This When more than one participants are detected in the same grid
grid was designed based on the coordinates of the specific area, no audio granules are selected. Instead, the playback
installation space (and more particularly after testing where a volume for the corresponding audio granule is increased by 1dB
participating person could stand or move inside the installation’s for each additional participant, that is
room, in order to avoid non-active grid areas). Finally, for the
purposes of this work, provided that the selected installation   (2)
room has simple geometry with parallel surfaces only, the grid
was designed to have nine discrete rectangle areas with equal where i denotes the specific grid area, V is a playback gain
surface areas. It is also significant to mention that apart from the constant (in dB) that obviously on the standard playback gain
above discrete rectangle areas, some audio-fading grid territories imposed by the sound-card and the active loudspeaker initial
were also defined within a specific distance of the grid borders. setup and N[i] represent the number of the participants traced
This distance was selected to be equal to 10% of a grid area within the current grid area. Clearly, when all the audio granules
width/length. The process of audio-fading within the grid are linearly mixed, the number of participants within a grid area
borders is analytically described later in this paragraph. defines the audio mixing-weights.
For synthesizing the visual component of the installation, the The playback of an audio granule immediately stops when no
initially defined goal was to design very simple geometrical and participants are detected within a specific grid area.
coloured shapes that do not require detailed visual analysis and Additionally, as mentioned previously, some audio-fading
attention by the participants. Hence, after an extensive sequence territories are defined closely to the grid border. When a
of design experiments, the final decision was to employ three participant is moving towards a 10% grid border line, an audio
circles (dots) with a set of colours of blue, yellow and red. As it fade-out process is activated, by linearly re-adapting the overall
is shown in Figure 2, these three dot types are also included into audio granule gain as a function of the distance to the grid
the “dots” installation logo, giving the name to the overall border.
project. A fade-in gain is correspondingly applied when a participant is
The appearance of these dots on the projection area of the moving away from the 10% grid border (but he is still within the
installation is algorithmically defined by the instantaneous fade territory). In that case, the final playback gain for the
positions and the total number of the participants within a specific sound granule is progressively decreased as a function
specific grid area. More specifically, a dot appears when one of of the participants distance within the 10% grid border area.

-2-
113
dots: an Audio Entertainment Installation using Visual and Spatial-based Interaction

projector inside the room is made. This means, that one visitor
2.2. The “dots” installation space outside the installation room can see the exact position of each
person together with the number of them that are in a specific
The technical equipment that is needed for the realizing the part of the grid inside the room.
complete installation is a personal computer (typically with a
core2duo processor, 2GB of basic memory and ideally a high 3. “dots” installation demonstration
performance video processing expansion card), a High
Definition (HD) video camera for creating the input video The complete “dots” platform was realized for the purposes of
signal, a video projector with a minimum analysis of 1024x768 the second annual festival of the department of audiovisual arts,
pixels and two monitor active loudspeakers. As it shown in held in Corfu, Greece on May 2008. The installation room
Figure 3, the camera is located on the ceiling so that it can track dimensions were 4 m (width) x 4m (length) x 3.4m (height). As
and trace the movement of people that exist and move into the mentioned previously, the camera was placed on the ceiling.
installation room. In the same Figure, the 3x3 virtual grid Under these dimensions, the effective detection surface covered
mentioned in the previous Section is also illustrated. a satisfactory part of the installation room.
The computer required for executing the core application is The colour of the room was plain and bright so that the contrast
located inside the room, and preferably towards the rear of it. between the human and the room would be high. This condition
Based on the minimum requirements that the computer system is very significant for the blob detection algorithm, as it clearly
must have, it would be good to take into consideration that the defines the human edges/borders. Accordingly, a white carpet
technical equipment must be silent so that there will not be any was placed on the floor in order to additionally increase the final
audience noise disturbance. Hence the computer was placed video signal contrast. Additionally, the illumination inside the
inside a special sound absorptive construction. The “dots” core room had to be uniform and ambient; otherwise it would cause
application is running on the computer receiving the data from problems to the blob detection process, while the sensitivity of
the camera and projecting the final result on the opposite wall the detection was adjusted according to the specific illumination
with the help of the projector. As it is understandable, the during the initial installation setup.
projector must be installed at the back of the room projecting the The equipment described in Section 2.2 had to be hidden in
final visual effect on the opposite front wall. order to be transparent to the people participating into the
The stereo loudspeakers employed are located next to the front installation and not to attract their attention. The camera that
two corners of the room and give feedback to the user with used was connected via a typical firewire interface, allowing fast
sounds that are produced each time with a different and robust digital video signal transmission to the core
combination. In a future enhanced version of the installation, application.
there will be reproduction surround sound from four or more Finally, the gain of the different sound sources was
loudspeakers in the room. appropriately and independently adjusted, depending on the
During this work, a number of important aspects related to the acoustic properties of the room and the disturbance it would
installation effective realization where considered, such as the cause to neighbour installations.
existence of light (and more preferably the presence of ambient An additional, important point of interest was to design an
light) which is shed equally in the whole room, and the selection appropriate audiovisual sub-system to attract the attention of the
of a light coloured floor or a white carpet so that the blob people passing outside of the installation, as it was difficult to
detection accuracy is significantly increased. Moreover the notice the installation functionality from outside. An ambient
colour used for painting the walls is an important installation sound together with the small monitor that was installed outside
parameter, as it should be the same for all room surfaces and of the installation room (as mentioned previously) showing the
preferably light (for example light blue or white). dots’ patterns created by the participants’ interaction finally (and
efficiently) solved this problem (see Figure 4). The ambient
sound excited the visitors’ curiosity, so that most of them finally
decided to enter into the installation room.

During a series of observations organized under real usage and


interactions conditions with participants that didn’t know
anything about the experiments (see Figures 5(a) – (d)) it was
found that the combination of both sound/audio and visual
effects were the basic installation characteristics which gave
“energy” to the audience movement. The visual content creates
a strange feeling of following up, which in a small number of
cases makes somebody moving and watching the projector
without necessarily thinking about the existence and production
of sound. On the other hand, some other participants were
mainly interested in audio signal composition without paying
much attention on the visual component. It was interesting that
these users were making strange movements following the
rhythm of the sound composition. However, the majority of the
participants focused on discovering the relation between the
audio and visual components by creating complex motion traces
and using various spatial body movements. The discovery
approach was audio-based (that is the participants were trying to
Figure 3: Typical installation space layout relate the sound produced with the dots visual effect) or visual-
based (the visitors were associating the visual effect with the
Finally, a small CRT (or LCD) monitor is installed outside the produced audio stream). In most cases, the interaction algorithm
installation room, in which the same reproduction of the was described as a complex audio/visual interaction, without a

-3-
114
dots: an Audio Entertainment Installation using Visual and Spatial-based Interaction

specific algorithmic structure. Specific movements towards the


room borders were also frequently observed, aiming to
determine the spatial interaction limits.

(d)

Figure 5: Demonstrating the “dots” installation

Finally, it should be noted that during the interaction sessions,


the complete installation was considered by many participants to
be a complex game of sound and colours. Hence, in many cases,
two or more participants decided to comprehensively play.
Driven by both sound and projected colours they were able to
produce / synthesize interesting audio and visual effects.

4. Conclusions and ongoing work

In this work, an interactive sound installation prototype called


“dots” is presented which aims to synthesize and produce high-
Figure 4: The external monitor subsystem quality audio streams in real-time, using a novel interaction
algorithm that employs spatial and motion characteristics of the
installation participants. More specifically within the installation
room, the participants are motivated to move using a
dynamically controlled visual content (i.e. a number of coloured
dots, the number and colour of which depends on the number
and the relative position of the participants within a virtually
designed, two-dimensional grid). Accordingly, the human
movement dynamically controls the sound synthesis engine,
which performs weight-based mixing of short synthesized
granules, taking into account the instantaneous participants’
positions within the virtual grid. Hence, a complex spatial, audio
and visual interaction effect is achieved, which produces
continuous audio variations and non-repetitive and time-variant
sound patterns.
Apart from the analytic technical description of the proposed
interactive sound synthesis platform and the discussion of a
number design issues and concepts, during this work we
(a) (b) additionally demonstrated the “dots” installation in the context
of an audiovisual festival recently organized by the department
of Audiovisual Arts, Ionian University. A number of tests
sessions and audience observations organized during the real-
world installation demonstration have shown that the proposed
interactive platform can be indeed considered as both a visual-
based or audio-based synthesis platform, as well as an audio-
visual game with complex synthesis rules. In all test cases, after
a relatively small acquaintance time interval (typically in the
range of 10 seconds up to half minute) the majority of the
participants were able to interact with the system, aiming to
control the interaction parameters and produce specific sound
and visual effects. It should be also noted that depending on
their specific artistic interest, some of the participants
considered the visual effects as the basis of sound production
(c) and vice versa, confirming the dual synthesis nature of the
proposed installation.
Taking advantage of this approach, future enhancements of the
proposed platform will be considered, mainly in terms of spatial
audio reproduction through modern multichannel audio systems
(such as 5.1, binaural [9], wavefield synthesis or ambisonics),

-4-
115
dots: an Audio Entertainment Installation using Visual and Spatial-based Interaction

panoramic, three-dimensional visual content projection using


multiple video projectors and complicated, dynamically adjusted
motion and interaction scenarios.

5. Acknowledgments

The authors wish to thank Mr. Loukas Ziaras for providing the
photographic material included in this work.

References
[1] D. Birchfield, K. Phillips, A. Kidané and D. Lorig,
“Interactive Public Sound Art: a case study”. In Proc. 2006
International Conference on New Interfaces for Musical
Expression (NIME06), Paris, France (2006)
[2] D. Birchfield, D. Lorig, and K. Phillips, "Network Dynamics
in Sustainable: a robotic sound installation", Organised Sound,
10 (2005), pp. 267-274
[3] S. Boxer, "Art That Puts You in the Picture, Like It or Not",
New York Times, April 27, 2005.
[4] N. A. Streitz, C. Rocker, Th. Prante, R. Stenzel, & D. Van
Alphen, “Situated Interaction with Ambient Information:
Facilitating Awareness and Communication in Ubiquitous Work
Environments”. In Proc. HCI International, Crete, Greece (2003)
[5] I. Zannos and J.P Hebert, "Multi-Platform Development of
Audiovisual and Kinetic Installations" in Proc. of the 8th
International Conference New Interfaces for Musical Expression
(NIME 2008), Genova, June 5-7 (2008)
[6] http://www.processing.org (last visited May 29th, 2008)
[7] http://code.compartmental.net/tools/minim/ (last visited May
29th, 2008)
[8] http://www.v3ga.net/processing/BlobDetection (last visited
May 29th, 2008)
[9] Ch. Tsakostas and A. Floros, “Real-time Spatial
Representation of Moving Sound Sources”, Audio Eng. Soc.
123rd Convention, New York, October 2007 (preprint 7279)

-5-
116
AllThatSounds: Associative Semantic Indexing of Audio
Data
Authors: Hannes Raffaseder, Matthias Husinsky, Julian Rubisch
University of Applied Sciences St.Pölten, Austria

Abstract
Motivation: Demand and supply of digitally stored sound files increased rapidly during the
last years and reached an unmanageable amount. For media producers the search for suitable
sounds is an essential and time-consuming part of their work.
The research project AllThatSounds tries to improve the search procedure by indexing the
files in an associative, semantic way. A method for the systematic categorization of sounds is
introduced to simplify the annotation of audio files with metadata. Furthermore additional
data is collected by the evaluation of user profiles and by analyzing the sounds with signal
processing methods.
The project's result is a tool for structuring sound databases with an efficient search
component, which means to guide users to suitable sounds for their sound track of media
productions.

Introduction
Supply and demand for digitally stored audio increased rapidly in the recent years. Number
and diversity, as well as quality of available sound files reached an unmanageable amount.
Efficient approaches in the search for audio data play an important role in the process of
media production. Most of the available search tools nowadays require the user to know
important features of the sound before the search can be carried out. A search request using
semantic features, that are closer to human perception rather than technical parameters, is
hardly possible. In a broader perspective the search procedure is also harder because of the
volatility of the medium sound, which makes it difficult to uptake acoustic events. This results
in a disability to verbally describe sounds properly. Usually, not the sounds themselves, much
more the preceding events that caused the sound are described.
The research project AllThatSounds aimed on simplifying the process of finding suitable
sounds for media productions. For this purpose many different possibilities to categorize and
describe sonic events were analyzed, evaluated and linked. Apart from the applicatory use of
the tool, the research questions raised by the work on these topics trigger a discussion process
about perception, meaning, function and effect of the sound track of a media product.

The meaning of the sound track in media production


The immense influence of the sound track on the artistic and commercial success of an audio-
visual product is undisputed in contemporary media theory. Examples like “2001 – A Space
Odyssey” by Stanley Kubrick or “The Birds” by Alfred Hitchcock prove this impressively. At
the same time it is astounding that their directors found the music of György Ligeti and the
sound of the Trautonium of Oscar Sala respectively, as both composers were very much
unknown at the time of the production.
These examples show that many times these insider tips, these seldom played and by a broad
audience unheard pieces, are the bits that make the works special. Apparently, an enormous
special knowledge, far greater than that of the director or the producer, is needed together
with an excellent overview over the available music and sounds, to find the best sounds that

117
fit optimally to the mood and image of the product. Since only a few persons have a sufficient
knowledge in this area, the search for audio is performed more or less arbitrary, at best guided
by intuition and the own taste. Many interesting pieces of music and audio files will not be
taken into account.

Description and categorisation of audio files


The main goal of the project AllThatSounds is the description and categorization of digitally
stored audio files. In particular, four different approaches are considered:
• Descriptive Analysis
• Listener’s Analysis
• Machine analysis
• Semantic analysis

Descriptive Analysis
The descriptive analysis enables for the uniform description of acoustical events from the
sound designers perspective. The aim is that already at the time the upload of a sound into the
database takes place, it can be sufficiently and distinctly registered.
Based on the insights of Murray Schafer (3), David Sonnenschein (4), Darry Truax (5), Theo
von Leeuwen (5) and earlier thoughts of the author (2), a general classification of sounds was
developed, which enables for a differentiated description in the following categories: Type of
excitation, sound source, timbre, pitch, dynamics, room, familiarness and possible use.
Very soon the difficulties when trying to describe the sonic events in an adequate and
universal way, become apparent. On one hand a detailed and satisfactory description of the
sound requires high complexity and accuracy, but on the other hand the process of describing
the sound must not take too long in practical use. To keep time short when describing the
sound at upload, full description cannot be reached.

Listener’s Analysis
The context, personal mood of the listener, the field of application or the intention of the
production have a big influence on the description of an acoustical event. There is a semantic
gap between human perception and objectively measurable signal parameters. Sometimes,
two technically very similar signals have a totally different effect.(2)(6)
Therefore only one description of the acoustical event might lead to an unwanted result in
another context. To disarm this problem, AllThatSounds uses collaborative interfaces and
description methods which became popular under the buzzword Web 2.0.

Machine Analysis
In the recent years many research institutions started working on methods to give machines
similar perceptual possibilities as humans. In the research field of “Music Information
Retrieval” fundamental work was done by using the results of signal processing methods on
audio data for getting an idea about its musical content or its semantic content. Classification
is often done using artificial intelligence models. Basic examples are automatic beat and
tempo tracking, pitch estimation and melody or instrument extraction.
For AllThatSounds the application of sound similarity models promised to be very
meaningful. In many audiovisual productions sounds, which have no relation at all to the
objects shown on the screen, but still appear to be authentic, are used. Therefore a similarity

118
model based on Mel Frequency Cepstrum Coefficients (MFCCs) is used to compare each
sound with each other in the database. This model proved to be very useful in comparing and
classifying sounds and music (1). The 20 most related sounds are stored with each sound and
enables that way for a new kind of browsing for viable sounds.
Furthermore, for each sound descriptors, corresponding the MPEG 7 standard and some other
psychoacoustic parameters like roughness and sharpness, are calculated and can be used in
search requests.

Semantic Analysis
In media production the semantic content in acoustic events is utterly important to achieve a
certain perception at the consumer. Until today it is almost impossible to automatically extract
semantic meaning from audio signals automatically, since not only signal parameters, but to a
wide extent also cultural and social experience of the listener play a role. How these
denotations to specific aural events happen, how they manifest and are described, is in many
respects still not investigated. Theo van Leeuwen shows (6) that different intermodal
interplays play an important role, as well as the social, cultural or historic environment.
AllThatSounds aims on combining the annotated metadata from the users and the calculated
features to explore possibilities to automatically extract semantic denotations. Conclusions
can be done, when the database has been used for a longer time by numerous users.
Also, to investigate this topic further, a library of short film and video clips, that stand out
because of their interesting or typical sound design, was created. These clips were tagged and
indicated in categories like “event”, “symbol” and “acoustic material”. That way it is easy to
search for clips with certain events like “murder” (event), “grief” (symbol) or “orchestra
music” (acoustic material). A detailed evaluation of this material in the clip library has yet to
be done.

Prototype and Perspective


The first prototype of the AllThatSounds Soundlibrary with an improved search tool was
released to the public on May 30th 2008 as a web based application under
www.allthatsounds.net . At the time point of the release the database contained about 3000
sounds, recorded by the students of the University of Applied Sciences St.Pölten’s Media
Engineering course. The sounds can be downloaded in uncompressed wave-format and be
used under the Creative Commons Sampling Plus license. Everyone is welcome to participate
in the library. The release of a standalone client using AllThatSounds technology is planned.
Even though the prototype application proved to be extremely useful in the search for the
right sound in many respects it revealed a lot of new research questions. In future research
projects a deeper examination of the semantic content of acoustic events is absolutely
necessary. Nonetheless, AllThatSounds offers a great basis to build on.

Acknowledgement
Starting in October 2005 a research team at the University of Applied Sciences St.Pölten
worked together under the project leader Hannes Raffaseder with the Vienna based companies
Audite / Interactive Media Solutions, Team Teichenberg and, since April 2007 also the
University of Applied Sciences Vorarlberg. The works were supported by the Österreichische
Forschungsförderungsgesellschaft (FFG) in their FHplus funding programme until May 2008.

119
References
[1] Feng, D., Siu, W.C., Zhang, H.J. (2003): Multimedia Information Retrieval and Management, Springer-Verlag, Berlin.
[2] Raffaseder, H. (2002): Audiodesign. Hanser-Fachbuchverlag, Leipzig.
[3] Schafer, Murray R. (1994): The Soundscape – Our Sonic Environment and the Tuning of the World. Destiny Books,
Rochester.
[4] Sonnenschein, D. (2001): Sound Design – The Expressive Power of Music, Voice, and Sound Effects in Cinema.
Michael Wiese Productions, Studio City.
[5] Truax, B. (2001): Acoustic Communication (2nd ed.), Ablex Publishing, Westport.
[6] van Leeuwen, T. (1999): Speech, Music, Sound. MacMillan Press Ltd., London.

120
IMPROVe: The Mobile Phone as a Medium for Heightened Sonic
Perception
Richard Widerberg
Dånk! Collective
Karl Johansgatan 47 G
SE-41455 Göteborg, Sweden
rwiderberg@g mail.com
+46708977452

Zeenath Hasan
School of Art s and Communication
Malmö University
SE-20506 Malmö, Sweden
zeenath.hasan@gmail.com

Abstract. In this paper, we describe the design and research phase of a project that aims to create conditions for heightened sonic
perception through a mobile phone based software application. The initial design concept is that of an aural architecture for sonic
socio-cultural exchange where sonic realities of the everyday are improvised live in a non-linear mode. The design approach adopted
is collaborative. The project is a work in progress.

1. Introduction such reveals the direct and incidental connections between the
different aspects that come together in its creation.
Our experience of our everyday lives is mediated through a
‘multitude of mechanically produced sounds’ [1]. The everyday 3. Initial Design Concept
sounds that we experience are produced outside of our own
volition. The capability to capture sounds and play it back has 3.1. Background
made it possible to listen to sounds outs ide of its original context A community of practice with a history and tradition of working
[2]. Sound recording technology can be used as an extension of closely with found sounds through the means of electronic and
listening and enhance our aural awareness[3]. The mobile phone digital tools is that of electro-acoustic musicians. When viewed
is also a medium through which sounds are heard outside of on a scale of involvement from active to passive, the members
their original context. However, the normative definition of the of this community of practice involves not only those who
mobile phone as a medium for communication has restricted its actively engage in the creation and reproduction of their own
potential as a medium for sounds that exist outside of the tools or instruments for mixing sounds, but also those who
immediate tele-communication. This design and research project passively listen to the sounds that are produced. Lastly, one
explores the potential of the mobile phone as a medium of mode in which electro-acoustic musicians might work to
communication beyond its currently dominant role as a compose sounds in a group is spontaneously through live
transmitter of sounds. The design space for exploration is the improvisation.
mobile phone as a digital networked medium that is
appropriated by social networks to communicate across
boundaries of time, space and context [4]. The project thus 3.2. Improvisation
proposes the design of the mobile phone as a medium for the Professional musicians have practiced improvisation to create
exchange of everyday sounds within communities and across compositions spontaneously. Melodies, harmonies and rhythms
socio-cultural contexts by mobilizing the potential of the mobile are combined within the traditional structures of mus ic that the
phone as a tool for the production of everyday sounds. professional musician has been trained in. Musical instruments
have been known to tear away from their established histories to
2. Approach accommodate and challenge each other. When the mobile device
is used as a musical instrument in an improvisation, what
The project adopts a collaborative design approach by gathering musical structures, if any, emerge? Improvisation is a collective
a community of interest [5] consisting of members from activity. Professional musicians practice it to scope the
different communities of practice [6] who conduct their boundaries of the musical form. What pursuits will the untrained
activities at various levels of involvement. The design process improviser indulge in when involved in sonic improvisation?
gathers participants around an initial des ign concept that is used
as a boundary object [7] or as a common point of reference. The
design concept that emerges in the interactions with the gathered
participants is regarded as an artifact [8] because perceiving it as

-1 -
121
4. Working Prototype which also clarifies the concept, can be viewed on the project's
web site [11]. The initial design concept was also introduced
A working prototype was developed for use as a common point separately to eight participants who were selected based on their
of reference for discussions on the design concept with gathered active to passive involvement with found sounds. Participants
participants and other stakeholders. were given a mobile phone to carry for a month’s period after
which separate discussions were held with each of them.
4.1. Scenarios
Scenarios were constructed as a way to unfold the initial design 6. Findings
concept at work and also to have a shared understanding within
the project team. Provided below are two scenarios in brief. When the mobile phone was used as an instrument for recording
everyday sounds by participants who do not pursue musical
4.1.1. Scenario 1
performance as a profession, they reported emotional, nostalgic,
A group of friends, untrained in music, record soundscapes from anecdotal and politically analytical associations with the sounds
their daily life. The group meets at a local pub, where there is a they chose to record. Participants who pursued music production
sound system for playing the gathered sounds. They perform a as a career mainly recorded aesthetically pleas ing sounds for use
live-remix of the sounds on their mobile devices. The aural in their next performance. The possibility of recording sounds
exchange affects individual and group understanding at the and then processing and playing them back into a soundsystem
cultural and social level through a sharing of the everyday makes the mobile device a musical instrument among
soundscape. professional musicians. Although the above two groups have
been presented as a dichotomy, it is not a strict division becaus e
4.1.2. Scenario 2
both groups reported associations with sounds other than that of
Trained mus ic practitioners, like cellists, record sound objects
the aesthetic.
through a mobile phone. The group meets in a concert hall. They
perform a group improvisation with the collected sounds
through their respective mobile devices. The exchange is an 7. Conclusi on
exploration of the formal aspect of aural compos ition that builds
on traditional music structures and creates new forms of music. The project began with the objective to conceive of the mobile
phone as a medium for heightened sonic awareness. It has
4.2. System Functionality achieved proof of concept on people’s reception to the existence
Sounds are collected via a mobile phone and sent to a location and use of such an application on the mobile phone. The next
where they can be played back into a sound system. The same threshold is to use this application for heightened sonic
mobile phone controls the playback of the collected sounds in awareness in different pedagogical contexts.
the soundsystem. Playback control occurs in the physical
location of the soundsystem. The sounds that are played back
are process ed live via interaction through the mobile phone. The
output of the processed sound can be heard directly through the
soundsystem. References
[1] Bull, M., Back, L. (Eds) The Auditory Culture Reader. Berg
4.3. Prototype Application Publishers, Oxford, New York, (2003)
Python programming language was used for rapid prototyping [2] Kahn, D.. Noise, Water, Meat: a history of sound in the arts.
on Nokia Series 60 devices [9]. The phone microphone is used The MIT Press, London, (2001)
for the recording interaction. As current audio processing [3] Truax, B. Acoustic Communication. 2nd edition. Ablex
capabilities on the phone requires some amount of work, mixing publishing, Westport, (2001), 219
and playback functions are processed on an external computer [4] Rheingold, H. Smart Mobs. Perseus Publishing, Cambridge,
using Pure Data [10]. The processing is then controlled live by Masachusetts, (2002)
the mobile phone via bluetooth connection. [5] Fischer, G. Beyond ‘Couch Potatoes’: In Consumers to
Designers and Active Contributors First Monday, volume 7,
4.4. Graphica l User Interface number 12. (2002)
For the recording of sounds, a simple interface allows recording, [6] Wenger, E. Communities of Practice: Learning, Meaning and
listening to the recording and then uploading of the recording to Identity. Cambridge University Press, USA, (1998)
a server. The recording-interface cons is ts of three buttons for [7] Marick, B. Boundary Objects
these three functions. For the improvisation interface, the four- http://www.visibleworkings.com
way directional button was the only key activated for interaction [8] Diaz-Kommonen, L. Art, Fact and Artifact Production:
with the GUI. The ‘Play’ command selects recorded sounds at Design Research and Multidisciplinary Collaboration.
random. Three options are provided for live mixing of the University of Art and Design Helsinki, Finland, (2002)
played back sounds, allowing the participant to control the [9] PyS60. http://sourceforge.net/projects/pys60/
volume, speed and loop length. The ‘Stop’ command stops the [10] Pure Data real-time graphical programming environment
playback. for audio, video, and graphical processing.
http://www.puredata.org/
5. Fi el d Acti vity [11] IMPROVe project web site. http://www.riwid.net/improve

The working prototype has been used in various venues and


with different settings. For example at a private coffee lounge
consisting of 15 people, at a dinner for four in a res taurant and
in two separate weeklong workshops resulting in a live
performance by 20-25 people per workshop at two different
public venues. Video documentation from different locations,

-2 -
122
Audio Interface as a Device for Physical Computing

Kazuhiro Jo, RCAST, University of Tokyo / Culture Lab, Newcastle University, jo@jp.org

Abstract. In this paper, we would like to describe the employment of audio interface as a device for physical computing. We
compare the audio interface with other devices and describe its characteristics. We also present examples of the employment with
three different art works, Monalisa "shadow of the sound", The SINE WAVE ORCHESTRA stay amplified, and AEO. We explain
the implementation of each work with different physical components. Finally, we discuss some of the potential of the employment of
audio interface for future implementation.

1. Introduction Table 1. Comparison of devices


An audio interface is a device that allows a computer to (specifications are taken from the website)
receive/send audio signal from/to the outside. It converts a
stream of bits (digital) into a time-varying voltage (analog) and
vice versa. Most of computers currently have a built-in audio
interface. The interface generally provides two (stereo) built-in
input/output channels with 44.1KHz sampling rate and 16-bit
sampling resolution. People employ such audio interface for
listening or recording audio in the field of music, film, and/or
gaming. Some of external audio interface serves more
input/output channels, higher sampling rate, and higher
sampling resolution (e.g. 56-ch, 192KHz, 24-bit).
Attachment:
Physical computing is an approach to sensing and controlling
Most of computers currently have a built-in audio interface with
the physical world with computers [8]. People create a
two (stereo) input/output channels. It enables people to connect
conversation between the physical world and the virtual world
with several components (e.g. headphone, microphone, speaker)
of the computer by employing not only standard computer
without additional input/output equipments. External audio
interface (e.g. keyboard, mouse, display, speaker) but also
interface offers extra input/output channels.
several kinds of physical components (e.g. sensor, switch,
actuator). The conversation comes into sight as a form of
Sampling resolution:
musical controller, gaming console, interactive installation etc.
Sampling resolution is a measure of how preciously the device
treats the quantity of object. The audio interface basically offers
To realize such conversation, people have been developed
16-bit sampling resolution, while the other devices provide 10 or
several input/output devices, such as Infusion I-CubeX
12-bit resolution. Some external audio interface has higher
(http://infusionsystems.com/), Arduino (http://www.arduino.cc/),
resolution such as 20-bit or 24-bit.
and MAKE Controller Kit
(http://www.makezine.com/controller/). These devices allow a
Sampling rate:
computer to communicate with the physical world by employing
Sampling rate is a measure of how frequently the device looks at
physical components. The devices convert the signals from/to
the input from the outside. Most of the audio interfaces support
the components with several protocols. I-CubeX employs MIDI,
44.1KHz or more sampling rates because of the sense of hearing
arduino employs serial, and MAKE Controller Kit employs OSC
(about 20Hz to 20KHz). Owing to the limitation of the protocol
(Open Sound Control). The programming environment such as
or the hardware, other devices offer lower sampling rate than the
Cycling74 MaxMSP (http://www.cycling74.com/), Adobe Flash
audio interface.
(http://www.adobe.com/), and Processing
(http://www.processing.com/) manages multiple input/output
2. Implementation
signals from/to the physical world and maps them into other
We have employed the audio interface as a device for physical
types of representation (e.g. graphic, text, sound).
computing with three different art works, Monalisa "shadow of
In this paper, we would like to propose the employment of audio the sound", The SINE WAVE ORCHESTRA stay amplified,
and AEO. In each work, we employed different physical
interface as a device for physical computing. It allows a
components with the audio interface: a push button for Monalisa
computer to communicate with the physical world by audio. The
"shadow of the sound", a fader controller and a foot switch for
signals from/to the components in the physical world are treated
The SINE WAVE ORCHESTRA stay amplified, and two
as audio and processed on the programming environments in the
accelerometers, a set of distance sensor, and two custom built
computer.
converter boxes for AEO. In following sections, we explain the
implementation of each work in both hardware and software
Audio interface is unique to compare with the other devices,
settings.
I-CubeX, Arduino, and MAKE Controller Kit, at least in three
points: attachment, sampling resolution, and sampling rate
2.1. Monalisa "shadow of the sound"
(Table 1).
Monalisa "shadow of the sound" is an installation created with
Norihisa Nagano. It is a part of software platform Monalisa
which enables to "see the sound, hear the image" by treating all
the image and the sound as the sequence of numbers represented

-1-
123
[Audio Interface as a Device for Physical Computing]

by binary codes [7]. The work was premiered at Open Space at of a set of control computer, sound synthesis computer, light,
NTT InterCommunication Center from 9th June 2006 to 11th foot switch, fader controller, rotational controller, and 16
March 2007. The work consists of a set of computers with speakers circularly situated in a room (Figure 3). We briefly
custom version of Monalisa applications, projector, camera, describe the procedure of the work below.
microphone, push button, and speaker situated in a room (Figure
1). We briefly describe the procedure of the work below.

Figure 1: Monalisa "shadow of the sound"

When entering the room, each participant saw his/her image,


caught with the camera, projected on the screen. When he/she Figure 3: The SINE WAVE ORCHESTRA stay amplified
pushed the push button, the image was captured as a static
bitmap image data. The image data was transformed to the When entering the room, each participant was exposed to a
application and automatically played as a stream of sound collective sound representation. The sound representation
through the speaker. The microphone simultaneously captured consisted of sine waves that previous participants had produced.
the stream of sound and transformed it to the application. The 16 speakers that horizontally encircled the participant produced
application sequentially re-projected the incoming sound data as the sounds. When a participant stood on the foot switch in front
an image on the screen from top left to down right of the screen. of the controllers, a new sine wave started to play. As he/she
Because of the acoustics of the room, the re-projected image moved the fader controller, the frequency of the sine wave was
was blurred like a shadow with reverberations of the sound. changed. As he/she rotated the rotational controller, the sound
source position (i.e. speaker) of the sine wave was changed.
2.1.1. Push button He/she selected the frequency and the sound source position of
We employed a push button and a set of input/output channel of the sine wave. By leaving from the food switch, he/she left
a built-in audio interface to receive the action from participants. his/her own sine wave as a part of collective sound
The push button has two terminals. The terminals conduct with representation. The volume of the sine gradually attenuated and
the push of the button. To detect the conduction with the audio disappeared after the exhibition period. As more participants left
interface, we cut a standard audio into two fragments. Each their sine waves, more sounds were accumulated. During the
fragment of the cable is attached to each terminal of the button. exhibition period, the collective sound representation was
Each other side of the fragments connected to the input/output changing from a phase where each sine wave was discriminable
channel of the audio interface (Figure 2). to a cluster that consisted of mutually interfering sine waves like
a white noise that contains all frequencies.

2.2.1. Foot switch, Fader controller


We employed a foot switch, a fader controller, a rotational
controller, and two sets (stereo) of input/output channels of a
built-in audio interface to receive the action from participants.
The foot switch and the fader controller are connected with a
control computer with the audio interface and the rotational
Figure 2: Push button
controller is directly connected with the control computer
We detected the conduction of the button as audio signal on through USB. The foot switch has two terminals. The terminals
MaxMSP. We assigned a sine wave for the output channel. We conduct with the push of the foot switch. The fader controller
set up a gate for the input channel. When the terminals conduct also has two terminals. It behaves as a variable resistor. The
with a push of a participant, the gate identifies an incoming move of the fader changes the value of the resister. The foot
audio signal. With the identification, the application captures the switch and the fader controller are separately connected with
image of the participant. two sets of input/output channels of a built-in audio interface in
the same manner of the push button of Monalisa "shadow of
2.2. The SINE WAVE ORCHESTRA stay amplified sound" (Figure 4).
The SINE WAVE ORCHESTRA stay amplified is a work
which was premiered at WAVES exhibition at the Exhibition
Hall ARSENALS of the Latvian National Museum of Art from
25th August to 17th September 2006 [3]. It is a work of a
participatory sound performance project The SINE WAVE
ORCHESTRA (http://swo.jp). The author serves as the one of
four core organizers of the project. In the project, under the
basic concept that each participant plays a sine wave, people are Figure 4: Foot switch (left) and Fader controller (right)
invited to create a sea of sine waves as a collective sound
representation. Each of the work in the project employed the We detected the conduction of the foot switch and the resistance
different instrument with different temporal, physical, of the fader controller as audio signals on MaxMSP. We
environmental, and procedural settings [6]. This work consists assigned a sine wave for each output channel. For the input
channel of the foot switch, we set up a gate to identify the

-2-
124
[Audio Interface as a Device for Physical Computing]

conduction in the same manner of the push button of Monalisa audio input, three audio outputs, one extra audio input/output, a
"shadow of sound". For the input channel of the fader controller, power socket, and three ring modulation circuits (Figure 6).
we set up a volume calculator. It reports the change of the Every audio inputs/outputs of the converter boxes are directly
resistance of the fader as the change of volume of incoming connected to the outputs/inputs channel of a 56-ch, 192KHz,
audio signal. With the change of the volume, the control 24-bit external audio interface, RME Fireface 800
computer sends a message for the synthesis computer engine to (http://www.rme-audio.com/).
change the frequency of a sine wave. The sampling resolution
for the volume is 16-bit (i.e. 65535 level). Therefore it is
sufficient to accomplish subtle changes of the frequency of the
sine wave.

2.3. AEO
AEO is a sound performance project consisting of three
members: Eye, Taeji Sawai and the author. AEO has performed
at several international festivals (e.g. Dutch Electronic Art Figure 6: Converter boxes (Black for the sphere with the
Festival 2004, Radar 7 at 24 Festival de MEXICO 2008). In the transmitter, White for the sphere with the receiver)
project, each member takes one of three roles: performance
(Eye), sound design (Sawai) and instrument design (the author). The amplified box has two audio output, one volume, two audio
During AEO performances, the performer holds the instrument inputs, a power socket, and two channels amplifier (Figure 7).
in each hand and shakes, sways, or swings them. These
movements by the performer produce patterns of sound and light
through devices and a computer. The instrument has undergone
a transition in function and forms over six iterations [5].

In the latest iteration for the performance at Fujirock Festival


2008 (http://www.fujirockfestival.com/) on 27th July 2008, the
instruments consist of two sets of hand-held plastic spheres.
Each sphere contains a three-axis accelerometer, a part of
distance sensor, a small light bulb, and a switch. One sphere has Figure 7: Amplified box
a transmitter of the distance sensor, while another has a receiver.
The accelerometer converts the inclination and acceleration of The audio outputs are connected to the sphere and the audio
each axis (i.e. x, y, z) into changes in analog voltage. The inputs are connected to the output channel of RME Fireface 800.
distance sensor measures the distance between two spheres by We treat all the audio signals on MaxMSP.
employing ultrasonic sine wave (40KHz) with the transmitter
and the receiver (Figure 5). The small light bulb changes its Accelerometer:
brightness by the change of analog voltage. The switch changes For each accelerometer, we employ one audio input, three ring
the represented patterns of sound. modulation circuits, and three audio outputs of the converter box.
We assign an audio signal to the audio input. The audio signal is
The device for the instrument was changed in each iteration. We distributed to the three ring modulation circuits. The amplitude
employed I-CubeX for the first and second iteration, and of the audio signal carries the change in analog voltage from
MakingThings Teleo (http://www.makingthings.com/teleo/) for each axis of the accelerometer (i.e. x, y, z) by employing ring
the third and the fourth iteration. In the fifth iteration, we modulation circuit with two transformers (Sansui ST-75) and
developed two sets of converter boxes for the accelerometer and four germanium diodes (1K60) (Figure 8). Each audio output
the distance sensor. We also employed MAKE Controller Kit to produces the change of acceleration in each axis as modulated
change the brightness of the bulb with PWM (Pulse With audio signal.
Modulation). In the latest iteration, in addition to the converter
boxes, we developed an amplified box for the small light bulb.
The accelerometer, the distance sensor, and the switch are
connected to the convert box with a multi cable, while the small
light bulb is connected to the amplified box with other cable.

Figure 8: Ring modulation circuit

Figure 5: AEO instruments We detected the inclination and acceleration of each axis of the
accelerometer as a separate audio signal. For each signal, we
2.3.1. Accelerometer, Distance sensor, Small light bulb setup a volume calculator. It reports the change in analog
We developed two sets of converter boxes for the accelerometer voltage from each axis as the change of volume of incoming
and the distance sensor, and an amplified box for the small light each audio signal. We employ the reported value to control the
bulb. Each converter box has a connector to the sphere, one parameter of represented sound of the performance.

-3-
125
[Audio Interface as a Device for Physical Computing]

Distance sensor: percussion with its fine time precision. However, with the
For distance sensor, we employ one extra audio input and one MAKE Controller Kit, the resulted change of the sound did not
extra audio output of the converter boxes. The extra audio input reflect the subtle movement of the performer. As Wessel and
is connected to the transmitter and the extra audio output is Wright pointed out [10], the low latency between gesture and
connected to the receiver. We assign an ultrasonic sine wave gesture controlled audio output is essential for live computer
(40KHz) to the extra audio input. The audio interface enables music performance. We believe that our approach suggests a
the utilization of such high frequency with its high sampling rate sensitive way to communicate physical world with its
than the built-in audio interface. The external audio output preciousness of measurements and time.
produces the audio signal from the receiver. We detected the
distance between two spheres as an audio signal. For the signal, Wessel and Wright argued the possibility to apply existing audio
we setup a volume calculator. It reports the change of the DSP (Digital Signal Processing) modules (e.g. Filters, Fast
distance as the change of volume of the audio signal. We Fourier Transforms, Linear Predictors) to process the signals
employ the reported value to control the parameter of from/to the physical world [10]. We have not investigated such
represented sound of the performance. possibilities well. However we consider handling multiple
sensor signals with one channel of the audio interface by
Small light bulb: employing band-pass filters.
For each small light bulb, we employ one audio output, one
audio input, and one channel amplifier of the amplified box. By As a future work, we consider to publish our developments for
feeding audio signals in appropriate amplifier, it is possible to public. We plan to provide various instructions and examples of
bring the coil of the bulb in an excited state. We assign the the employment of audio signals to communicate with the
represented sound of the performance to the audio input. We physical world. We hope to encourage people to stimulate each
amplify the sound and connect it to the small light bulb. In the other to discover the potential of the audio interface as a device
performance, the brightness of the bulb reflects the change of for physical computing.
the represented sound changes. The volume of the amplified box
adjusts the threshold of the brightness. Acknowledgements
We would like to thank for the other member of Monalisa, The
3. Related works SINE WAVE ORCHESTRA, and AEO. Monalisa "shadow of
There are some precursors who also tried to employ audio the sound" was developed under the support of NTT
signals as means to communicate with the physical world. InterCommunication Center and FY2005 IPA
Allison and Place [1] developed the SensorBox. The device (Information-Technology Promotion Agency) Exploratory
accepted six sensor inputs and two audio inputs. The data from Software Project (Project Manager: KITANO Hiroaki).
each sensor was carried as the amplitude of a sine wave, which
was located in the 18KHz to 20KHz, and mixed back on the two
audio inputs. They did not provide its technical detail well,
however their approach was quite same as ours of the converter
References
box of AEO. Canadian artist artificiel also explored the light [1] Allison, J., Place, T., SensorBox: practical audio interface
bulb as sound source in their electro-acoustic installation for gestural performance, In Proceedings of the 2003
"beyond6281 [2]". They feed processed audio signals in Conference on New interfaces For Musical Expression,
powerful amplifiers in the similar way to the small light bulb of pp.208-210 (2003)
AEO. PingPongPlus [4] employed audio signals for position [2] artificiel, beyond6281, Art + Communication 2006: WAVES,
tracking. It detects the position of ping-pong ball with a sets of RIXC, the Center for New Media Culture, Riga, Latvia,
microphones mounted under the ping-pong table. They detect pp.38-39 (2006)
the time of hit with each microphone and calculate the position [3] Furudate, K., Jo, K., Ishida, D., Noguchi, M., The SINE
with the time difference. Ms. Pinky [9] by Scott Wardle consists WAVE ORCHESTRA stay amplified, Art + Communication
of a set of vinyl records and software running on MaxMSP. The 2006: WAVES, RIXC, the Center for New Media Culture, Riga,
vinyl contains special signals and from the signal, the software Latvia, pp.104-105 (2006)
decodes the velocity, direction, and physical position of the [4] Ishii, H., Wisneski, C., Orbanes, J., Chun, B., Paradiso, J.,
needle on the surface of the vinyl in real time. PingPongPlus: design of an athletic-tangible interface for
computer-supported cooperative play, In Proceedings of CHI
'99. ACM, New York, NY, pp.394-401 (1999)
4. Discussions
[5] Jo, K., Transition of an Instrument: The AEO Sound
We explored the characteristics of the audio interface as a
Performance Project, Leonardo Music Journal No.17, pp.46-48,
device for physical computing. The audio interface
(2007)
accomplished higher sampling resolution and sampling rate than
[6] Jo, K., Furudate, K., Ishida, D., Noguchi, M., Transition of
the other devices. In the installation Monalisa "shadow of the
instruments on The SINE WAVE ORCHESTRA, ACM
sound", we simply employ the push button with the built-in
Computers in Entertainment, on October (2008) (to be
audio interface. In The SINE WAVE ORCHESTRA stay
published)
amplified, we accomplish subtle change of the frequency of the
[7] Jo, K., Nagano, N., Monalisa: "see the sound, hear the
sine wave with 16-bit sampling resolution. With the latest
image", In Proceedings of the 8th International Conference New
instrument of AEO, the sampling resolution and sampling rate
Interface for Musical Expression, pp.315-318 (2008)
of the audio interface enables to represent subtle movements of
[8] O'Sullivan, D., Igoe, T., Physical Computing, Boston, USA:
the performer as patterns of sound and light than the other
Thomson Course Technology (2004)
devices. During the latest implementation of AEO instruments,
[9] Wardle, S., Ms Pinky, http://www.mspinky.com/
we have conducted informal comparison with the audio
[10] Wessel, D. and Wright, M., Problems and Prospects for
interface and the MAKE Controller Kit. We assigned the change
Intimate Musical Control of Computers, Computer Music
of the acceleration into the change of the volume of white noise
Journal. 26, 3, pp.11-22 (2002)
sound. With the audio interface, when the performer shakes the
sphere, the resulted sound could be heard as a kind of maracas

-4-
126
Automatic genre and artist classification by analyzing improvised
solo parts from musical recordings
Jakob Abesser, Christian Dittmar, Holger Grossmann ({abesjb,dmr,grn}@idmt.fraunhofer.de)
Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany

Abstract. This paper introduces a set of high-level features to describe instrumental solo-parts. The set consists
of 148 single- and multidimensional features related to the melodic, harmonic, rhythmic and structural properties
of four instrumental domains. A simple yet common instrumentation model has been applied to describe both
the soloing and the accompanying instruments as well as rhythmic and melodic interaction between them. To
evaluate the features’ discriminative power related to different musical styles, an evaluation for content-based
genre and artist classification has been performed each with two different test sets consisting of symbolic and
real audio data. Two different classifier approaches have been utilized, one commonly used support vector
machine (SVM) classifier with preliminary discriminant analysis (LDA) and one novel approach based on the
Rhythmical Structure Profile which allows a tempo-adaptive representation of the rhythmic context provided by
the accompanying instruments. For both classification scenarios, ensemble decisions based on single instrument-
related classifiers led to the highest scores of 84.0% for genre and 58.8% for artist classification.

1 Introduction commonly derived from the progression of pitches and


intervals. Basic statistical attributes like mean, stan-
Within all musical cultures, music-pieces are never per-
dard deviation, entropy as well as complexity-based
formed exactly the way they are transcribed. Each
descriptors are therefore applied such as in [13], [12],
musician has got an individual conception and under-
[5] and [11]. Genre classification systems are commonly
standing of aspects like timing, dynamics and articu-
based on low- and mid-level features. A combination
lation which directly affects his rendition of the piece.
of high- and low-level features for this purpose is de-
These nuances enhance our perception of music to be
scribed in [11]. In [13], a set of 109 musical high-level
lively, exciting and rich of variety. The majority of
features are applied for genre-classification for three
publications within the Music Information Retrieval
root-genres each with three sub-genres achieving very
(MIR) community aim to characterize musical pieces
good classification results. To the current knowledge of
by means of features like tempo, bar measure or key.
the authors, there are no works dealing with genre clas-
Our goal is to analyze the semantics of the performance
sification solely based on high-level features extracted
of these pieces by investigating the individual style of
from the solo part of a song. An overview of publi-
the participating musicians. We focus on solo parts be-
cations concerned with performance analysis and the
cause they offer the most freedom of individual musical
investigation of improvisation and interaction can be
expression to the soloist in spite of a mostly predefined
found in [3] and [15]. In this context, different appli-
rhythmical and harmonical composition. Since many
cation scenarios were introduced within the literature
contemporary musicians are usually active within one
such as the analysis of improvisations performed in
or only a few music genres, the assumption can be made
clinical music therapy [7] or artist classification based
that there exists typical playing styles within songs of
on typical sequences derived from the progression of
a certain music genre besides general rhythmic, har-
tempo and dynamic within piano performances [14].
monic and melodic characteristics.

1.1 Related work 2 Exposition


The extraction of high-level features is covered in sev- To investigate solo parts from songs of different music
eral publications in the MIR literature. Different ap- genres, a simple yet prevalent instrumentation model
proaches to extract rhythmic features as for instance consisting of four different instrument categories has
derived from typical rhythmical deviations [8], from been used. It contains the melody instrument (MEL)
the percussion-related instrumentation [9] of a music played by the soloist, the harmony (HAR), bass (BAS)
piece or from different statistical spectrum descriptors and percussion instrument (DRU) as musical accompa-
based on periodic rhythm patters [11] have been re- niment providing the rhythmic and harmonic context
ported. Melodic and harmonic high-level features are within the solo part. All songs within the assembled

127
Automatic genre and artist classification by analyzing improvised solo parts from musical recordings

test sets (described later on in 2.3) fit into this model. 2.2 Feature extraction
To describe the soloists’ way of playing within a solo
2.1 System overview part, three main questions have been investigated.
Which notes are played within the given harmonic and
2.1.1 Transcription and pre-processing
rhythmic context? How is the solo part structured?
The implemented system allows the processing of both To what extent does the soloist interact with the ac-
symbolic and real audio data (MIDI and Audio files). companying instruments? Timbral characteristics of
Our experiments are all based upon excerpts from the the instrument, the precise instrumentation of a solo
analyzed solo parts of 20 to 40 seconds length. To ex- (e.g. whether the soloist plays a electric guitar or a
tract the score parameters from symbolic audio files, saxophone) as well as applied playing styles of the
the MIDI Toolbox for MATLAB [6] has been used soloist (like glissando or vibrato) have explicitly not
for data conversion. It allows to derive a list of all been taken into account here. A total of 148 high-level
notes containing the parameters note onset and dura- features both single- and multi-dimensional have been
tion (both in seconds and bars), velocity, MIDI pitch implemented and a certain sub-set of them can be ex-
and MIDI channel. To process real audio data, the tracted for each one of the four instrumental tracks.
Transcription Toolbox [4] (developed at the Fraunhofer In the following four sections, a selection of the imple-
Institute for Digital Media Technology) has been uti- mented features will be explained in detail.
lized. It is a software toolbox that encapsulate four
different algorithms to perform a separate transcrip- 2.2.1 Melodic and harmonic features
tion of the melody, harmony, bass and drum track of
Three different representations of the melodic progres-
a music-piece. It furthermore offers the user manifold
sion have been examined to derive melodic and har-
ways to correct the transcription results e.g. by choos-
monic high-level features. Besides the absolute and
ing a temporal quantization grid or a pitch correction
relative pitch (intervals between adjacent notes in half-
causing all notes to fit to the manually selectable key
tone steps mapped to one octave), the functional pitch
of the analyzed excerpt of the song. The Transcrip-
(see 2.1.2) of each note within the solo is determined
tion Toolbox also extracts the beat grid of the song
based on the aforementioned harmony analysis. A wide
which enables a subsequent projection of all detected
range of different features characterizing the melody
note onsets from their absolute values in seconds to
have been extracted. These are e.g. the pitch range
certain multiples of the bar lengths and thus allows a
in halftones, a measure of chord-tone ratio (derived by
tempo-independent onset representation.
analyzing simultaneously played chord notes of the har-
mony instrument) as well as the temporal ratio of poly-
2.1.2 Quantization and harmonic analysis phonic parts, chromatic note sequences (with consec-
utive intervals of a half-tone step) and note sequences
For some of the extracted rhythmic high-level features,
with constant pitch. Additionally, the progression of
the note onsets and durations have additionally been
the relative pitch was also converted into the corre-
quantized to a 64th-note beat grid. Furthermore, a
sponding functional pitch values to derive a key- and
simplified harmony analysis has been applied to the
scale-independent representation of the applied inter-
harmony track. The goal was to determine the root
vals. All single probabilities (e.g. of a fifth downwards
note of each played chord. The system is able to de-
or a third upwards) as well as some other basic statis-
tect the most common 2-, 3- and 4-note chords in all
tical features like zero- and first order entropy and the
possible inversions by using chord interval templates.
D’Agostino measure [2] have been furthermore com-
In case the chord was unknown, the lowest note was
puted as melodic features. The temporal ratio of frag-
supposed to be the root note. For internal representa-
ments with a constant melodic direction is mapped to
tion, all played chord notes are artificially elongated to
a measure of balanced direction, furthermore the dom-
allow a detection of the harmonic context for each note
inant direction (ascending or descending) is thereby
played by the soloist. By mapping the interval between
determined as additional feature.
each note of the solo melody towards the detected root
note of the simultaneously sounding chord, a represen-
2.2.2 Rhythmic features
tation called functional pitch was defined. Here, only
the type of the interval (third, fifth, etc.) is projected For the computation of rhythmic high-level features,
to the corresponding integer value (3, 5, etc.), the size the note onsets, durations and inter-onset intervals
(e.g. major or minor third) is not taken into account have been analyzed. To characterize the perceived
to increase the independence from the key-type (major rhythmical precision of a track related to different beat
or minor). grids (4th, 8th, 16th and 32th-note grid), the quantiza-

-2-

128
Automatic genre and artist classification by analyzing improvised solo parts from musical recordings

tion cost was calculated as an inverse measure of rhyth- the distribution of both rhythmic and melodic patterns
mic precision within the particular beat grid. Further- within the solo.
more, a swing ratio was also calculated for the beat
grids mentioned above using a similar approach as de- 2.2.4 Interaction-related features
scribed in [8]. To derive a rhythmical representation of To describe the interaction between the soloist and
all notes of an instrumental track that is independent the accompanying musicians, two approaches have
from tempo and bar measure, we introduce the Rhyth- been followed. By calculating the euclidean distance
mical Structure Profile (RSP) which is derived from the between bar-wise RSPs one can determine whether
un-quantized note onsets. The RSP is based on parti- two musicians play rhythmically in unison or use
tioning each bar length into k equidistant grid points, complementary rhythms. The aforementioned chord-
where different corresponding binary and ternary val- tone ratio (see 2.2.1) is furthermore calculated bar-
ues of k (2-3, 4-6 etc.) have been investigated, each grid wise to characterize the progression of the harmony-
as an un-shifted and a shifted version related to down- relatedness of the solo melody. For both vectors, both
and off-beat positions. Each note of the instrumental mean and standard deviation are calculated as fea-
track is mapped onto these grids that contain a grid tures.
point aound the note’s onset time. By summing up
the note’s normalized velocities mapped to all defined
2.3 Evaluation
grid-points, the RSP can be calculated and saved in
form of a three-dimensional matrix. Afterwards, one The partitioning of the data sets into training and
can both analyze the temporal distribution of notes test data generally has been performed class-wise to
over all grids and as well as within each grid. They al- a proportion 50% - 50% randomly for each iteration,
low the calculation of the features dominant rhythmical whereas a total of 50 iterations were passed through
grid (containing the majority of all notes), dominant for each evaluation scenario.
rhythmical feeling (down- or off-beat) and dominant
rhythmical characteristic (binary or ternary). Further- 2.3.1 Genre classification
more an algorithm to detect syncopations within dif- For the genre classification experiments, a 6-fold-
ferent rhythmical grids based on the RSP was imple- taxonomy has been utilized, consisting of the music
mented. genres Swing (SWI), Latin (LAT), Funk (FUN), Blues
(BLU), Pop-Rock (POP), Metal-Hardrock (MHR). Be-
2.2.3 Structure-related features sides instrument-related single classifiers, the efficiency
To describe the structure of a solo, both rhythmical of ensemble classifiers (based on a probabilistic ma-
and melodic repetitions within the instrumental tracks jority decision) was investigated. Two different ap-
have been seeked. For this purpose, an algorithm for proaches have been chosen, a common support vector
detecting repeating patterns within character strings machine (SVM) classifier with preliminary linear dis-
(Correlative Matrix approach [10]) has been utilized. criminant analysis (LDA) and a nearest-neighbor clas-
These character strings are derived from the absolute sifier based on the aforementioned RSP.
pitches as well as from the quantized onset and dura-
tion values. All detected patterns were mapped into a LDA-SVM classifier Before the evaluation, all feature
three-dimensional representation consisting of the pa- vectors are extracted from solo excerpts of a particular
rameters length, incidence rate and mean distance. As data set. After a feature-wise variance normalization of
a fourth parameter, supposing to characterize the re- the training data, LDA has been performed for dimen-
call value of a detected pattern, the so called relevance sionality reduction of the feature space to 5 dimensions
has been calculated from the normalized pattern pa- (since we are dealing with a six-class problem). Sup-
rameter values as rP at = lP at,N orm + fP at,N orm + (1 − port vector machines have been chosen as classifier ap-
dP at,N orm )2 . It is based on the simple assumption that proach, more precisely C-Support Vector Classification
the recall value increases with ascending pattern length (C-SVC) using the radial basis function (RBF) kernel
and frequency and decreases with ascending temporal as described in [1]. Subsequent to variance normal-
distance whereas its impact is furthermore reduced by ization and dimension reduction, the optimal classifier
the squaring operation. Basic statistical features like parameters C and γ are determined using a threefold
mean, median, standard deviation, minimum and max- grid-search and the classifier model is trained after-
imum value are calculated for each of the four pattern wards. To evaluate the trained classifier, all feature
parameters as well as the number of patterns related to vectors from the test data passed the same two prelim-
the overall number of notes of the current track. After inary steps. Finally the classifier output was compared
all, 63 feature values contain manifold information on with the ground truth label vector.

-3-

129
Automatic genre and artist classification by analyzing improvised solo parts from musical recordings

RSP classifier The main idea behind this novel ap- LDA-SVM RSP Human
proach is to model the rhythmic context provided by Input MIDI Audio MIDI MIDI
the accompanying instruments during the solo part
(re-synth.)
which is usually specific for each music genre. There-
fore, the RSPs of the bass track, the harmony track
MEL 63.8 44.4 – 37.6
and the drum track with separate investigation of the HAR 57.3 45.1 63.7 –
bass drum (B) and snare drum (S) track are computed MEL + 71.7 – – 58.8
globally over the total length of the analyzed excerpt HAR
to extract the most frequent rhythms and to minimize BAS 70.1 51.8 66.3 –
the influence of rhythmical breaks and variations. Af- DRU 62.2 35.9 61.0 (B) –
ter their computations, the instruments’ RSP matri- 47.7 (S)
ces of each song in the training data set are stored ALL – – – 63.1
in genre- and instrument-related containers for later
ENS 84.0 63.4 73.2
use. After applying the same computation step for the
songs within the test data set, the euclidean distance
between each extracted RSP matrix and all stored ma- Table 1: Genre classification results in %
trices related to the same instrument is calculated. The
minimum distances to each container can be converted
to assignment probabilities to the according genres due lagher, Jimi Hendrix and Steve Ray Vaughan). Since
to rhythmic similarity. the accompanying musicians are not supposed to have
impact on artist classification, only the features de-
Listening test To compare the results of the two clas- rived from the melody track have been provided to the
sifier approaches with the ability of human listeners to artist LDA-SVM classifier. Training and evaluation is
assign an excerpt from a solo part to a music genre, performed as described for genre classification.
a listening test has been performed. 25 test persons
between 20 and 42 years (µ = 26 years) of age with a 3 Results
relatively high average musical background of µ = 12
years participated. Each test person had to assign 15 The genre classification results are listed in table 1.
excerpts from different solo parts (randomly selected Besides the two classifier approaches described in 2.3.1,
from the symbolic-audio genre testset) to one of the the results of the listening test related to the three
given music genres. The instrumentation of the ex- investigated scenarios test are presented in the fifth
cerpts has been unified (melody, harmony and bass in- column. The single classifiers of both the LDA-SVM
strument assigned to a piano sound) to prevent a genre and the RSP approach achieved classification scores
assignment based on commonly appearing instruments up to 71.7% for MIDI and up to 51.8% for real audio
(e.g. Metal-Hardrock with electric guitar). Three input. Using ensemble based classification, scores up
different instrumentation scenarios have been investi- to 84.0% respectively 63.4% within the aforementioned
gated, the first five pieces only consisted of the melody 6-fold genre taxonomy were achieved. We assume that
instrument, the second five pieces of the melody and partly incomplete or erroneous transcription results are
harmony instrument and the last five pieces of the com- the main reasons for lower scores for real audio data.
plete instrumentation (see 2). A simple metronome has The achieved scores for artist classification are 58.8%
been furthermore added within the first two scenarios (electric guitar) and 56.0% (saxophone).
to provide a rhythmical orientation to the test persons.
4 Summary and future work
2.3.2 Artist classification
In this paper, we presented different high-level fea-
To evaluate the features’ discriminative power to iden- tures related to the melodic, rhythmic, structural and
tify the artist who is playing a certain solo, two exper- interaction-related description of improvised solo parts.
iments have been performed. For each of them, four A simple but common instrumentation model allows an
musiciancs playing the same instrument and being allo- application of these features for a wide range of differ-
cated to related music genres have been chosen and for ent music genres. Using the extracted information of
each of them 30 excerpts of solos have been collected. all four instrumental tracks by applying an ensemble
In detail, the first set consists of four famous saxophone classifier, classification rates up to 84.0% within a 6-
players (John Coltrane, Dexter Gordon, Charlie Parker fold genre taxonomy were achieved. As the listening
and Joshua Redman) and the second one of four well- test’s results show, a genre classification solely based
known electric guitar players (Eric Clapton, Rory Gal- on the solo part of a song is a difficult task. Despite of

-4-

130
Automatic genre and artist classification by analyzing improvised solo parts from musical recordings

the dominant solo instrument, the genre assignment is Conf. on Digital Audio Effects (DAFX), Septem-
primarily based on the characteristics of the accom- ber 2003.
panying instruments. Considering that timbre- and [9] P. Herrera, V. Sandvold, and F. Gouyon.
instrumentation-related features have not been taken Percussion-related semantic descriptors of music
into account here and only the solo part has been ana- audio files. In Proc. of the 25th Int. AES Conf.,
lyzed, the results are encouraging for further research 2004.
within this topic. As the results of the artist classi-
fication reveal, describing the way of playing by using [10] J.-L. Hsu, C.-C. Liu, and A. L. P. Chen. Discov-
high-level features basically allows a discrimination be- ering nontrivial repeating patterns in music data.
tween different performing artists. On the other hand, In IEEE Transactions on Multimedia, volume 3,
it still exists a lack of semantic information. To over- pages 311 – 324, September 2001.
come this, additional features to describe playing styles [11] T. Lidy, A. Rauber, A. Pertusa, and J. M. Iesta.
in detail as well as specific instrumentation and tim- Improving genre classification by combination of
bre aspects need to be implemented to derive better audio and symbolic descriptors using a transcrip-
results for artist classification. Regardless of the clas- tion system. In Proc. of the 8th Int. Conf. on
sification task one has to emphasize the importance of Music Information Retrieval (ISMIR), 2007.
a well-performing transcription system in order to an-
[12] S. T. Madsen and G. Widmer. A complexity-
alyze real audio data by the use of high-level features
based approach to melody track identification in
based on score parameters.
midi files. In Proc. of the Int. Workshop on Artifi-
cial Intelligence and Music (MUSIC-AI), January
References 2007.
[1] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: [13] C. McKay and I. Fujinaga. Automatic genre clas-
a library for support vector machines. In http: sification using large high-level musical feature
// www. csie. ntu. edu. tw/ ~ cjlin/ libsvm sets. In Proc. of the Int. Conf. in Music Infor-
(last called: 10.09.2008), 2001. mation Retrieval (ISMIR), pages 525–530, 2004.
[2] P. J. Ponce de Léon and J. M. Inesta. Pattern [14] C. Saunders, D. R. Hardoon, J. Shawe-Taylor,
recognition approach for music style identifica- and G. Widmer. Using string kernels to iden-
tion using shallow statistical descriptors. In IEEE tify famous performers from ther playing style.
Transactions on System, Man and Cybernetics - In Proc. of the 15th European Conference on Ma-
Part C : Applications and Reviews, volume 37, chine Learning (ECML), pages 384–395, 2004.
pages 248–257, March 2007.
[15] G. Widmer and W. Goebl. Computational models
[3] R. López de Mántaras and J. L. Arcos. AI and mu- of expressive music performance: The state of the
sic: From composition to expressive performances. art. In Journal of New Music Research, volume 33,
AI Magazine, 23:43–57, 2002. pages 203–216, 2004.
[4] C. Dittmar, K. Dressler, and K. Rosenbauer. A
toolbox for automatic transcription of polyphonic
music. In Proc. of the Audio Mostly, 2007.
[5] T. Eerola and A. C. North. Expectancy-based
model of melodic complexity. In Proc. of the
6th Int. Conf. of Music Perception and Cognition
(ICMPC), 2000.
[6] T. Eerola and P. Toiviainen. Midi toolbox: Mat-
lab tools for music research. In www. jyu. fi/
musica/ miditoolbox/ (last call: 10.09.2008),
Jyvskyl, Finland, 2004. University of Jyvskyl.
[7] J. Erkkilä, O. Lartillot, G. Luck, K. Riikkilä, and
P. Toiviainen. Intelligent music systems in music
therapy. In Music Therapy Today, volume 5, 2004.
[8] F. Gouyon, L. Fabig, and J. Bonada. Rhythmic
expressiveness transformations of audio record-
ings - swing modifications. In Proc. of the 60th Int.

-5-

131
The Heart as an Ocean
exploring meaningful interaction with biofeedback
Pieter Coussement, Marc Leman, Nuno Diniz, and Michiel Demey

IPEM, Dept. of Musicology Ghent University, Blandijnberg 2, B-


9000 Ghent, Belgium
pieter.coussement@ugent.be

Abstract. This paper discusses the need to redefine the concept of ‘interaction’ within the context of interactive (audio) installations.
This discussion is based on the realization of ‘The Heart as an Ocean’, a media piece that explores the relationship between auditory
senses and biometric feedback.

1. Interactivity in the context of arts


2.1. Interface and Usability
‘The Heart as an Ocean’ is a new media piece (designed by the The usability and user interaction are among the most defining
first author) that is based on the artistic use of the participant’s factors when developing interactive media installations for art.
auditory senses and biometric feedback. In a broader context, When dealing with usability it is important to take into account
‘The heart as an Ocean’ also functions as an experimental that technical complexity of both the user- interface and the
setting in which new forms of interactivity are explored , more sensory data mapping which mainly influences the experience.
particularly in the context of media installations and new On the one hand the complexity can be due to the fact that the
technologies. The media piece explores the fundamentals of user interface is too complex or on the other hand that reactions
meaningful interaction by looking to what extend the physiology of the system having little to none of an obvious correlation with
of the body can be both sensor and actuator in an art context. the public’s interaction. Until now this resulted into a way of
The installation was first exhibited at Gallery Jan Colle in thinking about ‘interactivity’, usability and interface design as a
Ghent, Belgium i in February 2007. subtle equilibrium between the need of easy-to-use interfaces
and a certain amount of complexity. This should result in an
exciting experience where people are challenged to explore and
play with the installation. Although this may be sufficient to
explore some technical issues surrounding new media
installations it seems not sufficient enough to explore a more
conceptualised meaningful interaction.

2.2. Meaningful interaction


In the design of ‘The Heart as an Ocean’, art has been conceived
as a way to communicate between artist and public, but also to
communicate on a broader social level among the public itself.
Art communicates ideas through sense and the artistic
experience is a result of the effectiveness of this communication.
It involves a conversation between artist and art piece, and
between art piece and public. Within an ideal interaction this
relation is symbiotic both in concept as in realisation. There is a
need to differentiate between responsiveness and interaction
Figure 1: Gallery Jan Colle, Belgium. eventhough both may have its distinct use in digital arts and
entertainment.
2. Problem definition The responsiveness of interactive art can be situated between a
range of 100% responsive and 100% autonomous. From that
Within the arts, interactive media installations become more and perspective, hyperinstruments iii for example, are 100%
more prominent, although interactive media installations are responsive since they always respond in the same manner to the
seldom part of the permanent collection of museums. Interactive same stimuli. However in using hyperinstruments in interactive
media installations have been mostly exhibited at special installations, the public is often confronted with a learning curve
festivals like the Ars Electronica festival in Austria or during which technical possibilities and functions of the device
SIGGRAPH in the U.S. Recently some private organizations, have to be explored and learned. Of course this may be fun and
started to build collections of media installations (in Belgium, exciting in itself. Yet, in the end this focus on the instrument
see the Verbeke Foundation ii ). Although they are oriented may result in a rather limited experience of interaction, since the
towards a more general public rather than a public of specialists, interaction does not necessarily imply a goal-direction.
it still requires a specialised exhibition environment and some Therefore no effect of non-mediation or implicit conceptual
advanced maintainability of the installations. meaning can be developed. As a result, the artist may have the
feeling that the public is not able to transcend beyond the barrier

132
of the technological mediator, and the public may have the line output to create an aggregated device providing eight line
feeling that it never experienced the artist’s intentions. level outputs. An extra nineteen-inch screen showed the
The question is whether it is possible to cope with this problem software GUI. The speakers were hidden in order to emphasize
of technological mediation and learning curves. Are there ways the atmosphere of the exhibition space, giving more room to the
to overcome the inherent limitations of hyperinstruments? audio.

3. Basic concept

In ‘The Heart as an Ocean’, the goal was to get a natural flow 4.2. Software.
of communication without the restrictions of having a too The software is developed using Cycling 74’s MAX/MSP vi . On
technical interface that could obtrude the intended interaction. the top level there is a GUI running, which enables a real-time
The interaction had to work like an affordance. No sophisticated HD recording of the interaction. This can be rendered to a DVD
explanations should be necessary to interact, and user feedback and is offered as a multiple.
should be based on a very strong homogeneity in ‘experiencing’.

3.1. In depth concept


The media piece was designed in such a way that the state of
mind of any person who interacts with the installation could be
sensed. This way it would be possible to influence that person’s
physiology through sound in such a way that the outcome would
be similar for every participant.
To achieve this goal, a synthesised ocean wave was created, that
imitated the breaking of a wave on an imaginary shore. The
intensity, level, duration and amplitude of the wave are all
derived from the heart rate of the person who interacts with the
installation. The way in which the musical parameters relate to
heart rate is as follows: an agitated person, with a strong and fast
heart rate, would generate strong loud and fast waves. A calm
person, with a weak and slow heart rate, would generate slow Figure 2: GUI enabling a named recording.
and gentle waves. Since a new wave is generated at every
heartbeat the auditory illusion of a sea breaking on a shore is Beneath the GUI level there are several patches working
created. This effect is emphasised by a spatial movement of each together to capture and calculate the heart rate from the sensor,
wave in a setup with eight speakers. Each wave starts its cycle to create the waves, to take care of the spatial position of the
randomly at one position and moves through the auditory space wave and to render the recordings.
using the other speakers. The sound of the sea was initially
chosen because of its soothing effect. Secondly, water has
played an important role in the spiritual, psychological and
physical ablution throughout history. Moreover, the sound of the
sea has all frequency bands in it and therefore, it can be
conceived as a sort of a white noise signal spread out in space.
Because of this, it largely numbers out all other surrounding
sounds resulting in a very personal auditory space. Michael
Wenger, Dean of Buddhist Studies at the San Francisco Zen
Centre iv , speaks about this:

"Moving water is 'white noise,' in which you can hear many


things. Each individual may hear a different song in the water.
Just listening to the sound--not tying it to anything, just letting
sound wash over you--is a way of letting go of your ideas and
directly experiencing things as they are."

4. Technical realisation

When the installation was first presented there were some


difficulties related to the use of the heart rate sensor, which had
to be taken into consideration while programming the software. Figure 3: Calculating wave size in milliseconds according to
During a recent upgrade of the project, the heart rate sensor has heart rate.
been replaced with a wireless sensor. This gives better results
and leads to a less obtrusive interaction.

4.1. Hardware
‘The Heart as an Ocean’ consisted of seven satellite speakers,
one subwoofers and a heart rate sensor hooked up to an
Arduino v board connected to a Mac Book. The seven satellite
speakers were spread across a wall spanning eight meters. The
subwoofer was discretely placed in the room. An M-Audio
Firewire audiophile was used in conjunction with the computer

133
The Heart as an Ocean- exploring meaningful interaction with biofeedback

have a distinct spatial impression, a lot of extra reverb effects


were added to ameliorate the experience. Having a bigger
exhibition space with more speakers and more generated waves
is preferred.

7. Discussion

The reaction of the public corresponded with the intended


design of this project. And after some time, depending on the
subjects, the effects of the installation on the public’s
physiology were quite similar. This leads to some points of
interest in extending our research of interactive art. First of all,
there is indeed the need to further (re)define interaction within
interactive arts. The idea that the installation needs to mediate
intelligently between the artist’s concept and what the public
Figure 4: Spatial position of wave. conceives is of great importance. Secondly, if you choose to
implement interactivity, it should be a necessity from the point
5. Action-reaction cycle. of view of the affordance of the installation, which engages the
user in an interaction that is originally intended, rather than in an
In order to get a meaningful interaction, a feedback loop was interaction about the technological mediator. Finally, it can be
installed, based on an action-reaction cycle model for ecological stated that interactive and responsive art are essentially different
interaction (Leman, 2007) vii . The cycle consists out of 4 stages, from each other. The main difference is that interactive art
called: Play, Listen, Judge and Change. Play is the stage in subscribes an action-reaction cycle model in which feedback has
which sound is generated. Listen involves the perception of an effect on the conditions for interaction.
sound, while Judge involves its evaluation. In the final stage the
action is changed to modify the resulting sound. Within
interactive art, both the Judge and Change stages are often left
unexplored, resulting either in responsive art intended as
responsive art or responsive art intended as interactive art. Both References.
have little to do with what I call meaningful interaction. Within
the states Judge and Change lies the essence of meaningful 1
http://www.jancolle.be/ (last visited 20.08.2008)
interaction because they rely on the activity of the participant, ii
http://www.verbekefoundation.com/ (last visited 20.08.2008)
even if this activity is unconsciously stimulated. 3
http://www.media.mit.edu/hyperins/ (last visited 20.08.2008)
4
http://www.sfzc.org/ (last visited 20.08.2008)
In the case of ‘The Heart as an Ocean’, the Judge and Change 5
http://www.arduino.cc (last visited 20.08.2008)
stages are implemented in a rather direct way. Judge is 6
http://www.cycling74.com (last visited 20.08.2008)
calculated as the relationship between speed and amplitude of 7
Marc Leman, Embodied Music Cognition and Mediation Tech-
previous waves and the heart rate. Change is calculated as the ology, The MIT Press, Cambridge, Massachusetts
amount of energy the installation will implement in the next
wave. To complete the cycle, the installation will first Listen to
the heart rate and only then start to Play.

Eventhough the mapping is rather obvious, the results were


impressive. People interacting with the installation all made
similar comments. The participants found the interacting to have
a soothing effect. Most of the participants identified with the
sound. They liked to listen to themselves. Some of them
fantasized what kind of beach (either sandy or rocky etc.) they
were on through analysing the sound. Most of them wanted to
walk in the exhibition space instead of sitting down, what has
led me to change the sensor into a wireless sensor for future
exhibitions.

6. Problems to solve / Problems solved.

Concerning the installation, some changes have further


improved the basis for interacting. Originally, the heart rate
sensor used a three point measuring technique with the
consequence that the public had to sit down in front of the
installation. In addition, there was also a lot of noise in the
sensor data that had to be filtered out. More accurate
commercial heart rate sensors were at that time too expensive or
too hard to implement. As a result, not every heartbeat came
across and sometimes, although largely filtered out, noise was
interpreted as a heartbeat. By using new commercial wireless
heart rate sensors, these problems have now been fixed.
Although a speaker setup spanning eight meter is enough to

-3-
134
EarMarkIt: An Audio-Only Game for Mobile Platforms

David Black, Kristian Gohlke, and Jörn Loviscach


Hochschule Bremen (University of Applied Sciences), Bremen, Germany
dblack@stud.hs-bremen.de, kgohlke@stud.hs-bremen.de, joern.loviscach@hs-bremen.de

Abstract. The fields of both mobile gaming and audio-only gaming have rapidly expanded in the last few years. In this work, we
attempt to combine these two fields and present an immersive audio-only game prototype meant for future application in mobile
gaming platforms. The game is based on the scenario of an open-air market, in which the player roams freely by actually walking
around, guided by the auralization played back via headphones. We created a wireless stride- and direction-sensing device with
attached manual input controls to simulate future mobile gaming devices that allow the user to explore an open space while playing a
game.

1. Introduction station before the tram leaves the station. The player is allotted a
purse of money, which he can use to pay for the food items. In
Our objective for this project is to create an audio-only game in the course of the game, there is also a thief who tries to steal
which a player can navigate a virtual audio space by walking food that the player has collected. Points are awarded based on
around in the real world. By using audio cues present in the the speed in which the player completes the entire set of
game, the player, wearing a wireless sensor device, can detect purchases and the number of food items remaining in his or her
his position relative to the objects and interact with them. Our basket.
project aims at developing an audio-only game that combines
the accessible and innovative features of immersive audio with
the freedom and ubiquity of mobile gaming.

Previous work [1] demonstrated a similar audio-only immersive


game that could be controlled via a stationary handheld input
device. We seek to extend this concept and add a mobile device
with location sensing. Given the technical issues with location
sensing based on GPS, cellular phones or WLAN for small
distances [2], we elected to use a stride and direction sensor to
create a game in which the user could interact with his own real-
life space. We keep the important elements of creating audio-
only games in mind, including the simplification of auditory
cues, meaningful motion tracking, and narrative interactivity [3].

Today’s handheld mobile electronic devices, such as phone,


PDAs, or MP3 players often provide large computing power.
Some of these devices are also equipped with sophisticated
sensing and communication capabilities. However, most of them
still have rather tiny buttons and screens and thus do not appear
to be platforms that could yield an immersive gaming
experience. But such an argument would neglect the audio
capabilities: Almost any mobile music player combined with
decent headphones can deliver an immersive audio experience
that is comparable to the output quality of most computers.

Because it does not rely on any visual feedback, audio-only Figure 1: The user walks across the market, buying food and
games are highly accessible for blind or visually impaired trying to catch the tram in the end.
people [3, 4]. However, common audio-only games rely solely
on buttons as the primary means of interacting with the game 2.1. Gameplay
[1]. What differentiates our game is the integration of a wireless The game is based on a number of auditory cues that help the
stride and direction sensor box that allows the player to interact player navigate the audio space of the market and provide
with the game in a more pleasurable way by taking full information about the food items that need to be collected.
advantage of his or her physical and perceptual capabilities. “Environmental” sounds are original self-recorded or freely
available recordings of typical markets that are spaced along the
2. Game Outline one-dimensional playing corridor to simulate a market.
“Interaction” sounds are recordings that involve the gameplay
The basic premise of the presented game is that of an open-air elements including the food vendors, the tram and the thief.
food market in which the player navigates through a corridor Recordings of the vendors include those in which they are heard
filled with different sounds of vendors, musicians, and other trying to sell one of four single food items: apples, sugar,
typical market events, see Fig. 1. In the game, the player needs donuts, and fish. The tram sound is that of a tram passing a
to purchase a certain amount of food items and then find a tram station. The thief sound is made up of footsteps that gradually

-1-
135
EarMarkIt: An Audio-Only Game for Mobile Platforms

become audible if the user does not move for a specified amount flexible off-the-shelf components in both hardware and
of time. software. The use of these well-known tools allows us to set the
focus of development on the game concept and evaluation of the
The primary difficulty in the game is for the player to game, rather than having to deal with platform-specific
distinguish between the environmental and interaction sounds, problems that might arise when developing for a wide range of
and to recognize when a vendor is selling a certain food product mobile devices (also see [5] for another example of rapid audio
for an acceptable price. When the player encounters a vendor prototyping using Max/MSP ). As has been shown in previous
selling food, he or she can make a purchase by pushing a audio-for-gaming implementations [6], this rapid prototyping
momentary switch button on the wireless sensor and input approach also makes it possible to evaluate the use of
device that correlates to that type of food. If the player presses technologies that are not yet implemented in the current mobile
the wrong type of food button, the vendor becomes “upset” that devices for the mass market.
the player requested food that is not being sold at that location,
and that food vendor disappears, making that instance of food no For the current prototype of the game, we chose sensors that are
longer available for the player to purchase at this place. If the already available in some modern mobile phones today (such as
player requests the correct type of food, the food item is added Nokia 6210 Navigator, Apple iPhone and similar). These phones
to his basket, the price of the food is deducted from his purse, could be a near-ubiquitous platform to control, or given enough
and that specific vendor becomes silent. Pressing a query button processing power, even run the entire game. As the gaming
in conjunction with one of the food buttons allows the player to hardware is thus already available to many people, the game
check the remaining food that he must still purchase. The player could be quickly deployed to any mobile phone that provides the
must seek out the best price for each food item so that the required technical capabilities. The vibration motor of the phone
money in the purse is sufficient for all of the items required. could also be used to augment the audible feedback by basic
haptic sensation.
The second difficulty in the game is that of limited time. The
player must find the tram station and arrive in time. Since the In our prototype, the game is controlled by a small, battery-
player is required to search the market for the best prices of powered, wireless device that the user wears on the belt, a
food, he cannot simply buy the first item of each type of food headband or similar. The device contains different sensors that
that he encounters. He or she must remember where certain allow us to determine the current heading of the player, detect
items were sold and possibly return to them. strides, and button presses. To achieve this, we use a three-axis
accelerometer, a magnetometer (“compass sensor”) a Bluetooth
The third difficulty in the game is the thief, who continually modem and pushbuttons tied together by an Arduino
follows the player through the corridor. If the player does not microcontroller board, see Fig. 2. The compass is used to
take a certain number of steps within a specified time, the thief determine the players heading and needs to be calibrated once at
advances towards the player and by random selection steals one the beginning of the game. As the player walks, the
of the food items from the basket. After a food item is stolen, the accelerometer data is used to detect footsteps [7]. Each step the
player must navigate again to try to find a remaining vendor - player makes in the real world also changes his position in the
which do not reappear after selling the player an item - in order game. By integrating the steps with the current heading of the
to meet the game requirements. The thief element is included in player, it is possible to move freely inside the game in two
the game to keep the player active and moving. dimensions. The soundscape is adapted to the player’s current
position by changing the stereo field and using some basic
The game ends when either the player arrives at the tram station psychoacoustic effects to simulate sounds coming from different
or the tram leaves the station before. A score is calculated based directions.
on how much money the user has left, how much time is left
before the tram leaves, and how many items from the required
list the user has collected in his basket.

2.2. Navigation
Similar to a real market layout, the auditory cues are placed
along a corridor with a set beginning and end. The player starts
the gameplay at the beginning of the corridor and is able to
navigate through the corridor using the wearable wireless sensor
device. Except for the thief sound, the auditory cues take up a
certain portion of the corridor and the user can hear them pass
by as he navigates through. The cues are panned either to the
right or left side of the player’s position. As the player passes Figure 2: To test the prototype software, we developed a
them, the localization of the auditory cues in the stereo field wireless handheld controller and a sensor
changes based on the user’s position to them. This, coupled with
a high-shelf filter with variable cutoff frequency, simulates the To handle the incoming raw values from the sensors more
effect of the sounds traveling around the user. When the player flexibly on the PC that computes the audio, we developed a slim
changes his heading (as measured by the direction sensor), the serial-to-OSC proxy. The proxy requests the raw sensor readings
stereo field adjusts accordingly so that sounds rotate relative to from the microcontroller. Inside the proxy this data can be
the user’s orientation. filtered and interpreted to generate OSC messages [8]. These
messages can be streamed to any network address, thus the
3. Prototyping Process and Game Hardware different modules of the game could also run on different
computers if necessary. This approach has proven to be a robust
Our audio-only game is meant in the future to be played on and easy way to interchange data between the different modules
mobile devices such as phones, PDAs, or even MP3 players. of the prototype, while keeping the architecture of the prototype
However, for the development of the current prototype we used as flexible as possible.

-2-
136
EarMarkIt: An Audio-Only Game for Mobile Platforms

The OSC stream from the proxy is sent to the actual game In addition, the absence of a graphical user interface also makes
software: A Max/MSP patch processes the OSC messages and the game accessible to the visually impaired.
handles the game logic and sound output. Based on the user’s
current orientation and estimated position the game logic We will continue to develop our game to include improved
generates the audio stream. To play the game, the player does inventory-checking controls and more cohesive sets of sounds.
not have to be aware of the computer. It remains in the We will develop our wireless sensor algorithms to give a more
background and merely processes the user input from the accurate reading, leading to improved game navigation.
wireless device to generate the audio stream that is then sent to a
set of wireless headphones. The corridor-based game stage lends itself well to being played
on an actual train or tram platform, in which the user might be
As the interaction with the game is not bound to a specific able to synchronize the end of the game with the arrival of the
location, the player can use any open space to interact with the real life tram or train. Also, the technology demonstrated here
game, yielding a user experience that is very similar to using a may fit into interactive installations such as those found in
mobile device. Whereas it allows simulating the experience of museums or exhibits.
using a mobile device, the prototype stays open for rapid
extensions and changes to both hardware and software during The sound output is optimized for stereo headphones, as they are
development. Once the development of the prototype is finished, a ubiquitous audio playback device for mobile devices. When
the game can be ported to any sufficiently equipped mobile being played on a personal computer, surround output would
device. deepen the immersion in the game.

4. Obstacles The prototype of our game hardware could also work as a


controller for a wide range of different virtual instruments or
When a player steps in one place instead of actually walking, basically all audio applications that can receive OSC messages.
each of these steps still moves the position of the game These applications could even be used as a sound source for the
character. However, the sensors we use only allow a rough game.
estimate of the position of the player in the real world. This is
not crucial for the game as the game logic does not need to
know the absolute distance the player moved, but only a relative
position measured in steps and the current heading. In addition, References
this effect allows the game to be played in small places.
[1] Cahill, Conor, The Interactive Audio Game Project,
Some commercially-available pedometers that are meant to
http://www.audiogame.com [retrieved on 27.05.08] (1997)
measure absolute distances suffer from similar problems,
although some claim a quite decent positioning error of about
[2] Strahan, Robert, Location Sensing Technologies.
3% [9]. A further issue is that the one-axis magnetometer inside
University College Dublin, Report No. e=mC2.2.1.1.2002,
the wireless device gives erroneous headings whenever it is
(2002)
tilted. The integration of the compass readings with the
accelerometer data or the use of inertial sensors, such as
[3] Röber, N., and Masuch, M., Playing Audio-only Games: A
gyroscopes, could eliminate this obstacle.
Compendium of Interacting With Virtual, Auditory Worlds.
DIGRA Conference 2005, (2005)
Commonly, computer games provide some display of inventory,
score or other information concerning the state of the game.
[4] Donker, H., Klante, P., and Gorny, P. The Design of
They are often placed in the visual periphery of the player, to be
Auditory User Interfaces for Blind Users. Proc. NordCHI 2002,
readily available. Creating a similar information channel in an
(2002)
audio-only game appears rather difficult. Text-to-speech
synthesis for auditioning the player’s score could be a way to
[5] Hobbs, Scott. http://pushyourdesign.com/Scott/
provide information whenever it is requested. This would be a
exectechnology.html [retrieved on 25.05.08]
pragmatic way to provide the data to the player, but it appears as
a brute-force approach. We are currently investigating other
means to convey the necessary information, without having to [6] Paul, Leonard. Audio Prototyping with Pure Data
interrupt the soundscape of the game. http://www.gamasutra.com/resource_guide/20030528/paul_01.s
html [retrieved on 26.05.08] (2003)
5. Conclusion
[7] Scarlett, Jim. Enhancing the Performance of Pedometers
In terms of our objectives of creating an immersive audio-only Using a Single Accelerometer, http://www.analog.com/library/
game that could be applied to mobile devices, we have analogdialogue/archives/41-03/pedometer.html [retrieved on
succeeded insofar as we have developed a working model of the 26.05.08] (2008)
game that can be played wirelessly via our prototype sensor and
control box. [8] Wright, M. Freed, A., and Momeni, A., Open Sound
Control: State of the Art 2003, International Conference on New
We have demonstrated that audio-only games with an Interfaces for Musical Expression, Montreal, pp. 153-159 (2003)
immersive gaming experience that is ready to be implemented
on mobile devices that only have a very limited amount of [9] Schneider, P. L., Crouter, S. E.; Lukajic, O., and Bassett,
screen space, and that the process of prototyping applications for D. R. Jr, Accuracy and Reliability of 10 Pedometers for
mobile devices can be accelerated by using off-the-shelf tools. Measuring Steps over a 400-m Walk, Medince & Science in
Sports & Exercise, Vol. 35, No. 10, pp. 1779-1784, (2003)

-3-
137

You might also like