Professional Documents
Culture Documents
JON FRANCOMBE, TIM BROOKES, AES Member, AND RUSSELL MASON, AES Member
(j.francombe@surrey.ac.uk)
There are a wide variety of spatial audio reproduction systems available, from a single
loudspeaker to many spatially distributed loudspeakers. An important factor in the selection,
development, or optimization of such systems is listener preference and the important per-
ceptual characteristics that contribute to this. An experiment was performed to determine
the attributes that contribute to listener preference for a range of spatial audio reproduction
methods. Experienced and inexperienced listeners made preference ratings for combinations
of seven program items replayed over eight reproduction systems and reported the reasons for
their judgments. Automatic text clustering reduced redundancy in the responses by approx-
imately 90%, facilitating subsequent group discussions that produced clear attribute labels,
descriptions, and scale end-points. Twenty-seven and twenty-four attributes contributed to
preference for the experienced and inexperienced listeners respectively. The two sets of at-
tributes contain a degree of overlap (ten attributes from the two sets were closely related);
however, the experienced listeners used more technical terms while the inexperienced listeners
used more broad descriptive categories.
As discussed by Francombe et al. [3], the numerous stud- audio attribute elicitation; for example, in Zacharov and
ies produce a complex picture and do not leave clear guide- Koivuniemi’s preference mapping experiment [7]. How-
lines as to which attributes should be investigated or opti- ever, Francombe et al. [17] found that non-trained listeners
mized. For example, in a review of academic and popular were able to perceive and report important differences be-
sources, Pedersen and Zacharov [12] found 200 descriptive tween auditory situations, and Francombe et al. [18] found
terms, even after removing redundant words. The relative that groups of experienced and inexperienced listeners per-
importances of such terms are unclear, and their relation- ceived similar attributes when comparing real and repro-
ship to listener preference is not known. This problem is duced audio—albeit with less detail from the inexperienced
compounded when different studies define concepts with listeners. In order to determine attributes that can be used
different labels, or use the same label for apparently dis- for improving audio quality for the general public, it is
tinct attributes [13]; for example, Berg [14] highlighted important to consider the characteristics of spatial audio
contrasting definitions of the term “envelopment.” replay that non-trained listeners deem to be important.
One factor that contributes to the complexity is that pre-
vious studies have elicited all terms that can be used to 1.1 Experiment Aims
describe differences between the stimuli under test, rather
The brief literature review above suggests that there is a
than only those that are important or relevant to listener
need to investigate expert and inexpert listener perceptions
preference. It is likely that in real listening situations, some
of the spatial and timbral characteristics of current spatial
of the attributes may make little or no contribution to lis-
audio reproduction systems, including headphones and ex-
tener preference and are, therefore, less important. Where
tended channel-based systems (with height loudspeakers),
studies that try to map the attributes to preference judg-
in order to derive a clearly-defined list of the attributes
ments have been performed (e.g., [7]), the results have been
that are likely to be important to listener preference. Con-
inconclusive.
sequently, an experiment was designed to determine the
Another significant challenge is presented by the rapid
important perceptual differences between spatial audio re-
development of new and competing technologies. It is likely
production methods that contribute to listener preference.
that new loudspeaker technologies or signal processing
The experiment methodology used in this paper was
methods introduce distortions, artifacts, and new capabili-
designed to compensate for the problems identified in
ties that have not previously been considered. This is a prob-
Sec. 1 and, therefore, to produce an attribute set that: (i)
lem that is difficult to overcome without periodically rein-
contains only the most important attributes (i.e., those that
vestigating the important perceptual characteristics, per-
directly contribute to listener preference) by basing the ex-
forming a wide-ranging and future-proof experiment, or
perimentation on attributes elicited alongside preference
determining key attributes that will always be desirable re-
ratings (Sec. 3); (ii) is relevant to a representative range of
gardless of the system under test. This means that additional
loudspeaker arrangements and reproduction methods, in-
elicitation experiments are required when paradigm shifts
cluding channel-based systems with height loudspeakers
occur. For example, many spatial audio elicitation studies
(Sec. 2.1); (iii) is derived from a variety of program mate-
have been limited to horizontal-only loudspeaker config-
rial selected to encompass a wide range of both timbral and
urations or, where periphony has been considered, have
spatial characteristics (Sec. 2.2); (iv) takes the similarities
used ambisonics rather than emerging channel-based setups
and differences between the experience of trained and non-
such as 9.1 and 22.2; elicitation of terms related to elevated
trained listeners into account (Sec. 5.2); and (v) provides
loudspeakers in these emerging systems is now required.
clear, concise attribute labels, descriptions, and end-point
Studies have tended to focus on a small range of techniques
scales developed by listeners through group discussions
rather than comparing across different “families” of re-
(Sec. 5).
production methods, and more recent developments (such
This methodology was intended to produce an attribute
as channel-based reproduction with height loudspeakers)
set that closely reflects listener preference and could ulti-
have not been considered. For example, Lorho [15] inves-
mately contribute to the development of a perceptual model
tigated the perceptual characteristics of stereo audio replay
of listener preference based on spatial audio quality.
in mobile devices, which might result in a different set of
attributes. While it is not possible to include every potential
reproduction method in an experiment, nor to completely 1.2 Experiment Design
future-proof against new technologies or the adaptation of The experiment had three stages: a preference rating and
listeners to such developments, it is still important to cover free elicitation task, an automatic text clustering procedure,
a wide range of potential reproduction methods when at- and a group discussion. These stages are discussed in more
tempting to elicit an attribute set that is widely applicable. detail below. The methodology was based upon that used by
Similar considerations must also be made about the pro- Francombe et al. [17], which drew on techniques used in the
gram material content used in experiments. Audio Descriptive Analysis and Mapping (ADAM) method
It is commonly held in the attribute elicitation literature [19]. The reproduction methods and program material used
that determining the differences between products, or mak- for the experiment, and the participants, are described in
ing ratings on sensory attributes, should be performed by Sec. 2.
experts, while preference ratings should be made by con- In the preference rating and free elicitation stage, par-
sumers [16, p. 343]. This approach has been taken in spatial ticipants were asked to make preference ratings for audio
program material items replayed over various spatial re- computer monitor at 0 degrees azimuth, −15 degrees
production methods in a paired comparison paradigm and elevation)
to give reasons for their judgments. Stage one is discussed r Stereo (layout “A”)
further in Sec. 3. r 5-channel surround (layout “B”)
A reduction stage was then used to remove redundancy r 9-channel surround (layout “D”)
from the text data and facilitate the group discussion phase; r 22-channel surround (layout “H”)
without removing such redundancy, the task would have r Cuboid (loudspeakers at ±45 degrees and ±135 degrees
been lengthy and repetitive, which risks annoyance and azimuth, ±30 degrees elevation)
boredom for the participants, resulting in lower quality r Headphones
results. An automatic text clustering algorithm was de-
signed based upon the similarity of word use between The reproduction methods were set up on the Surrey
the elicited phrases. Stage two is discussed further in Sound Sphere (a metal geodesic sphere of radius 1.9 m) in
Sec. 4. a room with dimensions of 7.85 m × 12.38 m (with a heavy
Finally, the resulting clusters were presented to groups curtain at 8.23 m) × 4.00 m, and an RT60 of 0.52 s to 0.26
of listeners who were tasked with turning the individual s between 125 Hz and 4 kHz. All loudspeakers were Gen-
responses into a set of attributes covering the percep- elec 8020A, with the exception of the subwoofers (Mackie
tual differences that were experienced during stage one. HRS120) and relatively low quality computer speaker (Log-
The group discussion involved putting clusters that de- itech S150). The digital-to-analog converters (DACs) used
scribed essentially the same perceptual attribute into sets were RME Fireface 800s with the exception of the com-
and then labelling, describing, and creating scale end- puter speaker, which had a built-in DAC. The headphones
points for the sets. Finally, relationships between the at- used were Sennheiser HD600 open-back headphones, with
tributes produced by the experienced and inexperienced a Focusrite Virtual Reference Monitoring (VRM) Box used
listeners were explored. Stage three is discussed further in as the DAC (only the DAC, and not the additional VRM
Sec. 5. engine, was used).
The overall results are discussed in Sec. 6 alongside sug- For all methods with the exception of headphones and
gestions for further work. low quality mono, a bass management algorithm was im-
plemented in Max/MSP; the crossover frequency was set to
66 Hz (the lower limit of the Genelec free field frequency
2 EXPERIMENT SETUP response [20]), and the subwoofer nearest to each speaker
A limitation of existing spatial audio attribute elicitation was used for the low frequency reproduction. Where the
experiments highlighted in the introduction was that the re- loudspeaker was equidistant from both subwoofers, i.e.,
sults are to some extent specific to the reproduction methods on the median plane, a single subwoofer was selected
selected. The systems used in this experiment were selected arbitrarily.
to make the test as externally valid and future-proof as pos- The loudspeakers on the sphere were calibrated so that
sible within reasonable constraints. It was also necessary a −18 dBFS (RMS) pink noise signal replayed from a
to select program material items that covered a range of single loudspeaker measured 85 dBA (slow) at the listening
timbral and spatial properties. Details pertaining to the se- position [21]. The subwoofers were calibrated so that the
lection of reproduction methods and program materials are sound pressure levels (SPLs) of a single loudspeaker and
given in the following sections, along with a description of each subwoofer were matched within a frequency band in
the experiment participants. which the devices each exhibited a flat frequency response
[22].
r Brass quintet (recorded live in the University of Surrey each program item. The repeated items were selected ran-
Studio 1 with different capture techniques for each re- domly from the possible 28 combinations. Different random
production method) selections were made for each program item, but all listen-
r Jazz quintet (as above) ers repeated the same stimuli. This resulted in a total of 31
r Pop recording (studio multitrack recording separately judgments per participant per program item. All judgments
mixed for each reproduction method) for a single program item were made in one test session,
r Big band (multi-microphone recording in the Royal requiring a total of seven sessions per participant. The pro-
Albert Hall, separately mixed for each reproduction gram items were presented in a different random order for
method) each participant. Within each session, the individual com-
r Sport (football match broadcast with commentary sep- parisons were presented on separate user interface pages
arately mixed for each reproduction method, including and randomly ordered. The preference ratings—including
first-order ambisonic decode of Soundfield crowd micro- further experimental details, a depiction of the user inter-
phone) face, and analysis of listener reliability—are further dis-
r Experimental music (first-order B-format rendering cussed by Francombe et al. [23].
of Rotating psychoacoustic tuning curves by Florian The user interface was created in Max/MSP, with one
Hecker, decoded to each reproduction method) paired comparison of stimuli on each test page. Participants
r Film excerpt (5.1 clip from Skyfall with dialogue, music, were asked to make their preference ratings on a continu-
and sound effects; manually downmixed and upmixed ous horizontal slider and required to type their reasons for
to each different reproduction method, with added first- giving a particular preference rating on separate lines into
order B-format rain decoded to different loudspeaker ar- a text box on the screen. The wording of the instructions
rangements) pertaining to the free elicitation response was as follows.
“In the box provided, please type the factors that led
It should be noted that it is not possible to fully sep- to your preference choice. Please type each factor on a
arate the reproduction method from the production tech- separate line. You are not asked to list all of the differences
nique used, and that future capture, mixing, and encoding between the two stimuli; please list only those factors that
techniques may potentially change the listening experience contributed to your preference decision. However, please
for a particular reproduction method. However, a range of try to be as specific as possible. For example, it is not
state-of-the-art methods (including mixing by experienced sufficient to simply say that you felt A was “better” or
practitioners and direct microphone capture) was used in “worse” than B; please say which aspects of the stimuli led
order to maximize the range of elicited attributes and to to you feeling this way. You may use positive or negative
ensure generalizable results. terms. It is possible that for a given stimulus pair you might
have a large preference for some aspects of A and a large
2.3 Participants preference for other aspects of B; this might lead to an
All stages of the experiment were performed by the same overall preference that is not especially large one way or
two groups of participants: seven experienced listeners and the other, or no preference. If you preferred some aspects
eight inexperienced listeners. The experienced participants of one stimulus and some aspects of the other, please type
were fourth year undergraduate students on the Music and all of these aspects.”
Sound Recording course at the University of Surrey, Guild-
ford, UK, all of whom had passed a technical ear training
module and had critical listening experience in recording
studios. The inexperienced participants were current stu- 3.1 Free Elicitation Results
dents or recent graduates in a range of disciplines. None of Prior to the group discussion stage, simple analysis was
the inexperienced listeners had specific technical ear train- performed on the text data to begin to understand the re-
ing, although they may have had a musical background sponses.
and/or have participated in listening tests before. A total of 6806 responses were collected from the 15
listeners (39107 words). The mean phrase length was
5.75 words. Of the 6806 responses, 4220 were unique
3 STAGE ONE: FREE ELICITATION phrases. The experienced listeners produced a total of 3773
The purpose of the free elicitation stage was to collect a responses—of which 1967 were unique (52.13%)—with a
pool of text data that contained listeners’ reasons for giving mean of 4.72 words per phrase. The inexperienced listen-
particular preference judgments, so that these phrases could ers produced a total of 2740 responses—of which 2270
form the basis of subsequent group discussion sessions. were unique (82.85%)—with a mean of 7.12 words per
The stimuli were presented as paired comparisons be- phrase. These results suggest that, on the whole, the inex-
tween the reproduction methods, and listeners were asked perienced listeners were more verbose and their responses
to rate their degree of preference for stimulus A or B (or were more varied, which could be attributed to their lack of
no preference). With eight methods, there are a total of prior knowledge of suitable descriptive terminology. How-
ever, the number of responses given by each individual
(8
2 ) = 28 comparisons. In order to facilitate analysis of listener (Fig. 1) indicates that there are substantial individ-
participant reliability, three comparisons were repeated for ual differences even within the two listener groups, in terms
1000 form for each group with 263 and 317 unique phrases for
Unique responses
Total responses the experienced and inexperienced listeners respectively.
800
Extrapolating to the number of elicited phrases for this ex-
Number of phrases
0
E1 E2 E3 E4 E5 E6 E7 I1 I2 I3 I4 I5 I6 I7 I8 4 STAGE TWO: AUTOMATIC CLUSTERING
Participant
Fig. 1. Number of responses (total and unique) in the free elicita- Using automatic reduction methods to reduce the num-
tion task for each listener. The prefix “E” indicates an experienced ber of responses from the free elicitation stage could signif-
listener; “I” indicates an inexperienced listener. icantly reduce the length of the discussion process and thus
lead to more considered judgments and a higher quality
outcome. However, using automatic reduction has the dis-
of the total number of responses and the proportion that are advantage of potentially obfuscating subtle nuances used
unique. in the language, especially when the participants that con-
A quick, informal method for investigation of a large tributed the words are present in the group discussion and
text data set is to create “word clouds,” which visualize able to explain their meanings. Consequently, it is neces-
the data by representing the frequency of word use (after sary to achieve a compromise in reduction that facilitates
the removal of stop words) by the size of a word. Such the group discussion while not obscuring unique attributes
word clouds were generated using Wordle1 [24] for the from the human participants.
experienced and inexperienced listener responses (includ- Reduction methods have been used in previous stud-
ing all elicited text, i.e., without filtering duplicated re- ies: Zacharov and Koivuniemi [7] truncated words from a
sponses). The experienced listener cloud is dominated by free elicitation to their first five letters and rejected words
the term enveloping, suggesting that this may be an im- with similar roots in order to reduce the terms for a group
portant factor. Other technical terms (mono, distorted, hf, discussion; and Guastavino and Katz [8] reduced elicited
image, wider, and so on) also stand out, as well as some terms to their root forms ("lemmata"), grouped synonyms,
descriptive language (e.g., sounds, better, much). The in- and used a thesaurus to group terms by “semantic themes.”
experienced listener cloud highlights a lot of less specific While both of these methods can successfully reduce terms,
language, more often focussing on describing aspects of the they suffer from some disadvantages. Simply truncating
experimental methodology (words such as sound(ed), head- words and completely rejecting other variants means that
phones, prefer, like, feels, and stimulus); however, some some terms are not carried forward to the group discussion;
descriptive terminology is also present (unidirectional, sur- this reduces the benefit gained by holding a group dis-
round, clearer, and so on). cussion with participants who have listened to the stimuli
and produced the descriptive terms. However, using hu-
3.2 Free Elicitation Conclusions mans to perform a preliminary reduction (possibly with
the aid of a thesaurus) could potentially introduce a source
The free elicitation was designed to create a pool of
of bias and is not objective or repeatable. It was there-
textual responses that described the listeners’ reasons for
fore considered necessary to develop a repeatable algo-
giving particular preference judgments and could be used
rithmic method of grouping the elicited responses, while
as the input to a group discussion stage in order to pro-
still being able to present all of the data to the listen-
duce attribute sets. By asking listeners to write down only
ers so that the automatic grouping could be overridden if
the factors that significantly contributed to their prefer-
desired.
ence judgments, it was hoped that the responses would be
The process of analyzing large sets of textual data to
succinct and avoid many terms that had only a small in-
search for relationships is known as “text mining”; it varies
fluence. However, the results from the first stage suggest
greatly in scope from the simple stemming described above
that this may not have been the case, with 1967 and 2270
to complex algorithms that aim to extract information from
unique phrases written by the experienced and inexperi-
large data sets [25, pp. 1–13]. In this case, a clustering
enced listeners respectively. It was consequently consid-
algorithm based on the similarity of word use between
ered necessary to perform a redundancy reduction stage
each response was implemented in MATLAB. The result-
prior to the group discussion (see Sec. 4). In the exper-
ing clusters—containing elicited phrases that the algorithm
iment upon which this methodology was based [17], the
considered to have similar content—were then presented
group discussion stage took approximately 5 hours to per-
to the participants. This had the effect of greatly reduc-
ing redundancy in the dataset while allowing participants
1
It should be noted that Wordle uses a set of stop words that to potentially identify terms that had been misidentified.
may differ from those used in the clustering procedure described The clustering algorithm is described in more detail in the
in this study. following section.
Fig. 2. Word clouds showing all responses from both groups of listeners. The word size is proportional to frequency of use.
The following procedure was followed (separately for Included words Stop words
1000
the results from each listener group) based on techniques
described by Weiss et al. [25, pp. 114–116]. Prior to the
Frequency of use
800
clustering, the data were preprocessed to remove the exact
duplicate responses described in Sec. 3.1 (the automatic 600
clustering should combine any duplicates, so removing
them at this stage aided in clarifying the output that was 400
lo ls
froke
s
so ral
hah
er
en f hf
im to
s
so of
ds
mo is
in
d
a r
mund
had
v d
widery
be ge
wae
y
me
too
re
b
a
tte
htl
c
un
de
th
ve ee
li
un
tu
likes
un s
ch
so i
ry
pr as
clelittle
of
ds
d
is
to
ty
fro n
in
mu r
me
tha t
m
wad
so it
the
re
b
a
no
are
an
un
de
sti fee
ali
b
mo
un
4.1.2 Dictionary Generation It should be noted that words that are designated as stop
words are not used for determining the cluster into which a
A dictionary was generated after removal of the stop
response should fall; however, they do remain present in the
words. Any words felt to have limited predictive power,
final clustered responses (i.e., the responses are unchanged
but that had not been identified as stop words, were man-
by the procedure).
ually added to the stop word list (52 and 61 words for the
experienced and inexperienced listeners respectively). The
words removed included intensifiers and instrument/sound 4.1.4 Determination of the Number of Clusters
names. The agglomerative clustering algorithm produced the
maximum number of clusters that it was allowed to in
each case. Therefore, it was necessary to determine an ap-
4.1.3 Clustering propriate number of clusters to allow. This number was a
The clustering was performed using the following algo- compromise between too few clusters (which increases the
rithm. chance of obscuring unique attributes) and too many clus-
ters (which would leave the subjects with many clusters to
1) Each word of the responses, the dictionary, and the stop group). The clustering algorithm was run multiple times to
word list was “stemmed”; that is, the words were trun- generate from 1 to 500 clusters (selected as a suitable upper
cated to the first N letters (where N was set at 5 based limit based on the time taken to perform group discussion
upon the value used by Zacharov and Koivuniemi [7] as in a similar experiment [17]), and various statistics that
well as experience with using the clustering algorithm). described the clusters were examined (as well as manual
Stemming words ensures that similar terms are grouped observation of the clusters to check their suitability).
together and also helps to mitigate the effect of spelling To ensure that participants were able to easily classify
mistakes on the clustering (e.g., enveloped, enveloping, the terms in each cluster, it was desirable that each cluster
and envelpoing [sic] would all be coded as envel). contained a reasonable number of responses and that no
2) The dictionary and stop word list were regenerated to cluster contained a very large number of responses. If too
account for any new overlap caused by the stemming few phrases were clustered, the participants would waste
process. time grouping similar responses; if too many phrases were
3) An n-by-m matrix was generated, where n was the num- clustered, it would be time consuming and complicated for
ber of responses and m the number of words in the dic- participants to divide clusters. It was also notable that when
tionary. the clustering algorithm produced a single cluster with a
4) For each row—that is, each response—each cell of the large number of responses, the content was much less fo-
matrix was populated with a one if the dictionary word cused. Therefore, the mean and maximum cluster size was
in that column was contained in the phrase, and a zero if considered. Fig. 4 shows the mean and maximum cluster
it was not. size for 1 to 500 clusters for both groups of listeners. As the
5) Any all-zero rows (i.e., phrases where all of the words number of clusters increases, the mean number of words
had been classified as stop words) were removed from the per cluster drops off quickly; the maximum cluster size
matrix (and therefore these phrases were unclustered). stabilizes at approximately 60 after around 200 clusters.
6) Agglomerative hierarchical clustering was performed on It may also be expected that the clustering is most ef-
the resulting matrix using the “Ward” linkage method. fective (i.e., the created clusters contain separate attributes)
This technique initiates each row of the matrix as an in- when the number of dictionary words within each cluster is
dependent cluster and merges the two clusters that result small. In order to evaluate this, the proportion of each dic-
in the minimum increase in within-cluster variance at tionary word’s contribution to each cluster was evaluated.
each step. The total number of dictionary words T in any particular
attribute labelling, attribute definition, and end-point la- 27 attributes, which are listed alongside their descriptions
belling. and scale end-points in Table 1.
In the grouping stage, the clusters were presented se-
quentially to the group, who were asked to put into sets the 5.1.2 Inexperienced Listeners
clusters that contained terms relating to the same perceptual
The inexperienced listeners spent two sessions perform-
experience. The following wording was used to introduce
ing the grouping task (1 hr 31 mins and 1 hr 26 mins),
the task.
before finishing the grouping and performing the attribute
“The aim of these sessions is to determine the charac-
development in one further extended session (2 hrs 59 mins
teristics (or ‘attributes’) of audio replay that contribute to
plus breaks). The three sessions took a total of 5 hrs 56
listener preference. In the first stage, you each made pref-
mins. The participants converted 244 clusters into 24 at-
erence ratings and gave reasons for your judgments. There
tributes, which are listed alongside their descriptions and
were a large number of reasons collected, and it’s likely
scale end-points in Table 2.
that some actually refer to the same perceptual experience
(even if the word used was not necessarily the same). In
these group discussion sessions, the reasons you gave will 5.2 Attribute Set Comparison
be presented back to you, and your first task is to group The two attribute sets were produced by listeners with
them into sets of terms that describe essentially the same different backgrounds and experience, and it is therefore
experience or reason for preferring different types of audio possible that the two types of listeners find different aspects
replay. Because there was such a large number of terms of the experience to be important when comparing spatial
elicited, they have been pre-clustered into sets using an audio reproduction methods. However, it is also likely that
automatic process designed to attempt to predict similar there is at least some overlap between the two attribute sets.
groups. Because the process is automated, it may not al- To investigate this, participants were asked to compare the
ways get it right, so it might be necessary for you to divide two attribute sets. The participants were each sent a spread-
or join clusters. The clusters have been labeled with up to sheet with the attribute labels and descriptions from the set
three of the most commonly used word stems in that cluster, that their group elicited in rows and the attribute labels and
and there is also a bar chart showing the frequency of word descriptions that the other group elicited in columns (both
use within the cluster. Words that were used to create the lists were randomly ordered, with different orders for each
cluster are shown in bold font.” participant). They were asked individually to complete the
During the grouping, participants were allowed to discard spreadsheet by entering a number from 0–2 into each cell,
clusters if they felt that none of the responses were relevant where 0 indicated that there was no overlap between the two
or if they felt that it was difficult to determine exactly attributes; 1 indicated that there was some overlap between
which group a cluster fell in yet they were confident that the attributes but that they were not identical; and 2 indi-
the percepts referred to had all been covered in other groups. cated that the attributes were identical (they described the
This helped to streamline the process by preventing long, same aspect of the listening experience, even if they used
complicated discussions that would not have uncovered any different words). Participants were asked to make judg-
novel attributes. ments based primarily on the attribute descriptions, using
Following the grouping task, participants were asked to the labels for guidance if necessary. It should be noted that
suggest an attribute label, attribute definition, and scale the inexperienced listeners did not receive any training on
end-point labels for each group. The following wording technical vocabulary before filling out the spreadsheet; they
was used to introduce the task. had the attribute descriptions to refer to and were instructed
“Once all of the terms (or clusters of terms) have been to contact the experimenter if they had any questions (no
put into groups, the next task is to produce: one word or such questions were asked).
short phrase to label that group; a brief description that From the 15 participants, 10 spreadsheets were returned
you could use to explain to someone that hadn’t taken part (6 by experienced listeners, and 4 by inexperienced lis-
in the original experiment what is meant by that group; and teners). Table 3 contains the attributes that were felt to be
a set of scale end-points if that attribute was to be rated on identical (i.e., given a score of 2) by 50% or more of the
a scale.” participants that returned the spreadsheets. The results are
unsurprising given the similarities of the definitions. Look-
5.1 Results ing at the attributes that were felt to be similar but not
identical presents a much more complex picture; only 146
In the following subsections data from the experienced
of the 648 (22.5%) attribute combinations were never con-
and inexperienced listeners are analyzed in turn.
sidered to have some degree of overlap. This points to the
difficulty of using language to describe the characteristics
5.1.1 Experienced Listeners of spatial sound and possibly goes some way towards ex-
The experienced listeners completed the grouping task plaining the difficulty of determining a small set of useful
in two sessions (1 hr 25 mins and 1 hr 43 mins) and the attributes (as discussed in Sec. 1).
attribute development tasks in two further sessions (1 hr 27 The attribute comparison data can also be used to sug-
mins and 1 hr 35 mins). The four sessions took a total of 6 gest attributes that are unique to each set. No single attribute
hrs 10 mins. The participants converted the 228 clusters into from the experienced or inexperienced listener set scored
zero overlap with every other attribute; however, the sum overlap even with the simple descriptions suggests that dif-
(over participants and attributes) of similarity values for ferent attributes are important to the two listener groups.
each attribute was calculated and the attributes put into This is supported by similar findings in literature; for ex-
rank order. The attributes that scored lower than the lowest ample, Rumsey et al. [6] found that frontal spatial fidelity
scoring attribute from Table 3 (overall level and echo for the was more important to experienced listeners than inexperi-
experienced and inexperienced listeners respectively) were enced listeners, and that surround spatial fidelity was more
audibility of compression, dynamic range, spatial move- important to inexperienced listeners than experienced lis-
ment, bandwidth, overall spectral balance, phasiness, sub- teners.
jective quality of reverb, horizontal width, spatial natural-
ness, and amount of distortion for the experienced listeners,
and can’t hear difference, treble, emotional reaction, bass, 5.3 Group Discussion Conclusions
and headphones for the inexperienced listeners. The unique In the group discussions, a total of 51 attributes were
experienced listener attributes are mainly technical terms, developed: 27 for the experienced listeners and 24 for the
while the unique inexperienced listener attributes are either inexperienced listeners. There is undoubtedly some over-
broad descriptive categories or, potentially, sub-attributes lap between the two attribute sets, as highlighted by the
of others in the experienced listener set. results of the comparison analysis. The attributes produced
With the available data, it is not possible to say defini- are similar to attributes seen in previous studies, with some
tively whether the differences in listener groups arose due to exceptions that are likely due to the more extended range
differences in perception, preference, or vocabulary. How- of reproduction methods and program material items in-
ever, the fact that some attributes appeared to have little cluded as well as the use of experienced and inexperienced
Table 3. Attributes that were felt to be identical by 50% of under the category realism. Others relate to aspects of the
participants or more. stimulus such as the position of sounds or the presence of
particular technical elements. Finally, some attributes relate
Attribute to aspects of the overall listening experience; for example,
Experienced Inexperienced Pct. (%)
ease of listening from the inexperienced listener attributes.
Physical sensation Physical reaction 100 The participants reported that the task was understand-
Overall level Volume 100 able and achievable although difficult to complete. Rea-
Realism Realism 100 sons for this included the large number of terms (even after
Perceived transducer quality Output quality 100
Enveloping Immersion 90 clustering) and the high cognitive load involved, resulting
Depth of field Distance 80 in participants becoming fatigued. Consequently, the pro-
Spatial balance Spatial balance 70 cess was time-consuming (approximately six hours for each
Ensemble balance Prominence 60 group).
Spatial clarity Discernibility 50
Level of reverb Echo 50
6 DISCUSSION AND CONCLUSIONS
terms rather than repeating themselves, regardless of the particular application and to steer future system develop-
instructions given. Using a multiple stimulus presentation ment towards a perceptually optimal listening experience.
when conducting a free elicitation (as in Francombe et al. The aim of the work documented in this paper was to deter-
[17]) may help to mitigate this. However, the methodology mine the attributes that differentiate spatial audio systems
was successful at producing text data that could feed into in terms of listener preference.
a group discussion and also facilitated detailed analysis of A literature review revealed a number of limitations in
the written responses alongside the preference data. Word the existing work: (i) spatial and timbral attributes were not
clouds of the free elicitation responses suggested differ- often considered together; (ii) most studies included a lim-
ences between the types of language used by experienced ited range of reproduction methods, often without emerging
and inexperienced listeners; the experienced listeners were channel-based methods such as 9.1 and 22.2; (iii) prefer-
more technical while the inexperienced listeners were more ence was often evaluated by inexperienced listeners, while
descriptive. The term “enveloping” stood out as being very attributes were elicited by experienced listeners; and (iv)
frequently used by experienced listeners. the resulting attribute lists were complex and not necessar-
ily relevant to listener preference. Consequently, an experi-
6.2 Stage Two: Automatic Clustering ment was designed in which trained and untrained listeners
The automatic clustering stage was required to reduce auditioned program material embodying a wide range of
the vast text dataset collected in the first stage in order that both timbral and spatial characteristics, reproduced over a
group discussions could reasonably be performed. Simple range of current reproduction systems (from low-quality
reduction strategies have been used in the past; however, mono to surround sound systems with height loudspeak-
the authors are not aware of such a clustering method be- ers, and including headphones), and gave the reasons for
ing employed in an audio attribute elicitation experiment their preferences for one system or another. An automatic
and feel that the method possesses significant advantages. clustering procedure was followed by group discussions to
The method gave a large reduction of redundancy in the generate clear, concise attribute labels, descriptions, and
data (approximately 90%), providing time savings that al- scale end-points.
lowed the group discussions to be manageable. Clustering The perceptual differences between reproduction meth-
redundant terms rather than merging or removing them also ods are described by the two sets of attributes in Tables
enabled participants to access all of the data and make their 1 and 2. There is a degree of agreement between experi-
own decisions as to any responses that belonged in differ- enced and inexperienced listeners (described in Sec. 5.2)
ent groups. Using such an algorithm also produces results but some attributes were unique to the experienced listen-
that are objective and repeatable. Any use of an automatic ers (mainly technical terms) and others were unique to the
clustering method can result in the loss of some nuance in inexperienced listeners (broad descriptive categories and
the text data, but this was a necessary trade-off in order to possible sub-attributes of those in the experienced listener
allow the participants in the group discussion to perform the set).
task without becoming fatigued or bored. The method ap- Further work will provide a fuller understanding of how
peared to work well; the group discussions were performed the elicited attributes correspond to the preference ratings
in a comparable time to those reported by Francombe et that were collected simultaneously. Future work might also
al. [17] even given the higher number of responses (before focus on determining correlations between physical proper-
clustering). ties of the reproduced sound and the perceptual attributes,
enabling the creation of perceptual models. Such models
are beneficial when testing new or existing systems in or-
6.3 Stage Three: Group Discussion
der to ensure that the systems are optimized to produce the
In the group discussion, a total of 51 attributes were best possible listening experience. Predictive models can
developed. It is likely that participants can differentiate be- also be used in adaptive systems so that parameters can be
tween stimuli on all of the attributes that were produced, but adjusted in real-time to ensure an optimal listening expe-
it is still of interest to determine those that contribute most rience. Identifying the important perceptual characteristics
to listener preference. The attributes that were produced in of spatial audio reproduction is important as it enables de-
this study are in many cases similar to those produced in pre- velopment of new systems (or improvement to existing sys-
vious elicitation experiments, but some stand out as being tems) to focus on improving the characteristics that matter
notably different; for example, spatial movement, audibil- most to listeners.
ity of dynamic compression, physical sensation, and depth
of field from the experienced listeners’ attributes, and tar-
geting, distance, ease of listening, odd sounds, emotional 7 ACKNOWLEDGMENTS
response, headphones, and physical reaction from the in- This work was supported by the EPSRC program
experienced listeners’ attributes. Grant S3A: Future Spatial Audio for an Immersive
Listener Experience at Home (EP/L000539/1) and the
6.4 Conclusions and Future Work BBC as part of the BBC Audio Research Partner-
Many spatial audio reproduction systems are currently ship. Details about the data underlying this work, along
in use. Understanding why listeners prefer some systems with the terms for data access, are available from
to others could help to determine the best system for any https://doi.org/10.15126/surreydata.00808897.
THE AUTHORS
Jon Francombe graduated with a first-class honors de- where he is now senior lecturer in audio and director of
gree in music and sound recording (Tonmeister) from the research. His teaching focuses on acoustics and psychoa-
University of Surrey, Guildford, UK, in 2010, and received coustics and his research is in psychoacoustic engineering:
a Ph.D. in perceptual audio quality evaluation from the measuring, modeling, and exploiting the relationships be-
same institution in 2014. His Ph.D. research investigated tween the physical characteristics of sound and its percep-
the experience of a listener in an audio-on-audio inter- tion by human listeners.
ference situation. He is currently working as a research •
fellow on the EPSRC-funded “S3A: Future Spatial Audio” Russell Mason graduated from the University of Sur-
project, investigating the perceptual attributes of spatial au- rey in 1998 with a B.Mus. in music and sound recording
dio reproduction. He has also worked as a music technician, (Tonmeister). He was awarded a Ph.D. in audio engineer-
freelance musician, and sound engineer. ing and psychoacoustics from the University of Surrey
• in 2002 and was subsequently employed as a Research
Tim Brookes received the B.Sc. degree in mathematics Fellow. He is currently a senior lecturer in the Institute
and the M.Sc. and D.Phil. degrees in music technology of Sound Recording, University of Surrey, and is pro-
from the University of York, York, UK, in 1990, 1992, gram director of the undergraduate Tonmeister program.
and 1997, respectively. He was employed as a software Russell’s research interests are focused on psychoacous-
engineer, recording engineer, and research associate be- tic engineering, including the development of methods for
fore joining, in 1997, the academic staff at the Institute subjective evaluation and modelling aspects of auditory
of Sound Recording, University of Surrey, Guildford, UK, perception.