A Study in Perceived Believability of Computer Generated Imagery

A Study in Perceived Believability of Computer Generated Imagery
Utilizing the Limitations of the Human Visual System
Søren Thinggaard Andersen

Carla Cecilie Lind-Valdan
Anders Lumbye
Louise Blom Pedersen
2011 Medialogy, Aalborg University Copenhagen
ABSTRACT
Investigations within the limitations of the human visual system to imitate scotopic, mesopic, and photopic visionis
was utilized in order to present a computer generated imagery sequence with a low level of detail (LoD) appearing
as believable as a high LoD version of the same sequence. To obtain the three different visions color grading was
applied together with different level of details added to the CGI. A quantitative method in the form of a
questionnaire was used, where the answers were interpreted and evaluated as a believability response rating (Ḃ),
which was correlated with a subjective quality assessment. Furthermore, a dataset containing the explicitly asked
binary question “is it believable” question was collected.The results we obtained showed a significant difference in
the believability between the graded photopic and graded scotopic/mesopic sequences, however no significant
difference in the perceived believability between the LoDs. There was a strong correlation between Ḃ rating and
the subjective quality assessment suggesting the metric is measuring the perceived quality of different LoDs. The
data set containing binary responses showed no significant differences between any of the possible combinations
suggesting ambiguous interpretations of the questions asked.
Keywords: Human Visual System, believability, level of detail, Computer Generated Imagery, color grading,
visual effects.
1 INTRODUCTION We proposed an optimization of a visual effects workflow

Alex Alvarez1, who in 2010 held an online Master Class on that utilized the human visual system with color grading to
the development of a digital creature from concept art to make a low detailed scene appear as believable as a high
final animation and compositing, is one of many 3D artists detailed scene.
who spend hours working on small details and do not stop
until reaching perfection. On March 14th 2010, Alvarez 2 PREVIOUS RESEARCH
posted a video on Vimeo2 from his Master Class showing the Optimizing CGI is not a new field; research has been done,
different steps of the development [Alvarez, 2010]. During primarily within the field of real time CGI, into developing
the video we saw how much time was spent on details, new algorithms to avoid unnecessary rendering of polygons
which in the end-result were barely noticeable due to the [Myszkowski, 2002] and also into progressive simplification
dimly lit condition. of CG model meshes [Reddy, 1997][Williams, 2003].
This evoked our interest towards whether it was possible Commonly, the theory on the human visual system (from
to make a less detailed 3D CGI element (from this point this point referred to as HVS) is used in the development of
referred to as CGI) look as convincing as a fully detailed CGI new algorithms. Humans are very sensitive to the contrast
in a color graded and dimly lit scene. and the density of details. This perception of fine details can
Composing a less detailed version of Alvarez’ scene, be described as visual acuity [Wolfe, 2009, p. 40] measured
required an understanding of the human visual system and in cycle/degree (cy/de) in grating patterns. In the HVS, a
its limitations in relation to using color grading to imitate contrast sensitivity function describes how the relation of
photopic, mesopic and scotopic vision. cy/de affects the overall perception of contrast in a given
There are many opinions as to what color grading is and it grating pattern ranging from low (0.1 cy/de) to a high
has many names including: color correction, grading, and spacing (100 cy/de), where the peak of acuity perception of
color grading [Ganbar, 2011, p. 97] [Hurkman, 2011, p. ix]. fine detail is around 3-4 cy/de [Wolfe, 2009, p. 56]. This can
Color grading is a postproduction technique that consists of be exploited in the process where CGIs differ in range from
two main elements: color correction and grading. The the viewer, due to the relationship between acuity and
colorist color-corrects the image or sequence to bring it to a contrast sensitivity [Reddy, 1997, p. 27]. Ferwerda et al.
neutral state and uses grading to achieve the desired describe a technique for visually masking faceting, aliasing,
look/style of the scene and to make certain objects stand out noise and tessellation by the use of an algorithmic
[Øhlenschlæger and Jónsson, 2011, 00.13.49-00.14.38]. procedure, using the HVS as a guideline for when the
Next, we investigated if increasing the level of detail (from different elements are no longer perceivable. The algorithm
this point referred to as LoD) in a 3D CGI element would was tested with a detection model finding the relative
affect the perceived believability and if the perceived degree change in artifacts between the levels of detail in the
of detail in a CGI would have a correlation to the color different parameters. Their findings show that increasing
grading applied to the scene. Appertaining research the texture contrast and tessellation will effectively improve
regarding realism and believability was done in order to the realism of a given CGI [Ferwerda, 1997].
accommodate how they were used in this project. An important aspect of the HVS is how the rods and
cones3 behave during shifts in lighting condition. This is
1 http://www.alexalvarez.com/bio.html 3Rods are responsive during low illumination – they don’t meditate
2 www.vimeo.com - website where people can post their videos and get colors. Cones are responsive during high illumination – they meditate
feedback on them colors [Wolfe, 2009 p. 36].
1
known as scotopic and photopic vision [Smith Kinney, present [Rademacher, 2001 p. 3]. This shows the
1957], or more commonly known as night vision and day significance of manipulating the users to be active assessors.
vision, respectively. In the HVS, certain elements change As the definition of realism in an image is not clear,
when the light intensity changes; the rods and cones are Rademacher et al stressed not to inform subjects what kind
sensitive at different luminance levels [Ferwerda, 1995, p. of realism to look for as this would cause a bias towards the
7]. In the case of scotopic vision, which ranges from 10−6 to type of realism measured [Rademacher, 2001, p. 3].
10 cd/m2, the rods are sensitive to photons hitting the
retinal area, whereas the cones are sensitive from 0.01 to Comparing CGI and the correctness of such is difficult:
108 cd/m2 [Ferwerda, 1995, p.2]. The overlap from 0.01 to Ulbricht et al. present three components for validation of
10 cd/m2 is called the mesopic range where both rods and physically based rendering algorithms:
cones are sensitive and active. In the mesopic range certain (1) The correctness of the light reflection model;
colors are not perceivable (dependant on the illumination), (2) Verifying the light transport model, and
i.e. reddish colors tend to be visible whereas lower (3) Generating an image suitable for display.
wavelength colors like blue and purple are not visible in the [Ulbricht, 2005, p. 1]
mesopic range of vision [Ferwerda, 1995, p. 4].
More commonly this means that photopic vision Ulbricht focused on verifying the light transport model,
perceives sharp details and colors while in the scotopic whereas in this project we focused on the first component as
vision there are no colors and the contrast sensitivity is low. a subjective metric instead of an algorithmic approach as the
In mesopic vision, which is a combination of photopic and interest of this project is in relation to the limitations of the
scotopic, the colors are less saturated and blurred HVS under different light conditions.
[Ferwerda, 1995, p. 9].
Another important element that changes when light 2.1 BELIEVABILITY
varies is acuity4. Generally speaking, the lower amount of Within the field of perceptually guided rendering
luminance the lower acuity is expected. The relationship is algorithms, the objective is often to compare CGI and live-
not directly linear but assumes a sigmoid shape5 [Hecht, action footage in terms of numerical metrics, where any
1927, p. 256]. Figure 1 shows the data range of luminance changes between rendering iterations are measured in the
from photopic to scotopic vision in coherence to the spatial relative change of LoD such as in [Williams, 2003] and
frequency in grating patterns. [Ferwerda, 1997]. However, whether or not the CGIs are
believable had not yet been thoroughly investigated. In this
Highest Resolvable Spatial Frequency (cyc/deg)
60
paper, we focused on the subjective term believability,
effectively omitting computer-evaluating algorithms as a
50 means of measurement.
Opposed to realism in graphics, the believability of a
40 graphic element is less defined and standardized metric.
Previous attempts to measure such metric have been done
within the field of e.g. AI behavior. Mac Namee used a
30
comparison model for assessing two different
30 o
implementations of proactive AI characters. He simply
20
4o
asked which of the two implementations the viewer found
most believable and followed up the question by asking
10 which differences, if any, were noticed by the viewer [Mac
Namee, 2004]. A similar approach could be used when
0 evaluating different levels of detail in CGI. Magerko used a
0 1 2 3 4
log Luminance (cd/m^2) metric approach to measure "dramatic believability" where
several factors affected the believability of an AI agent in
Figure 1: Luminance level correlated to the highest resolvable
spatial frequency [Ferwerda, 1995, p. 5] computer games. Magerko argued that believability of AI
consists of what is expected of the agent and the user’s own
Much has been done within the field of perceptually expectation of some measure of reality [Magerko, 2007, p.
guided rendering using the knowledge of the HVS. For 81]. Magerko's methodology for measuring believability was
instance, McNamara used different lighting levels on an experimental method with ambiguous results thus the
simulated imagery to test the subjective perception of experiment was prone to bias However, an important aspect
realism. Her results showed that, even though the CGI of the research was the subjective metric that was proposed
render is of low quality, it was still perceived as real under as an aspect of the term believability, e.g. how can an agent
low lighting conditions [McNamara, 2006, p. 237]. Another be believable if you cannot engage in a conversation with
example is a perceptual study done by Rademacher et al. the agent, as in the real world. This suggested that viewers
who explored the different aspects of CGI rendering in the should have a comparison metric, e.g. how lifelike is the
context of perception of realism. Key elements such as the model or how naturally does the model interact with the
number of lights used, surface smoothness and shadow environment?
softness were explored, revealing where the boundary It is also worth mentioning that characters were not
between realism and non-realistic rendering appeared. An limited to humanoid, but in fact could be any creature
interesting aspect from this study was the method where [Livingstone, 2006, p.7]. However, the creature had to
the users were told that both CGI and real photography demonstrate real life behavior.
were present in a forthcoming slideshow, yet no CGI was
3 METHOD
3.1 VARIABLES
4 Acuity is a measure of the visual system’s ability to resolve spatial detail The theories regarding the HVS in relation to scotopic,
[Ferwerda 1995, p. 5] mesopic and photopic were used when considering the color
5 An S-shaped curve
2
grading, the size and distance to the CGI. The LoD was based
on the previously mentioned Alvarez video, although three For this purpose we developed a short sequence featuring
types of LoD were created in order to create an experiment CGI resembling a robot insect, which scrambles for spare
for the hypothesis. Based on Magerko’s theories, metal. The viewer see’s this through the camera as a Point of
believability was considered to be whether or not the CGI View (PoV). When the robot sees the viewer, it gets scared
belongs to the scene. From the motivation and previous and tries to escape by blinding the viewer.
research, three variables arose for this experiment: the
believability of the CGI, the LoD in the CGI, and the color 3.3 IMPLEMENTATION
grading to make up for lower detail in the CGI, i.e. we want For the LoD, we were inspired by Alvarez’ workflow and
to test the effect of color grading on the believability of a CGI we used to some extent his creature as measurement for our
across three levels of detail. LoD. We divided the rendering of our CGI into render
When talking believability of a CGI in a sequence, it is passes6. This gave us the opportunity to enable and disable
important to take into account that the viewers will have the amount of detail we needed for each sequence in
expectations [Magerko, 2007, p. 1] regarding the compositing.
believability of the CGI due to state-of-the-art films such as The three different sequences each had a different set of
Iron Man 2 and Transformers: Dark of the Moon because, no detail. The low detailed sequence consisted of a diffuse
matter how you define it, believability is a subjective matter shader namely the Lambert [Kerlow, 2009, p 252] combined
and needs to be measured implicitly [Livingstone, 2006, p.4] with a shadow pass. In the medium detailed sequence, we
[Mac Namee, 2004, p. 100]; this means that it is not asked used the mia_material_x_passes shader7 and the diffuse,
directly whether it is believable but the believability shadow and reflection passes were enabled. In the high
measure is hidden in questions that comprise believability. detailed sequence, all the render passes were enabled and
To accommodate these expectations, we have developed additional detailing such as textures, final gather, area light
five statements that might help measure the believability of and an image based lighting (IBL) were done using an HDRI
CGI in film (see Appendix A): image [Kerlow, 2009, p. 185] from the scene. All of the
1. Does it interact with the scene? render passes were rendered using mental ray [Kerlow,
2. Does it have sufficient detail compared to the 2009, p. 27].
surroundings? The live-action footage was shot with a Canon EOS
3. Does the robot’s appearance look like it was really 550D camera with an 18-55mm IS lens, recording in
there? 1920x1080 resolution at 23.97 FPS. The focal length was
4. Is the robot integrated in the scene? 18mm effectively giving a film back width of 22.3x12.54mm.
5. Does the robot fit visually in the scene? The aperture was set at f/3.5 allowing more light to hit the
sensor which also allowed us to set the ISO at 200 thus
The questions were phrased specifically to be covering limiting unwanted noise. When match-moving live-action
more than one aspect. The 3rd question was designed to footage, an often problematic issue is motion blur caused by
cover movement, texture and colors by the word heavy movement. This was to an extent avoided by having a
“appearance”. Likewise, in the 4th question aspects such as: fast shutter speed of 1/120s. All of the above-mentioned
shadows, perspective and compositing were covered by the data was entered in Adobe-Matchmover to get as accurate a
word “integrated”. track as possible. This makes it possible to edit the footage
The operational definition of the dependent variable and make a day–to-night conversion for the experiment for
believability in the context of this project concerns whether hypothesis two.
or not the CGI looks as if it belongs in the context of the The scene was live-action footage filmed on a cloudy
scene and will be measured as a subjective metric with day in a backyard in Copenhagen. In this scene, the CGI was
implicit questions regarding the behavior of the model and match-moved and animated. The CGI represents a rusty old
the interaction with the (real) environment, resulting in an robot created, textured and animated in Autodesk Maya.
average believability response rating Ḃ, i.e. a viewer To make sure our robot was at the right distance from
generally rates the implicit questions low resulting in a low the camera, we investigated further into the HVS. A typical
response rating. computer screen emits 150-400 cd/m^2 [ScreenTek, 2003]
The LoD works as an independent variable for affecting resulting in an average pupil size of 6mm to 5.2mm in the
the believability of the CGI, and the experiment presented age group 20-40 years old [Winn, 1994, p. 1135]. At this
later is designed to investigate if this claim is proven true. pupil size, the acuity is 2.7 minutes of arc [Hartridge, 1947,
p. 532]. The model chosen for this purpose was a robot
What we find interesting is whether the independent consisting of both low and high amounts of grating levels.
variable grading can make up for lack of detail and the We assumed a 4 cy/de spatial frequency for the model
grading performed on the scenes will be held up against which is where the human eye is most responsive to
theories regarding photopic, mesopic, and scotopic vision, contrast [Wolfe, 2009, p. 56]. At this level of acuity, the
meaning that through grading we will strive to create scenes maximum distance from object to eye (or camera in this
resembling the three types of vision. case) where fine details are perceivable, was [1728 Software
Systems, 2011]:
3.2 HYPOTHESIS ONE
argtan2.7 minutes
( )= 4cycle/deg
ree
In order to investigate whether the grading can make up arc distan
ce rom
f obje
ct
for the lack of detail, it was important to firstly investigate distan
ce rom
f obje
c
whether the lack of detail really has an impact on the
believability of the CGI, which lead us to the first hypothesis.
An increase in the LoD of a given CGI affects the perceived 6Render passes include Z-depth, shadows, diffuse, specular, ambient
believability of said element resulting in a corresponding occlusion, color, displacement and normal maps and alpha matte
increase in perceived believability of the given sequence. [Kerlow, 2009, p. 256]
7 Is a mental ray shader [Autodesk Maya, 2010]
3
Beyond this distance there should be no perceivable
difference in acuity. This distance calculation did not take
into account the focal length and any chromatic aberration
in the camera lens, meaning the distance was used as a
guideline rather than an absolute threshold.
After match moving and animating the robot while
making sure not to violate the guidelines above, it was color
corrected to match the tone of the scene.
3.4 EXPERIMENT DESIGN Picture 1: The visual stimuli consisting of the three LoDs presented
For the experiment concerning hypothesis one, we to the subjects
presented three sequences of raw footage with an animated
3.4.1 PROCEDURE
and match-moved CGI, where the only difference was the
LoD, and color corrected to match the scene’s composition. The visual stimulus was presented on an Asus N53 model,
The independent variable “LoD” only needed two levels of and the sequences were shown in original resolution:
detail, high and low, to be sufficient for this experiment. 1280x720. Brightness and contrast settings were identical
However, an independent variable as such provides less and for all subjects.
possibly incorrect information about its relationship to the After each sequence, the subjects were asked to evaluate
dependent variable “believability” [Cozby, 1997, p. 147] five statements on the previously described 5-point Likert
when only two levels of detail are present. Therefore, the scale, helping text was added to help the subjects
independent variable “LoD” was designed to have three understand the questions. Next they were asked to rate the
levels; low, medium, and high. sequence they had just seen in accordance to the level of
The dependent variable “believability“ was operationally quality on a 10-point scale. Here 1 corresponded to
defined as whether or not the CGI looked as if it belonged in “Amateur quality”, i.e. made by a user with little experience
the context of the scene and was measured as a subjective with filmmaking, and 10 was described as Hollywood
metric. Since believability is a subjective matter [Mac quality production, comparable with Blockbuster quality,
Namee, 2004, p. 100], the subjects were initially asked to see Appendix B for full questionnaire. Lastly they were
rate the five questions previously found to be contributing presented with a definition of believability and asked to
to the believability of a CGI in film on a 5-point Likert scale evaluate whether or not they found the CGI, in the sequence
with 1 = Strongly Disagree, 5 = Strongly Agree, and a middle they had just seen, believable – yes or no.
3 = Undecided. The average score of the five statements
3.4.2 SUBJECTS
resulted in a believability response rating (Ḃ). The subjects taking part in this experiment consisted of
Believability was in this context also decided to be 30 adults (15 male and 15 female, with a mean age of 23.8
distinguished in two levels; believable or not, to get an years (±3.09)) with self-reported normal or corrected-to-
indication of self-reported believability perception. Whether normal sight.
the CGI is believable was asked explicitly and answered with We conducted the experiment in a room in The Black
“yes” or “no”. Diamond (Den Sorte Diamant), a part of the Danish Royal
The three levels of detail and the perceived believability Library using convenience sampling.
were evaluated within subjects meaning each subject The subjects were offered cake in exchange for their
experienced all three levels of detail. However, when dealing participation.
with within subjects design, the risk of carryover effect was
present and to avoid this, we counterbalanced by combining 4 RESULTS
two 3x3 Latin Squares [Cozby, 1996, p. 309] ending up with The null hypothesis for this experiment was based on the
a 3x6 matrix such that each condition appeared once in each first hypothesis, which stated that there is a measureable
order. The subjects were presented with one of the six difference in perceived believability of three sequences with
different orders and all orders were viewed an equal difference in LoD. The null hypothesis was: “There is no
amount of times. difference in perceived believability between the three
The three LoDs can be seen in Picture 1. sequences”.
4.1 BELIEVABILITY RESPONSE RATING

Each subject rated amongst five categories that together
would result in a believability response rating (Ḃ). The
viewers could rate either strongly disagree, disagree,
undecided, agree or strongly agree, resulting in a value of 1,
2, 3, 4, 5, respectively. The sequence with high detail had an
average Ḃ of 3.08 (±0.72), the medium detail had an average
Ḃ value of 3.12 (±0.71), and the low detailed version had an
average Ḃ of 2.96 (± 0.74).
The self-reported quality assessment of the high detailed
model was at average 5.43 (±1.57) the average for the
medium detailed model was 5.1 (± 1.65) and the average
quality assessment for the low detailed model was 4.93 (±
1.64). See Appendix C for further graphs visualizing the
results.
We used the found Ḃ value for each LoD to correlate with
the subjective quality assessment data. Using the Pearson
correlation coefficient, we measured the high detail Ḃ to
4
have a strongly correlated r-value r(28) = .60, p < 0.01, the 7 EXPERIMENT DESIGN
medium detailed to have a strongly correlated r-value r(28) For the experiment concerning hypothesis two, we
= .49, p < 0.01 and the low detailed sequence to have a presented the sequences from hypothesis one, although now
likewise strong correlation r(28) = .50, p < 0.01. This we had three levels of grading to hopefully make up for the
suggested a strong correlation between the measured lack of detail in the CGI.
construct Ḃ and the subjective quality assessment for all the The LoD was the same as when testing hypothesis one;
sequences [Cozby, 1997, pp. 295 and 307]. low, medium, and high. Even though the LoD and its affect
on the believability of the CGI had been tested previously, it
4.2 ANALYSIS OF THE BINARY METRIC was important to make absolutely sure by including all
The viewers were furthermore prompted to explicitly rate levels of detail and grading in this experiment to see
the perceived believability of the sequence they had just whether there in fact was a connection between LoD and the
seen. This resulted in a binary rating for each sequence, grading.
where the high detailed sequence was rated 53% believable, Just as the LoD, the level of grading was an independent
the medium detailed sequence was rated 60% believable variable. There were three levels of grading corresponding
and the low detailed sequence was rated 40% believable (30 to the three levels of vision: scotopic, mesopic, and photopic.
answers for each sequence). Audio was implemented in coherence to result from
To check for significant difference in the believability [Andersen, 2011] which showed that sound effects and
metric, a Cochran’s Q test was conducted and the results ambience was the most appropriate soundscape for the
showed no significant difference between the groups (Q = scene. The visual stimulus will therefore be supported by
3,500, df = 2, Q crit value = 5.991, p = 0,174, α = .05). the auditory stimulus from [Andersen, 2011]
The dependent variable “believability” had not changed
4.3 ANOVA in definition and was still distinguished in only two levels:
By conducting a one-way ANOVA test of the means of Ḃ, believable or not, supported by Ḃ measured on five
we checked if the three groups were significantly different. parameters assessed on the same 5-point Likert scale as in
The ANOVA test gave an F-value of .26 with df of 87. Using the previous experiment.
the critical value of test with df =87 and an α = 0.05, we Since we had two independent variables: three levels of
obtained a critical F-value of 3.10. The found F value of .26 is detail and three levels of grading, on one dependent
lower than the critical value and thus we had to accept the variable, the obvious experiment design choice was a 3x3
null hypothesis. factorial experiment, meaning that we tested the two
independent variables each with three levels on each other
4.4 CONCLUSION giving us nine different sequences. Due to the extensive
The results from the first hypothesis were somewhat number of subjects needed to perform a between subjects
ambiguous. We found that even though the LoD of the CGI design and the heightened risk of carryover effect when
was dramatically changed, there seemed to be no statistical performing a within subjects design with nine sequences, the
difference in the perception of believability. However, there nine different sequences and their perceived believability
seemed to be an indication that the explicitly asked “Is it were evaluated in a mixed subjects design, meaning that
believable” metric was actually working as intended. each subject only experienced one LoD and three levels of
grading. We had a total of 18 different orders which were
5 HYPOTHESIS TWO seen an equal number of times.
With the ambiguous results and no clear difference The three types of grading can be seen in Picture 2.
between the groups, we conducted a similar experiment in
which the change was based on the limitations of the HVS
and each sequence was simulated to either a scotopic,
mesopic or photopic environment, where due to the
limitations of the HVS, we believed there would be no
change in perceived believability amongst the three
sequences with different LoDs.
HVS-imitated color grading nullifies the perception of LoDs

resulting in equally perceived believability across different
LoDs.
6 HVS-IMITATED COLOR GRADING

The sequences were visually altered in coherence to the
limitations of the HSV. All the sequences were based on the
same sequence: the scotopic sequence had an approximated
luminance level of -2.5 log cd/m2 with blur filter resembling
4 cy/de as the highest spatial frequency perceivable.
Furthermore saturation was lowered to a level where no
colors were perceivable. The mesopic sequence had an
approximated luminance level of -0.5 log cd/m2 with blur
filter resembling 25 cy/de as the highest spatial frequency
perceivable. Furthermore saturation was lowered to a level
where only colors above 575 nm were slightly perceivable
[Ferwerda, 1995, p. 4]. The photopic sequence had no
change in luminance, acuity or color clarity.
5
Believability Response Rating
LoD/Grading Photopic Mesopic Scotopic Average
High Detail 2,85 (±1,01) 3,43 (±0,96) 3,25 (±0,69) 3,18 (±0,9)
Medium Detail 2,45 (±1,26) 3,35 (±0,92) 3,7 (±0,8) 3,17 (±1,12)
Low Detail 3 (±1,11) 3,63 (±0,49) 3,55 (±0,82) 3,39 (±0,87)
Average 2,77 (±1,12) 3,47 (±0,8) 3,5 (±0,77)
Quality Assessment
High Detail 4,67 (±1,5) 5,75 (±1,96) 5,33 (±1,92) 5,25 (±1,81)
Medium Detail 5,08 (±2,02) 5,5 (±1,17) 5,08 (±1,24) 5,22 (±1,49)
Low Detail 4,25 (±2,45) 6 (±2,13) 5,42 (±2,35) 5,22 (±2,37)
Average 4,67 (±2) 5,75 (±1,76) 5,28 (±1,85)
Table 2: Self-reported quality assessment
Correlated to the subjective quality assessment, we again

saw a strong relationship between the Ḃ rating and the
quality assessment. Except for one, each of the possible
conditions had a positive Pearson r-value above .50 as can
Picture 2: The visual stimuli the subjects were presented with. This be seen in Table 3. There seemed to be a significant
example shows the high LoD with corresponding grading. The
videos can be found in Appendix D located on the DVD.
difference in all but two of the conditions (df= 10, critical r-
value = .576, α =0.05), which are the Scotopic/High LoD and
7.1 PROCEDURE the Mesopic/Medium LoD combinations. For further graphs
The visual stimulus was presented on a MacBook Pro 13” visualizing the results see Appendix F.
screen in original resolution: 1280x720 using Bose On-Ear
headphones to present the auditory stimuli. Volume, Ḃ rating correlated with self-reported quality assessment (r- values)
brightness, and contrast settings were identical for all LoD/Grading Photopic Mesopic Scotopic
subjects. High +0.60 +0.67 +0.54
This experiment was identical to the experiment for Medium +0.69 -0.38 +0.70
hypothesis one. Low +0.89 +0.78 +0.66
Table 3: B rating correlated with self-reported quality assessment
7.2 SUBJECTS (r-values)
The subjects taking part in this experiment consisted of
36 adults (11 male and 25 female, with an average age of 8.2 ANOVA
24.2 years) with self-reported normal or corrected-to- To test the hypothesis, we conducted a two-way ANOVA
normal sight. There was 12 subject in each test group, cf. the on the three different grading conditions and the three
36 in total. Each group had an appertaining LoD: low, different LoDs. A post-hoc analysis using the Turkeys HSD8
medium or high with corresponding grading to imitated day, procedure was conducted to find any significant difference
dusk and night. between the interactions of the independent variables.
The experiment took place in a room in KUBIS The ANOVA showed a significant difference between
(Copenhagen University’s Library and Information Service), grading groups (F (2.99) = 8.04, p = 0.00058), but no
a part of the Danish Royal Library using convenience significant difference in the interaction between variables (F
sampling. The subjects were offered cake in exchange for (4.99) = 0.82, p = 0.514). Post-hoc analysis using the
their participation. Turkeys HSD procedure showed there was a significant
difference between photopic Ḃ values (M = 2.77, SD = 1.12),
8 RESULTS the mesopic (M = 3.47, SD = 0.80), and scotopic (M = 3.50,
In this experiment we used a factorial design to test every SD = 0.77) gradings. There was no significant difference
possible connection between the affective grading and the between mesopic and scotopic gradings.
LoD, i.e. the two independent variables. Each group was run
twice such each possible order of sequences were shown 8.3 ANALYSIS OF THE BINARY METRIC
twice. The order was taken from the Latin Square, which can A Cochran’s Q test on the binary question “Is it believable”
be seen in Appendix E. was conducted to see if any of the possible combinations
were significantly different. The results can be seen Table 4.
8.1 BELIEVABILITY RESPONSE RATING
The believability response rating Ḃ for the photopic
graded sequence was 2.77 (±1.12), the mesopic graded
sequence had an average of 3.47 (±0.80) and the scotopic
graded sequence was rated 3.50 (±0.77). All the responses
can be seen in Table 1.
Believability Response Rating

High Detail 2,85 (±1,01) 3,43 (±0,96) 3,25 (±0,69) 3,18 (±0,9)
Medium Detail 2,45 (±1,26) 3,35 (±0,92) 3,7 (±0,8) 3,17 (±1,12)
Low Detail 3 (±1,11) 3,63 (±0,49) 3,55 (±0,82) 3,39 (±0,87)
Average 2,77 (±1,12) 3,47 (±0,8) 3,5 (±0,77)
Table 1: Believability Response Rating
The self reported quality assessment was calculated as an

average across all possible combinations and can be seen in
Quality2.
Table Assessment
LoD/Grading Photopic Mesopic Scotopic Average Table 4: Binary Metric response overview
High Detail 4,67 (±1,5) 5,75 (±1,96) 5,33 (±1,92) 5,25 (±1,81)
Medium Detail 5,08 (±2,02) 5,5 (±1,17) 5,08 (±1,24) 5,22 (±1,49)
Low Detail 4,25 (±2,45) 6 (±2,13) 5,42 (±2,35) 5,22 (±2,37)
Average 4,67 (±2) 5,75 (±1,76) 5,28 (±1,85)
8 Honestly Significant Difference
6
The results from the group-wise comparison Q test, that is interacted with the bicycle but not the cardboard box: even
comparison between the LoDs, showed no significant though the CGI seemed to bump into it nothing happened to
difference (Q = 0.269, df = 2, Q crit value = 5.991, α = .05) it. Another subject stated that the raw footage was a bit
between the gradings. Comparison amongst the other noisy, compared to the CGI, which then may have come
independent variable, color grading, showed no significant across as being too perfect.
difference for the photopic grading (Q = 0.200, df = 2, Q crit The sequences were semantically rich, i.e. there were a lot
value = 5.991, p =0.905 α = .05), no significant difference of items the viewer could take note of besides the CGI. As we
between mesopic gradings (Q = 1.000, df = 2, Q crit value = did not specifically state what the viewer should take note
5.991, p = 0.607, α = .05), and no significant difference of, viewers may have looked at lesser important elements in
between scotopic gradings (Q = 0.182, df = 2, Q crit value = each sequence. This was a conscious choice, as we did not
5.991, p = 0.913, α = .05). This meant we had to accept the want to have active assessors. However, with the possible
null hypothesis; there was no difference between the consequence of achieving ambiguous results, which was the
gradings or LoDs. case in this project. A possible change of the sequence would
be to dramatically decrease the semantic richness of the
9 DISCUSSION sequences, ensuring a less ambiguous experience.
The results from the second hypothesis show a significant
difference between the photopic and mesopic/scotopic CONCLUSION
sequences, which implies a difference in the perceived In this project we investigated the subjective element
believability. By these results we can reject the null “believability” in a film-watching context. The interest was
hypothesis, which is in direct contrast to the main to measure where the border between believable CGI and
hypothesis that stated there was no measureable difference non-believable CGI would occur, and through a workflow
in the perceived believability across the imitated vision optimization we used the HVS to implement three
sequences. sequences that imitated day vision, night vision and a
We cannot assure a high face validity of the measured combination. The results from the experiments showed no
construct (believability response rating) and as believability difference in the perceived believability between LoDs and a
is not a standardized metric, and with few previous significant difference between the photopic and
attempts (in CGI context) to measure such metric, we cannot mesopic/scotopic sequences, which, in both cases, was not
assure convergent validity either. We based the believability expected. The believability metric proposed for this
response rating on a pre-test conducted prior to the experiment lacks convergent validity and shows no clear
hypothesis experiments. We cannot say whether the data pattern for assessing believability, however, there seems to
gathered was sufficient or biased in any way, which may be a strong correlation between quality assessment and the
have led to accumulated errors in the final data. believability response rating.
The explicitly asked question “Is it believable or not”, Even though the results stand inconclusive, we are
proved not to show any significantly different results certain that there is a connection between perceived
amongst the groups of gradings and no clear pattern was believability and CGI. By applying outlined changes in the
apparent. This begs the question “did the viewer understand discussion we feel confident that a measureable change in
the question in the same manner as we did”? In an believability can be recorded.
explanatory text, we described how the viewer should
understand believable, but we had no means of telling if ACKNOWLEDGEMENTS
they actually did assess the question the way they were We would like to thank Thomas Øhlenschlæger and Magnús
meant to. We did, however, see a strong correlation between Sveinn Jónsson from Ghost A/S for taking the time to talk to
the Ḃ values and the self-reported quality assessment, which us about this research. Their answers were enlightening and
led us to believe that even though the believability does not useful.
significantly alter between grading groups of LoD, it
nevertheless evaluates the quality of the sequences in REFERENCES
regards to LoD. 1728 Software Systems Angular Size Calculator [Website] – 2001
Since the experiment was conducted using a mixed design 1728 Software Systems – Link: http://www.1728.org/angsize.htm
the subjects could to some degree be argued as being a
Alvarez, Alex 3D Creature Development [Video] – March 14th 2010
passive assessors since the subjects were not told what to – Vimeo – Link: http://vimeo.com/10163279
look for in the sequences. However, during the second
sequence the assessment was a bit ambiguous; it could be Andersen, et al. Multimodal Perception and Cognition: Sound
argued for that the subjects turned to active assessors after Check, does sound have an influence on the believability of a CGI
the first questionnaire, since they were watching the same element in a scene? [Paper] – 2011 Medialogi : Aalborg University
sequence and may have had an idea of what to look for, Copenhagen
however, they were not told that the three questionnaires
Autodesk Maya mental ray for Maya nodes [Website] – 2010
were identical, and they may not have figured this out, thus Autodesk Maya – Link:
the subjects could still be argued for as being passive http://download.autodesk.com/us/maya/2010help/index.html?url
assessors. When watching the last sequence, the subjects =Shading_Nodes_mental_ray_for_Maya_nodes.htm,topicNumber=d0
might have noticed a pattern based on the two previous e542881
sequences and appertaining questionnaires, and they could
be argued as being active assessors at this point. What is Cozby, Poul C. Methods in Behavioural Research [Book]. - Mountain
interesting about this is that it may have influenced their View : Mayfield Publishing Company, 1997. - 6th Edition. - ISBN: 1-
55934-659-0.
results since they may have been aware of certain details
that they may not have been if the experiment had been Ferwerda, James A. A Model of Visual Adaptation for Realistic
conducted using a between subjects design. Image Synthesis [Paper] – 1995 Cornell University : Program of
One of the viewers said the reason for his evaluation of Computer Graphics – Ithaca NY, USA
believability being “not believable” was the fact that the CGI
7
Ferwerda, James A. A Model of Visual Masking for Computer Williams, Nathaniel et al. Perceptually Guided Simplification of Lit,
Graphics [Paper] – 1997 Cornell University : Program of Computer Textured Meshes [Paper] – 2003 University of North Carolina :
Graphics – Ithaca NY, USA Chapel Hill
Ganbar, Ron Nuke 101 – Professional Compositing and Visual Winn, Barry et al. Factors Affecting Light-Adapted Pupil Size in
Effects” [Book] – PeachpitPress 2011 – Berkeley, California, USA – Normal Human Subjects [Paper] – 1994 Association for Research in
ISBN: 978-0-321-73347-4 Vision and Ophthalmology : Investigative Ophthalmology & Visual
Science, Vol. 35 No. 3 – Department of Vision Sciences : Glasgow
Hartridge, H. The Visual Perception of Fine Detail [Report] – 1947 Caledonian University, Glasgow, UK
The Royal Society : Philosophical Transactions of the Royal Society
of London : Series B, Biological Sciences, Vol. 232 No. 592 p. 519-671 Wolfe, Jeremy M. et al. Sensation & Perception [Book] – Second
Edition – 2009 Sinauer Associates, Inc. : Sutherland, Massachusetts,
Height, Selig The Relation Between Visual Acuity and Illumination USA – ISBN: 978-0-87893-953-4
[Report] – 1927 The Journal of General Physiology – Laboratory of
Biophysics, Columbia University, New York Øhlenschlæger, T. and Jónsson, M. S. [Interview] 2011 December
1st 57mins interwiew with Thomas Øhlenschlæger VFX Supervisor /
Hurkman, Alexis Van Color Correction Handbook – Professional 3D Artist and Magnús Sveinn Jónsson I/O Data Handling at Ghost
Techniques for Video and Cinema [Book] – Peachpit Press 2011 – VFX.
Berkeley, California, USA – ISBN: 978-0-321-71311-7
Kerlow, Isaac The Art of 3D Computer Animation and Effects

[Book] Fourth edition – 2009 John Wiley & Sons Inc., Hoboken, NJ,
USA – ISBN: 978-0-470-08490-8
Livingstone, Daniel Turing’s Test and Believable AI in Games

[Report] – ACM Computers in entertainment, Vol. 4 No. 1, 2006 –
University of Paisley - United Kingdom
Mac Namee, Brian Proactive Persistent Agents : Using Situational

Intelligence to Create Support Characters in Character-Centric
Computer Games [Report] – 2004 Department of Computer Science :
University of Dublin, Trinity College – Ph.D. thesis
McNamara, Ann Exploring Visual and Automatic Measures of

Perceptual Fidelity in Real and Simulated Imagery [Report] – 2006
ACM Transactions on Applied Perception Vol. 3 No. 3 p. 217-238 –
Department of Mathematics and Computer Science : St. Louis,
MO63103
Myszkowski, Karol Perception-Based Global Illumination,

Rendering, and Animation Techniques [Paper] – 2002 Maz-Planck-
Institut für Informatik : Saarbrücken, Germany
Magerko, Brian Measuring Dramatic Believability [Report] – 2007

Games for Entertainment and Learning Lab – Michigan State
University
Rademacher, Pablo M. et al. Measuring the Perception of Visual

Realism in Images [Paper] – 2001 Microsoft Research : University of
North Carolina – Chapel Hill
Rademacher Pablo M. Measuring the Perceived Visual Realism of

Images [Report] – 2002 Department of Computer Science -
University of North Carolina - Chapel Hill - Ph.D. thesis
Reddy, Martin Perceptually Modulated Level of Detail for Virtual

Environments [Report] – Doctor of Philosophy : 1997 University of
Edinburgh – Ph.D. thesis
ScreenTek Brightness and Contrast Ratio: How Brightness and

Contrast Affect Notebook LCD Screens [Website] – 2003 ScreenTek –
Link: http://www.screentekinc.com/resource-center/brightness-
and-contrast-ratio.shtml
Smith Kinney, Jo Ann Comparison of Scotopic, Mesopic, and

Photopic Spectral Sensitivity Curves [Paper] – U. S. Naval Medical
Research Laboratory : New London, Connecticut – 1957 Journal of
the Optical Society of America : Vol. 48 No. 3
Ulbricht, Christiane et al. Verification of Physically Based

Rendering Algorithms [Report] – 2006 The Eurographics
Association and Blackwell Publishing Ltd. : Computer Graphics
forum Vol. 25 No. 2 p. 237-255 – Institute of Computer Graphics and
Algorithms : Vienna University of Technology, Austria
8
APPENDIX A The questionnaire can be found on www.cgi-
QUESTIONNAIRE METHODOLOGY AND RESULTS believability.speedsurvey.com
We constructed a questionnaire to find out why people find
CGI believable or not believable.
SUBJECTS
We tested 25 people from the Medialogy department at
Aalborg University Copenhagen using convenience
sampling. The reason for testing Medialogists was to get
people who, to some extent, have knowledge about movies
and computer graphics to evaluate as we predicted to get
more buzzwords than when asking people without said
knowledge. Sex and age were not important.
PROCEDURE
We showed the subjects snapshots from three different
movies: Iron Man 2, Transformers: Dark of the Moon, and
Mega Shark vs. Crocosaurus. The reason for choosing these
three films is that the first one is CGI applied to humans, the
second and third are extreme CGI objects interacting with
humans; the first and second are what we define as
believable CGI and the last is not believable CGI. The movies
are from 2010, 2011, and 2010 respectively. We introduced
the subjects to the operational definition of believability and
asked them after four snapshots of the first movie if they
found the CGI believable (yes/no), and what
elements/properties of the CGI, the scene or both that made
them answer the way they did, then the same for the next
two, respectively. The reason for asking if it is believable or
not when we only wanted to know what
elements/properties make it believable or not, was to
compare the answers with the believability so we avoid
cases where, for example, they say that the lighting is
amazing thus it is not believable, or vice versa.
RESULTS
We found 15 parameters to evaluate the believability of CGI
in film which then will apply to our sequence as well as we
want it to represent, to the farthest extent possible, a film-
watching situation.
The following buzzwords occurred most frequently. Those

in parentheses are examples which are only mentioned once
or twice.
 Scene
o Real life situation
o Interaction
 Creation
o Details
o Texture
 (Too smooth (bad))
 Composition
o Perspective
o Proportion
 (Scaled/zoomed (bad))
 (Weight)
o Light
 Cast shadows
 Reflections
 (Depth)
o Color
 Contrast
 (Depth)
 Realism
o Comparison to real life situation/object
o Natural/Unnatural
The exact results can be found on the CD.
9
Appendix B: Questionnaire
10
Appendix C
Graphs visualizing the hypothesis one data
Quality assessment of Low Detail

7
6
Frequency of responses
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
Quality assessment
Quality assessment of Medium Detail

9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
Quality assessment
Quality assessment of High Detail

12
10
0
0 1 2 3 4 5 6 7 8 9 10
Quality assessment
11
High detailed clip Ḃ value correlated to subjective
quality assessment
9
8
7
Quality assessment
6
5
4
3
2
1
0
0 1 2 3 4 5 6
Believability response rating
Medium detailed clip Ḃ value correlated to

subjective quality assessment
10
9
8
Quality assessment
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6
Medium detailed clip Ḃ value correlated to

subjective quality assessment
10
9
8
Quality assessment
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6
12
Appendix E
Latin Square order of participants
Hypothesis 1:
13
Hypothesis 2:
14
Appendix F
Graphs visualizing data from hypothesis two test
Correlation of High detail Ḃ with

Photopic quality assessment
8
7
Quality assessment
6
5
4
3
2
1
0
0 1 2 3 4 5
Ḃ rating
Correlation of Medium detail Ḃ with

9
8
Quality assessment
7
6
5
4
3
2
1
0
0 1 2 3 4 5
Ḃ rating
Correlation of Low detail Ḃ with

10
Quality assessment
0
0 1 2 3 4 5
Ḃ rating
15
Mesopic quality assessment
10
Quality assessment
0
0 1 2 3 4 5 6
Ḃ rating

9
8
Quality assessment
7
6
5
4
3
2
1
0
0 1 2 3 4 5
Ḃ rating

10
Quality assessment
0
0 1 2 3 4 5 6
Ḃ rating
16
Scotopic quality assessment
9
8
Quality assessment
7
6
5
4
3
2
1
0
0 1 2 3 4 5
Ḃ rating

8
7
Quality assessment
6
5
4
3
2
1
0
0 1 2 3 4 5 6
Ḃ rating

10
Quality assessment
0
0 1 2 3 4 5 6
Ḃ rating
17

A Study in Perceived Believability of Computer Generated Imagery

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Study in Perceived Believability of Computer Generated Imagery

Uploaded by

Copyright:

Available Formats

A Study in Perceived Believability of Computer Generated Imagery

Utilizing the Limitations of the Human Visual System

Søren Thinggaard Andersen

2011 Medialogy, Aalborg University Copenhagen

1 INTRODUCTION We proposed an optimization of a visual effects workflow

4.1 BELIEVABILITY RESPONSE RATING

HVS-imitated color grading nullifies the perception of LoDs

6 HVS-IMITATED COLOR GRADING

Correlated to the subjective quality assessment, we again

Believability Response Rating

Table 1: Believability Response Rating

The self reported quality assessment was calculated as an

Kerlow, Isaac The Art of 3D Computer Animation and Effects

Livingstone, Daniel Turing’s Test and Believable AI in Games

Mac Namee, Brian Proactive Persistent Agents : Using Situational

McNamara, Ann Exploring Visual and Automatic Measures of

Myszkowski, Karol Perception-Based Global Illumination,

Magerko, Brian Measuring Dramatic Believability [Report] – 2007

Rademacher, Pablo M. et al. Measuring the Perception of Visual

Rademacher Pablo M. Measuring the Perceived Visual Realism of

Reddy, Martin Perceptually Modulated Level of Detail for Virtual

ScreenTek Brightness and Contrast Ratio: How Brightness and

Smith Kinney, Jo Ann Comparison of Scotopic, Mesopic, and

Ulbricht, Christiane et al. Verification of Physically Based

The following buzzwords occurred most frequently. Those

Quality assessment of Low Detail

Quality assessment of Medium Detail

Quality assessment of High Detail

Medium detailed clip Ḃ value correlated to

Medium detailed clip Ḃ value correlated to

Correlation of High detail Ḃ with

Correlation of Medium detail Ḃ with

Correlation of Low detail Ḃ with

Correlation of Medium detail Ḃ with

Correlation of Low detail Ḃ with

Correlation of Medium detail Ḃ with

Correlation of Low detail Ḃ with

You might also like