You are on page 1of 128




A Dissertation submitted to the

Graduate School—New Brunswick
Rutgers, The State University of New Jersey
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
Graduate Program in Computer Science

Written under the direction of

Doug DeCarlo
and approved by

New Brunswick, New Jersey

May, 2005

The Art of Seeing: Visual Perception in Design and

Evaluation of Non-Photorealistic Rendering

by Anthony Santella

Dissertation Director: Doug DeCarlo

Visual displays such as art and illustration benefit from concise presentation of in-
formation. We present several approaches for simplifying photographs to create such
concise, artistically abstracted images. The difficulty of abstraction lies in selecting
what is important. These approaches apply models of human vision, models of image
structure, and new methods of interaction to select important content. Important loca-
tions are identified from eye movement recordings. Using a perceptual model, features
are then preserved where the viewer looked, and removed elsewhere. Several visual
styles using this method are presented. The perceptual motivation for these techniques
makes predictions about how they should effect viewers. In this context, we validate
our approach using experiments that measure eye movements over these images. Re-
sults also provide some interesting insights into artistic abstraction and human visual

Thanks go to the many people whose help and support was essential in making this
work possible. None of this would have happened without my advisor Doug DeCarlo.
Thanks go also to my other committe members: Adam Finkelstein, Eileen Kowler,
Casimir Kulikowski and Peter Meer for their advice and encouragement at various (in
some cases many) stages of this process.

Thanks go also to the many friends and family members who have supported and
kept me sane through this long process. I wouldn’t have survived it without my parents
and brothers Nick and Dennis. Special thanks go to Bethany Weber. Thanks also to
Jim Housell, all the old NYU crowd, the grad group at St. Peters and all the supportive
souls in the CS Department, RuCCS and the VILLAGE.

Finally, thanks go to Phillip Greenspun for photos used in several renderings that
appear in chapters 7 and 9, as well as models Marybeth Thomas, Adeline Yeo and
Franco Figliozzi. Special thanks to Georgio Dellachiesa for looking equally thoughtful
in countless illustrative examples.

Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1. Inspirations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1. Artistic Practice . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2. Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.3. Computer Graphics . . . . . . . . . . . . . . . . . . . . . . . 7

1.2. Our Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2. Abstraction in Computer Graphics . . . . . . . . . . . . . . . . . . . . 11

2.1. Manual Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2. Automatic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3. Level Of Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3. Human Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1. Eye Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1. Eye Movement Control . . . . . . . . . . . . . . . . . . . . . 19

3.1.2. Salience Models . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2. Eye Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3. Limits of Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1. Models of Sensitivity . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2. Sensitivity Away from the Visual Center . . . . . . . . . . . . 26

3.3.3. Applicability to Natural Imagery . . . . . . . . . . . . . . . . 26

4. Vision and Image Processing . . . . . . . . . . . . . . . . . . . . . . . 30

4.1. Image Structure Features and Representation . . . . . . . . . . . . . 30

4.2. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3. Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1. Eye tracking as Interaction . . . . . . . . . . . . . . . . . . . . . . . 38

5.2. Using Visibility for Abstraction . . . . . . . . . . . . . . . . . . . . . 40

6. Painterly Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1. Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2. Applying the Limits of Vision . . . . . . . . . . . . . . . . . . . . . 43

6.3. Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7. Colored Drawings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.1. Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.1.1. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.1.2. Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2. Perceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.3. Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8. Photorealistic Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.1. Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.2. Measuring Importance . . . . . . . . . . . . . . . . . . . . . . . . . 65

8.3. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67

9. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

9.1. Evaluation of NPR . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

9.1.1. Analysis of Eye Movement Data . . . . . . . . . . . . . . . . 75

9.2. Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9.2.1. Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9.2.2. Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

9.2.3. Physical Setup . . . . . . . . . . . . . . . . . . . . . . . . . 78

9.2.4. Calibration and Presentation . . . . . . . . . . . . . . . . . . 79

9.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9.3.1. Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9.3.2. Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . 82

9.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

9.4.1. Quantitative Results . . . . . . . . . . . . . . . . . . . . . . 86

9.4.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9.5. Evaluation Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 92

10. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.1. Image Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.1.1. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.1.2. Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

10.2. Perceptual Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.3. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

11. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Curriculum Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

List of Figures
1.1. (a) Henri de Toulouse-Lautrec’s “Moulin Rouge—La Goulue” (Litho-
graphic print in four colors, 1891). (b) Odd Nerdrum’s “Self-portrait
as Baby” (Oil, 2000). Artists control detail as well as other features
such as color and texture to focus a viewer on important features and
create a mood. La Goulue’s swirling under-dress is a highly detailed
focal point of the image, and contributes to the picture’s air of reck-
less excitement. Artists have a fair amount of latitude in how they
allocate detail to create an effect. Nerdrum renders his eyes (usually
one of the most prominent features in a portrait) in a sfumato style
that makes them almost nonexistent. Detail is instead allocated to the
child’s prophetic gesture. These choices change a common baby pic-
ture into something mysterious and unsettling. . . . . . . . . . . . . 4

1.2. Judith Schaechter’s, “Corona Borealis” (Stained glass, 2001). Skill-

ful artists use the formal properties and constraints of a medium for
expressive purposes. The high dynamic range provided by transmit-
ted light and the heavy black outlines of the lead caming that holds
the glass together are used to set the figure off from the background
creating a powerful image of joy in isolation. . . . . . . . . . . . . . 5

2.1. Direct placement of strokes. Complete control of abstraction is pos-

sible when a user provides actual strokes that are rendered in a given
style. Reproduced from [Durand et al, 2001]. . . . . . . . . . . . . . 11

2.2. Manual annotation for textural indication. Important edges on a 3D

model are marked and have texture rendered near them, while it is
omitted in the interior. Reproduced from [Winkenbach and Salesin,
1994]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3. Manual local importance images. Hand painted images can indicate
important areas to be rendered in greater detail or fidelity. Reproduced
from [Hertzmann, 2001] . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4. (a) original image. (b) corresponding salience map [Itti et al, 1998]. (c)
corresponding salience map [Itti and Koch, 2000]. Salience methods
picks out potentially important areas on the basis of contrast in some
space (not limited to intensity). The two methods pictured here differ in
the method of normalization used to enhance contrast between salient
and nonsalient regions. . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1. Patterns of eye movements of a single subject over an image when

given different instructions. Note (1) free observation which shows
fixations that are relatively dispersed yet still focused on relevant ar-
eas. Contrast it with (3) where the viewer is instructed to estimate the
figures’ ages. Reproduced from Yarbus 1967. . . . . . . . . . . . . . 18

3.2. Similar effects to [Yarbus, 1967] are easily (even unintentionally) achieved
when using eye tracking for interaction. Circles are fixations, their di-
ameter is proportional to duration. The first viewer was instructed to
find the important subject matter in the image. The second viewer was
told to ’just look at the image’. The viewer assumed, from prior expe-
rience in perceptual experiments, that he was going to be later asked
detailed questions about the contents of the scene. This resulted in a
much more diffuse pattern of viewing. . . . . . . . . . . . . . . . . . 19

3.3. Log-log plot of contrast sensitivity from equation (3.2) This function
is used to define a threshold between visible and invisible features. . 25

3.4. Cortical Magnification describes the drop-off of visual sensitivity with

angular distance from the visual center. . . . . . . . . . . . . . . . . 27

4.1. (a) Scale space of one dimensional signal. Features disappear through
scale space but no new features appear. (b) Plot of inflection points of
another one dimensional signal through scale space. Reproduced from
[Witkin 1983] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2. Interval tree for 1D signal illustrating decomposition of the signal into
a hierarchy. Reproduced from [Witkin 1983]. . . . . . . . . . . . . . 33

5.1. (a) Computing eccentricities with respect to a particular fixation at p.

(b) A simple attention model defined as a piecewise-linear function for
determining the scaling factor ai for fixation fi based on its duration
ti . Very brief fixations (below tmin ) are ignored, with a ramping up (at
tmax ) to a maximum level of amax . . . . . . . . . . . . . . . . . . . . 40

6.1. Painterly rendering results. The first column shows the fixations made
by a viewer. Circles are fixations, size is proportional to duration, the
bar at the lower left is the diameter that corresponds to one second. The
second column illustrates the painterly renderings built based on that
fixation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2. Detail in background adjacent to important features can be inappro-

priately emphasized. The main subject has a halo of detailed shutter
slats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3. Sampling strokes from an anisotropic scale space avoids giving the
image an overall blurred look, but produces a somewhat jagged look in
background areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.4. Color and contrast manipulation. Side by side comparison or render-

ing with and without color and contrast manipulation (precise stroke
placement varies between the two images due to randomness). . . . . 48

7.1. Slices through several successive levels of a hierarchical segmentation

tree generated using our method. . . . . . . . . . . . . . . . . . . . . 51

7.2. Line drawing style results. . . . . . . . . . . . . . . . . . . . . . . . 60

7.3. Stylistic decisions. Lines in isolation (a) are largely uninteresting. Un-
smoothed regions (b) can look jagged. Smoothed regions (c) have a
somewhat vague and bloated look without the black edges superimposed. 61

7.4. Renderings with uniform high and low detail. . . . . . . . . . . . . . 62

7.5. Several derivative styles of the same line drawing transformation. (a)
Fully colored, (b) color comic, (c) black and white comic . . . . . . . 62

8.1. Mean shift filtering tends to create images that no longer look like pho-
tographs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.2. Photo abstraction results . . . . . . . . . . . . . . . . . . . . . . . . 68

8.3. Photo in (a) is abstracted using fixations in (b) in a variety of differ-

ent styles. (c) Painterly rendering, (d) line drawing, (e) locally disor-
dered [Koenderink and van Doorn, 1999], (f) blurred, (g) anisotropi-
cally blurred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.4. (a) Detail of our approach, (b) the same algorithm using an importance
map where total dwell is measured locally. Notice in (b) the leaking of
detail to the wood texture from the object on the desk. Here differences
are relatively subtle; but in general it is preferable to allocate detail in
a way that respects region boundaries. . . . . . . . . . . . . . . . . . 70

8.5. The range of abstraction possible with this technique is limited. With
greater abstraction the scene begins to appear foggy. In some sense it
no longer looks like the same scene. . . . . . . . . . . . . . . . . . . 71

9.1. Example stimuli. Detail points in white are from eye tracking, black
detail points are from an automatic salience algorithm. . . . . . . . . 76

9.2. Illustration of data analysis, per image condition. Each colored collec-
tion of points is a cluster. Ellipses mark 99 % of variance. Large black
dots are detail points. We measure the number of clusters, distance
between clusters and nearest detail point, and distance between detail
points and nearest cluster. . . . . . . . . . . . . . . . . . . . . . . . 80

9.3. Statistical significance is achieved for number of clusters over a wide
range of clustering scales. The magnitude of the effect decreases, but
its significance remains quite constantly over a wide interval. Our re-
sults do not hinge on the scale value selected. . . . . . . . . . . . . . 82

9.4. Average results for all analyses per image. . . . . . . . . . . . . . . 84

9.5. Average results for all analyses per viewer. . . . . . . . . . . . . . . 85

9.6. Original photo and high detail NPR image with viewers’ filtered eye
tracking data. Though we found no global effect across these image
types, there are sometimes significantly different viewing patterns, as
can be seen here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

10.1. A rendering from our line drawing system (b), can be compared to
an alternate locally varying segmentation (c). This segmentation more
closely follows the shape of shading contours. . . . . . . . . . . . . 96

10.2. Locally varying segmentation cannot replace a segmentation hierar-

chy. Another example of a locally varying segmentation controlled by
a perceptual model (c), compared to a rendering from our line drawing
system. Note fine detail in the brick preserved near the subjects head
in (c). This is a consequence of the threshold varying continuously as
a function of distance from the fixations on the face. . . . . . . . . . 97

10.3. A rendering from our line drawing system demonstrates how long but
unimportant edges can be inappropriately emphasized. Also, promi-
nent lower frequency edges like creases in clothing are detected in
fragments and filtered out because edges are detected at only one scale. 100

10.4. Attempting technical illustration of mechanical parts pushes our image

analysis techniques close to (if not over) their limits. . . . . . . . . . 103


Chapter 1
In all eras and visual styles, artists control the amount of detail in the images they
create, both locally and globally. This is not just a technique to limit the effort in-
volved in rendering a scene. It makes a definite statement about what is important
and streamlines understanding. Our goal is to largely automate this artistic abstrac-
tion in computer renderings. The hope is to remove detail in a meaningful way, while
automating individual decisions about what features to include. Eye tracking allows
the capture of what a viewer looks at and indirectly, what they find important. We
demonstrate that this information alone is sufficient to control detail in an image based
rendering, and change the way successive viewers look at the resulting image. Our
method is grounded in the mechanisms and nature of vision—how we see and un-
derstand the world. This is an intuitive idea, if often overlooked. Artists must first
be viewers [Ruskin, 1858] and viewers ultimately consume the resulting images. So,
vision must be central in the design of algorithms for creating imagery.

Vision appears simple and effortless. Because under most circumstances it requires
no conscious effort or exertion, it seems like a trivial operation, something that just
happens, as if the light falling on the eye made one see in the same way it warms a stone.
But sight is the product of an extraordinarily developed and complicated visual system.
In seeing we are all experts, and experts make things seem easy. Without any effort we
can navigate and act in the world and recognize objects even under difficult conditions.
The abilities of our sight outreach even our awareness of them. Experiments have
shown that the eyes of radiologists searching for tumors linger longer over tumors
that they fail to notice and report [Mello-Thoms et al., 2002]. The limited success of
attempts to mimic these human abilities in computer vision systems highlight both the
difficulty of the computations involved, and our phenomenal success at them.

The apparent ease with which we see slips when our vision is stressed: struggling
to keep a written page in focus as we fall asleep, searching for a loved one’s face in the
shifting crowd of an airport. At these times we become conscious of sight as a struggle
to organize and make sense of the world. This struggle has continual victories, but also
failures. An old friend waving to us on the street is passed by, a typo make its way into
an important document. The apparent ease of vision also masks our limitations. We
miss much, and are easily overloaded. Sometimes our failures are engineered: a cam-
ouflaged soldier, the proverbial fine print. More often, however, they are accidental.
Some information was present, or presented, and we failed to notice it.

Well designed displays of visual information ensure we don’t miss anything impor-
tant by careful arrangement and manipulation. A wide variety of techniques are used
to make meaning clear. Detail is put just where it is important, shapes can be changed
or removed, colors and textures enhanced or suppressed. Paintings, sketches, technical
illustrations, and even the most apparently photorealistic of art—all products of the hu-
man hand—have been simplified and manipulated for ease of understanding. Reality
is complicated and messy. Rather than realism, what is more often desired is verisimil-
itude. We want the appearance of reality which has been organized and structured to
make its meaning clearer, if necessarily more limited than the infinite complexity of

Achieving this kind of clarity has always been the job of artists and designers who
make subjective, but not arbitrary, decisions about what is important, and how to con-
vey it. The ubiquity of digital media creates a need for automation in achieving this
kind of good design. The goal is not to replace the artist who creates a carefully crafted
one-off display, but instead to create a potentially vast number of adaptive displays,
tailored to particular situations and viewers. This information would otherwise be dis-
played in some less well-designed manner, laying more of a cognitive burden on the
user. It has been argued in fact, that avoiding this burden is one of the primary char-
acteristics of powerful art [Zeki, 1999]. If good design can be formalized, this will

enhance understanding and aid effective communication, as well as improve our own
understanding of the workings of visual communication. This thesis presents some
initial steps toward this goal.

1.1 Inspirations

There are many techniques proposed by various artists, and perhaps even more theory
proposed by various researchers and critics on how to achieve good visual design. Yet
it remains imperfectly understood in all of the fields where it has been studied. Because
of this, a successful practical approach must necessarily draw on elements from many
areas of practice and theory. If a practical system is designed to be as general as
possible, its creation can improve understanding of what visual clarity means, and how
it relates to communication. It can also provide a framework in which to unify concepts
and techniques from many fields.

1.1.1 Artistic Practice

One important source of inspiration for this work is artistic practice and practical the-
ory. Artists have always had strong motivation to capture the attention and interest of
uninterested, sometimes hostile viewers. Much ingenuity has been applied to creating
images that are as gripping and clearly communicative as possible. Careful observation
of such images can yield interesting insights (see Figure 1.1). Similarly, artists have
throughout history given advice on the practice of their craft. Theorists and art histo-
rians have tried to make generalizations and analyze techniques [Ruskin, 1857, Gom-
brich et al., 1970, Graham, 1970, Arnheim, 1988]. This is true in graphic design as
well as fine art. Classical texts like Tufte [1990] try to explore the qualities of good
and bad presentations of information and make generalizations from carefully chosen

However, these instructions and recommendations are often difficult to apply. They

(a) (b)

Figure 1.1: (a) Henri de Toulouse-Lautrec’s “Moulin Rouge—La Goulue” (Litho-

graphic print in four colors, 1891). (b) Odd Nerdrum’s “Self-portrait as Baby” (Oil,
2000). Artists control detail as well as other features such as color and texture to focus
a viewer on important features and create a mood. La Goulue’s swirling under-dress is
a highly detailed focal point of the image, and contributes to the picture’s air of reckless
excitement. Artists have a fair amount of latitude in how they allocate detail to create
an effect. Nerdrum renders his eyes (usually one of the most prominent features in a
portrait) in a sfumato style that makes them almost nonexistent. Detail is instead allo-
cated to the child’s prophetic gesture. These choices change a common baby picture
into something mysterious and unsettling.

are sometimes limited in scope, providing specific instructions for a particular narrow
problem. More often, guidelines are too broad and vague in their application. They
count for their functioning on the judgment of the artist. The advice of artists and
designers often comes in the form of heuristics, rules of thumb to be taken with a grain
of salt, kept in the back of one’s mind, and applied when the moment seems right.
Becoming an expert in a visual field is often a question of cultivating, through practice
and observation, an instinctive sense of when to apply such rules, and conversely when
to break them.

Figure 1.2: Judith Schaechter’s, “Corona Borealis” (Stained glass, 2001). Skillful
artists use the formal properties and constraints of a medium for expressive purposes.
The high dynamic range provided by transmitted light and the heavy black outlines
of the lead caming that holds the glass together are used to set the figure off from the
background creating a powerful image of joy in isolation.

1.1.2 Psychology

A somewhat different approach is to study good design with the methodologies of psy-
chology, psychophysics and neuroscience. This is in essence an attempt to understand
good design from first principles: the functioning of the human mind and visual sys-
tem. Visual perception obviously mediates all information that passes from a display
to a user. So, as a form of visual communication, art must be constrained by the laws
of psychology and the visual system [Arnheim, 1988, Zeki, 1999, Ramachandran and

Hirstein, 1999]. This is an attractive idea. By understanding the strengths and weak-
nesses of the process that allows us to see, it should be possible to maximize use of the
limited cognitive bandwidth between a display and viewer.

This is perhaps not so far from what artists have done all along. One could view
every daub of paint, every pen stroke as an informal experiment in vision. Artists test
their actions against the evidence of their own visual systems, and make predictions
about how they will affect others. Formal attempts to understand perception and art
are simply more conscious, more systematic, and more interested in understanding the
creative process itself than making a statement through it. A number of psychologists
have speculated on this, and pointed to specific examples from art history [Arnheim,
1988, Leyton, 1992, Zeki, 1999, Ramachandran and Hirstein, 1999]. Studies have in-
deed found empirical evidence of perceptual effects resulting from artistic style or com-
position [Ryan and Schwartz, 1956, Locher, 1996].

Like most attempts to do anything complicated from first principles, looking at art
and design using cognition is hard. There is much that has been understood about the
visual system, but also much that is not. The more basic and low level the area of
visual function is, the more we know about it, and the less useful that information is
for design. Much for example, is known about the physical mechanism of how we per-
ceive color, substantially less is known about how we parse shapes out of a background
and assemble them into objects. It’s not surprising that many researchers looking at art
from a cognitive standpoint consider primarily 20th century painters, like Mondrian,
Kandinsky, or even Picasso at his more abstract, who themselves were largely con-
cerned with the purely formal aspects of pictorial space rather than the semantics of
subject matter. The semantic aspects of vision which reference the rest of the world
and its non-visual aspects are ill understood, so little cognitive research can be brought
to bear on the semantics of art.

Given the limited basic knowledge, general theories of how art functions cogni-
tively are, almost of necessity, rather vague in their application. Ramachandran [1999]

for example, suggests that all art is guided by the peak shift principle. This principle,
found in a number of situations in psychology, says that if a response is trained to some
stimuli, the greatest, or peak, response will be found with a stimulus that is greater than
the one used in training. A depiction functions by emphasizing the features that nor-
mally let one know what it is. In this view all art is a form of caricature. However,
this does not tell us the qualities of a successful caricature. In another example, Leyton
[1992] argues that art maximally encodes a causal history that can be read by viewers.
Good art should contain as much information in the form of asymmetry as possible to
stimulate viewers, but not too much, which will disturb them. Though a reasonable
sounding standard, this only hints at what the correct level of complexity is.

The application of psychology to design is difficult. However, we do not need to

build a system directly on these principles. Inspired by them, we can apply knowledge
from low-level vision and computer graphics techniques to build practical systems.

1.1.3 Computer Graphics

A large body of work in computer graphics ignores all these difficulties and sets out
to create attractive synthetic art and illustration. Attempts at algorithmic definitions of
good design surface in a number of areas in computer science, graphics, scientific vi-
sualization, document layout, human computer interaction, and interface design. Con-
cerns of effective art-like visual communication have particularly come to the forefront
in the realm of non-photorealistic rendering, or NPR. This area is perhaps excessively
broad. It includes almost any part of graphics that aims to create images that are not an
imitation of reality. It includes things as diverse as computer generation of geometrical
patterns, instructional diagrams and impressionist paintings. NPR images run a gamut
between the purely ornamental and those designed to convey very specific information.
A large area of research in NPR has been the production of many, often quite impres-
sive, phenomenological models for rendering in various traditional media and styles.
There is however an increasing interest in NPR as not just a way to imitate traditional

visual styles, but also as a set of techniques for trying to display visual information in
a concise and abstract way.

The link between concise presentation and imitating traditional artistic styles is not
accidental. Almost all the visual styles of traditional media, line drawings, wood-block
prints, comics, expressionist or impressionist paintings, pencil sketches, necessarily
discard vast amounts of information as a direct consequence of their visual style. There
is, for example, no color or shading in a pure line drawing. However, these images still
carry the essential content that the artist (and viewer) requires of them. Skillful artists
can use the properties and constraints of a medium to enhance the expressiveness of
a work (see Figure 1.2). A brief time spent working with photo filters in a program
like Adobe Photoshop suggests that computer implementations of these styles capture
some of the effects of traditional media, but often in a way that does not adapt to
particular situations with an artist’s flexibility. Artists ultimately can judge their results
as they go. Applying a technique in a blanket manner is often less satisfactory. What
is acceptable as reality in a photograph can look fussy and crowded as a painting.

1.2 Our Goal

Though today’s algorithms cannot model the general intelligence of an artist, we argue
that carefully designed systems can make use of minimal user interaction to create
much more expressive images. Specifically, we look at modulation of local detail, an
important cue used in traditional art and visualization. Including detail only where it
is needed focuses viewer interest and can help clarify the point of an image. As well
as being a feature of art and illustration, applications in visualization could benefit
from this. It would allow the computer to hand-craft displays for clarity and efficient
understanding in a particular situation.

This work does not directly address specific visualization applications. Rather than
exploring visualization directly, art remains the focus, and this thesis remains firmly in

the relm of artistic NPR. Our hope however, is that insights gained in this way should
be applicable to a number of areas in visualization. Art is a particularly good place to
explore the link between cognition and design of displays. Specific applications tend
to distract with their own implementation details and domain constraints. Radiology,
for example, is a domain where complexity and high stakes greatly constrain practical
applications. Art encourages a wider view, in which it is easier to look at general
techniques and patterns that are widely useful. Similarly, in evaluation, validation of
a particular system is of limited interest, while evaluation of more general techniques
can provide insights into cognition and be more widely relevant.

Grounding our work in knowledge of visual perception also helps focus attention
away from application engineering and towards general concepts. We are interested in
interactively efficient methods for achieving expressive NPR images. Knowledge of
visual perception suggests that by exploiting the visual system we can reserve human
effort for just the hardest parts of the process of crafting images, and pass the major-
ity of the work over to a computer. For a computer application, the hardest part of
abstraction is deciding what is important. This is not hard for people, since it is done
instinctively. Deciding what to paint a picture of is the easy part for an artist. It is the
mechanics of turning that intention into an image that takes training, time and effort.
This leads us to a simple, minimally interactive method for controlling detail via eye
tracking. As we will soon see, vision research leads us to believe that where people
look indicates importance. Such areas should be portrayed in detail. Conversely, what
viewers don’t look at is unimportant to them and can be removed or de-emphasized.
The same insights about vision that leads to this methodology also leads us to quanti-
tative methods for evaluation. If our approach is successful, increased interest in areas
highlighted with detail should be reflected in eye movements. This methodology holds
the promise of images that are carefully crafted for understanding on sound principles,
and can be formally evaluated for effectiveness. Such images and techniques can in
turn serve as a tool for further investigating human vision in a way targeted toward the

questions that are important for crafting images. With more information, even better
techniques and images can be built.

In this thesis we begin in Chapter 2 by laying out the basic problem of control-
ling detail in NPR imagery, and look at the range of techniques that have been used
to address it. In Chapters 3 and 4 we then review the basic background in human and
computer vision underlying our approach to this problem. The nature of vision leads
us to an approach of capturing the intentionality central to design via eye tracking.
Information about where people look alone is sufficient to control detail in a directed
way, allowing us to craft semi-automatic NPR images with much of the attractive and
engaging intentionality of completely hand made art. The basic nature of this interac-
tion is described in Chapter 5. In Chapter 6, 7 and 8 we then present several systems
for creating NPR renderings built on this idea, and discuss their strengths and weak-
nesses. An evaluation of one of these systems is presented in chapter 9, which not only
validates the general approach but gives some interesting insights into abstraction and
human vision. Finally, in Chapter 10 we discuss some directions for future research.

Chapter 2
Abstraction in Computer Graphics
In any work of art all parts of the picture plane do not receive equal attention from the
artist. Critical areas are more detailed, while others are left relatively abstract. This is
the case even in quite realistic styles, and in technical illustration. Such effects have not
been ignored in computer graphics and NPR. Local control of detail has been addressed
in several visual styles. Whatever the rendering techniques used, important areas can
be identified and depicted with greater detail, or emphasis on fidelity. Deciding what is
important is difficult to do automatically. Two broad approaches to selecting important
areas can be characterized: manual user annotation, and simple heuristics.

Figure 2.1: Direct placement of strokes. Complete control of abstraction is possible

when a user provides actual strokes that are rendered in a given style. Reproduced from
[Durand et al, 2001].

2.1 Manual Annotation

At one extreme, near complete control of detail can remain in the hands of a user.
This provides many expressive possibilities at the expense of much interaction. At its

Figure 2.2: Manual annotation for textural indication. Important edges on a 3D model
are marked and have texture rendered near them, while it is omitted in the interior.
Reproduced from [Winkenbach and Salesin, 1994].

Figure 2.3: Manual local importance images. Hand painted images can indicate im-
portant areas to be rendered in greater detail or fidelity. Reproduced from [Hertzmann,

furthest extreme the computer becomes merely a digital paintbrush the user directly
manipulates [Baxter et al., 2001]. A number of intermediate approaches exist that aid
the user in the technicalities of creating an image while still giving them complete
control over detail. The earliest work creating a painting-like appearance, or painterly
rendering effect [Haeberli, 1990] took this approach. A user places strokes entirely
by hand, their color being sampled from an underlying source image. The approach
is in effect a form of tracing, where the user ultimately remains in control of stroke
placement and size while, like a traditional media artist, making their own decisions
about which details are important as they go. A similar kind of interaction has been
used [Durand et al., 2001] in generating pencil renderings (see Figure 2.1. The user
places strokes which are shaded and shaped automatically to create a final drawing.
The same stroke based interactive methods are applicable in 3D [Kalnins et al., 2002].

One step distant from actually drawing strokes, it is also possible to indicate in-
creased importance for some areas of a rendering using an importance map, where
higher intensity indicates the need for more attention or detail in that area. For exam-
ple in a painterly rendering framework [Hertzmann, 2001], a hand drawn importance
map was used to indicate that a source image should be more closely approximated in
certain locations (see Figure 2.3). Similarly, [Winkenbach and Salesin, 1994] in 3D
hand drawn lines have been used to indicate locations near which textural detail should
be included (see Figure 2.2). In another painterly rendering application [Gooch and
Willemsen, 2002] rectangles to be painted in greater detail could be drawn by hand.
Various digital versions of other media, such as pen and ink [Salisbury et al., 1994]
and watercolor [Curtis et al., 1997] have been developed that provide the user with a
significant control over the detail present in different areas. Such approaches can yield
attractive results, but require careful attention on the part of a user.

(a) (b) (c)

Figure 2.4: (a) original image. (b) corresponding salience map [Itti et al, 1998]. (c)
corresponding salience map [Itti and Koch, 2000]. Salience methods picks out poten-
tially important areas on the basis of contrast in some space (not limited to intensity).
The two methods pictured here differ in the method of normalization used to enhance
contrast between salient and nonsalient regions.

2.2 Automatic Methods

More common in NPR have been purely automatic methods. Automatic methods also
run a gamut, from approaches that process an image in a completely local, uniform
manner to those that automatically extract some quantity from an image as a proxy for

importance. Uniform approaches perform some (not necessarily local) operation uni-
formly across an image, and have been used extensively in painterly rendering [Hertz-
mann, 1998, Litwinowicz, 1997, Shiraishi and Yamaguchi, 2000]. A global effect pro-
vides users with only limited control. Rather than being truly uniform, some of these
approaches make a (largely implicit) simple assumption that some low level features
are important and worth preserving. Automatic painterly rendering methods for ex-
ample, largely assume strong high frequency features are important and should be
preserved in a rendering. In fact, painterly techniques vary largely in their method
for respecting these boundaries: aligning strokes perpendicular to the image gradi-
ent [Haeberli, 1990], terminating strokes at edges [Litwinowicz, 1997], or drawing in
a coarse-to-fine fashion [Hertzmann, 1998, Shiraishi and Yamaguchi, 2000, Hays and
Essa, 2004]. Similarly, automatic line drawing approaches (both 2D and 3D) assume
the importance of all lines that meet certain purely geometrical definitions, occluding
contours, creases, [Saito and Takahashi, 1990,Interrante, 1996,Markosian et al., 1997],
and suggestive contours [DeCarlo et al., 2003]. Such techniques can create attractive
images, but lack the selective omission which gives art much of its expressive power.

The kind of omission commonly used in depicting specific objects can sometimes
be explicitly stated. In drawing trees for example, [Kowalski et al., 1999, Deussen and
Strothotte, 2000] you can avoid drawing detail in the center of the tree, especially as the
tree is drawn smaller. Though this may be an accurate characterization of a particular
common style of depiction, it is not generally applicable to any subject.

For general images, there are relatively few options for automatically selecting
important areas. Some attempts have been made to predict importance using various
image analysis techniques. In 3D, image pyramids have been applied to omit detail in
the interior of a shape [Grabli et al., 2004]. In 2D, drawing on vision research, some
approaches have attempted to use salience measures to capture importance. Salience
measures are a guess at the ability of a feature to capture interest based on its low level
properties [Itti et al., 1998,Itti and Koch, 2000]. Similarly motivated salience measures

have been applied to attempt to predict features worth preserving in painterly rendering
[Collomosse and Hall, 2003]. Because faces are often an important component of
images, detecting them also provides a useful (though not always reliable) automatic
cue for what areas are important. Face detection has been used alongside salience
methods in other areas of graphics loosely related to NPR where identifying important
features is useful, such as automatic cropping [Chen et al., 2002, Suh et al., 2003] and
recomposing of photographs [Setlur et al., 2004].

2.3 Level Of Detail

An area of computer graphics left out in the above discussion has dealt with many of
these same issues. Various adaptive rendering and level of detail (LOD) schemes have
used the visibility or potential interest of features to skip computations that are unlikely
to be noticed. This is different from our goal. We are interested in detail modulation for
stylistic and expressive reasons. Level of detail seeks to control the computational cost
of rendering through approximation, not abstraction. Though both are concerned with
simplification, LOD and various other corner cutting is usually meant to be invisible,
or nearly so, while expressive abstraction is meant to be seen and indeed have a strong
effect on the way a viewer looks at an image. Though the goals are different, some
of the methodologies overlap. The goal of imperceptible omission has encouraged
researchers to look at perceptually motivated methods. Salience measures have been
applied to concentrate computation on noticeable areas, [Yee et al., 2001, Cater et al.,
2003]. In addition, a variety of low level perceptual models have been applied to try
to quantify the visibility of features and guarantee that simplification is invisible, or
minimize visibility. We adopt several of these metrics in our own efforts. One of
our contributions can be seen as applying and expanding perceptual models originally
adopted in LOD to create expressive artistic abstraction.

Both perceptually motivated LOD methods and the methods we present in this

thesis use models of vision to identify expendable areas of an image. It is the functional
definition of an expendable area that differs between the two. In the following chapter
we present the relevant background in human vision necessary for understanding why
such areas exist, and how they may be identified.

Chapter 3
Human Vision
A background in human vision is essential in computationally defining artistic abstrac-
tion. We have extraordinarily complex abilities to analyze images, these abilities have
weaknesses and strengths. Level of detail simplification methods seek to exploit the
limits of vision to cut corners in an unnoticeable way. In contrast, we hope to use the
related strengths of the visual system to improve visual design, clarifying content and
make things that need to pop out, pop out. Our interactive technique uses eye move-
ments and the limits of vision to indirectly measure the importance of features. Some
background will clarify the motivation for this approach.

3.1 Eye Movements

The human eye is maximally sensitive over a relatively small central area called the
macula. This area of relatively high resolution is approximately 5 degrees across, while
the most sensitive region (the fovea) is only 1.3 degrees (from a total visual angle of
about 160 degrees) [Wandell, 1995]. Sensitivity rapidly degrades outside of this central
region. Our perception of uniform detail throughout space is a result of continually
switching the point at which our eyes are looking (the point of regard or POR).

This process involves two important types of eye motions: fixations, relatively long
periods spent looking at a particular spot, and saccades, very rapid changes of eye po-
sition. These are not the only kinds of motion of which the eye is capable. In smooth
pursuit the eye follows a moving object, and even when fixated the eye continually
makes very small jittery motions. Fixations and saccades however are the most signif-
icant motions when viewing static scenery. Saccades can be initiated consciously, but
for the most part occur naturally as we explore a scene. Though fixating on a location

Figure 3.1: Patterns of eye movements of a single subject over an image when given
different instructions. Note (1) free observation which shows fixations that are rel-
atively dispersed yet still focused on relevant areas. Contrast it with (3) where the
viewer is instructed to estimate the figures’ ages. Reproduced from Yarbus 1967.

is not identical to attending it, for the most part an attended location is fixated, (i.e. if
we pay attention to something, we strongly tend to look at it directly) [Underwood and
Radach, 1998].

Figure 3.2: Similar effects to [Yarbus, 1967] are easily (even unintentionally) achieved
when using eye tracking for interaction. Circles are fixations, their diameter is propor-
tional to duration. The first viewer was instructed to find the important subject matter
in the image. The second viewer was told to ’just look at the image’. The viewer as-
sumed, from prior experience in perceptual experiments, that he was going to be later
asked detailed questions about the contents of the scene. This resulted in a much more
diffuse pattern of viewing.

3.1.1 Eye Movement Control

Qualitatively, a great deal is known about fixations. Eye movements are highly goal
directed. Viewers don’t just look around at random. Instead, they fixate meaningful
parts of images [Mackworth and Morandi, 1967, Underwood and Radach, 1998, Hen-
derson and Hollingworth, 1998], and fixation duration is related to processing [Just
and Carpenter, 1976, Henderson and Hollingworth, 1998]. Viewing is highly influ-
enced by task. The classic example of this [Yarbus, 1967] showed that viewers ex-
amining the same image, with different tasks to perform, showed drastically differ-
ent patterns of viewing, in which they focused on the features relevant to their task
(see Figure 3.1). Given the same task, the motions of a particular viewer over an
image at different viewings can be quite different, yet the overall distribution of fix-
ations remains similar [Yarbus, 1967]. In real activities, actions, even those thought

of as automatic, are usually preceded by (largely unperceived) fixations of relevant

features [Land et al., 1999]. These effects have been noted from some of the earliest
research in the field [Yarbus, 1967], but the mechanisms involved remain for the most
part informally understood.

In general, understanding of most higher-level aspects of eye movement control

is largely qualitative. In limited domains such as reading, attempts have been made
to formulate mathematical models of viewing behavior. For complex natural scenes,
much less is known [Henderson and Hollingworth, 1998]. Clearly any information
used in guiding eye movements must come from the scene. Likewise, the process of
selecting a new location to view must be guided in part by low frequency information
gathered from the periphery during earlier fixations. A matter of debate is whether low-
level visual information gained like this is a direct control of behavior or whether it is
primarily used when integrated into a higher level understanding. The precise factors
involved in control and planning of eye movements are an active and highly debated
topic [Kowler, 1990].

3.1.2 Salience Models

Much effort has gone into attempts to identify purely low-level image measurements
that can account for a significant amount of viewing behavior. Clearly it would be inter-
esting if what appears to be a highly complex behavior requiring general understanding
could be modeled or at least reasonably predicted by a simple approach. Results have
been mixed. Fixation locations do not correlate very well over time with the presence
of simple low level image features such as areas of high contrast, junctions, etc... [Un-
derwood and Radach, 1998].

More complex models have been formulated, such as the salience methods men-
tioned earlier. All measure contrast in one sense or another. In general, salience meth-
ods embody the assumption that unusual features are likely to be important and looked
at. Choice of feature space, and scale of measurement and comparison differ. One

popular approach [Itti et al., 1998, Itti and Koch, 2000] uses center surround filters to
measure local contrast in color, orientation and intensity to model general viewing be-
havior. [Rosenholtz, 2001] uses a probabilistic framework to measure the probability of
a feature given a Gaussian model of color or velocity in the surround. This was used to
predict visual search performance. A related salience framework was proposed [Walker
et al., 1998] to select unique image locations to match for image alignment. This ap-
proach used kernel estimation to measure the rarity of local differential features in the
global image wide distribution of those features.

These approaches share the same basic idea but vary in what they attempt to model.
This begs the question of what one is really trying to capture with salience. One can
look at salience as simply a quantitative method of deciding whether something is
present in a particular location in the visual field. In this context, salience doesn’t actu-
ally state the location is important, just that it might be because something is there. It
seems quite plausible that a measure like this plays a role in perception. However, more
is usually claimed for salience, for example that it predicts most of viewing behavior
or the valuable content in an image.

Salience would seem to have some additional predictive power because in a wide
class of images the semantically important subject does contrast with the rest of the
scene. Relatively few people take pictures of their family members dressed in camou-
flage and lurking in the bushes. Nobody takes a picture of a leaf of grass in a field. The
tendency of meaningful features to be visually prominent is by no means universal. It
is also unclear if this is really a property of the world, or a property of pictures people
take, but it does seem to underlie some of the success of salience as an engineering tool
in graphics.

Salience models have also been used to model viewing in narrower domains where
their applicability is more clear. The presence or absence of pop out effects in search
for example [Rosenholtz, 1999, Rosenholtz, 2001] is effectively modeled by simple
salience models that measure how distracting a distracter actually is.

Debate about how useful salience is in understanding general viewing is ongoing.

Some optimistically state that salience predictions correlate well with real eye motions
of subjects free viewing images [Privitera and Stark, 2000,Parkhurst et al., 2002]. Oth-
ers are more doubtful and claim that when measured more carefully and in the context
of a goal driven activity, the correlation is quite poor [Land et al., 1999, Turano et al.,
2003]. This mismatch in experimental results fits the intuition that visually promi-
nent, ’eye catching’ features might be more correlated with idle exploration of a scene,
and much less related to eye movements made during a task. In spite of this contro-
versy, salience methods are quite popular and have seen a fair amount of application
in computer graphics. They show some correlation with visually prominent features
and are fairly simple to implement. Code for some is publicly available. Clearly both
semantics and low-level features play a part in eye movements. Further investigation
is necessary to clarify the contributions to viewing behavior of salience and scene se-
mantics. Though they seem unable to model important aspects of viewing behavior,
salience models may provide important measures of visual prominence.

3.2 Eye Tracking

Much of the knowledge above about human eye motion has been gained through the
use of eye-tracking. A system measures a viewer’s eye in one of several manners
and records the point where it is looking, termed the point of regard or POR. One
common approach involves a video camera and an infrared light source. The relative
positions of the pupil and corneal reflection in the resulting image are used to calculate
point of regard [Duchowski, 2000]. These systems are reasonably reliable and accurate
and improve with each generation, though they are still subject to drift over time and
variability between viewers. The same technology is used in producing units that sit
in front of a fixed display, and in head mounted units for use in more general scenes.
Video based trackers have the virtue of not interfering directly with a viewer, making

them useful as both a natural interactive method and a research tool.

Outside of research in human vision, eye-trackers have seen increasing use as a

mode of human computer interaction. It has also enabled the use of eye movements
as a gauge of cognitive activity for psychological investigations and for evaluation of
visual displays.

Eye position has been used as a cursor for selection tasks in a GUI [Sibert and Ja-
cob, 2000]. They have also been used to indicate a users’ attention to others in a video-
conferencing environment [Vertegaal, 1999]. Another class of use, related to ours, uses
POR to control simplifying images or scenes for efficiency purposes. Knowing where
a user looks enables pruning of information that is not perceptible, and need not be
transmitted in a video stream [Duchowski, 2000]. Similarly, unexamined content need
not be rendered in a 3D environment. In practice, few current systems that make use
of such simplification actually use eye tracking, presumably because of limited avail-
ability, head tracking is typically used instead [Reddy, 2001].

On the whole, eye tracking has been found more useful in interaction where it
serves as an indirect measure of user interest. Eye movements are not under full vol-
untary control. Because of this, when viewers attempt to explicitly point with their
eyes the result tends to lack control and suffer from the so called “Midas Touch” prob-
lem [Jacob, 1993] where struggling to control eye position, like a cursor, based on
visual feedback creates even more uncontrolled looking, touching on many irrelevant
or undesirable locations.

The same involuntary link of eye movement to thought processes that makes eye
tracking a bad mouse have made it useful as an indirect measure of interest and cog-
nitive activity. Eye tracking has been used to evaluate the effectiveness of informa-
tional displays including application interfaces [Crowe and Narayanan, 2000], web
pages [Goldberg et al., 2002], and air traffic control systems [Mulligan, 2002]. As
mentioned earlier, eye movements may even reveal information that viewers are trying
to report, but cannot, because it is not consciously available. Experiments have shown

that professional radiologists examining slides look longer at locations where tumors
are present, even when they fail to identify and report them [Mello-Thoms et al., 2002].
In the future, this might hold the promise of computer assisted technologies to avoid
such mistakes. Several consulting companies currently sell evaluation services using
eye tracking to graphic design houses and web content creators among others1 .

3.3 Limits of Vision

Eye movements are related to the resolutional limitations of the eye. At any of the fix-
ations with which a viewer explores a scene, the most detailed information is received
only in the fovea, but lower frequency information is received throughout the visual
field. These limits on sensitivity within the visual field are not a weakness of the visual
system. On the contrary, they are part of our ability to efficiently process wide fields
of view and integrate information across eye movements and changes in viewpoint.

3.3.1 Models of Sensitivity

Quantitative models of visual acuity and contrast sensitivity have been developed to
model sensitivity to stimuli with different properties. Models of acuity predict whether
an observer can detect a black feature of a particular size on a white background. Con-
trast sensitivity measures an observer’s ability to discriminate a repeating pattern of a
particular contrast and frequency from a uniform gray field. The drop-off in these sen-
sitivities away from the visual center is modeled as a function of eccentricity, location
relative to the point of fixation.

Contrast sensitivity has been studied extensively in a variety of conditions usually

using monochromatic sinusoidal gratings (smoothly varying, repeating patterns of light
and dark bands). This sensitivity declines sharply with eccentricity [Kelly, 1984, Man-
nos and Sakrison, 1974, Koenderink et al., 1978]. Contrast threshold is defined as the



(unitless) contrast value (0 to 1 with 1 being maximal contrast) at which a grating and
uniform gray become indistinguishable. Contrast sensitivity is the reciprocal of this

Contrast Sensitivity

inverse contrast




0 1

frequency cycles/degree
10 10

Figure 3.3: Log-log plot of contrast sensitivity from equation (3.2) This function is
used to define a threshold between visible and invisible features.

Many researchers have empirically studied human contrast sensitivity and several
have developed mathematical models from their data. Researchers in computer science
have also used existing data and models in applications. Different aspects of a stimuli
are important in different situations. Fitting models to data collected from different
viewers under different circumstances gives somewhat different results. Two examples
are given here to illustrate the form these mathematical models take.

Kelly [1984] developed a mathematical model for the contrast sensitivity curve (at
the center of the visual field) including appropriate scaling factors describing the effects
of velocity (v) as well as frequency ( f in cyles/degree) of a grating on sensitivity.

A( f , v) = (6.1 + 7.3(log10 (v/3)3 )v f 2 e−2 f (v+2)/4.59 (3.1)

Mannos and Skarinson [1974] fit a mathematical model appropriate to still imagery
to results of prior empirical studies for use as a metric in evaluating image compression.

A( f ) = smax 2.6(0.0192 + 0.144 f )e−(0.144f) (3.2)

Where smax is the peak contrast sensitivity (this is around 400, but varies from
person to person).

3.3.2 Sensitivity Away from the Visual Center

A number of researchers have explored how sensitivity varies with eccentricity [Kelly,
1984,Rovamo and Virsu, 1979]. At larger eccentricities (expressed in degrees of visual
angle) the contrast sensitivity function is multiplied by another function which models
the drop-off of sensitivity in the visual periphery. This function is termed the cortical
magnification factor. It is not radially symmetric, but drops off faster vertically than
horizontally. It can be approximated [Rovamo and Virsu, 1979] with separate formulas
for decrease in sensitivity in four areas. For simplicity a bound from the most sensitive
area can be used in estimating visibility [Reddy, 2001, Reddy, 1997].

M(e) = (3.3)
1 + 0.29e + 0.000012e3

The cubic term can usually be ignored, as its contribution in the range of eccentricities
normal in a screen display is negligible [Reddy, 1997]. The contrast sensitivity is then
M(e) · A( f ).

3.3.3 Applicability to Natural Imagery

Some caution is necessary in applying these models derived from simple monochro-
matic repeating patterns to complex natural imagery. Though these models have been
applied with good results in graphics [Reddy, 2001], our goal of creating visible ab-
straction rather than conservative level of detail is more ambitious, and more likely to
stress the models involved.

Cortical Magnification










0 10 20 30 40 50 60 70 80 90 100

eccentricity degrees

Figure 3.4: Cortical Magnification describes the drop-off of visual sensitivity with
angular distance from the visual center.

How to measure contrast is relatively obvious in gratings, there are only two ex-
trema. A single contrast exists for the entire grating. Between two regions in a scene
the meaning of contrast is less clear. Regions are neither uniform in color nor uni-
formly varying. No strong perceptually motivated approach to this problem appears to
have been formulated. Lillesaeter [1993] attempts to address this by defining a contrast
between a nonuniform figure and ground. This contrast measure is a weighted aver-
age of the contrast between the region and background and the integral of the contrast
along the edge of the region. This is demonstrated to provide more intuitive results
than simpler alternatives on regions with flat colors. Issues related to sampling in real
images are not addressed. Measuring contrast in a color image presents another prob-
lem. Contrast in colored gratings has been studied, and much work has been done in
general on color perception. However, there does not appear to be a simple general
contrast sensitivity model defined in color space [Regan, 2000]. Adapting a luminance
based model therefore remains a plausible course of action in designing a model for a
practical application.

Applying the notion of visibility for a grating to a non-repeating pattern of regions


also presents problems. The hump-like shape of the contrast sensitivity curve tells us
something counterintuitive if the size of an area is treated as proportional to an inverse
frequency [Reddy, 2001]. Very low frequencies are much less visible than some higher
ones at a given contrast. This is because detectability of a grating is related to the
density of receptive fields of corresponding size. There are upper bounds on the size of
human receptive fields. Intuitively, a large slowly varying sine wave may be difficult
to see.

This has been less of a concern in previous work where judgments were being made
mostly about high frequency parts of the curve [Reddy, 2001], but will be noticeable
when visibly abstracting images.

It can be argued [Reddy, 1997] that natural images, at least in places (and certainly
the uniform color regions that we will ultimately use in rendering) more closely resem-
ble square wave, rather than sine, gratings. Since a square wave can be approximated
by the sum of an infinite sequence of sine waves, and sensitivity to combined sinu-
soidal patterns is closely related to that of the independent components [Campbell and
Robson, 1968] one might think the visibility for low frequency square waves would
be higher than that for equal frequency sine waves. The actual relation has been stud-
ied empirically [Campbell and Robson, 1968] and confirms this intuition. For square
waves at frequencies below about 1 cycle/degree sensitivity levels off rather than drop-
ping. A theoretical derivation of the difference is presented in [Campbell and Robson,
1968]. It matches some but not all features of the empirical data.

These concerns remind us that when applying these models to real images they
cannot serve as an accurate absolute perceptual measure of visibility. Rather, they
provide a plausible relative sense of the visibility of different features. The absolute
contrast or acuity threshold at which a feature becomes visible is not necessary for our
application. What is important is the relative ordering of feature visibility, that allows
us to create a prioritization. It is necessary to model visual sensitivity only up to the
level where results correspond to our intuitions about this prioritization.

To apply these models in actual scenes, we need to decide on a definition of the

features whose visibility we are judging with these methods. For example, these mod-
els have been used in 3D level of detail [Reddy, 1997] to avoid rendering invisible
features. In this context the obvious choice of feature is a polygon which may or may
not be included in the rendering. For images the choice is less clear, as image prop-
erties can be measured in an unstructured, local way or an image can be partitioned
into a more structured representation. We review some of the possibilities for image
representation in the following chapter.

Chapter 4
Vision and Image Processing

4.1 Image Structure Features and Representation

(a) (b)

Figure 4.1: (a) Scale space of one dimensional signal. Features disappear through
scale space but no new features appear. (b) Plot of inflection points of another one
dimensional signal through scale space. Reproduced from [Witkin 1983]

Image representation and processing is a large field of relevance in both human and
computer vision. We concentrate on some basic concepts relevant to the task of simpli-
fying images. Scale space theory provides a way of characterizing the different scales
of information present in an image and making correspondences between features at
different scales. Segmentation divides an image into distinct regions, enabling an ex-
plicit, non-local representation of image content. Edge detection provides a measure
of the prominent boundaries in an image.

An important unifying concept in image analysis is that the same image data can
be represented in many forms. In any of these certain information in the image is
explicit and other information is less easy to access [Marr, 1982]. The information and
representation appropriate is task dependent. A variety of representations with different
properties are available. With the exception of 3D techniques, NPR applications have
largely used low-level representations, often functioning locally on the original image
itself. However, human artistic processes operate on richer representations. Ruskin,

one of the 19th century’s most prominent art historians and theorists, famously argued
that in teaching art technique, the most important lesson was teaching the student to
see [Ruskin, 1858]. There seems to be an assumption in image based NPR that seeing
is simply capturing a bitmap representation of the scene, and that it can be considered
accomplished in the presence of a source photo. Human vision however is much more
than simply capturing an image. If a computer is to produce artistic renderings that
capture some of the expressiveness of real art, especially in highly abstracted styles,
some higher level representation is necessary, analogous to those created in the artists
head as she understands the scene before her, and begins to paint. The better suited
this representation is to the task, the easier it should be to drastically simplify an image
while retaining its important features.

The lowest level representation is the image itself, analogous to the retinal image.
This is the starting point of any further representation, making explicit the light inten-
sities at each pixel. There is structure here that can be more explicitly represented in
other ways. Information in the image exists over a variety of scales, small and large
features, making up parts and whole objects in the scene.

One common way to come to terms with the multiple scales of information in an
image is through its scale space. From a single image, a three dimensional stack of
images is generated in which each contains progressively coarser scale information.
Again, this representation has an analogue in human vision where neurons have recep-
tive fields of different sizes, in effect generating a multi scale representation from the
retinal image.

Scale space has come to refer to such a space of increasingly simple images gener-
ated by a range of processes. Generically this can be thought of as a stack of images
with decreasing information contained at each level as scale increases. This stack is
in theory continuous, in practice sampled at some discrete interval. Starting with the
original image, detail is progressively lost until a uniform color is all that remains (see
Figure 4.1).

A number of constructions for such a space have been developed. Perhaps the sim-
plest approach creates something like an image pyramid, successively downsampling
the image so it is more coarsely pixelated. This approach has a problem in that de-
tailed, high frequency information (the edges between the new larger pixels) may have
been introduced which was not in the original image. This is the problem of spurious
resolution [Koenderink, 1984]. New information has been hallucinated into existence
by imposing a coarser grid structure on the data. Convolution with a Gaussian kernel
(blurring) generates a space that avoids this problem [Witkin, 1983,Koenderink, 1984].
In fact this blurring has been proven [Koenderink, 1984] to be the unique way to gen-
erate a scale space which is both uniform or uncommitted, (i.e., the process is uniform
across image space and through the scale dimension), and also avoids spurious reso-
lution. Information disappears but cannot be created. In one dimension, this ensures
that any feature will only disappear as scale increases. In two dimensions new features,
maxima for example can appear. However in both cases clear judgments can be made
about what features exist at what range of scales.

That the process of blurring is uniform is an advantage in that filtering can be

applied to any signal, one doesn’t need to have a model of what the important features
present are. A disadvantage is that coarser features are more coarsely located, the
blurring process that reveals them distorts their spatial extent.

If you know what you’re looking for, there is no reason why the blurring operation
must be uniform or uncommitted. A number of nonuniform or nonlinear scale spaces
have been formulated which do not introduce false content but remove information
selectively in certain locations. One of the best known of such methods is anisotropic
diffusion [Perona and Malik, 1990]. Here the diffusion process is not uniform but rather
inversely proportional to the magnitude of the gradient at any position. This results in
an edge preserving blurring which removes low contrast detail while preserving strong
edges. This has the advantage that edges are better preserved in their initial location
until the point at which they disappear. Niessen et al [1997] compares this and several

other nonlinear methods in the context of segmentation. Nonlinear methods perform

well but are significantly more expensive.

A practical application must sample the continuous scale space at some discrete
intervals. One would like to sample sufficiently finely to capture interesting events,
the order of disappearance of different features, but not more densely than need be.
Looking at the linear scale space, Koenderink [1984] derives an appropriate sampling
as logarithmic along the scale axis corresponding to a uniform sampling in the scale
parameter t, the standard deviation of the Gaussian kernel used in blurring. This is
intuitive. At small scales many tiny regions are merging quite often, requiring dense
sampling. At higher scales, there are fewer regions, fewer events to capture, and much
less dense sampling in t is required. The issue is the same for nonlinear spaces. Re-
lating scales in different spaces is not straightforward. Some attempt at doing this has
been made in [Niessen, 1997].

Figure 4.2: Interval tree for 1D signal illustrating decomposition of the signal into a
hierarchy. Reproduced from [Witkin 1983].

While a scale space such as this begins to capture structural relations of features
across scales, this is still largely an implicit representation. To make this explicit,
features at different scales need to be directly related to each other. Witkin [1983]

addresses this problem in 1D signals. In the scale space of a one dimension signal
features will never appear at coarse scales. So, any features found at a coarse scale
can (if the sampling is dense enough) be traced directly back to their fine scale origin.
This allows localization of features found at a coarse level. Witkin demonstrates this
choosing as a feature zero crossings in the second derivative, inflection points in the
signal (Figure 4.1).

Similarly, using these correspondences across scale it is also possible to create

a structure that captures the relationship between all features at all scales. Intervals
between two zero crossings (which again correspond to sections of the signal between
two inflection points) disappear in only one way. Two successive zero crossings merge
together, with the result that three intervals, the one between the crossings and those on
either side, merge into one. These three intervals can be made children of the resulting
interval to create an interval tree which characterizes the structure of the signal at all
scales. Witkin observes that those intervals which have longer persistence through
scale space appear to be those identified by human observers as subjectively salient or
important in the signal.

Extending this nice analytical derivation to a practical application in 2D is not

trivial. In 2D features such as maxima, or curves defined by inflection points may split
into two at coarser scales. Koenderink [1984] suggests the use of equiluminance curves
in the image as a 2D equivalent to Witkins intervals. Generic equiluminance curves
form a single closed curve. There are two singularities: extrema where the curve is
just a point, and saddle points where the curve forms multiple loops which intersect at
one point. Each loop may contain other saddle points and has to contain at least one
extrema [Koenderink and van Doorn, 1979]. The nesting of these saddle points gives
the structure of the image regions. Though new saddle points may appear inside a loop,
centermost saddle points must disappear before outer ones. Because of this the saddle
points present at all scales can be represented as a tree. Such a structure is difficult to
calculate in practice. It is not obvious how to find these saddle points efficiently or if

they provide a subjectively intuitive partitioning of the image. In addition it’s not clear
how color could be handled. In a naive approach, each band would produce its own
surface with its own saddle points, resulting in 3 separate scale space trees that would
need to be unified in some way.

4.2 Segmentation

The process described above of dividing up a signal based on the intervals between
features is a particular approach to the general problem of segmentation. This problem
again occurs in both computer and human vision. Segmentation makes explicit the
association (or disassociation) between different areas of an image. It produces an ex-
plicit representation of parts of the image that are associated with each other, assigning
each pixel to one, usually connected group or region. These regions should be uniform
by some measure. Separate regions, at least the adjoining ones, should be markedly
different. How people do this, parsing shapes and objects from the background is only
partially understood. In computer vision, a tremendous variety of methods have been
devised to define similarity measures for this using color, gray scale intensity, texture
etc. This segmentation is usually a partitioning of an image at a single scale. However
it is sometimes desirable to define a segmentation over a range of scales.

Scale space has been considered in segmentation. It is typically used to make seg-
mentations produced with other methods more robust. Niessen et al [1997] link pixels
with their neighbors who have similar color in both the spatial and scale dimensions
to create a hierarchy. The end product is a single flat segmentation taking its set of
regions from a coarse scale and their spatial extent from a fine scale. A similar ap-
proach is taken in Bangham et al [1998]. Here, the desire is to create a hierarchical
segmentation tree that describes the image over a variety of scales. An alternate ap-
proach [Ahuja, 1996] creates a multi-scale representation without explicitly generating
a scale space.

Each of these methods compute a hierarchical representation of image structure.

However, there is no clear relation between the hierarchy and the theoretical hierarchy
induced by scale space. This is not a major concern; scale space structure is attractive
because of its simple formal definition, but is not the single correct answer in any
meaningful sense for a given practical application. Hierarchical representations are
not general purpose, desirable properties depend on the application. For the purposes
of image abstraction, an important question is whether each subtree in the structure
represents some coherent area or region. This is guaranteed in some geometric sense
by scale space proper, since nodes occur in the tree only when features disappear. In
contrast, methods for building a hierarchy that iteratively merge regions, may have
many intermediate nodes that consist of fragments of regions that have not yet all
merged together. Such hierarchical representations are harder to use directly for tasks
that require meaningful regions like image abstraction.

Scale space captures the structure of intensities in the image, not the structure of
what is pictured. In some cases these may correspond closely (e.g., eyes, nose, and
mouth in a head), but this is not necessarily the case. A hierarchy that corresponds to
actual objects in the image would allow abstraction on an object by object basis. This
is a difficult problem. Some attempts have been made to rearrange subtrees based on
color (so for example a hole in an object would be associated with the background,
not the object) [Bangham et al., 1998]. Any general solution would need to draw on
analogues of high-level vision tasks that do not currently exist.

4.3 Edge Detection

Edges are another important image feature. A region is a uniform area. An edge is
the boundary or discontinuity between uniform regions. Edges are important in human
vision; they are one of the low level features built into the visual systems. Edges are
commonly detected using derivative filters. Discontinuities produce a filter response

that can be processed to extract edges as chains of positions. Like regions, edges
themselves exist at a number of scales. Edges at different scales are usually detected
by using derivative filters of different widths or, equivalently, by first convolving the
image with a blurring kernel. A popular procedure for performing the detection and
thresholding is the Canny edge detector [Shapiro and Stockman, 2001]. An interest-
ing modification on this procedure adds a measure of how closely the local image
resembles an edge model [Meer and Georgescu, 2001]. This allows faint edges to be
captured while false alarm responses from texture can be suppressed. As with regions,
larger scale features are detected with less spatial resolution. Nonlinear anisotropic dif-
fusion [Perona and Malik, 1990] was originally proposed as a better method of blurring
images for coarse scale edge detection since it removes fine scale detail while better
preserving the shape and position of high contrast low frequency edges.

Edges and regions are related to each other. A region identifies a homogeneous
area. An edge indicates a break between two areas. Since edges and regions are closely
related but their identification usually draws on different measures, each feature can be
used to improve results from the other [Christoudias et al., 2002]. These two features,
regions and edges provide a fairly complete representation of image content, one that
knowledge of human vision suggests is biologically important. These features are the
building block on which our rendering techniques will be built.

Chapter 5
Our Approach
The ideas and techniques above suggest a particular path for achieving minimally inter-
active abstraction in computer renderings. Eye tracking can serve as a bridge between
the computer and a user’s interests and intentions. As the evidence discussed above
suggests, locations of a viewer’s fixations can be reasonably interpreted as identifying
content that is important to that viewer. Preserving detail where the viewer looked
and removing it elsewhere should create a sensible NPR image that captures what the
viewer found important. The nature of abstraction in art suggests this is worthwhile
because it may help future viewers see the main point of the image more easily. It may
even be able to steer successive viewers into the same patters of viewing as the original
viewer, encouraging a particular interpretation of the image.

5.1 Eye tracking as Interaction

Eye tracking is used in our work as a minimal form of interaction to answer the ques-
tion of what is important in an image. Because eye movements are informed by seman-
tic features and fixations are economically placed, they provide a guide to important
features that require preservation. Because they are natural and effortless, they are a
desirable modality for interaction. Abstracting an image becomes as simple as looking
at the image, an action that requires no conscious effort.

Interaction in all of our systems proceeds in the same way. A viewer simply looks
at a photograph for a set period of time, usually around five seconds while their eye
movements are monitored by an eye tracker. The recording is then processed to identify
fixations, and discard saccades. The fixations found are taken to represent important
features and detail in these locations will be preserved.

An apparent contradiction is worth noting here. There seem to be two possibilities

concerning this interaction. Either the viewer will look at distracting image features,
and these will be included in the rendering, making fixation data useless for determin-
ing importance. Or, the viewer will not look at supposedly distracting elements, in
which case removing the information seems to be of negligible value. This is related to
the basic question of whether abstracting images is itself worthwhile, or if we should
just let people do the abstraction in their head.

There are several responses to this. Firstly, there is a great deal of content like
texturing on a wall or detail in grass that is not directly examined but is certainly visible.
This kind of information will be removed. Artists certainly seem to manipulate this
kind of detail in a purposeful way, suggesting it is worth doing. In a particular style,
the prominence of small features could be emphasised inappropriately without this kind
of omission. An ink drawing of a field of grass with each leaf depicted in silhouette
would be distracting.

Even on occasions when the eyes fixate low level visual distracters, or regions of
high contrast or brightness, which don’t say anything about the image meaning, these
fixations can be expected to be shorter in duration [Underwood and Radach, 1998]. So
we can still hope to remove such distracters despite their being looked at, and avoid the
distraction for future viewers.

This suggests it will be important to have a model of attention that takes into ac-
count the length of a fixation. Considering fixation duration will allow brief fixations
to be discounted as distraction, while giving more weight to longer fixations at which
it can be assumed more processing occurs. The simple model we use for this is shown
in Figure 5.1 (b). This model is a piecewise linear function in which fixations below a
certain duration have no effect, ramping relatively quickly up to a maximal weight at
fixations of at least a particular duration. This is a very simple model considering the
complexity of visual cognition. More sophisticated attention models may be useful.
Various information might be useful such as the time course and grouping of fixations.

p di (p)
(xi , yi) 1
ei (p)

0 ti
tmin tmax
(a) (b)

Figure 5.1: (a) Computing eccentricities with respect to a particular fixation at p. (b)
A simple attention model defined as a piecewise-linear function for determining the
scaling factor ai for fixation fi based on its duration ti . Very brief fixations (below tmin )
are ignored, with a ramping up (at tmax ) to a maximum level of amax .

5.2 Using Visibility for Abstraction

To apply fixation data to simplification, we need to link fixations to individual decisions

about what content to include. The models of visual sensitivity discussed above allow
us to decide what features are visible. However, a difficulty exists when applying
these models to image simplification. These models all define a boundary between
perceptible and imperceptible features. In our application the goal is to remove features
in a way that is perceptible but makes sense to the viewer and preserves meaning. On
a well positioned monitor, nearly everything should be visible, down to nearly a pixel
resolution in the course of a brief viewing, so an accurate model could tell us to include

In order to accomplish abstraction, the hope is that these models can be scaled back
by some global constant, representative of a particular amount of abstraction. This will
allow us to remove visible detail in a way that still reflects the relative perceptibility of
features given a particular viewers eye movements. By making our algorithm interpret
the viewer as having, in a sense, worse and worse eyesight, we can force it to remove
visible information in a way that is still perceptually motivated. To the best of our
knowledge this is the first work to attempt to formalize abstraction in this manner.

This constant scaling factor can be seen either as a separate scaling factor, indi-
cating degree of abstraction or as part of the attention model mentioned above. When
folded into the attention model Figure 5.1 (b) as amax it can be seen as representing
a certain background amount of detail, that is not interesting even when a location is

In general, a framework for accomplishing abstraction using this methodology in-

cludes 3 main choices: first a system of image analysis to represent the image content
in a meaningful way, second a method of indicating which of this content is actually
important, third a style in which to render the important content. In successive chap-
ters we present several systems that share the same basic interactive technique of eye
tracking described above but vary in the particulars of these decisions.

Chapter 6

Painterly Rendering
Our first system for creating abstracted NPR images does so in a painterly style. Painterly
rendering creates an image from a succession of individual strokes, mimicking the
methodology of a human painter. The intuition behind our system is that in most paint-
ing styles, painters use fewer and larger strokes to model secondary subject matter such
as background figures or scenery. This system is relatively simple, and served as an
initial proof of concept for our interactive technique. The model of image content is
simple and unstructured, the perceptual model equally minimal.

6.1 Image Structure

Our approach to painterly rendering [Santella and DeCarlo, 2002] has no explicit
model of visual form, it uses a simple scale space representation. This is possible be-
cause painterly rendering is a somewhat forgiving style in which imprecision in stroke
placement typically is not distracting.

Our approach follows an existing methodology, [Hertzmann, 1998] in which curved

strokes of different widths are painted with colors taken from the appropriate level of
a scale space of blurred versions of the original image. Our only model of image
contents is this scale space of images, and the corresponding image gradient at each
scale. This information is used in a standard algorithm to generate candidate strokes.
These strokes are the features to which we apply our perceptual model. The features
we consider are therefore not image features properly speaking, but strokes that exist
only in the rendering, not in the original image.

6.2 Applying the Limits of Vision

Given a choice of feature, we now need to make judgments about what features to
include. To do this we need to pick a model of visual sensitivity and decide how to
apply it to our system. The simplest model we could use would be an acuity model that
modulates brush size. This corresponds to considering each brush stroke in isolation
as a mark of maximal contrast with its background. Such a model is a fairly large
oversimplification, but provides an intersting starting point.

Simple acuity models like this have been used in graphics before [Reddy, 1997].
In that work, Reddy fit a function to the threshold frequencies provided by a contrast
sensitivity model for maximal contrast features at varying velocities and eccentricity.
This provides an acuity model that is simple to apply. It takes an input speed and
eccentricity and outputs a threshold frequency. Thought simple to apply and based on
psychophysical data, this model is not useful for our purposes. Because the model is
crafted to be highly conservative a fairly large central region 5.79 degrees in width
in assigned a maximal acuity. A conservative estimate like this may be desirable for
imperceptible LOD. Where the goal is to remove unattended information this model is
overly conservative. The circular region where detail is maximal can be highly visible
and distracting in an abstracted rendering.

We would prefer a function closer in shape to the actual drop-off in visual sen-
sitivity. A similar model that provides more intuitive results is equally simple. The
maximum frequency humanly visible is assigned to just the center of the visual field
(G = 60cycles/degree). From there sensitivity drops off as a function of the standard
cortical magnification factor equation (3.3). This produces a continuous degrading of
detail from center of vision purely on a frequency basis. This value is scaled by the
simple attention model described in Section 5.1 to produce a final frequency threshold.
Each fixation defines a potentially different threshold at each point in the image. The
highest threshold, (usually corresponding to the closest fixation) is used as the actual

threshold at point p, termed fmax (p).

We model a brush stroke with width D as a half cycle in a grating and compare
the resulting frequency f = 1/2D to the cutoff provided by our perceptual model. The
stroke is included if it is lower in frequency than the cutoff.

6.3 Rendering

To actually produce candidate brush strokes our approach uses a standard algorithm
[Hertzmann, 1998] to approximate the underlying image with constant color spline
strokes that generally run perpendicular to the image gradient. These strokes originate
at points on a randomly jittered grid. An extended stroke is created by successively
lengthening the stroke by a set step size in the direction perpendicular to the image
gradient. When the color difference between the start and end points of the stroke
crosses a threshold, or the stroke reaches a maximal length the stroke terminates (see
[Hertzmann, 1998] for further details).

Our method varies in a few particulars. When strokes are created by moving per-
pendicular to the image gradient, they can be excessively curved and worm like in
appearance. Hertzmann’s strokes are B-splines, which can meander to a sometimes
excessive extent. Real paint strokes do not tend to do this. Even in paintings by artists
that one thinks of as using very salient curving strokes, for example van Gogh, com-
pound curves are made up of multiple gently curving strokes. In response to this our
maximal stroke length is shorter and we use a single bezier curve for each stroke, using
a subset of the calculated control points. This produces somewhat more natural, sim-
ple curves. However, even these can curve more sharply than is normal with real paint

Strokes are painted from coarse to fine with the entire image being covered with
the coarsest strokes used. For finer scale strokes, a choice is made at each stroke origin
point whether a stroke of that size is necessary based on the perceptual model. Only

Figure 6.1: Painterly rendering results. The first column shows the fixations made by
a viewer. Circles are fixations, size is proportional to duration, the bar at the lower
left is the diameter that corresponds to one second. The second column illustrates the
painterly renderings built based on that fixation data.

necessary strokes are generated.

In addition to removing detail, artists also use color as a vehicle for abstraction.
Vibrant colors and high contrast can enhance the importance of a feature or make it
easier to see. Muted color and contrast can de-emphasize unimportant items. Stroke
colors can be adjusted to achieve this [Haeberli, 1990]. Our perceptual model provides
a means of deciding where to make these adjustments. For instance, lowering the
contrast in unviewed regions makes them even less noticeable; raising it emphasizes
viewed objects.

Since color contrast is not well understood [Regan, 2000] we use a simple ap-
proach to adjust colors. Though where we apply these manipulations is controlled by
our perceptual model, the extent of these manipulations was simply picked by experi-
mentation. We start by defining a function of location u(p) ranging from 0 (where the
user did not look at point p) to 1 (where the user fixated p for a sufficiently long period
of time). This is defined as the ratio between the perceptual threshold at point p, and
the maximal threshold possible:

fmax (p)
u(p) = (6.1)
amax G

We then adjust color locally for each stroke based on this function.

• Contrast enhancement: Contrast is enhanced by extrapolating from a blurred ver-

sion of the image at p out beyond the original pixel value. The amount of extrap-
olation changes linearly with u(p), being cmin when u = 0 and cmax when u = 1
(a cmin and cmax of 1 would produce no change). cmin and cmax are global style
parameters for controlling the type of contrast change. For example, choosing
[cmin , cmax ] to be [0, 2] raises contrast where the user looked, and lowers contrast
where they didn’t. (Default: [cmin , cmax ] = [0, 2].)

• Saturation enhancement: Colors can also be enhanced; colors are intensified

in important regions and de-saturated in background areas. The transformation

proceeds the same as with contrast, now specified using [smin , smax ], and extrap-
olating between the original pixel value and its corresponding luminance value.
As an example, choosing [smin , smax ] to be [0, 1] just desaturates the unattended
portions of the image. (Default: [smin , smax ] = [0, 1.2].)

Figure 6.2: Detail in background adjacent to important features can be inappropriately

emphasized. The main subject has a halo of detailed shutter slats.

6.4 Results

Results from this technique capture some of the abstraction present in paintings (see
Figure 6.1). Focal objects are emphasized with more tight rendering, intense color and
contrast. Background features are accordingly de-emphasized. All this is done with
virtually no effort on the part of the user. In contrast, hand painting strokes or even
painting a detail map to control stroke size would require greater effort.

This painterly rendering framework has some limitations. Of course neither the
placement nor individual appearance of paint strokes seriously mimic real paint. Real-
istic paint strokes were never a goal here. How to accomplish this to various degrees
of approximation is fairly well understood. Placement of strokes is a more interesting
shortcoming of this approach. Using few strokes is a major part of painterly abstrac-
tion. In our renderings, despite throwing out small strokes in most places, too many

Figure 6.3: Sampling strokes from an anisotropic scale space avoids giving the image
an overall blurred look, but produces a somewhat jagged look in background areas.

(a) (b)

Figure 6.4: Color and contrast manipulation. Side by side comparison or rendering
with and without color and contrast manipulation (precise stroke placement varies be-
tween the two images due to randomness).

strokes of too small a size are used to approximate any given part of the image. Other
painterly rendering methods exist which could be used to carefully place strokes while
retaining our underlying methods for choosing detail levels [Shiraishi and Yamaguchi,

Aside from the limitations of the rendering techniques we’ve appropriated, our ap-
proach to abstraction has some inherent limitations of its own. Since there is no explicit
model of image structure, detail can only be modulated in a continuously varying way
across the image. Stroke placement is designed to respect edges, however the size of

strokes does not. Detail spreads from important locations to neighboring areas creating
distracting haloing artifacts (see Figure 6.2).

In addition, sampling coarse stroke colors from blurred versions of the image
tends to give results an overall blurry look, especially when down sampled. An artist
wouldn’t actually blur colors and shapes in this way, but would preserve more of a re-
gions original appearance while removing detail. Abstraction in paintings, even when
objects are rendered indistinctly is not just blurring. High frequency information is
not blurred out. Rather, small elements are removed completely. One way to accom-
plish this would be to sample strokes from an anisotropic scale space that blurs out
detail while still retaining sharp edges. An example of a painterly rendering that does
this is shown in Figure 6.3. Stroke detail has been modulated by the same perceptual
model. Color has been left unchanged. This image does preserve more of the original
image structure. However, because intensities have not been blurred together in the
background, imperfections in placement of coarse strokes are emphasized. In back-
ground areas strokes appear jagged, their orientations varying excessively. A more
global strategy for orienting [Hays and Essa, 2004] and placing [Gooch and Willem-
sen, 2002] strokes along with blending might help overcome these difficulties.

Chapter 7
Colored Drawings
Some of the limitations of a purely local model of image structure are addressed by our
second NPR system. This approach utilizes a richer model of image structure to create
images in a line drawing style with dark edge strokes and regions of flat color [DeCarlo
and Santella, 2002]. This style resembles ink and colored wash drawings or lithograph
prints such as those of Toulouse-Lautrec. However, it is something of a minimal style.
Images are rendered with the primitives of our new image representation itself. The
more structured image representation used here allows us to create a more simplified
visual style. It also allows us to control detail across the image in a discontinuous way.
This means we can keep detail from leaking from important objects to unimportant
surrounding regions.

7.1 Feature Representation

The primary image representation underlying this visual style is a hierarchical seg-
mentation that approximates the scale space structure of the image. As noted above,
analytical calculation of this structure is problematic. Our choice of approximation was
motivated by the desire for reasonable efficiency and also by a requirement that regions
at each level of the tree should, as much as possible, be reasonable areas to draw. Any
of them might wind up in a rendering with some given distribution of detail.

7.1.1 Segmentation

Our approach is to use a robust mean shift segmentation independently at each level of
a linear scale space of blurred images (which are downsampled for efficiency). We use
publicly available code to accomplish this (

Figure 7.1: Slices through several successive levels of a hierarchical segmentation tree
generated using our method.

To avoid artifacts from downsampling a dense image pyramid is created. Each

level of the pyramid is smaller than its predecessor by a factor of square root of two.
Each level is segmented. Once each of these independent segmentations has been gen-
erated, the regions from each separate level are then linked to a parent region chosen
from the next coarser segmentation. This is done using a simple method that chooses a
parent region with maximal overlap, using color information to disambiguate difficult
choices, see [DeCarlo and Santella, 2002]. This is not a particularly robust method
of assignment. We are however conservative, putting off merging if there is no parent
that is a reasonably good match. These ill matching regions are propagated up to the
parent level (implying that the tree is not necessarily complete). Because the segmen-
tations on each level are themselves quite good, difficult situations do not often occur.
This approach allows leveraging existing segmentation methods and implementations,
and is flexible enough to incorporate alternate segmentation methods or alternate scale
spaces, like anisotropic diffusion.

Inherently, the way different colored regions corresponding to different objects

merge together at very coarse scales is somewhat unpredictable and unstable. An en-
gineering decision was to simply not sample very coarse scales. The coarsest scale
segmentation was selected to still contain a fair number of regions. Coarser scales with
only a few regions are not usually useful for rendering images. For all images (most
at 1024x768 resolution) 9 levels of downsampling were used. All the coarsest scale
regions were simply set as children of the tree root. Figure 7.1 illustrates some of the
slices through one of these trees.

Though good, these segmentations are neither perfect on each level nor in decisions
made across scale. Certain features consistently present difficulties. Textures tend to
be over segmented. Smoothly varying regions are broken into a number of patchy,
roughly constant color areas. Better segmentation techniques could be incorporated
into our method as they materialize.

A separate concern is the extent to which an image region hierarchy is appropriate


for abstraction, since image structure does not directly correspond to object structure.
Our system makes no attempt to turn a scale space structure into one that represents
the structure of actual scene objects [Bangham et al., 1998]. This is a difficult prob-
lem though there are some interesting possibilities for applying image understanding
techniques [Torralba et al., 2004] to add a top down component to the bottom up image
structure generated here.

For the time being we have addressed this problem by creating a simple interface
that allows interactive editing of the segmentation tree. A user can take a segmentation
tree as a starting point, and manually assign children areas to different parents, or split
and merge regions. With a fair amount of effort this could allow an object hierarchy
to be built. More realistically, occasional segmentation errors can be corrected fairly
easily. On the whole however we have not found this hand editing necessary. It was not
used in any of the published results in DeCarlo and Santella [2002] . In creating a larger
set of 50 renderings for [Santella and DeCarlo, 2004a] it was used to correct a few
prominent segmentation errors in a handful of the images. Most segmentation errors
that violate object boundaries appear high in the segmentation tree. These incorrect
regions are rendered only when the area is highly abstracted. In these abstracted areas
violations of object boundaries are usually not particularly distracting.

7.1.2 Edges

Our initial implementation tried rendering dark edges using a subset of the edge borders
of the regions. This did not on the whole provide sufficiently clean and flexible edges.
Edges extended beyond where desired. Because only average color for each region
was stored, significant region boundaries were difficult to distinguish from boundaries
due to smooth shading variation. Because of this a separate representation of edges
was used. Edges were detected with a robust variant of the Canny edge detector [Meer
and Georgescu, 2001]. Edges were detected on only one scale. This presents some
limitations but was enough to capture a reasonable set of important edges.

The combination of edges and regions provides a reasonable model of image con-
tent. The structure of this model will allow us to abstract images by more intelligently
removing detail in a coherent region-based manner.

7.2 Perceptual Model

Our richer model of image structure provides us the opportunity to use richer models
of visual perception to interpret eye tracking data in the context of our image. Though
useful, a perceptual model based only on frequency like that used in our painterly
rendering system is limited.

For example, in the output of our painterly system simplified parts of the image are
still painted with quite a large number of strokes. Most of these strokes are quite similar
in color. If larger strokes were used in all background areas, those background areas
with content would be completely obliterated. As it is, the long uniformly colored
strokes introduce detail, exaggerating the local color variation in mostly blank areas
(this is akin to the problem of aliasing or spurious resolution). This is in sharp contrast
to skilled examples of real painting where coarsely rendered forms of near uniform
color are blocked in with just a few very large strokes. The problem is that strokes
are selected based purely on size. It would be desirable to remove features based on
contrast as well as size.

This system continues to use a frequency model like that in [Santella and DeCarlo,
2002] for judging line strokes. Here, frequency is taken as proportional to the length of
the stroke, rather than its width. This decision is somewhat arbitrary, though it results
in the intuitive behavior of shorter lines being more easily filtered out. Since strokes
are rendered in our system with a width proportional to their length, one could look at
the perceptual model as measuring the prominence of the feature being drawn rather
than the original edge.

We can use a contrast sensitivity model to judge the visibility of regions that have

both a size and color. Of the available contrast sensitivity models, Equation (3.2) seems
appropriate because it is derived from a variety of experimental data and has been
used in computational setting on real images. This is the model we use [Santella and
DeCarlo, 2002].

As mentioned earlier, a perceptual model by itself will remove little content. Most
of the features we are interested in are visible. Some kind of global scaling down of
relative visibility is necessary to create an interesting amount of abstraction. Several
possibilities for scaling the model exist. Scaling can be applied to frequency making
smaller regions progressively less visible and removing them from the image as was
done for painterly strokes. Scaling in frequency is fairly intuitive in some respects.
Given a desired smallest feature size, it is simple to derive a scaling factor that will
include features of this frequency foveally, and degrade size further at larger eccentric-
ities [Santella and DeCarlo, 2002].

Scaling frequency in a contrast sensitivity model is a bit problematic. The contrast

sensitivity function has a hump like shape (see Figure 3.3). Visibility degrades at both
extremes of frequency. This produces unintuitive behavior. When scaling up frequen-
cies, a region can become more visible as it becomes smaller, before becoming less
visible. Several approaches to solving this are possible. As mentioned above, there is a
different pattern of contrast sensitivity for square wave gratings. A square wave model
might be more appropriate for our region features since, like a square wave grating,
they have a sharp (high frequency) boundary. A simple mathematical model for the
square wave contrast sensitivity function does not seem to be available but an approx-
imation built by analyzing the frequency spectrum of the square wave signal has been
derived [Campbell and Robson, 1968]. Using this model is a bit complicated. An al-
ternative is to simply use the maximal sensitivity for regions larger than the frequency
corresponding to the peak sensitivity. This replaces the low frequency slope of the
function with a horizontal line at peak sensitivity. This could be considered a first level
of approximation of the square wave model, which very roughly states that visibility is

largely governed by the most visible frequency component in a square grating.

Another possible approach for scaling to remove content is to scale only in the
contrast domain. After calculating a contrast threshold value this can be scaled before
being compared to the actual contrast of a region. This behaves in a more intuitive
manner, as more scaling always removes more content.

From our experimentation, scaling in the contrast domain seems to be the more
useful option for scaling region visibility. It corresponds more to our intuitive sense of
which detail should be removed first, as a fairly wide range of frequency regions are
desired in a final image. Removing large low contrast regions using frequency scaling
also removes the few desired high frequency regions. Contrast scaling preferentially
removes lower contrast regions, which looks better. This may be due in part to our
segmentation method. Because it breaks shading into many patchy regions, there is a
desire to get rid of these sometimes large but low contrast shading regions faster than
proportionally smaller but higher contrast features.

Ultimately, [DeCarlo and Santella, 2002] we combined both these approaches in

our final system. Scaling is applied only to contrast and region frequencies are capped
at a minimum value to keep the visibility of low frequency regions from degrading
too much. This is an approximate patch to problems in the applicability of the current
contrast sensitivity model. The most perceptually realistic approach is still an open

Once a contrast threshold has been calculated, the contrast of a particular feature
needs to be measured for comparison to that threshold. As mentioned above, contrast
models are derived over gratings in which there is one relatively obvious way to mea-
sure contrast. Our regions are more complex, they include color and are non-uniform
(in the initial image).

One approach is to measure contrast between the average color of parent and child
regions in the tree. This could be thought of as capturing whether the next branching
represents a significant change. This works reasonably, but can be susceptible to chance

features in the segmentation tree. For example, two pairs of black and white regions
might merge to form two gray regions. These then merge to form a single gray region.
The difference between parent and child region colors on this level would be minor.

An approach that seems more successful is to use the average of the contrasts be-
tween a region’s color and those of adjoining regions in the same level of the tree. In
this case it would be possible to take a cue from [Lillestaeter, 1993] and measure con-
trast between regions as a weighted average of contrast between their average colors
and the contrast across their shared edge in the initial image. The correct scale at which
to measure contrast across the edge is however unclear. We choose ultimately to mea-
sure contrast between a region and adjoining regions on that level using only region
color. The contrast for the region is the average of contrast with each of its neighbors
weighted by extent of the border they share.

Since we are not aware of any simple color contrast framework, we take a simple
approach to measuring contrast between individual pairs of region colors. We use a
kc1 −c2 k
slight variation of the Michelson contrast: kc1 k+kc2 k . Each color exists in the percep-
tually uniform L*u*v* color space, so the measure provides a steady increase with
increasing perceptual differences in color. This formula sensibly reduces to the stan-
dard Michelson contrast in monochromatic cases. More sophisticated models of color
perception for regions may help the system work better on a broader class of images.

7.3 Rendering

For regions, we choose to emphasize abstraction by removing even more information

than was directly specified by the (scaled) perceptual model. Instead of applying the
perceptual model to regions uniformly across the image, we divide the image into fore-
ground regions where the standard perceptual model is used, and background regions
where a more aggressive version of the model is used. Unfixated background areas are
identified. In them, a high constant eccentricity is used in place of the actual distance

to the nearest fixation, removing more detail in these regions.

Foreground regions are regions that appear to have been examined by the viewer. A
region is considered examined if a fixation rests in its bounding circle, or if the region
is in a subtree that is approximately fovea sized and centered on a fixation. Foreground
regions could be identified by searching the segmentation tree from its leaves upward.
In practice, it is simplest to descend the tree, applying the default model to regions
that contain fixations. When a subtree that matches a fovea is identified, the standard
model is applied throughout this subtree. When a subtree does not contain a fixation,
the background model is used. This avoids having to touch the majorityof elements in
the tree that will not be rendered.

Applying the model produces a trimmed tree. The leaves of this smaller tree are
the regions that will be rendered into an image. The set of selected regions and lines
are then rendered by drawing the regions as areas of flat color and drawing the edges
in black on top of them. This is a simple style and there are relatively few explicit style

One important stylistic choice is region smoothing. Before rendering, regions are
smoothed by an amount proportional to their size. This removes high frequency detail
along the border of large regions giving the regions a smooth, organic look. Smooth-
ing is needed because the spatial extent of a region is taken from the union of its
child leaves at the lowest level of segmentation. Without smoothing the region border
would contain inappropriate distracting detail. Edges are smoothed by a small constant
amount. The resulting misalignment between highly smoothed region boundaries in
abstracted areas and the corresponding edges adds to the ’sketchy’ look of results.

Edges are filtered in several ways based on their length, in order to eliminate clutter
from the many fragmentary edges resulting from edge detection. Very short edges of
only a few pixels are directly thrown out. Somewhat longer edges, (by default less than
15 pixels in length) are drawn only if near the border of a region and included by the
acuity model. Longer edges are compared only against the acuity threshold.

Selected edges are drawn with a width proportional to their length. Edge width
interpolates between a minimal and maximal thickness (3 and 10 pixels) for edges
between 15 and 500 pixels in length. Thickness is constant outside this range. Edges
also taper to points at either end. This is a rough approximation of the appearance of
many lines in traditional media illustration. Thickness being proportional to length is
a debatable choice. It occasionally produces odd results, but does succeed in adding
some variation to line weight, while capturing the sense that long lines are usually more

Rendering lines tapered at either end serves the additional purpose of disguising the
sometimes broken nature of automatically detected edges. Without this, the fact that a
single edge is often broken or dotted in many places tends to be highly distracting. Our
lines make no attempt to simulate the fine grained look of real brush or pen and ink
lines, something which has been done [Gooch and Gooch, 2001] and could be applied
if desired.

7.4 Results

Some results from this system can be seen in Figure 7.2. The visual style of flat colored
regions and dark lines is attractive. The abstraction provided by eye tracking seems to
succeed in highlighting important areas. How important this abstraction is can be seen
by comparing the result with abstraction to those in Figure 7.4. Images with uniformly
high detail look excessively busy, while those with uniform low detail appear overly
simplified. Abstraction is vital in producing clear images with a distribution of detail
that is not distracting.

Uniformly detailed images are created by removing the cortical magnification fac-
tor from the perceptual model. The constant scaling factor on contrast now provides
a single global control for simplification. This model is applied uniformly across the
image. Regions are still removed based on contrast sensitivity; low contrast and small

Figure 7.2: Line drawing style results.





Figure 7.3: Stylistic decisions. Lines in isolation (a) are largely uninteresting. Un-
smoothed regions (b) can look jagged. Smoothed regions (c) have a somewhat vague
and bloated look without the black edges superimposed.

Figure 7.4: Renderings with uniform high and low detail.

regions are eliminated first. However the effect of this uniform simplification is not
nearly as successful.

The additional scaling factors on the region and edge perceptual models provide
detail sliders for the user. These preserve the relative distribution of detail between
examined and unexamined locations while reducing or increasing the overall amount
of detail. Most images pictured use similar settings for these values. Though there is
some variability from image to image in what global amount of detail looks best, a
single scaling factor usually looks acceptable on most images. Tweaking for the very
best look involves searching only a small area of parameter space.

(a) (b) (c)

Figure 7.5: Several derivative styles of the same line drawing transformation. (a) Fully
colored, (b) color comic, (c) black and white comic

Results confirm our intuition that regions and edges are complimentary features.
Both are necessary to create comprehensible images. Each feature in isolation fails to
convey the scene. This is illustrated in Figure 7.3. Edges in isolation, in part because
of broken outlines fail to convey the sense of a scene made up of solid objects. Regions
in isolation do clearly make up a scene, but without edges their smoothed borders are
distracting. As in the ink and colored wash styles commonly used in illustration, dark
edges add a kind of definition that is difficult to achieve with color alone.

Regions and edges make up our rendering style, but they can also be considered
building blocks of many other styles. Figure 7.5 illustrates some trivial derivative
styles, a color comic book look created by thresholding dark regions to black, and
a black and white comic look created by fully thresholding the image. More interest-
ing styles can also be built from the same building blocks. A natural possibility that we
have not attempted is a painterly style taking advantage of this structured image model.
Watercolor would be a particularly interesting possibility [Curtis et al., 1997]. Regions
could provide areas of color to fill with simulated watercolor, while rather than being
explicitly rendered, strong edges could indicate locations where a hard edge, wet in dry
technique should be used.

We have argued that abstraction is a quality of any visual display designed with
the purpose of clear communication. Even depictions usually considered true to life
contain similar kinds of abstraction. Photorealist painters do this with subtle manipu-
lations of tone and texture. Photographers composing studio shots do the same thing
by manipulating the physical objects present. Graphic artists touching up photographs
act similarly, editing out small distracting features. In the next chapter we present a
very simple preliminary attempt to define a semi-automatic photorealistic abstraction
using the same techniques we’ve applied to artistic rendering.

Chapter 8
Photorealistic Abstraction
Our goal is to perform abstraction in a photographic style, removing detail while pre-
serving the sense that an image is an actual photograph. This is a more challenging
goal than abstraction in an artistic style. Artistic styles provide a clear statement that
an image does not directly reflect reality, and provides a fairly free license to change
content. Viewers are much less forgiving of artifacts in an image that claims to be an
accurate depiction of reality. The approach we present here is far from a solution to
this problem but presents some interesting images, and suggest semi-automatic photo-
graphic abstraction is possible.

8.1 Image Structure

The technique used here is simple. Anisotropic diffusion [Perona and Malik, 1990] is
used to smooth away small, light detail while preserving strong edges. Our contribution
is using eye tracking data to locally control the amount of simplification, allowing for
meaningful photorealistic abstraction.

Figure 8.1: Mean shift filtering tends to create images that no longer look like pho-

Mean shift filtering is an alternative space for simplification. In theory it should

provide greater control of the simplification process. Achieving a photo-like result with
it is a bit tricky since small areas quickly converge to a constant color. A high contrast
discontinuous border is then visible between that region and adjoining areas that have
converged to a different mode. Though an interesting form of image simplification,
mean shift filtered images no longer look like photographs (see Figure 8.1).

Though its definition implicitly takes into account edges, anisotropic smoothing is
defined on a flat image. In performing abstraction we’d still like to use the structure
of the image to avoid artifacts like those seen in our painterly rendering system, where
detail leaks from important features into surrounding background. To achieve this, we
combine a segmentation with eye tracking data to create a piecewise constant impor-
tance map reflecting the interest shown in each image region. This importance image
will control the amount of smoothing performed.

8.2 Measuring Importance

Though the attention model we used in the work above took dwell time into account
to tell if a fixation was really meaningful, it was not primarily intended to measure the
relative importance of different areas of the image. As long as a fairly long fixation
was present in an area, it was considered important. Relative importance was largely a
function of distance from a fixation. Here however, we want to capture a finer grained
measure of relative feature importance to control a more delicate process of abstraction.

As mentioned above, the length of fixations tells us something about how important
a feature is because they relate to time spent processing the content of a location [Just
and Carpenter, 1976]. Fixations can vary quite widely in duration. It appears from
our initial experiments that the total dwell time in two fixated locations does provide at
least a rough measure of their relative importances. With this in mind, our strategy is to
create an importance map that is brighter in important areas. This is done by coloring

a segmentation using an estimate of the total amount of time spent fixating different
parts of the image.

We begin by breaking an image into regions using a flat mean shift segmentation
of the image. Using a multiscale segmentation based on our previous work is an in-
teresting possibility, which we have so far not explored. Based on fixations, we assign
a weight, which can be considered an importance value, or empirical salience to each
region in the image. Conceptually, we wish to count the amount of time spent fixating
each region in our segmentation. However, a fixation might not actually rest within the
boundary of the region that best represents the feature being examined. No segmen-
tation is perfect, so for example a fixation may rest in a region that represents half of
the object examined. In addition drift and noise in the point of regard as measured by
the eye tracker can cause fixations to sit just over a boundary in another object entirely.
Because eye trackers are at best accurate to about a degree of visual angle, (roughly
25 pixels in our setup) noise is particularly apparent when a small object is fixated.
Depending on how small the object is, the corresponding fixation has a good chance
of lying within a surrounding background region. Some smoothing of containment to
deal with this problem was implicitly built into our previous work because bounding
circles were used to calculate intersections between fixations and regions.

To explicitly deal with this we make a soft assignment between each fixation and
each region and weight each region by the sum of these values. This smooths the
containment of fixations in regions because each fixation contributes to a range of
segments near it, rather than simply the one it rests in.

To set the contribution for a fixation fi = xi , yi ,ti to segmented region r j we compute

the average (A) over the distances between the fixation and all points in the region. We
define a threshold distance T=175 pixels (more generally about 7 degrees of visual
angle) and the contribution for fixation fi to r j ’s weight is equal to (1 − A/T ) if A < T,
0 otherwise. The weight for each region is the sum of the weight contributed by each
fixation. Weights are capped at a maximum viewing time of 1 second. The region is

then drawn into the importance map using this intensity.

The result of this is an image where each region in the segmentation has a constant
color that reflects the total time spent examining that region. This is fundamentally dif-
ferent from the perceptual measures used in our other approaches. It is not a measure
of visibility, but instead a measure of how much something has been looked at. It is
similar conceptually to the subject/background distinction used to render background
areas particularly abstractly in our line drawing style. While that was a binary dis-
tinction, this approach creates a relative measure of importance using fixation length.
More sophisticated ways of matching fixations and regions are possible but we have
found this sufficient for our prototype.

The resulting subject map is then used in a very straightforward way. At each point
in the image, n iterations of anisotropic diffusion are performed where n interpolates
linearly between 1 at the brightest parts of the importance map and an abstraction
parameter M at the darkest parts (M=250 in most results shown here).

8.3 Results and Discussion

Some results of this process are pictured in Figure 8.2. The abstraction is much subtler
than in other styles, but when viewed at high resolution an interesting subtle falloff
in detail, largely small scale texture, is noticeable. This captures at least a bit of the
effect that can sometimes be seen in photorealistic paintings where low contrast detail
seems to disappear, while some high contrast details, for example in reflections and
specularities, appear to be emphasized.

Figure 8.4 illustrates the importance of taking region boundaries into account in
assigning importance to image regions. If importance just varies locally based on dwell
time in the vicinity, fixated objects have a halo of detail around them.

A number of limitations to this approach are obvious. Clearly it doesn’t capture

the flexibility an artist uses in abstracting texture, and removing entire elements. Even

Figure 8.2: Photo abstraction results


(a) (b)

(c) (d)

(e) (f)


Figure 8.3: Photo in (a) is abstracted using fixations in (b) in a variety of different
styles. (c) Painterly rendering, (d) line drawing, (e) locally disordered [Koenderink
and van Doorn, 1999], (f) blurred, (g) anisotropically blurred.

(a) (b)

Figure 8.4: (a) Detail of our approach, (b) the same algorithm using an importance
map where total dwell is measured locally. Notice in (b) the leaking of detail to the
wood texture from the object on the desk. Here differences are relatively subtle; but in
general it is preferable to allocate detail in a way that respects region boundaries.

for performing simple textural abstraction it is limited. Importantly, the total amount
of abstraction possible without creating disturbing artifacts is limited. When relatively
few iterations of smoothing are performed, abstraction is limited and small high con-
trast features in the least important areas remain quite distinct. In contrast, if many
iterations of smoothing are performed, blurring becomes very apparent and the image
takes on a foggy appearance that distracts from the scene (see Figure 8.5).

There is an interesting unanswered question here of what features are important

for the percept of a realistic image. What about an image makes it appear like a pho-
tograph, as distinct from a highly finished traditional painting or a painting by a pho-
torealist artist. The range of contrasts is clearly one cue, that has been shown to be
perceptually important for material recognition [Adelson, 2001, Fleming et al., 2003]
Figure 8.3 provides an interesting comparison of abstraction performed using a number
of different methods, which give very different impressions.

Though a principled understanding of the perception involved is ultimately nec-

essary, there are various techniques that might currently be brought to bear on this
problem. Anisotropic diffusion provides a gradient weight that controls how sensitive
to the local gradient blurring is. At one extreme, blurring is uniform. At the other

Figure 8.5: The range of abstraction possible with this technique is limited. With
greater abstraction the scene begins to appear foggy. In some sense it no longer looks
like the same scene.

extreme, variations in the image are so carefully respected that almost no blurring oc-
curs. This parameter could also be varied based on importance, though it is not clear
how useful this parameter would be. A similar process of filtering using a mean shift
or bilateral filtering framework might provide more control. Though mean shift filter-
ing tends to produce images that no longer look like photographs, a careful scheme for
controlling the number of iterations, color and spatial scale of filtering might overcome
this problem. Ultimately, to capture a wider variety of artistic effects a more structured
understanding of texture and grouping of scene elements is important; these are more
difficult problems.

Chapter 9
Though the problem of photorealistic abstraction is difficult, our results for artistic
styles suggest we have succeeded in achieving meaningful abstraction. Results look
interesting and the reduction of detail does not seem visually jarring. Often in graphics
this kind of informal impression is evaluation enough.

This is not an illegitimate viewpoint. In the context of art or entertainment, formal

evaluation may not be necessary. An appeal to visual intuitions about what looks good
can be enough. Though our methods are targeted at creating artistic images for enter-
tainment, we are interested in applying these techniques to illustration or visualization.
Because of this we would like to be able to empirically evaluate the claim that our sys-
tem can direct viewers to areas highlighted with detail. Even this does not categorically
prove the technique actually makes images easier to understand. Being able to show
this would require a visualization application, where goals and task related factors are
in play. However, showing that abstraction directs visual interest would prove a quan-
tifiable perceptual effect resulting from our technique. To establish this we perform
a user study, comparing viewing behavior over our images to viewing of the original
photographs and renderings in our style created with several different distributions of

We first motivate our choice of evaluation technique, then present the specifics of
how we conducted our experiments. Results and some implications of our findings are
then discussed. Our aim is not just to validate our system, but is instead a threefold

• Present a method of evaluation new to NPR (Section 9.2)—one based on tracking

viewers’ eye movements.

• Use this method to provide quantitative validation for our system (Section 9.4)
as well as interesting new insights into the role of detail in imagery (Section 9.5).

• Explain why this evaluation methodology is widely applicable in NPR, even

when the NPR system itself does not use eye tracking.

9.1 Evaluation of NPR

Prior methodologies used to evaluate NPR fall into one of two categories. The first
method polls a representative number of users, collecting their opinions to find out
how they respond to the system. Schumann et al. [1996] polled architects for their
impressions of sketchy and traditional CAD renderings, and based on the results, ar-
gued for the suitability of sketchy renderings for conveying the impression of tentative
or preliminary plans. Similarly, Agrawala and Stolte [2001] demonstrate the effective-
ness of their map design system using feedback from real users.

The second approach measures users’ performance at specific tasks as they use a
system (or its output). When the task depends on information gained from using the
system, performance provides a measure of how effectively the system conveys infor-
mation. An early study [Ryan and Schwartz, 1956] looked at the time required to judge
the position of features in photos and hand rendered illustrations in different styles.
Faster responses suggested more simple illustrations were clearer. Interrante [1996]
assessed renderings of transparent surfaces using medical imaging tasks. Performance
provided a measure of how clearly the rendering method conveyed shape information.
Gooch and Willemsen [2002] tested users’ ability to walk blindly to a target location
in order to understand spatial perception in a non-photorealistic virtual environment.
Gooch et al. [2004] compared performance on learning and recognition tasks using
photographs and NPR images of faces. Heiser et al [2004] evaluated automatic in-
structional diagrams by having subjects assemble physical objects and assessing their
speed and errors. Investigations like this draw on established research methodologies

in psychology and psychophysics.

Both of these methods have their limitations. For example, the goal of imagery is
not always task related. In advertising or decorative illustration (and possibly in much
fine art) the goal is more to attract the eye than to convey information. Success is mea-
surable, but not by a natural task. Surveys have their own limitations. The information
desired may not be reliably available to subjects by introspection. In addition, both task
performance and user approval ratings assess only the quality of a system as a whole.
Neither directly say why a pattern in performance or experience occurs. To understand
this, the system needs to be systematically changed and the experiment repeated. This
process can be costly and time consuming (or impossible). Any additional information
that aids the interpretation of results is therefore highly valuable.

We evaluate only one of the several styles of rendering presented in this work, the
segmentation based line and color drawings [DeCarlo and Santella, 2002]. This system
was chosen for evaluation in large part because it is the most developed of these sys-
tems. It is also an interesting candidate for evaluation in that it performs a very clean
aggressive kind of simplification. Unlike the other methods, it removes everything
from abstracted regions leaving them completely featureless. There is also no ran-
domness in the algorithm, as opposed to the painterly renderings system where there
are random variations in stroke placement. This allows multiple, otherwise identical,
renderings for comparison to be created with different distributions of detail.

Our hope is that removing detail can enhance image understanding. Further, suc-
cessive viewers may be encouraged to examine the image in a way similar to the first
viewer, and take away a similar meaning or impression. There is no natural task in
which to evaluate this effect, because our goal is creating artistic imagery rather than
visualizations for some task. Systematic questioning of viewers might substantiate the
intuition that the images are well designed, but would not inform future work.

Here, we present an alternate evaluation methodology which draws on established


psychophysical research. This approach analyzes eye movements and provides an ob-
jective measure of cognition. It can be the basis of evaluation, or provide complemen-
tary evidence when a task or other method is available. Regardless of the context in
which the user is viewing an image, the common factor is the act of looking. This
mediates all information that passes from the display to the user. In all of this work,
this key insight has provided an easy and intuitive method for abstraction. For the same
reason, we apply eye tracking to evaluation. These choices are independent; evaluation
via eye tracking is a general methodology that can be used regardless of how imagery
is created. Our study also looks at renderings that are created without the use of eye

9.1.1 Analysis of Eye Movement Data

Basic parsing of eye movements into fixations and saccades has already been discussed
Section 3.1. Once individual fixations have been isolated, it is often useful to impose
more structure on the data. In looking at an image, viewers examine many different fea-
tures, some closely spaced on a single object, others more distant. A common pattern
of looking is to scan a number of different features and then return back to particularly
interesting ones. Multiple close fixations suggest interest and increased processing in
the same location. Because of this, cumulative interest in a location is often a valu-
able measurement. This was used as the basis of an importance map in our system for
photorealistic abstraction Section 8. When the location of features is known, this is
often measured by counting viewing time spent within a bounding box [Salvucci and
Anderson, 2001]. When there are not predetermined features, clustering can be used
to characterize regions of interest in a data driven fashion [Privitera and Stark, 2000].
Nearby fixations are clumped together, yielding larger, spatially structured units of vi-
sual interest. The number of clusters indicates the number of regions of interest (ROI)
present, and the number of points contained in them provides a measure of cumulative

interest. This is achieved using a mean shift clustering that considers only the x,y posi-
tions of locations viewed [Santella and DeCarlo, 2004b]. In the experiment described
in the next section this will reveal important information about how viewers look at

Photo Detail Points

High Detail Low Detail

Salience Eye Tracking

Figure 9.1: Example stimuli. Detail points in white are from eye tracking, black detail
points are from an automatic salience algorithm.

9.2 Experiment

9.2.1 Stimuli

The images used in this experiment were 50 photographs, and four NPR renderings
of each photo for a total of 250 images and five conditions. Most photos were taken
from an on-line database1 . Photos spanned a broad range of scenes. Images that could
not be processed successfully were avoided, such as blurry or heavily textured scenes.
Prominent human faces were also excluded, although human figures were present in
a number of the images. All NPR images were generated using the method of De-
Carlo and Santella [2002] presented in Chapter 6. The four renderings differed in how
decisions about the inclusion of detail were made.

The five conditions are pictured in Figure 9.1, they are:

Photo: This is the unmodified photograph.

High Detail: A low global threshold on contrast ensures that most detail is retained,
removing primarily areas of low contrast texture and shading.

Low Detail: A high contrast threshold is used, removing most detail throughout
the image. The resulting image is drastically simplified but still for the most part

Eye Tracking: Detail is modulated as in [DeCarlo and Santella, 2002], using a

prior record of a viewer’s eye movements over the image. Detail is preserved in lo-
cations the original viewer examined (here we call these locations detail points) and
removed elsewhere. The eye tracking data was recorded from a single subject who
viewed each image for five seconds (and was instructed to simply look at the image).

Salience Map: Detail is modulated in the same manner as eye tracking, but the
detail points are selected automatically by a salience map algorithm [Itti et al., 1998,Itti


and Koch, 2000]2 . The algorithm has a model of the passage of time. So, like fixations,
each point has an associated duration. Five seconds worth of detail points were created.
The locations viewed by people and chosen by the salience algorithm can be similar in
some cases, but in general result in renderings with noticeably different distributions
of detail.

This set of conditions represents a systematic manipulation of an image. The effects

of NPR style, detail, and abstraction are separated. Local simplification is present in
two forms: one based on a viewer, and the other on purely low level features. Because
detail is controlled by choosing the levels of a hierarchical segmentation, simplified
images consist of a subset of the features in higher detail images. The eye tracking and
salience conditions are rendered literally using a part of the tree used to render the high
detail condition, while the low detail case generally includes the least content.

9.2.2 Subjects

Data was collected from a total of 74 subjects including 50 undergraduates participat-

ing for course credit and 24 subjects (graduate and undergraduate) participating for

9.2.3 Physical Setup

All images were displayed on a 19 inch LCD display at 1240 x 960 resolution. The
screen was viewed at a distance of approximately 33.75 inches, subtending a visual
angle of approximately 25 degrees horizontally. Eye movements were monitored us-
ing an ISCAN ETL-500 table-top eye-tracker (with a RK-464 pan/tilt camera). The
movement of the pan/tilt unit introduced too much noise in practice; it was not active
during the experiment. Instead, subjects placed their heads in an optometric chin rest
to minimize head movements.

2 available at

9.2.4 Calibration and Presentation

Eye trackers need to be calibrated in order to map a picture of a subject’s eye to a

position in screen space. This is accomplished by having the viewer look at a series
of predetermined points. In our experiments, a nine point calibration was used. The
quality of this calibration was checked visually, and also recorded. Every 10 images,
the calibration was checked and re-calibration was performed if necessary. Recordings
were used to measure the average quality of the calibrations. Errors had a standard
deviation of approximately 24 pixels (about a half degree), which agrees with the pub-
lished sensitivity of the system. Note that this does not account for systematic drift
from small head movements during a viewing.

After calibration, subjects were instructed to look at a target in the center of the
screen and click the mouse to view the first picture when ready. On the user’s click, the
image was presented for 8 seconds, and eye movements were recorded. After this, the
target reappeared for one second. A question then appeared. The subject clicked on a
radio button to select their response, clicked again to go on, and the process repeated.

Subjects normally saw one condition of each of the 50 images. The condition
and order were randomized. While viewing the images, subjects were told to pay
attention so they could answer questions which came after each image. Questions
were divided into two types, the order of which was randomized. Questions asked the
viewer either to rate how much they liked the image on a scale of 1 to 10, or whether
they had already seen the image, or a variant of it, earlier in the experiment. Occasional
duplicate images were inserted randomly when this question was used; data for these
repeated viewings is not included in the analysis. The questions were selected to keep
the viewer’s attention from drifting, while at the same time not giving them specific
instructions which might bias the way they looked at the image.

Figure 9.2: Illustration of data analysis, per image condition. Each colored collection
of points is a cluster. Ellipses mark 99 % of variance. Large black dots are detail points.
We measure the number of clusters, distance between clusters and nearest detail point,
and distance between detail points and nearest cluster.

9.3 Analysis

9.3.1 Data Analysis

Analysis draws upon a number of established measures and techniques tailored to our
experiment, to provide complimentary evidence about how stylization and abstraction
modifies viewing. Some processing is common to all our analysis. First, all eye move-
ment data is filtered to discard point of regard samples during saccades. We then per-
form clustering on the filtered samples [Santella and DeCarlo, 2004b]. The clusters are
not always meaningful, but on the whole they correspond well to features of interest in
the image. There is reason to believe the number of points contained in a cluster may
reveal how important a feature is; this is not considered here.

Our clustering method requires a scale choice. Clusters whose modes are closer
than this scale value will be collapsed together. We select a scale of 25 pixels (roughly
half a degree) for all analysis, which is about the level of tracker noise present. Results
depend on the scale choice used in the clustering process. Clearly, at coarser scales
there will be fewer clusters and a smaller difference between the condition means. We
argue below that this does not affect interpretation of our results.

All clustering was conducted in two ways. In the first, which we will refer to as per
viewer analysis, each viewer’s data was clustered separately. In the second analysis,
which we will refer to as per image analysis, data for all viewers of a particular image
was combined before clustering. It is reasonable to think that as one adds data from
individual viewers, the data will approach some hypothetical distribution of image fea-
ture interest [Wooding, 2002]. This second analysis may therefore provide a better
measure of aggregate effects.

Below, we describe the measurements performed using the clusters. See Figure 9.2
for an illustration of the data.

Clusters: Because clusters roughly correspond to areas examined in the image,

we would expect to find fewer clusters in the eye tracking and salience cases if they
succeed in focusing viewer interest. We might also expect uniform simplification to
reduce the number of clusters, because it reduces detail.

Distance (from data to detail points) : In the eye tracking and salience condi-
tions, we wish to measure whether interest is focused on the locations where detail is
preserved. The change in distance from each of the cluster centers to the closest detail
point between conditions tells us how effective the manipulation is in drawing interest
to these locations. If the abstraction is successful, we would expect that clusters will
be closer. This tests the system as a whole. There will be no change in distance if
our hypothesis is wrong, which would mean that varying detail does not attract more
focused interest. It is also possible that in a particular image there was no detail that
could be put in a particular location, because there was none in the original image, or
because it cannot be represented in our system’s visual style.

Distance (from detail points to data): Implicit in the choice of detail points is
the assumption that viewers should look at all of the locations. This is not captured
by the distance measure. A viewer could spend all the time looking at one detail point
yielding a zero distance. To quantify this, it is possible to measure the distance from
each detail point to the closest cluster. A high average value means the locations of

a significant number of detail points were not closely examined. This distance will
decrease in salience and eye tracking conditions if detail modulation makes people
look at high detail areas that were not normally examined.


Diffe renc e Hig h -Eye Tra c k


P Valu e

0.009 0.6
effect magnitude
p value
0.0 0
0 50 100 150 200
Clu ster Sc a le

Figure 9.3: Statistical significance is achieved for number of clusters over a wide range
of clustering scales. The magnitude of the effect decreases, but its significance remains
quite constantly over a wide interval. Our results do not hinge on the scale value

9.3.2 Statistical Analysis

Data for all subjects was clustered as discussed in Section 9.3.1. In total there are 10
eye tracking records for each of the 50 images in each of the 5 conditions for a total
of 2500 individual recordings. More data than this was gathered; a matched number of
recordings for each condition was selected randomly. As noted in Section 9.2.4, data
was recorded in blocks where one of two questions was asked. Analysis showed no
effect of the questions, and these results are based on roughly equal numbers of images
presented in blocks of each question type.

Analysis of variance (ANOVA) are used to test whether differences between con-
ditions are significant. These tests produce a p value: the probability the measured

difference could occur by chance. The per viewer case gives itself naturally to statisti-
cal testing by a two-way repeated measure ANOVA. In this context a two-way ANOVA
separately tests both the contribution that the particular image and the condition make
to the results. This lets one look at the effect of a condition while factoring out the
variation among the different images. A repeated measure analysis treats each view-
ers’ eye tracking record as an independent measurement, so there are 10 data points per
image and condition pair. In the per image analysis, the 10 recordings are collapsed
together and data is analyzed instead by a simple two-way ANOVA. There is now only
one data point per image and condition pair, so it is more difficult to show a statistically
significant effect. We want to know not only if some of these conditions are different
from each other, but also which pairs are different. This requires a number of tests.
When performing this kind of analysis there is a concern that, since each test is asso-
ciated with a certain probability that the results could occur by chance, there will be
an unacceptably high cumulative risk that some positive results may occur by chance.
Several approaches exist to deal with this problem. We adopt a common methodology
for minimizing this risk. One test is used to establish that all of the means are not
the same, and only if this test succeeds are pairwise tests performed. This method is
implicit in all pairwise test results reported.

9.4 Results

Figure 9.4 graphs the average results for all measures. The take-away message, quan-
tified below, is that on the whole:

• Eye tracking and salience conditions have fewer clusters than photo and uniform
detail conditions in all analyses. In the per image analysis, eye tracking has fewer
clusters than salience.

• Distance between the viewed locations and the detail points decreased as a result
of modulating detail.

Per Image Analysis, data for all viewers of an image is clustered together.


Number Clusters



high eye sal. photo low

Number of Clusters



high eye high sal.
eye track detail points salience detail points

Distance from Cluster to Detail Point

Detail Point Distance





hi. eye. hi. sal.
eye track detail points salience detail points

Distance from Detail Point to Cluster

Figure 9.4: Average results for all analyses per image.


Per Viewer Analysis, data for each viewing is clustered separately.




Number Clusters


high eye sal. photo low

Number of Clusters








high eye high sal.
eye track detail points salience detail points

Distance from Cluster to Detail Point

Detail Point Distance


hi. eye. hi. sal.
eye track detail points salience detail points

Distance from Detail Point to Cluster

Figure 9.5: Average results for all analyses per viewer.


• Distance between detail points and viewed locations showed no change; how-
ever the distances for salience points were significantly higher than those for eye
tracking points.

9.4.1 Quantitative Results

Clusters: In the per viewer analysis, there was about one fewer cluster in the eye
tracking and salience conditions, compared to the others. This means each viewer
examined one fewer region on average. Analysis showed this difference was significant
(p < .001). There was no significant difference (p > .05) between the photo or uniform
detail conditions, or between eye tracking and salience.

In the per image analysis, eye tracking had about 6 fewer clusters than uniform
detail and photo conditions, while salience had about 3 fewer. Eye tracking differed
significantly from all other conditions including salience (p < .001). Salience differed
from original at p < .01, and from high and low at p < .05.

Distance (from data to detail points): Clusters in the eye tracking condition were
about 20 pixels closer to the eye tracking detail points than high detail clusters, in both
per viewer and per image analysis (p < .0001). Salience clusters were about 10 pixels
closer to salience detail points (per viewer: p < .0001, per image: p < .01). This is
not spatially very large, but it represents a consistent shift of cluster centers towards the
detail points. The magnitudes of the two shifts (10 and 20 pixels) were not significantly
different from each other. For per image analysis, distances measured to eye tracking
detail points were significantly higher p < .01) than corresponding distances to salience

Distance (from detail points to data): There was no significant change (p > .05)
for saliency or eye tracking renderings in either analysis. In both analyses however,
the distances were significantly smaller (p < .001) when measured from eye tracking
detail points than from salience detail points (a difference of about 40 in the per viewer
and 10 in the per image condition).

All of the two-way ANOVAs tested the significance of both the experimental condi-
tion, and the particular image. In all tests, the effect of the image was highly significant
(p < .001). This is neither surprising nor particularly informative. It simply states that
individual images have varying numbers of interesting features and they are distributed
differently in relation to the detail points.

As mentioned above, all of this analysis used clusters created with a particular
choice of scale. Figure 9.3 shows that results do not depend on this choice. The differ-
ence between mean number of clusters in the high detail and eye tracking conditions
(per viewer analysis) is plotted along with the corresponding p value. Though the
magnitude of the difference varies, p values show an effect of approximately equal sig-
nificance over a range of scales. The effects we have shown are therefore not due to
the particular scale selected.

9.4.2 Discussion

These results provide evidence that local detail modulation does change the way view-
ers examine an image. Eye tracking and salience renderings each had significantly
fewer clusters than all uniform detail images in both per image and per viewer analysis
(significance is stronger in the per viewer analysis, but that is to be expected based
on the number of samples). Distances to detail points also showed an improvement
for both salience and eye tracking renderings. This indicates that not only were fewer
places examined, but the examined points were closer to the detail points. Distances
from detail points to data did not show improvement. This indicates that though interest
was concentrated by the manipulation in places with detail, it did not bring new interest
to detail points that were not already interesting in the high detail renderings. Viewers
look more at detailed locations when other areas have been simplified, but this did not
benefit all locations equally. Rather, locations that were already somewhat interested
received increased interest. Results do not prove enhanced or facilitated understanding
per se; however, this is strongly suggested by the more focused pattern of looking.

Results also indicate that although improvement can be seen with detail modulation
based on both eye tracking and salience, the two behave differently. Modulation based
on both produces fewer clusters of interest, and decreased distance to detail points.
However, in the per image analysis, the number of clusters for the eye tracking condi-
tion was significantly lower than the salience condition. Also, the distances measured
from salience points were consistently higher than those from eye tracking points; this
is further evidence that eye tracking points are more closely examined. Distance to
detail points shows the opposite relationship (though more weakly) and argues against
this conclusion. However, we show below that this is almost certainly due to the num-
ber of detail points, and is not meaningful.

These results fit our intuition that the locations a viewer examined will, in general,
be a better predictor of future viewing than a salience model, which has no sense of
the meaning of image contents. There is considerable controversy in the human vision
literature about how much of eye movements can be accounted for by low level feature
salience. Some optimistically state that salience predictions correlate well with real
eye motions [Privitera and Stark, 2000, Parkhurst et al., 2002]. Others are more doubt-
ful and claim that when measured more carefully and in the context of a goal driven
activity, the correlation is quite poor [Turano et al., 2003,Land et al., 1999]. Our results
show salience points (at least those produced by the algorithm used) are less interesting
in general. Though abstraction does attract increased interest to salience points; people
look nearer to some of them, but still at less of them overall.

It is not clear at first glance that distance values measured against the eye tracking
and salience detail points can legitimately be compared to each other in order to judge
if they are functionally equivalent. Differences may be due to the number and distribu-
tion of the two kinds of detail points, rather than their locations relative to features in
the images and hence the locations of fixation data collected. One very obvious differ-
ence clouds comparison of fixation and salience detail points. The salience algorithm
produces more fixations than real viewers, so there were typically more detail points

in the salience case (10.9 per image on average) than eye tracking (5.96 on average).
This would seem to bias distances to salience detail points toward lower values, possi-
bly bias distances from salience points to higher values and make comparison difficult.
This is not as bad as it might seem because there is usually a fair amount of redundancy
in the salience detail points. Multiple points lie close to each other, so twice as many
points does not mean twice as many actual locations. Still this complicates quantitative
interpretation. In fact we do see that distances to detail points are higher overall for
fixation detail points, if interpreted as reflecting on the fixation data this would suggest
the strange idea that salience points are examined more closely.

Fortunately, some simple controls indicate that the lower distances to salience
points is an unimportant artifact, while the higher distances from salience detail points
is meaningful. Replacing recorded fixation data with random points allows one to test
if a particular effect is due to the relationship between detail points and data, or if detail
points alone drive the effect. If an effect disappears when random data is substituted
for recorded fixations, it was driven by data. If it persists the detail points drive the

Replacing all fixation data for all viewers with uniform random points eliminates
all effects in distances from detail points. The location of fixation data drives this
difference. In contrast, when assesed using this random fixation data, distances to
detail points are still significantly higher for eye tracking and lower for salience detail
points. This difference is at least partly driven by the detail points themselves (most
likely the fact that there are more of them in the salience condition, more points in a
confined space will mean a shorter distance to the nearest one).

A second control adds evidence that the higher number of detail points in the sal-
ince case is responsible for the overall lower distance to salience detail points. We
can discard some detail points in the salience case so that the number of detail points is
equal across conditions. We would expect effects driven by the number of points to dis-
appear in this case. When this is done, there is no qualitative change in the magnitude

of distances from detail points. Distances to salience detail points however become
significantly higher than those to eye track points: a reversal. The main effect of our
experiment (the decreasing distance resulting from detail modulation) is not affected by
this. These two controls provide fairly strong evidence that eye tracking detail points
really are more examined than salience points overall.

The second control could have done before creating renderings and recording data.
This would have provided a more complete match between both conditions. It would
however provide some extra, usually not available information to the salience algo-
rithm, the number of fixations a viewer made, a rough guide to how many important
features are present.

In contrast to the changes caused by abstraction, there is little evidence that the
style manipulation alone produces a significant change in viewing. There is no sig-
nificant difference between the photo and high detail images in number of clusters. A
qualitative comparison of fixation scatter plots in these two conditions also suggests
the distribution of points in both is largely similar. There are however some large dif-
ferences in the effect on individual images. In some images large areas of low contrast
texture are removed by the stylization itself. In these cases, viewer interest is different
between the high detail and photo conditions (see Figure 9.6 for an example). Remov-
ing prominent but low contrast texture is abstraction, but it is abstraction over which
one has no control. Rather, it is built implicitly into the system (in this case into the
segmentation technique). The opposite effect can also occur: the style can attract at-
tention to less noticeable features. Notice in Figure 9.6, how drawing ripples on the
water in black has attracted the eye to them. A method for quantifying when and where
these effects occur is a topic for future research. These appear to be primarily low-level
effects, so work on salience [Itti et al., 1998] may provide a good starting point.

Interestingly, our results also indicate the number of regions of interest in an image
is not primarily driven by detail. It is surprising how much the pattern of interest in the
low detail case qualitatively and quantitatively resembles the high detail. The highly

detailed and highly simplified renderings have the same number of clusters while the
mixed detail images, in which the number of regions lies between these extremes,
have less. This implies it is locally increased detail that attracts the eye. Too much,
or too little detail everywhere leads to a broader dispersion of interest. Locally high
detail seems to attract the eye, globally low detail in particular appears to produced
scattered fixations. The distribution of fixation is similar overall but groups of fixations
are smaller and more scattered (clusters in the low detail case in fact represent lower
total dewll times on average than in the other conditions). This pattern is suggestive
of a (failed) search for interesting content. Substantiating and quantifying this is an
interesting subject for future research and has direct application in designing future
NPR and visualization systems. It would for example, be interesting to see how detail
relates to the time course of viewing and how behavior might vary with longer or
shorter viewing times.

In summary:

• viewers look at fewer locations in images simplified using eye tracking and
salience data,

• these locations tend to be near locations where detail is locally retained,

• neither the NPR style itself, nor application of uniform simplification modify
number of locations examined, and

• this effect exists for both eye tracking and salience detail points, but there is less
interest overall in salience points.

These results might seem like exactly what one would expect, given the use of
abstraction in art. This was not the only possible outcome however. One could imagine
that abstraction performed in this semi-automatic manner would simply now work.
Simplifying arbitrary areas of an image might confuse viewers, and they might, for
example, spend all their time looking at background regions obscured by simplification,

trying to figure out what is there. This bears a certain resemblance to behavior in the
low detail images. But when detail is retained in some locations and removed elsewhere
viewers seem to get the point, and explore the detailed areas.

9.5 Evaluation Conclusion

These results validate our attempt to focus interest by manipulating image detail using
eye tracking data. Results also have broader implications for those designing NPR sys-
tems, using salience maps in graphics, and designing future experimental evaluations
of NPR systems.

Our results show meaningful abstraction is important for effective NPR. Abstrac-
tion that does not carry any meaning is implicit in many NPR styles; for example,
there are no shading cues in a pure line drawing. Uniform control of detail is also
common in NPR systems. These are important considerations. But, both were tested
in this study and produced no change in the number of locations viewers examined.
In contrast, meaningful abstraction clearly affected viewers in a way that supports an
interpretation of enhanced understanding. Directed meaningful abstraction should be
considered seriously in designing future NPR systems.

Similarly, although low level (salience map) and high level (eye track) detail points
behave similarly in their ability to capture increased interest, they differ in their abso-
lute capture of interest. The increased capture of interest seems to be a low level effect;
people don’t bother looking where there isn’t anything informative. However, seman-
tic factors are also active. The locations that interest another person are influenced by
image meaning and are a better predictor than salience of where future viewers will

This has implications for the use of salience maps in graphics. It would be highly
desirable to automatically locate places viewers will look in a number of applica-
tions, not the least of which would be automatic abstraction in NPR and visualization.

Though salience points behave similarly to eye tracking points in part of our analy-
sis, results indicate that on the whole salience is not suitable for this purpose. It can
be successful in adaptive rendering applications [Yee et al., 2001], where it is only
necessary that people be somewhat more likely to look at selected locations. Salience
does provide information about how likely the structural qualities of a feature are to
attract interest. However, we want to encourage a later viewer to get the same con-
tent from an image that an earlier, perhaps more experienced viewer examined closely.
Current salience map algorithms are hardly expert viewers. This kind of application re-
quires better predictions, motivated by semantic information that salience is generally
unlikely to provide.

In addition, eye tracking may be a useful technique for evaluations of other NPR
systems. It provides an alternative to questionnaires and task performance measures.
Even when a task based method is possible, eye tracking can be useful in investigating
what features underlie the performance observed. Information is, after all, extracted
from the imagery by looking, and the large body of research available suggests that
locations examined indeed reveal the information being used to complete a task.

The experiments performed by Gooch et al.[2004] can serve to illustrate this. If

users can perform a task better using an NPR drawing of a face rather than a photo-
graph, it is valuable to see where clusters of visual interest occur in the two conditions.
This information may explain performance. It could focus future experiments and in-
form design choices about rendering faces, without exhaustive experimental testing.
Similarly, in evaluation of assembly diagrams [Agrawala et al., 2003], eye tracking
can provide very specific information about how people use such instructions. For ex-
ample, eye tracking could help further explain the way users interleave actions in the
world and examination of the relevant part of the instructions. Eye tracking records are
directly and obviously related to the imagery that evokes them. This makes them very
interpretable—a desirable quality in any measurement. In turn, this guards against the
danger [Kosara et al., 2003] of performing a user evaluation that ultimately doesn’t

yield any useful result.

Figure 9.6: Original photo and high detail NPR image with viewers’ filtered eye track-
ing data. Though we found no global effect across these image types, there are some-
times significantly different viewing patterns, as can be seen here.

Our experimental results quantify our intuition that our technique can focus interest
on areas highlighted with increased detail in abstracted images. Loosely interpreted,
our results could even be looked at as an experimental confirmation of the widely held
informal theory [Graham, 1970] that art functions in part by carefully guiding viewers
eyes through an image. Our results also resonate with findings from the literature
on the psychology of art [Locher, 1996] that suggests viewers spend more time in
long fixations in the somewhat vaguely defined category of ’well balanced’ images
while less long fixations occur in ’unbalanced images’. This convergence of work
from different fields is encouraging.

Chapter 10
Future Work
We have demonstrated the effectiveness of our approach to artistic abstraction. This
success encourages investigation in a number of related areas. These include improve-
ments and extensions to image processing and representation techniques, better per-
ceptual models to control abstraction, and application of these and related techniques
to practical problems.

10.1 Image Modeling

The models of image features used in our work are drawn from the state of the art,
but this is constantly changing. As image processing and understanding techniques
improve these can be incorporated into a richer model of image contents. Interesting
developments are possible in a number of areas.

10.1.1 Segmentation

Our model of image regions could be extended in a number of ways. Our segmentation
technique is limited by a piecewise constant color model of image regions. A segmen-
tation technique that could model a segment as a region of uniform texture or smooth
variation would better represent meaningful areas of the image. Once able to capture
coherent textured areas, how to abstractly render them becomes an interesting ques-
tion. Simply rendering them in a mean or median color is possible. More meaningful
textural abstraction presents an interesting challenge. A natural question to ask is what
features of a texture make it look like what it is. The ability to create NPR versions of
textures from images could be applied in 3D as well as image based NPR.

The inability of the segmenter to capture shading on smoothly varying regions is


(a) (b) (c)

Figure 10.1: A rendering from our line drawing system (b), can be compared to an
alternate locally varying segmentation (c). This segmentation more closely follows the
shape of shading contours.

also problematic. Ideally, one would like the boundaries of regions created by shading
to be clearly distinguishable from other regions, and to smoothly follow isocontours of
the image. Instead, mean shift segmentation tends to produce patchy jigsaw puzzle like
regions (see Figure 10.1). If segmentation parameters are changed to create a coarser
segmentation, the result collapses an entire area of gradient leaving a number of small
island regions of greater variation dotted throughout.

It also seems that, when available, fixation information itself should be able to
provide an extra guide to the segmentation process. A fixation gives a fairly strong clue
that some important fine scale feature exists in its vicinity. This information should be
of some benefit to segmentation.

We have made some initial experiments addressing these issues. We begin by cre-
ating a segmentation using an alternate segmentation technique that tends to follow
isocontours [Montanvert et al., 1991]. This method iteratively merges regions based
on a class label derived from color information. We also use fixation data to locally
control the color threshold used in the segmentation. The contrast threshold used to




Figure 10.2: Locally varying segmentation cannot replace a segmentation hierarchy.

Another example of a locally varying segmentation controlled by a perceptual model
(c), compared to a rendering from our line drawing system. Note fine detail in the brick
preserved near the subjects head in (c). This is a consequence of the threshold varying
continuously as a function of distance from the fixations on the face.

decide whether two regions should merge is calculated using the same contrast sensi-
tivity model applied in our region and line drawing system. The result of this is a single
segmentation that displays locally varying resolution with smaller regions being pre-
served where a viewer looked. This achieves a kind of abstraction very similar to the
renderings in our colored line drawing style (see Figures 10.1 10.2). Note that shading
region boundaries in these images follow much more natural curves.

This technique has limitations as a form of abstraction. Detail is determined lo-

cally, so we again see detail preserved near important features, like the bricks near
the figure’s head in Figure 10.2. Though currently this technique only creates a single
segmentation, it could be easily extended to create a hierarchy that would allow us to
modulate detail discontinuously. Even when creating a hierarchy, it may be useful to
segment important areas more finely. Indeed, even if our goal is not abstraction, but
rather segmentation for its own sake, the locally varying resolution of such segmenta-
tions might be useful.

Various additional data could also be added to our segmentation. Items such as cars
and people can be identified [Torralba et al., 2004] and these labels could be added to
sets of regions in our representation. Knowing what an object is would aid more in-
formed abstraction. This contrasts with our system that views everything as a collection
of blobs. Such information could also be used in automatic attempts to modify the seg-
mentation tree to better reflect object structure. If some set of regions at the finest level
are identified as say, a car, the segmentation hierarchy can be modified to ensure that
these regions form a subtree that does not merge with the background until the whole
car is reduced to one region.

This style of abstracted rendering has been extended to video [Wang et al., 2004].
However, abstraction was performed largely by hand selecting groups of 3D space-
time regions. More fully automatic methods would require a 3D analogue of our 2D
hierarchical segmentation. Such a blob hierarchy could be created at fairly high com-
putational expense by repeated mean shift with successively larger kernels. A careful

iterative merging of regions could potentially create similar results at much less cost.

10.1.2 Edges

Our edge representation is another interesting area for improvement. Edges in our
current results are a weak point. They are detected at only one scale and tend to be
very broken in appearance, with an excessive number of small textural edges scattered
about. Very short edges are discarded. This filtering makes fine detail impossible to
capture with lines. In addition, because edges are detected at only one scale, we know
nothing about the range of frequencies an edge exists at and so are pressed into the
questionable decision of using edge length as a size measure.

Detecting edges at multiple scales is an obvious next step. There are several ways
this might be done. One would be to create a hierarchical edge representation similar
to our region hierarchy. Some work has been done on this problem. Edges have been
detected at multiple scales and correspondences made across scales to trace coarse
scale edges to their fine scale causes [Hong et al., 1982]. Such approaches seek to
achieve both of the normally conflicting goals of robust detection of features at all
scales, and fine localization of feature position. This work could be built on to represent
all edges in an image as a collection of tree structures of connected edges.

A more modern and popular approach to deal with multi-scale edges is scale se-
lection. Only a single set of edges is detected, but the scale at which they are detected
varies locally [Lindeberg, 1998]. Conceptually, one searches everywhere for edges at
a range of scales and picks the scale with the maximal edge response. This approach
does not consider the tracing of coarse scale features to fine scale. But this is not par-
ticularly necessary. Ideally at least, features detected at coarse scales actually exist at a
coarse scale, and finer scales in those locations will only contain noise. This approach
provides a more complete, continuous set of edges. It also provides important addi-
tional information, for each point on each edge there is an corresponding contrast and
scale value. The availability of this information suggests the use of more interesting

Figure 10.3: A rendering from our line drawing system demonstrates how long but
unimportant edges can be inappropriately emphasized. Also, prominent lower fre-
quency edges like creases in clothing are detected in fragments and filtered out because
edges are detected at only one scale.

perceptual models in making decisions about edge inclusion.

10.2 Perceptual Models

Like representation of edges, decisions about edge inclusion are a weak point in our
current approach. Currently, an acuity model uses edge length as a proxy for fre-
quency. This succeeds in removing shorter edges in unimportant regions, but is poorly
motivated perceptually and produces some unintuitive artifacts. This can be seen for
example in Figure 10.3 where unimportant edges in the background are inappropri-
ately included because of their great length. Edges in are system are in fact, filtered
not once but three times: first by the hysteresis threshold used in the original edge de-
tection scheme, second by a global length threshold, only then are edges judged by our
perceptual model.

If scale selection is used, we would have for each point on each edge a frequency
estimate as well as a contrast measure at that scale. This would allow us to use a
contrast sensitivity model to judge edge inclusion. A decision could be made at each
point along each edge, or a single scale and contrast could be assigned to the whole of

each candidate edge, perhaps using the median value. As we currently do with regions,
we could then plug frequency into the model and receive a contrast threshold that can
be compared to the measured contrast along the edge.

Recall that in applying contrast sensitivity models to regions, some modifications

were made to avoid the unintuitive effect of very large scale regions having lower vis-
ibilities. This was loosely justified by the properties of square wave gratings. For
multi-scale edges the unmodified model makes sense. Very coarse scale edges, such as
the edge of a soft shadow, are in fact less visually prominent and would be correctly
judged as less worthy of inclusion than somewhat higher frequency edges of similar
contrast. A model like this could take over all judgments about what constitute signifi-
cant edges. Such an approach could be used to intuitively filter detected edges outside
of NPR. Scale selection detects a very large number of low contrast high frequency
edges. A variety of strength measures have been used to filter them out. A model
like this would provide a perceptually motivated metric, as well as a way of creating
locally varying thresholds based on viewer input. A complete approach to edge extrac-
tion and filtering would require higher level effects like grouping and completion, but
perceptual metrics like this could be an interesting first step.

Similarly, there is room for improvement in perceptual models of region visibility.

The next step is less clear here. Better psychophysical models of color contrast sen-
sitivity could be applied if available. Better methods of measuring contrast between
regions and their surround would also be useful. Our current approach takes into ac-
count only the mean color of each region. It is easy to think up examples where this
provides a poor measure of the distinguishability of two regions.

A method that measured contrast using the color histograms of two regions would
likely be an improvement. Taking into account both the interior of the regions and the
characteristics of their shared border [Lillestaeter, 1993], could distinguish object and
shading boundaries. Alternatively, we could reduce region visibility to the visibility of
boundaries between two regions. This could be done using the contrast and frequency

of a best fitting edge along the border between them. Since we are measuring not the
size of a region, but the frequency of the boundary between two regions, an unmodified
contrast sensitivity model is again appropriate. Region boundaries due to slow shading
changes would have appropriately low contrasts, low frequencies and therefore low
visibility. In another alternative, the whole range of frequencies and corresponding
contrasts present on the border could be looked at, and visibility could be based on
the most visible among these. This kind of perceptually driven scale selection might
produce some interesting effects.

All of these additions and modifications could lead to more interesting and ex-
pressive imagery. We have shown that the abstraction embodied in these images can
communicate what a viewer found important and provide an effective guide to future
viewers. A component that remains missing in our argument that this methodology
will be useful in visualization is a demonstration of its benefits in a practical task.

10.3 Applications

The presence of similar abstraction in many technical and practical illustrations en-
courages us that there are many applications of these techniques in visualization and
illustration. A practical problem has been choosing a domain in which to test our
method. Our approach gives itself to illustrative rather than exploratory applications,
since the methodology requires that someone know what is important, so their fixa-
tions can be used to clarify the point for successive viewers. The domains where this
might be most useful present some challenges for our current image analysis. Med-
ical images for example tend to be low contrast, noisy and difficult to segment with
general-purpose segmentation techniques. Photographs of technical apparatus such as
for example a car engine (see Figure 10.4) present their own difficulties. Though clean
man made edges are generally easier to segment, these images are very crowded and
often poorly lit. In these circumstances, segmentations fail to respect object structure in

Figure 10.4: Attempting technical illustration of mechanical parts pushes our image
analysis techniques close to (if not over) their limits.

a way that can be confusing. Extra sources of information, such as sets of photos taken
with flashes in different locations have been used to ease image analysis in situations
like this [Raskar et al., 2004]. Despite these technical challenges, we feel confident
that these methods of abstraction will be useful for illustration in a number of domains.

These applications are not limited to photo abstraction. Similar kinds of abstraction
can be performed in 3D scenes. This removes the difficulties of image analysis, though
it presents a number of new challenges. Beyond textural indication in line drawings,
abstraction in 3D scenes has received relatively little attention. Perceptual metrics like
those we present could provide an interesting basis for a general framework of 3D

Chapter 11
Our goal was to create images that capture some of the expressive omission of art.
Several kinds of such images have been presented. These methods are motivated by
artistic practice and current models of human visual perception. Such images have
been experimentally shown to create a difference in the way viewers look at images.
This suggests our method has the ability to direct a viewers gaze, or at least focus
interest in particular areas. We therefore believe that these techniques are applicable
not only to art but also to wider problems of graphical illustration and visualization.

Rather than just a test of our system, our experiments can be seen as empirical val-
idation on controlled stimuli of the general idea that artists direct viewers gaze through
detail modulation. Our success also provides an experimental confirmation of sorts for
the hypothesis [Zeki, 1999] that at least part of the appeal of great art lies in the artists
careful control of detail, enticing the viewer with information, while not overwhelming
them with irrelevant detail. This balance serves to engage the viewer, leaving them free
to ponder an image’s meaning, without the burden of having to decipher its contents.

Detail modulation in illustration and art is a complex topic which we have only
begun to investigate. The work presented here has already inspired related approaches
from other researchers to problems in cropping, [Suh et al., 2003] and fluid visualiza-
tion [Watanabe et al., 2004]. Detail modulation is only part of visual artistry—one
of the many techniques available. Color, contrast, shape, and a host of higher level
concerns are manipulated in art and play a part in well designed images. All of these
techniques have some cognitive motivation. Understanding this perceptual basis is
an important guide in creating effective automatic instantiations of these techniques.
Continuing investigation of the role and functioning of abstraction in its many forms,
especially through building new quantitative models, should yield new ways to create

easily understood illustration.

The work presented here suggests a number of general insights to guide future

• An understanding of the cognitive processing involved in human understanding

of an image or stimuli is important for effective stylized and abstracted illustra-

• The importance of some user input in our system highlights the fact that current
automatic techniques cannot replace the semantic knowledge of a human viewer.
People can perform abstraction but so far computers cannot in the domain of
general images.

• The fact that eye tracking is sufficient for some level of abstraction in our con-
text makes an interesting point. It suggests that the understanding underlying
abstraction, and perhaps other artistic judgments, is not some mysterious abil-
ity of a visionary few, but a basic visual competence. Though not everyone can
draw, everyone it seems can control abstraction in a computer rendering.

• Eye tracking is a useful tool for understanding in this context. It is useful not only
as a minimal form of interaction, but also as a cognitive measure for evaluation
and for understanding what features are attended and hence may be critical to

• In a perceptually motivated framework, experimentation is useful not only to

evaluate or validate a final system, but also to investigate and, if possible, build
quantitative models of perception as it relates to questions of interest. Work in
psychology and cognitive science can provide a framework for undestanding a
problem, as well as general methodologies. Sometimes models applicable to
a specific problem are also available. However, these do not always address
questions in the way most useful to those building applied systems. This provides

a need for a cyclical process of cognitive investigation and system engineering

to build more effective systems for visual communication.

These considerations suggest a future path for research in NPR that diverges some-
what from traditional areas of investigation, but holds the promise of a consistent intel-
lectual underpinning for an expanding field, as well, of course, as the promise of more
expressive and perhaps ultimately artistic computer generated imagery.

[Adelson, 2001] Adelson (2001). On seeing stuff: the perception of materials by hu-
mans and machines. Proceedings of the SPIE, 4299:1–12.

[Agrawala et al., 2003] Agrawala, M., Phan, D., Heiser, J., Haymaker, J., Klingner, J.,
Hanrahan, P., and Tversky, B. (2003). Designing effective step-by-step assembly
instructions. In Proceedings of ACM SIGGRAPH 2003, pages 828–837.

[Agrawala and Stolte, 2001] Agrawala, M. and Stolte, C. (2001). Rendering effective
route maps: improving usability through generalization. In Proceedings of ACM
SIGGRAPH 2001, pages 241–249.

[Ahuja, 1996] Ahuja, N. (1996). A transform for multiscale image segmentation by

integrated edge and region detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(12):1211–1235.

[Arnheim, 1988] Arnheim, R. (1988). The Power of the Center. University of Cali-
fornia Press.

[Bangham et al., 1998] Bangham, J., Hidalgo, J. R., Harvey, R., and G.Cawley (1998).
The segmentation of images via scale-space trees. In Proceedings of British Ma-
chine Vision Conferenc, pages 33–43.

[Baxter et al., 2001] Baxter, B., Scheib, V., and Lin, M. (2001). Dab: interactive hap-
tic painting with 3d virtual brushes. Proceedings of ACM SIGGRAPH 2001, pages

[Campbell and Robson, 1968] Campbell, F. and Robson, J. (1968). Application of

fourier analysis to the visibility of gratings. Journal of Physiology, 197:551–566.

[Cater et al., 2003] Cater, K., Chalmers, A., and Ward, G. (2003). Detail to attention:
Exploiting visual tasks for selective rendering. In Proceedings of the Eurographics
Symposium on Rendering, pages 270–280.

[Chen et al., 2002] Chen, L., Xie, X., Fan, X., Ma, W., Shang, H., and Zhou, H.
(2002). A visual attention mode for adapting images on small displays. MSR-
TR-2002-125, Microsoft Research, Redmond, WA.

[Christoudias et al., 2002] Christoudias, C., Georgescu, B., and Meer, P. (2002). Syn-
ergism in low level vision. In Proceedings ICPR 2002, pages 150–155.

[Collomosse and Hall, 2003] Collomosse, J. P. and Hall, P. M. (2003). Genetic paint-
ing: a salience adaptive relaxation technique for painterly rendering. Technical
Report CSBU2003-02, Dept. of Computer Science, University of Bath.

[Crowe and Narayanan, 2000] Crowe, E. C. and Narayanan, N. H. (2000). Comparing

interfaces based on what users watch and do. In Proceedings of the Eye Tracking
Research and Applications (ETRA) Symposium 2000, pages 29–36.

[Curtis et al., 1997] Curtis, C. J., Anderson, S. E., Seims, J. E., Fleischer, K. W., and
Salesin, D. H. (1997). Computer-generated watercolor. In Proceedings of ACM
SIGGRAPH 97, pages 421–430.

[DeCarlo et al., 2003] DeCarlo, D., Finkelstein, A., Rusinkiewicz, S., and Santella, A.
(2003). Suggestive contours for conveying shape. In Proceedings of ACM SIG-
GRAPH 2003.

[DeCarlo and Santella, 2002] DeCarlo, D. and Santella, A. (2002). Stylization and
abstraction of photographs. In Proceedings of ACM SIGGRAPH 2002, pages 769–

[Deussen and Strothotte, 2000] Deussen, O. and Strothotte, T. (2000). Computer-

generated pen-and-ink illustration of trees. In Proceedings of ACM SIGGRAPH
2000, pages 13–18.

[Duchowski, 2000] Duchowski, A. (2000). Acuity-matching resolution degradation

through wavelet coefficient scaling. IEEE Trans. on Image Processing, 9(8):1437–

[Durand et al., 2001] Durand, F., Ostromoukhov, V., Miller, M., Duranleau, F., and
Dorsey, J. (2001). Decoupling strokes and high-level attributes for interactive tradi-
tional drawing. In Proceedings of the 12th Eurographics Workshop on Rendering,
pages 71–82.

[Fleming et al., 2003] Fleming, R. W., Dror, O. R., and Adelson, E. (2003). Real-
world lillumination and the perception of surface reflectance properties. Journal of
Vision, 3:347–368.

[Goldberg et al., 2002] Goldberg, J. H., Stimson, M. J., Lewenstein, M., Scott, N., and
Wichansky, A. M. (2002). Eye tracking in web search tasks: design implications.
In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium
2002, pages 51–58.

[Gombrich et al., 1970] Gombrich, E. H., Hochberg, J., and Black, M. (1970). Art,
Perception, and Reality. John Hopkins University Press.

[Gooch and Willemsen, 2002] Gooch, A. A. and Willemsen, P. (2002). Evaluating

space perception in NPR immersive environments. In Proceedings of the Second
International Symposium on Non-photorealistic Animation and Rendering (NPAR),
pages 105–110.

[Gooch and Gooch, 2001] Gooch, B. and Gooch, A. (2001). Non-Photorealistic Ren-
dering. A K Peters.

[Gooch et al., 2004] Gooch, B., Reinhard, E., and Gooch, A. (2004). Human facial
illustration: Creation and psychophysical evaluation. ACM Transactions on Graph-
ics, 23:27–44.

[Grabli et al., 2004] Grabli, S., Durand, F., and Sillion, F. (2004). Density measure for
line-drawing simplification. In Proceedings of Pacific Graphics.

[Graham, 1970] Graham, D. (1970). Composing Pictures. Van Nostrand Reinhold.

[Haeberli, 1990] Haeberli, P. (1990). Paint by numbers: Abstract image representa-

tions. In Proceedings of ACM SIGGRAPH 90, pages 207–214.

[Hays and Essa, 2004] Hays, J. H. and Essa, I. (2004). Image and video-based
painterly animation. In Proceedings of the Third International Symposium on Non-
photorealistic Animation and Rendering (NPAR), pages 113–120.

[Heiser et al., 2004] Heiser, J., Phan, D., Agrawala, M., Tversky, B., and Hanrahan,
P. (2004). Identification and validation of cognitive design principles for automated
generation of assembly instructions. In Advanced Visual Interfaces, pages 311–319.

[Henderson and Hollingworth, 1998] Henderson, J. M. and Hollingworth, A. (1998).

Eye movements during scene viewing: An overview. In Underwood, G., editor, Eye
Guidance in Reading and Scene Perception, pages 269–293. Elsevier Science Ltd.

[Hertzmann, 1998] Hertzmann, A. (1998). Painterly rendering with curved brush

strokes of multiple sizes. In Proceedings of ACM SIGGRAPH 98, pages 453–460.

[Hertzmann, 2001] Hertzmann, A. (2001). Paint by relaxation. In Computer Graphics

International, pages 47–54.

[Hong et al., 1982] Hong, T.-H., Shneier, M., and Rosenfeld, A. (1982). Border ex-
traction using linked edge pyramids. IEEE Transactions on Systems, Man and Cy-
bernetics, 12:660–668.

[Interrante, 1996] Interrante, V. (1996). Illustrating Transparency: communicating

the 3D shape of layered transparent surfaces via texture. PhD thesis, University of
North Carolina.

[Itti and Koch, 2000] Itti, L. and Koch, C. (2000). A saliency-based search mechanism
for overt and covert shifts of visual attention. Vision Research, 40:1489–1506.

[Itti et al., 1998] Itti, L., Koch, C., and Niebur, E. (1998). A model of saliency-based
visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 20:1254–1259.

[Jacob, 1993] Jacob, R. J. (1993). Eye-movement-based human-computer interaction

techniques: Toward non-command interfaces. In Hartson, H. and Hix, D., editors,
Advances in Human-Computer Interaction, Volume 4, pages 151–190. Ablex Pub-

[Just and Carpenter, 1976] Just, M. A. and Carpenter, P. A. (1976). Eye fixations and
cognitive processes. Cognitive Psychology, 8:441–480.

[Kalnins et al., 2002] Kalnins, R. D., Markosian, L., Meier, B. J., Kowalski, M. A.,
Lee, J. C., Davidson, P. L., Webb, M., Hughes, J. F., and Finkelstein, A. (2002).
WYSIWYG NPR: Drawing strokes directly on 3D models. In Proceedings of ACM
SIGGRAPH 2002, pages 755–762.

[Kelly, 1984] Kelly, D. (1984). Retinal inhomogenity: I. spatiotemporal contrast sen-

sitivity. Journal of the Optical Society of America A, 74(1):107–113.

[Koenderink and van Doorn, 1979] Koenderink, J. and van Doorn, A. (1979). The
structure of two dimensional scalar fields with applicaitons to vision. Biological
Cybernetics, 30:151–158.

[Koenderink, 1984] Koenderink, J. J. (1984). What does the occluding contour tell us
about solid shape? Perception, 13:321–330.

[Koenderink et al., 1978] Koenderink, J. J., M.A. Bouman, A. B. d. M., and Slappen-
del, S. (1978). Perimetry of contrast detection thresholds of moving spatial sine
wave patterns. II. the far peripheral visual field (eccentricity 0-50). Journal of the
Optical Society of America A, 68(6):850–854.

[Koenderink and van Doorn, 1999] Koenderink, J. J. and van Doorn, A. (1999). The
structure of locally orderless images. International Journal of Computer Vision,

[Kosara et al., 2003] Kosara, R., Healey, C., Interrante, V., Laidlaw, D., and Ware,
C. (2003). User studies: Why, how and when? IEEE Computer Graphics and
Applications, 23(4):20–25.

[Kowalski et al., 1999] Kowalski, M. A., Markosian, L., Northrup, J. D., Bourdev, L.,
Barzel, R., Holden, L. S., and Hughes, J. (1999). Art-based rendering of fur, grass,
and trees. In Proceedings of ACM SIGGRAPH 99, pages 433–438.

[Kowler, 1990] Kowler, E. (1990). The role of visual and cognitive processes in the
control of eye movements. In Kowler, E., editor, Eye Movements and Their role in
Visual and Cognitive Processes, pages 1–70. Elsevier Science Ltd.

[Land et al., 1999] Land, M., Mennie, N., and Rusted, J. (1999). The roles of vision
and eye movements in the control of activities of daily living. Perception, 28:1311–

[Leyton, 1992] Leyton, M. (1992). Symmetry, causality, mind. MIT Press.

[Lillestaeter, 1993] Lillestaeter, O. (1993). Complex contrast, a definition for struc-

tured targets and backgrounds. Journal of the Optical Society of America,

[Lindeberg, 1998] Lindeberg, T. (1998). Edge detection and ridge detection with au-
tomatic scale selection. International Journal of Computer Vision, 30(2):117–154.

[Litwinowicz, 1997] Litwinowicz, P. (1997). Processing images and video for an im-
pressionist effect. In Proceedings of ACM SIGGRAPH 97, pages 407–414.

[Locher, 1996] Locher, P. J. (1996). The contribution of eye-movement research to an

understanding of the nature of pictorial balance perception: a review of the litera-
ture. Empirical Studies of the Arts, 14(2):146–163.

[Mackworth and Morandi, 1967] Mackworth, N. and Morandi, A. (1967). The gaze
selects informative details within pictures. Perception and Psychophysics, 2:547–

[Mannos and Sakrison, 1974] Mannos, J. L. and Sakrison, D. J. (1974). The effects of
a visual fidelity criterion on the encoding of images. IEEE Trans. on Information
Theory, 20(4):525–536.

[Markosian et al., 1997] Markosian, L., Kowalski, M. A., Trychin, S. J., Bourdev,
L. D., Goldstein, D., and Hughes, J. F. (1997). Real-time nonphotorealistic ren-
dering. In Proceedings of ACM SIGGRAPH 97, pages 415–420.

[Marr, 1982] Marr, D. (1982). Vision: A Computational Investigation into the Human
Representation and Processing of Visual Information. W.H. Freeman, San Fran-

[Meer and Georgescu, 2001] Meer, P. and Georgescu, B. (2001). Edge detection with
embedded confidence. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 23(12):1351–1365.

[Mello-Thoms et al., 2002] Mello-Thoms, C., Nodine, C. F., and Kundel, H. L.

(2002). What attracts the eye to the location of missed and reported breast cancers?
In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium
2002, pages 111–117.

[Montanvert et al., 1991] Montanvert, Meer, P., and Rosenfeld, A. (1991). Hierar-
chical image analysis using irregular tesselations. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 13(4):307–316.

[Mulligan, 2002] Mulligan, J. B. (2002). A software-based eye tracking system for

the study of air traffic displays. In Proceedings of the Eye Tracking Research and
Applications (ETRA) Symposium 2002, pages 69–76.

[Niessen, 1997] Niessen, W. (1997). Nonlinear multiscale representations for image

segmentation. Computer Vision and Image Understanding, 66(2):233–245.

[Parkhurst et al., 2002] Parkhurst, D., Law, K., and Niebur, E. (2002). Modeling the
role of salience in the allocation of overt visual attention. Vision Research, 42:107–

[Perona and Malik, 1990] Perona, P. and Malik, J. (1990). Scale-space and edge de-
tection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 12(7):629–639.
[Privitera and Stark, 2000] Privitera, C. M. and Stark, L. W. (2000). Algorithms for
defining visual regions-of-interest: Comparison with eye fixations. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 22(9):970–982.
[Ramachandran and Hirstein, 1999] Ramachandran, V. S. and Hirstein, W. (1999).
The science of art. Journal of Consciousness Studies, 6(6-7).
[Raskar et al., 2004] Raskar, R., Tan, K.-H., Feris, R., Yu, J., and Turk, M. (2004).
Non-photorealistic camera: depth edge detection and stylized rendering using multi-
flash imaging. In Proceedings of ACM SIGGRAPH 2004, pages 679–688.
[Reddy, 1997] Reddy, M. (1997). Perceptually Modulated Level of Detail for Virtual
Environments. PhD thesis, University of Edinburgh.
[Reddy, 2001] Reddy, M. (2001). Perceptually optimized 3D graphics. IEEE Com-
puter Graphics and Applications, 21(5):68–75.
[Regan, 2000] Regan, D. (2000). Human Perception of Objects: Early Visual Process-
ing of Spatial Form Defined by Luminance, Color, Texture, Motion and Binocular
Disparity. Sinauer.
[Rosenholtz, 1999] Rosenholtz, R. (1999). A simple saliency model predicts a number
of motion popout phenomena. Vision Research, 39:3157–3163.
[Rosenholtz, 2001] Rosenholtz, R. (2001). Search asymmetries? what search asym-
metries? Perception and Psychophysics, 63:476–489.
[Rovamo and Virsu, 1979] Rovamo, J. and Virsu, V. (1979). An estimation and ap-
plication of the human cortical magnification factor. Experimental Brain Research,
[Ruskin, 1857] Ruskin, J. (1857). The Elements of Drawing. Smith, Elder and Co.
[Ruskin, 1858] Ruskin, J. (1858). Address at the opening of the Cambridge school of
[Ryan and Schwartz, 1956] Ryan, T. A. and Schwartz, C. B. (1956). Speed of per-
ception as a function of mode of representation. American Journal of Psychology,
pages 60–69.
[Saito and Takahashi, 1990] Saito, T. and Takahashi, T. (1990). Comprehensible ren-
dering of 3-D shapes. In Proceedings of ACM SIGGRAPH 90, pages 197–206.
[Salisbury et al., 1994] Salisbury, M. P., Anderson, S. E., Barzel, R., and Salesin, D. H.
(1994). Interactive pen-and-ink illustration. In Proceedings of ACM SIGGRAPH
94, pages 101–108.

[Salvucci and Anderson, 2001] Salvucci, D. and Anderson, J. (2001). Automated eye-
movement protocol analysis. Human-Computer Interaction, 16:39–86.

[Santella and DeCarlo, 2002] Santella, A. and DeCarlo, D. (2002). Abstracted

painterly renderings using eye-tracking data. In Proceedings of the Second Interna-
tional Symposium on Non-photorealistic Animation and Rendering (NPAR), pages

[Santella and DeCarlo, 2004a] Santella, A. and DeCarlo, D. (2004a). Eye tracking
and visual interest: An evaluation and manifesto. In Proceedings of the Third In-
ternational Symposium on Non-photorealistic Animation and Rendering (NPAR),
pages 71–78.

[Santella and DeCarlo, 2004b] Santella, A. and DeCarlo, D. (2004b). Robust cluster-
ing of eye movement recordings for quantification of visual interest. In Proceedings
of the Eye Tracking Research and Applications (ETRA) Symposium 2004.

[Schumann et al., 1996] Schumann, J., Strothotte, T., and Laser, S. (1996). Assessing
the effect of non-photorealistic rendering images in computer-aided design. In ACM
Human Factors in Computing Systems, SIGHCI, pages 35–41.

[Setlur et al., 2004] Setlur, V., Takagi, S., Raskar, R., Gleicher, M., and Gooch, B.

[Shapiro and Stockman, 2001] Shapiro, L. and Stockman, G. (2001). Computer Vi-
sion. Prentice-Hall.

[Shiraishi and Yamaguchi, 2000] Shiraishi, M. and Yamaguchi, Y. (2000). An algo-

rithm for automatic painterly rendering based on local source image approximation.
In Proceedings of the First International Symposium on Non-photorealistic Anima-
tion and Rendering (NPAR), pages 53–58.

[Sibert and Jacob, 2000] Sibert, L. E. and Jacob, R. J. K. (2000). Evaluation of eye
gaze interaction. In Proceedings CHI 2000, pages 281–288.

[Suh et al., 2003] Suh, B., Ling, H., Bederson, B. B., and Jacobs, D. W. (2003). Auto-
matic thumbnail cropping and it’s effectivness. ACM Conference on User Interface
and Software Technolgy (UIST 2003), pages 95–104.

[Torralba et al., 2004] Torralba, A., Murphy, K., and Freeman, W. (2004). Contextual
models for object detection using boosted random fields. In Adv. in Neural Infor-
mation Processing Systems.

[Tufte, 1990] Tufte, E. R. (1990). Envisioning Information. Graphics Press.

[Turano et al., 2003] Turano, K. A., Geruschat, D. R., and Baker, F. H. (2003). Ocu-
lomotor strategies for the direction of gaze tested with a real-world activity. Vision
Research, 43:333–346.

[Underwood and Radach, 1998] Underwood, G. and Radach, R. (1998). Eye guid-
ance and visual information processing: Reading, visual search, picture perception
and driving. In Underwood, G., editor, Eye Guidance in Reading and Scene Percep-
tion, pages 1–27. Elsevier Science Ltd.

[Vertegaal, 1999] Vertegaal, R. (1999). The gaze groupware system: Mediating joint
attention in mutiparty communication and collaboration. In Proceedings CHI ’99,
pages 294–301.

[Walker et al., 1998] Walker, K. N., Cootes, T. F., and Taylor, C. J. (1998). Locating
salient object features. in Proceedings BMVC, 2:557–567.

[Wandell, 1995] Wandell, B. A. (1995). Foundations of Vision. Sinauer Associates


[Wang et al., 2004] Wang, J., Xu, Y., Shun, H.-Y., and Cohen, M. (2004). Video toon-
ing. In Proceedings of ACM SIGGRAPH 2004, pages 574–583.

[Watanabe et al., 2004] Watanabe, D., Mao, X., Ono, K., and Imamiya, A. (2004).
Gaze-directed streamline seeding. In APGV 2004.

[Winkenbach and Salesin, 1994] Winkenbach, G. and Salesin, D. H. (1994).

Computer-generated pen-and-ink illustration. In Proceedings of ACM SIGGRAPH
94, pages 91–100.

[Witkin, 1983] Witkin, A. (1983). Scale-space filtering. pages 1019–1021.

[Wooding, 2002] Wooding, D. S. (2002). Fixation maps: quantifying eye-movement

traces. In Proceedings of the Eye Tracking Research and Applications (ETRA) Sym-
posium 2002, pages 31–36.

[Yarbus, 1967] Yarbus, A. L. (1967). Eye Movements and Vision. Plenum Press.

[Yee et al., 2001] Yee, H., Pattanaik, S. N., and Greenberg, D. P. (2001). Spatio-
temporal sensitivity and visual attention in dynamic environments. ACM Trans-
actions on Graphics, 29:39–65.

[Zeki, 1999] Zeki, S. (1999). Inner Vision: An Exploration of Art and the Brain.
Oxford University Press.

Curriculum Vita
Anthony Santella
2005 Ph.D. in Computer Science, Certificate in Cognitive Science from Rutgers

1999 B.A in Computer Science from New York University

2001-2004 Research Assistant, The VILLAGE, Department of Computer Science, Rut-

gers University

1999-2001 Teaching Assistant, Department of Computer Science, Rutgers University


A. Santella and D. DeCarlo, ”Visual Interest and NPR: an Evaluation and Manifesto”.
In Proceedings of the Third International Symposium on Non-Photorealistic Anima-
tion and Rendering (NPAR) 2004, pp 71-78

A. Santella and D. DeCarlo, ”Robust Clustering of Eye Movement Recordings for

Quantification of Visual Interest”. In Proceedings of the Third Eye Tracking Research
and Applications (ETRA) 2004, pp 27-34

D. DeCarlo, A. Finkelstein, S. Rusinkiewicz and A. Santella, ”Suggestive Contours

for Conveying Shape”. In ACM Transactions on Graphics, 22(3) (SIGGRAPH 2003
Proceedings), pp 848-855

D. DeCarlo and A. Santella, ”Stylization and Abstraction of Photographs”. In ACM

Transactions on Graphics, 21(3) (SIGGRAPH 2002 Proceedings), pp 769-776

A. Santella and D. DeCarlo, ”Abstracted Painterly Renderings Using Eye-tracking

Data”. In Proceedings of the Second International Symposium on Non-Photorealistic
Animation and Rendering (NPAR) 2002, pp 75-82