This action might not be possible to undo. Are you sure you want to continue?
BY ANTHONY SANTELLA
A Dissertation submitted to the Graduate School—New Brunswick Rutgers, The State University of New Jersey in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy Graduate Program in Computer Science Written under the direction of Doug DeCarlo and approved by
New Brunswick, New Jersey May, 2005
ABSTRACT OF THE DISSERTATION
The Art of Seeing: Visual Perception in Design and Evaluation of Non-Photorealistic Rendering
by Anthony Santella Dissertation Director: Doug DeCarlo
Visual displays such as art and illustration beneﬁt from concise presentation of information. We present several approaches for simplifying photographs to create such concise, artistically abstracted images. The difﬁculty of abstraction lies in selecting what is important. These approaches apply models of human vision, models of image structure, and new methods of interaction to select important content. Important locations are identiﬁed from eye movement recordings. Using a perceptual model, features are then preserved where the viewer looked, and removed elsewhere. Several visual styles using this method are presented. The perceptual motivation for these techniques makes predictions about how they should effect viewers. In this context, we validate our approach using experiments that measure eye movements over these images. Results also provide some interesting insights into artistic abstraction and human visual perception.
Thanks go to the many people whose help and support was essential in making this work possible. None of this would have happened without my advisor Doug DeCarlo. Thanks go also to my other committe members: Adam Finkelstein, Eileen Kowler, Casimir Kulikowski and Peter Meer for their advice and encouragement at various (in some cases many) stages of this process. Thanks go also to the many friends and family members who have supported and kept me sane through this long process. I wouldn’t have survived it without my parents and brothers Nick and Dennis. Special thanks go to Bethany Weber. Thanks also to Jim Housell, all the old NYU crowd, the grad group at St. Peters and all the supportive souls in the CS Department, RuCCS and the VILLAGE. Finally, thanks go to Phillip Greenspun for photos used in several renderings that appear in chapters 7 and 9, as well as models Marybeth Thomas, Adeline Yeo and Franco Figliozzi. Special thanks to Georgio Dellachiesa for looking equally thoughtful in countless illustrative examples.
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1. Inspirations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1. Artistic Practice . . . . . . . . . . . . . . . . . . . . . . . . ii iii vii 1 3 3 5 7 8 11 11 13 15 17 17 19 20 22 24 24 26 26
1.1.2. Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3. Computer Graphics . . . . . . . . . . . . . . . . . . . . . . . 1.2. Our Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Abstraction in Computer Graphics . . . . . . . . . . . . . . . . . . . . 2.1. Manual Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Automatic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3. Level Of Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. Human Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Eye Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1. Eye Movement Control . . . . . . . . . . . . . . . . . . . . . 3.1.2. Salience Models . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Eye Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Limits of Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1. Models of Sensitivity . . . . . . . . . . . . . . . . . . . . . . 3.3.2. Sensitivity Away from the Visual Center . . . . . . . . . . . . 3.3.3. Applicability to Natural Imagery . . . . . . . . . . . . . . . . iv
Vision and Image Processing
. . . . . . . . . . . . . . . . . . . . . . .
30 30 35 36 38 38 40 42 42 43 44 47 50 50 50 53 54 57 59 64 64 65 67 72
4.1. Image Structure Features and Representation . . . . . . . . . . . . . 4.2. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3. Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1. Eye tracking as Interaction . . . . . . . . . . . . . . . . . . . . . . . 5.2. Using Visibility for Abstraction . . . . . . . . . . . . . . . . . . . . . 6. Painterly Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1. Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. Applying the Limits of Vision . . . . . . . . . . . . . . . . . . . . . 6.3. Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7. Colored Drawings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1. Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2. Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2. Perceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3. Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8. Photorealistic Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 8.1. Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2. Measuring Importance . . . . . . . . . . . . . . . . . . . . . . . . . 8.3. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 9. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
9.1. Evaluation of NPR . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1. Analysis of Eye Movement Data . . . . . . . . . . . . . . . . 9.2. Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1. Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2. 9.2.3. Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . Physical Setup . . . . . . . . . . . . . . . . . . . . . . . . .
73 75 77 77 78 78 79 80 80 82 83 86 87 92 95 95 95 99
9.2.4. Calibration and Presentation . . . . . . . . . . . . . . . . . . 9.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1. Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2. Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . 9.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1. Quantitative Results . . . . . . . . . . . . . . . . . . . . . . 9.4.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5. Evaluation Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
10. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1. Image Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.1. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2. Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2. Perceptual Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 10.3. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 11. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Curriculum Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
List of Figures
1.1. (a) Henri de Toulouse-Lautrec’s “Moulin Rouge—La Goulue” (Lithographic print in four colors, 1891). (b) Odd Nerdrum’s “Self-portrait as Baby” (Oil, 2000). Artists control detail as well as other features such as color and texture to focus a viewer on important features and create a mood. La Goulue’s swirling under-dress is a highly detailed focal point of the image, and contributes to the picture’s air of reckless excitement. Artists have a fair amount of latitude in how they allocate detail to create an effect. Nerdrum renders his eyes (usually one of the most prominent features in a portrait) in a sfumato style that makes them almost nonexistent. Detail is instead allocated to the child’s prophetic gesture. These choices change a common baby picture into something mysterious and unsettling. . . . . . . . . . . . . 4
1.2. Judith Schaechter’s, “Corona Borealis” (Stained glass, 2001). Skillful artists use the formal properties and constraints of a medium for expressive purposes. The high dynamic range provided by transmitted light and the heavy black outlines of the lead caming that holds the glass together are used to set the ﬁgure off from the background creating a powerful image of joy in isolation. . . . . . . . . . . . . . 2.1. Direct placement of strokes. Complete control of abstraction is possible when a user provides actual strokes that are rendered in a given style. Reproduced from [Durand et al, 2001]. . . . . . . . . . . . . . 2.2. Manual annotation for textural indication. Important edges on a 3D model are marked and have texture rendered near them, while it is omitted in the interior. Reproduced from [Winkenbach and Salesin, 1994]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 12 11 5
2.3. Manual local importance images. Hand painted images can indicate important areas to be rendered in greater detail or ﬁdelity. Reproduced from [Hertzmann, 2001] . . . . . . . . . . . . . . . . . . . . . . . . 2.4. (a) original image. (b) corresponding salience map [Itti et al, 1998]. (c) corresponding salience map [Itti and Koch, 2000]. Salience methods picks out potentially important areas on the basis of contrast in some space (not limited to intensity). The two methods pictured here differ in the method of normalization used to enhance contrast between salient and nonsalient regions. . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Patterns of eye movements of a single subject over an image when given different instructions. Note (1) free observation which shows ﬁxations that are relatively dispersed yet still focused on relevant areas. Contrast it with (3) where the viewer is instructed to estimate the ﬁgures’ ages. Reproduced from Yarbus 1967. . . . . . . . . . . . . . 3.2. Similar effects to [Yarbus, 1967] are easily (even unintentionally) achieved when using eye tracking for interaction. Circles are ﬁxations, their diameter is proportional to duration. The ﬁrst viewer was instructed to ﬁnd the important subject matter in the image. The second viewer was told to ’just look at the image’. The viewer assumed, from prior experience in perceptual experiments, that he was going to be later asked detailed questions about the contents of the scene. This resulted in a much more diffuse pattern of viewing. . . . . . . . . . . . . . . . . . 3.3. Log-log plot of contrast sensitivity from equation (3.2) This function is used to deﬁne a threshold between visible and invisible features. . 25 19 18 13 12
3.4. Cortical Magniﬁcation describes the drop-off of visual sensitivity with angular distance from the visual center. . . . . . . . . . . . . . . . . viii 27
4.1. (a) Scale space of one dimensional signal. Features disappear through scale space but no new features appear. (b) Plot of inﬂection points of another one dimensional signal through scale space. Reproduced from [Witkin 1983] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Interval tree for 1D signal illustrating decomposition of the signal into a hierarchy. Reproduced from [Witkin 1983]. . . . . . . . . . . . . . 5.1. (a) Computing eccentricities with respect to a particular ﬁxation at p. (b) A simple attention model deﬁned as a piecewise-linear function for determining the scaling factor ai for ﬁxation fi based on its duration ti . Very brief ﬁxations (below tmin ) are ignored, with a ramping up (at tmax ) to a maximum level of amax . . . . . . . . . . . . . . . . . . . . 6.1. Painterly rendering results. The ﬁrst column shows the ﬁxations made by a viewer. Circles are ﬁxations, size is proportional to duration, the bar at the lower left is the diameter that corresponds to one second. The second column illustrates the painterly renderings built based on that ﬁxation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. Detail in background adjacent to important features can be inappropriately emphasized. The main subject has a halo of detailed shutter slats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3. Sampling strokes from an anisotropic scale space avoids giving the image an overall blurred look, but produces a somewhat jagged look in background areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4. Color and contrast manipulation. Side by side comparison or rendering with and without color and contrast manipulation (precise stroke placement varies between the two images due to randomness). . . . . 7.1. Slices through several successive levels of a hierarchical segmentation tree generated using our method. . . . . . . . . . . . . . . . . . . . . 7.2. Line drawing style results. . . . . . . . . . . . . . . . . . . . . . . . ix 51 60 48 48 47 45 40 33 30
7.3. Stylistic decisions. Lines in isolation (a) are largely uninteresting. Unsmoothed regions (b) can look jagged. Smoothed regions (c) have a somewhat vague and bloated look without the black edges superimposed. 61 7.4. Renderings with uniform high and low detail. . . . . . . . . . . . . . 7.5. Several derivative styles of the same line drawing transformation. (a) Fully colored, (b) color comic, (c) black and white comic . . . . . . . 8.1. Mean shift ﬁltering tends to create images that no longer look like photographs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2. Photo abstraction results . . . . . . . . . . . . . . . . . . . . . . . . 8.3. Photo in (a) is abstracted using ﬁxations in (b) in a variety of different styles. (c) Painterly rendering, (d) line drawing, (e) locally disordered [Koenderink and van Doorn, 1999], (f) blurred, (g) anisotropically blurred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4. (a) Detail of our approach, (b) the same algorithm using an importance map where total dwell is measured locally. Notice in (b) the leaking of detail to the wood texture from the object on the desk. Here differences are relatively subtle; but in general it is preferable to allocate detail in a way that respects region boundaries. . . . . . . . . . . . . . . . . . 8.5. The range of abstraction possible with this technique is limited. With greater abstraction the scene begins to appear foggy. In some sense it no longer looks like the same scene. . . . . . . . . . . . . . . . . . . 9.1. Example stimuli. Detail points in white are from eye tracking, black detail points are from an automatic salience algorithm. . . . . . . . . 9.2. Illustration of data analysis, per image condition. Each colored collection of points is a cluster. Ellipses mark 99 % of variance. Large black dots are detail points. We measure the number of clusters, distance between clusters and nearest detail point, and distance between detail points and nearest cluster. . . . . . . . . . . . . . . . . . . . . . . . x 80 76 71 70 69 64 68 62 62
9.3. Statistical signiﬁcance is achieved for number of clusters over a wide range of clustering scales. The magnitude of the effect decreases, but its signiﬁcance remains quite constantly over a wide interval. Our results do not hinge on the scale value selected. . . . . . . . . . . . . . 9.4. Average results for all analyses per image. 9.5. Average results for all analyses per viewer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 84 85
9.6. Original photo and high detail NPR image with viewers’ ﬁltered eye tracking data. Though we found no global effect across these image types, there are sometimes signiﬁcantly different viewing patterns, as can be seen here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1. A rendering from our line drawing system (b), can be compared to an alternate locally varying segmentation (c). This segmentation more closely follows the shape of shading contours. . . . . . . . . . . . . 96 94
10.2. Locally varying segmentation cannot replace a segmentation hierarchy. Another example of a locally varying segmentation controlled by a perceptual model (c), compared to a rendering from our line drawing system. Note ﬁne detail in the brick preserved near the subjects head in (c). This is a consequence of the threshold varying continuously as a function of distance from the ﬁxations on the face. . . . . . . . . . 97
10.3. A rendering from our line drawing system demonstrates how long but unimportant edges can be inappropriately emphasized. Also, prominent lower frequency edges like creases in clothing are detected in fragments and ﬁltered out because edges are detected at only one scale. 100 10.4. Attempting technical illustration of mechanical parts pushes our image analysis techniques close to (if not over) their limits. . . . . . . . . . 103
Chapter 1 Introduction
In all eras and visual styles, artists control the amount of detail in the images they create, both locally and globally. This is not just a technique to limit the effort involved in rendering a scene. It makes a deﬁnite statement about what is important and streamlines understanding. Our goal is to largely automate this artistic abstraction in computer renderings. The hope is to remove detail in a meaningful way, while automating individual decisions about what features to include. Eye tracking allows the capture of what a viewer looks at and indirectly, what they ﬁnd important. We demonstrate that this information alone is sufﬁcient to control detail in an image based rendering, and change the way successive viewers look at the resulting image. Our method is grounded in the mechanisms and nature of vision—how we see and understand the world. This is an intuitive idea, if often overlooked. Artists must ﬁrst be viewers [Ruskin, 1858] and viewers ultimately consume the resulting images. So, vision must be central in the design of algorithms for creating imagery. Vision appears simple and effortless. Because under most circumstances it requires no conscious effort or exertion, it seems like a trivial operation, something that just happens, as if the light falling on the eye made one see in the same way it warms a stone. But sight is the product of an extraordinarily developed and complicated visual system. In seeing we are all experts, and experts make things seem easy. Without any effort we can navigate and act in the world and recognize objects even under difﬁcult conditions. The abilities of our sight outreach even our awareness of them. Experiments have shown that the eyes of radiologists searching for tumors linger longer over tumors that they fail to notice and report [Mello-Thoms et al., 2002]. The limited success of attempts to mimic these human abilities in computer vision systems highlight both the difﬁculty of the computations involved, and our phenomenal success at them.
The apparent ease with which we see slips when our vision is stressed: struggling to keep a written page in focus as we fall asleep, searching for a loved one’s face in the shifting crowd of an airport. At these times we become conscious of sight as a struggle to organize and make sense of the world. This struggle has continual victories, but also failures. An old friend waving to us on the street is passed by, a typo make its way into an important document. The apparent ease of vision also masks our limitations. We miss much, and are easily overloaded. Sometimes our failures are engineered: a camouﬂaged soldier, the proverbial ﬁne print. More often, however, they are accidental. Some information was present, or presented, and we failed to notice it. Well designed displays of visual information ensure we don’t miss anything important by careful arrangement and manipulation. A wide variety of techniques are used to make meaning clear. Detail is put just where it is important, shapes can be changed or removed, colors and textures enhanced or suppressed. Paintings, sketches, technical illustrations, and even the most apparently photorealistic of art—all products of the human hand—have been simpliﬁed and manipulated for ease of understanding. Reality is complicated and messy. Rather than realism, what is more often desired is verisimilitude. We want the appearance of reality which has been organized and structured to make its meaning clearer, if necessarily more limited than the inﬁnite complexity of reality. Achieving this kind of clarity has always been the job of artists and designers who make subjective, but not arbitrary, decisions about what is important, and how to convey it. The ubiquity of digital media creates a need for automation in achieving this kind of good design. The goal is not to replace the artist who creates a carefully crafted one-off display, but instead to create a potentially vast number of adaptive displays, tailored to particular situations and viewers. This information would otherwise be displayed in some less well-designed manner, laying more of a cognitive burden on the user. It has been argued in fact, that avoiding this burden is one of the primary characteristics of powerful art [Zeki, 1999]. If good design can be formalized, this will
enhance understanding and aid effective communication, as well as improve our own understanding of the workings of visual communication. This thesis presents some initial steps toward this goal.
There are many techniques proposed by various artists, and perhaps even more theory proposed by various researchers and critics on how to achieve good visual design. Yet it remains imperfectly understood in all of the ﬁelds where it has been studied. Because of this, a successful practical approach must necessarily draw on elements from many areas of practice and theory. If a practical system is designed to be as general as possible, its creation can improve understanding of what visual clarity means, and how it relates to communication. It can also provide a framework in which to unify concepts and techniques from many ﬁelds.
One important source of inspiration for this work is artistic practice and practical theory. Artists have always had strong motivation to capture the attention and interest of uninterested, sometimes hostile viewers. Much ingenuity has been applied to creating images that are as gripping and clearly communicative as possible. Careful observation of such images can yield interesting insights (see Figure 1.1). Similarly, artists have throughout history given advice on the practice of their craft. Theorists and art historians have tried to make generalizations and analyze techniques [Ruskin, 1857, Gombrich et al., 1970, Graham, 1970, Arnheim, 1988]. This is true in graphic design as well as ﬁne art. Classical texts like Tufte  try to explore the qualities of good and bad presentations of information and make generalizations from carefully chosen examples. However, these instructions and recommendations are often difﬁcult to apply. They
Figure 1.1: (a) Henri de Toulouse-Lautrec’s “Moulin Rouge—La Goulue” (Lithographic print in four colors, 1891). (b) Odd Nerdrum’s “Self-portrait as Baby” (Oil, 2000). Artists control detail as well as other features such as color and texture to focus a viewer on important features and create a mood. La Goulue’s swirling under-dress is a highly detailed focal point of the image, and contributes to the picture’s air of reckless excitement. Artists have a fair amount of latitude in how they allocate detail to create an effect. Nerdrum renders his eyes (usually one of the most prominent features in a portrait) in a sfumato style that makes them almost nonexistent. Detail is instead allocated to the child’s prophetic gesture. These choices change a common baby picture into something mysterious and unsettling.
are sometimes limited in scope, providing speciﬁc instructions for a particular narrow problem. More often, guidelines are too broad and vague in their application. They count for their functioning on the judgment of the artist. The advice of artists and designers often comes in the form of heuristics, rules of thumb to be taken with a grain of salt, kept in the back of one’s mind, and applied when the moment seems right. Becoming an expert in a visual ﬁeld is often a question of cultivating, through practice and observation, an instinctive sense of when to apply such rules, and conversely when to break them.
Figure 1.2: Judith Schaechter’s, “Corona Borealis” (Stained glass, 2001). Skillful artists use the formal properties and constraints of a medium for expressive purposes. The high dynamic range provided by transmitted light and the heavy black outlines of the lead caming that holds the glass together are used to set the ﬁgure off from the background creating a powerful image of joy in isolation.
A somewhat different approach is to study good design with the methodologies of psychology, psychophysics and neuroscience. This is in essence an attempt to understand good design from ﬁrst principles: the functioning of the human mind and visual system. Visual perception obviously mediates all information that passes from a display to a user. So, as a form of visual communication, art must be constrained by the laws of psychology and the visual system [Arnheim, 1988, Zeki, 1999, Ramachandran and
Hirstein, 1999]. This is an attractive idea. By understanding the strengths and weaknesses of the process that allows us to see, it should be possible to maximize use of the limited cognitive bandwidth between a display and viewer. This is perhaps not so far from what artists have done all along. One could view every daub of paint, every pen stroke as an informal experiment in vision. Artists test their actions against the evidence of their own visual systems, and make predictions about how they will affect others. Formal attempts to understand perception and art are simply more conscious, more systematic, and more interested in understanding the creative process itself than making a statement through it. A number of psychologists have speculated on this, and pointed to speciﬁc examples from art history [Arnheim, 1988, Leyton, 1992, Zeki, 1999, Ramachandran and Hirstein, 1999]. Studies have indeed found empirical evidence of perceptual effects resulting from artistic style or composition [Ryan and Schwartz, 1956, Locher, 1996]. Like most attempts to do anything complicated from ﬁrst principles, looking at art and design using cognition is hard. There is much that has been understood about the visual system, but also much that is not. The more basic and low level the area of visual function is, the more we know about it, and the less useful that information is for design. Much for example, is known about the physical mechanism of how we perceive color, substantially less is known about how we parse shapes out of a background and assemble them into objects. It’s not surprising that many researchers looking at art from a cognitive standpoint consider primarily 20th century painters, like Mondrian, Kandinsky, or even Picasso at his more abstract, who themselves were largely concerned with the purely formal aspects of pictorial space rather than the semantics of subject matter. The semantic aspects of vision which reference the rest of the world and its non-visual aspects are ill understood, so little cognitive research can be brought to bear on the semantics of art. Given the limited basic knowledge, general theories of how art functions cognitively are, almost of necessity, rather vague in their application. Ramachandran 
for example, suggests that all art is guided by the peak shift principle. This principle, found in a number of situations in psychology, says that if a response is trained to some stimuli, the greatest, or peak, response will be found with a stimulus that is greater than the one used in training. A depiction functions by emphasizing the features that normally let one know what it is. In this view all art is a form of caricature. However, this does not tell us the qualities of a successful caricature. In another example, Leyton  argues that art maximally encodes a causal history that can be read by viewers. Good art should contain as much information in the form of asymmetry as possible to stimulate viewers, but not too much, which will disturb them. Though a reasonable sounding standard, this only hints at what the correct level of complexity is. The application of psychology to design is difﬁcult. However, we do not need to build a system directly on these principles. Inspired by them, we can apply knowledge from low-level vision and computer graphics techniques to build practical systems.
A large body of work in computer graphics ignores all these difﬁculties and sets out to create attractive synthetic art and illustration. Attempts at algorithmic deﬁnitions of good design surface in a number of areas in computer science, graphics, scientiﬁc visualization, document layout, human computer interaction, and interface design. Concerns of effective art-like visual communication have particularly come to the forefront in the realm of non-photorealistic rendering, or NPR. This area is perhaps excessively broad. It includes almost any part of graphics that aims to create images that are not an imitation of reality. It includes things as diverse as computer generation of geometrical patterns, instructional diagrams and impressionist paintings. NPR images run a gamut between the purely ornamental and those designed to convey very speciﬁc information. A large area of research in NPR has been the production of many, often quite impressive, phenomenological models for rendering in various traditional media and styles. There is however an increasing interest in NPR as not just a way to imitate traditional
visual styles, but also as a set of techniques for trying to display visual information in a concise and abstract way. The link between concise presentation and imitating traditional artistic styles is not accidental. Almost all the visual styles of traditional media, line drawings, wood-block prints, comics, expressionist or impressionist paintings, pencil sketches, necessarily discard vast amounts of information as a direct consequence of their visual style. There is, for example, no color or shading in a pure line drawing. However, these images still carry the essential content that the artist (and viewer) requires of them. Skillful artists can use the properties and constraints of a medium to enhance the expressiveness of a work (see Figure 1.2). A brief time spent working with photo ﬁlters in a program like Adobe Photoshop suggests that computer implementations of these styles capture some of the effects of traditional media, but often in a way that does not adapt to particular situations with an artist’s ﬂexibility. Artists ultimately can judge their results as they go. Applying a technique in a blanket manner is often less satisfactory. What is acceptable as reality in a photograph can look fussy and crowded as a painting.
Though today’s algorithms cannot model the general intelligence of an artist, we argue that carefully designed systems can make use of minimal user interaction to create much more expressive images. Speciﬁcally, we look at modulation of local detail, an important cue used in traditional art and visualization. Including detail only where it is needed focuses viewer interest and can help clarify the point of an image. As well as being a feature of art and illustration, applications in visualization could beneﬁt from this. It would allow the computer to hand-craft displays for clarity and efﬁcient understanding in a particular situation. This work does not directly address speciﬁc visualization applications. Rather than exploring visualization directly, art remains the focus, and this thesis remains ﬁrmly in
the relm of artistic NPR. Our hope however, is that insights gained in this way should be applicable to a number of areas in visualization. Art is a particularly good place to explore the link between cognition and design of displays. Speciﬁc applications tend to distract with their own implementation details and domain constraints. Radiology, for example, is a domain where complexity and high stakes greatly constrain practical applications. Art encourages a wider view, in which it is easier to look at general techniques and patterns that are widely useful. Similarly, in evaluation, validation of a particular system is of limited interest, while evaluation of more general techniques can provide insights into cognition and be more widely relevant. Grounding our work in knowledge of visual perception also helps focus attention away from application engineering and towards general concepts. We are interested in interactively efﬁcient methods for achieving expressive NPR images. Knowledge of visual perception suggests that by exploiting the visual system we can reserve human effort for just the hardest parts of the process of crafting images, and pass the majority of the work over to a computer. For a computer application, the hardest part of abstraction is deciding what is important. This is not hard for people, since it is done instinctively. Deciding what to paint a picture of is the easy part for an artist. It is the mechanics of turning that intention into an image that takes training, time and effort. This leads us to a simple, minimally interactive method for controlling detail via eye tracking. As we will soon see, vision research leads us to believe that where people look indicates importance. Such areas should be portrayed in detail. Conversely, what viewers don’t look at is unimportant to them and can be removed or de-emphasized. The same insights about vision that leads to this methodology also leads us to quantitative methods for evaluation. If our approach is successful, increased interest in areas highlighted with detail should be reﬂected in eye movements. This methodology holds the promise of images that are carefully crafted for understanding on sound principles, and can be formally evaluated for effectiveness. Such images and techniques can in turn serve as a tool for further investigating human vision in a way targeted toward the
questions that are important for crafting images. With more information, even better techniques and images can be built. In this thesis we begin in Chapter 2 by laying out the basic problem of controlling detail in NPR imagery, and look at the range of techniques that have been used to address it. In Chapters 3 and 4 we then review the basic background in human and computer vision underlying our approach to this problem. The nature of vision leads us to an approach of capturing the intentionality central to design via eye tracking. Information about where people look alone is sufﬁcient to control detail in a directed way, allowing us to craft semi-automatic NPR images with much of the attractive and engaging intentionality of completely hand made art. The basic nature of this interaction is described in Chapter 5. In Chapter 6, 7 and 8 we then present several systems for creating NPR renderings built on this idea, and discuss their strengths and weaknesses. An evaluation of one of these systems is presented in chapter 9, which not only validates the general approach but gives some interesting insights into abstraction and human vision. Finally, in Chapter 10 we discuss some directions for future research.
Chapter 2 Abstraction in Computer Graphics
In any work of art all parts of the picture plane do not receive equal attention from the artist. Critical areas are more detailed, while others are left relatively abstract. This is the case even in quite realistic styles, and in technical illustration. Such effects have not been ignored in computer graphics and NPR. Local control of detail has been addressed in several visual styles. Whatever the rendering techniques used, important areas can be identiﬁed and depicted with greater detail, or emphasis on ﬁdelity. Deciding what is important is difﬁcult to do automatically. Two broad approaches to selecting important areas can be characterized: manual user annotation, and simple heuristics.
Figure 2.1: Direct placement of strokes. Complete control of abstraction is possible when a user provides actual strokes that are rendered in a given style. Reproduced from [Durand et al, 2001].
At one extreme, near complete control of detail can remain in the hands of a user. This provides many expressive possibilities at the expense of much interaction. At its
Figure 2.2: Manual annotation for textural indication. Important edges on a 3D model are marked and have texture rendered near them, while it is omitted in the interior. Reproduced from [Winkenbach and Salesin, 1994].
Figure 2.3: Manual local importance images. Hand painted images can indicate important areas to be rendered in greater detail or ﬁdelity. Reproduced from [Hertzmann, 2001] furthest extreme the computer becomes merely a digital paintbrush the user directly manipulates [Baxter et al., 2001]. A number of intermediate approaches exist that aid the user in the technicalities of creating an image while still giving them complete control over detail. The earliest work creating a painting-like appearance, or painterly rendering effect [Haeberli, 1990] took this approach. A user places strokes entirely by hand, their color being sampled from an underlying source image. The approach is in effect a form of tracing, where the user ultimately remains in control of stroke placement and size while, like a traditional media artist, making their own decisions about which details are important as they go. A similar kind of interaction has been used [Durand et al., 2001] in generating pencil renderings (see Figure 2.1. The user places strokes which are shaded and shaped automatically to create a ﬁnal drawing. The same stroke based interactive methods are applicable in 3D [Kalnins et al., 2002].
One step distant from actually drawing strokes, it is also possible to indicate increased importance for some areas of a rendering using an importance map, where higher intensity indicates the need for more attention or detail in that area. For example in a painterly rendering framework [Hertzmann, 2001], a hand drawn importance map was used to indicate that a source image should be more closely approximated in certain locations (see Figure 2.3). Similarly, [Winkenbach and Salesin, 1994] in 3D hand drawn lines have been used to indicate locations near which textural detail should be included (see Figure 2.2). In another painterly rendering application [Gooch and Willemsen, 2002] rectangles to be painted in greater detail could be drawn by hand. Various digital versions of other media, such as pen and ink [Salisbury et al., 1994] and watercolor [Curtis et al., 1997] have been developed that provide the user with a signiﬁcant control over the detail present in different areas. Such approaches can yield attractive results, but require careful attention on the part of a user.
Figure 2.4: (a) original image. (b) corresponding salience map [Itti et al, 1998]. (c) corresponding salience map [Itti and Koch, 2000]. Salience methods picks out potentially important areas on the basis of contrast in some space (not limited to intensity). The two methods pictured here differ in the method of normalization used to enhance contrast between salient and nonsalient regions.
More common in NPR have been purely automatic methods. Automatic methods also run a gamut, from approaches that process an image in a completely local, uniform manner to those that automatically extract some quantity from an image as a proxy for
importance. Uniform approaches perform some (not necessarily local) operation uniformly across an image, and have been used extensively in painterly rendering [Hertzmann, 1998, Litwinowicz, 1997, Shiraishi and Yamaguchi, 2000]. A global effect provides users with only limited control. Rather than being truly uniform, some of these approaches make a (largely implicit) simple assumption that some low level features are important and worth preserving. Automatic painterly rendering methods for example, largely assume strong high frequency features are important and should be preserved in a rendering. In fact, painterly techniques vary largely in their method for respecting these boundaries: aligning strokes perpendicular to the image gradient [Haeberli, 1990], terminating strokes at edges [Litwinowicz, 1997], or drawing in a coarse-to-ﬁne fashion [Hertzmann, 1998, Shiraishi and Yamaguchi, 2000, Hays and Essa, 2004]. Similarly, automatic line drawing approaches (both 2D and 3D) assume the importance of all lines that meet certain purely geometrical deﬁnitions, occluding contours, creases, [Saito and Takahashi, 1990,Interrante, 1996,Markosian et al., 1997], and suggestive contours [DeCarlo et al., 2003]. Such techniques can create attractive images, but lack the selective omission which gives art much of its expressive power. The kind of omission commonly used in depicting speciﬁc objects can sometimes be explicitly stated. In drawing trees for example, [Kowalski et al., 1999, Deussen and Strothotte, 2000] you can avoid drawing detail in the center of the tree, especially as the tree is drawn smaller. Though this may be an accurate characterization of a particular common style of depiction, it is not generally applicable to any subject. For general images, there are relatively few options for automatically selecting important areas. Some attempts have been made to predict importance using various image analysis techniques. In 3D, image pyramids have been applied to omit detail in the interior of a shape [Grabli et al., 2004]. In 2D, drawing on vision research, some approaches have attempted to use salience measures to capture importance. Salience measures are a guess at the ability of a feature to capture interest based on its low level properties [Itti et al., 1998,Itti and Koch, 2000]. Similarly motivated salience measures
have been applied to attempt to predict features worth preserving in painterly rendering [Collomosse and Hall, 2003]. Because faces are often an important component of images, detecting them also provides a useful (though not always reliable) automatic cue for what areas are important. Face detection has been used alongside salience methods in other areas of graphics loosely related to NPR where identifying important features is useful, such as automatic cropping [Chen et al., 2002, Suh et al., 2003] and recomposing of photographs [Setlur et al., 2004].
Level Of Detail
An area of computer graphics left out in the above discussion has dealt with many of these same issues. Various adaptive rendering and level of detail (LOD) schemes have used the visibility or potential interest of features to skip computations that are unlikely to be noticed. This is different from our goal. We are interested in detail modulation for stylistic and expressive reasons. Level of detail seeks to control the computational cost of rendering through approximation, not abstraction. Though both are concerned with simpliﬁcation, LOD and various other corner cutting is usually meant to be invisible, or nearly so, while expressive abstraction is meant to be seen and indeed have a strong effect on the way a viewer looks at an image. Though the goals are different, some of the methodologies overlap. The goal of imperceptible omission has encouraged researchers to look at perceptually motivated methods. Salience measures have been applied to concentrate computation on noticeable areas, [Yee et al., 2001, Cater et al., 2003]. In addition, a variety of low level perceptual models have been applied to try to quantify the visibility of features and guarantee that simpliﬁcation is invisible, or minimize visibility. We adopt several of these metrics in our own efforts. One of our contributions can be seen as applying and expanding perceptual models originally adopted in LOD to create expressive artistic abstraction. Both perceptually motivated LOD methods and the methods we present in this
thesis use models of vision to identify expendable areas of an image. It is the functional deﬁnition of an expendable area that differs between the two. In the following chapter we present the relevant background in human vision necessary for understanding why such areas exist, and how they may be identiﬁed.
Chapter 3 Human Vision
A background in human vision is essential in computationally deﬁning artistic abstraction. We have extraordinarily complex abilities to analyze images, these abilities have weaknesses and strengths. Level of detail simpliﬁcation methods seek to exploit the limits of vision to cut corners in an unnoticeable way. In contrast, we hope to use the related strengths of the visual system to improve visual design, clarifying content and make things that need to pop out, pop out. Our interactive technique uses eye movements and the limits of vision to indirectly measure the importance of features. Some background will clarify the motivation for this approach.
The human eye is maximally sensitive over a relatively small central area called the macula. This area of relatively high resolution is approximately 5 degrees across, while the most sensitive region (the fovea) is only 1.3 degrees (from a total visual angle of about 160 degrees) [Wandell, 1995]. Sensitivity rapidly degrades outside of this central region. Our perception of uniform detail throughout space is a result of continually switching the point at which our eyes are looking (the point of regard or POR). This process involves two important types of eye motions: ﬁxations, relatively long periods spent looking at a particular spot, and saccades, very rapid changes of eye position. These are not the only kinds of motion of which the eye is capable. In smooth pursuit the eye follows a moving object, and even when ﬁxated the eye continually makes very small jittery motions. Fixations and saccades however are the most significant motions when viewing static scenery. Saccades can be initiated consciously, but for the most part occur naturally as we explore a scene. Though ﬁxating on a location
Figure 3.1: Patterns of eye movements of a single subject over an image when given different instructions. Note (1) free observation which shows ﬁxations that are relatively dispersed yet still focused on relevant areas. Contrast it with (3) where the viewer is instructed to estimate the ﬁgures’ ages. Reproduced from Yarbus 1967.
is not identical to attending it, for the most part an attended location is ﬁxated, (i.e. if we pay attention to something, we strongly tend to look at it directly) [Underwood and Radach, 1998].
Figure 3.2: Similar effects to [Yarbus, 1967] are easily (even unintentionally) achieved when using eye tracking for interaction. Circles are ﬁxations, their diameter is proportional to duration. The ﬁrst viewer was instructed to ﬁnd the important subject matter in the image. The second viewer was told to ’just look at the image’. The viewer assumed, from prior experience in perceptual experiments, that he was going to be later asked detailed questions about the contents of the scene. This resulted in a much more diffuse pattern of viewing.
3.1.1 Eye Movement Control
Qualitatively, a great deal is known about ﬁxations. Eye movements are highly goal directed. Viewers don’t just look around at random. Instead, they ﬁxate meaningful parts of images [Mackworth and Morandi, 1967, Underwood and Radach, 1998, Henderson and Hollingworth, 1998], and ﬁxation duration is related to processing [Just and Carpenter, 1976, Henderson and Hollingworth, 1998]. Viewing is highly inﬂuenced by task. The classic example of this [Yarbus, 1967] showed that viewers examining the same image, with different tasks to perform, showed drastically different patterns of viewing, in which they focused on the features relevant to their task (see Figure 3.1). Given the same task, the motions of a particular viewer over an image at different viewings can be quite different, yet the overall distribution of ﬁxations remains similar [Yarbus, 1967]. In real activities, actions, even those thought
of as automatic, are usually preceded by (largely unperceived) ﬁxations of relevant features [Land et al., 1999]. These effects have been noted from some of the earliest research in the ﬁeld [Yarbus, 1967], but the mechanisms involved remain for the most part informally understood. In general, understanding of most higher-level aspects of eye movement control is largely qualitative. In limited domains such as reading, attempts have been made to formulate mathematical models of viewing behavior. For complex natural scenes, much less is known [Henderson and Hollingworth, 1998]. Clearly any information used in guiding eye movements must come from the scene. Likewise, the process of selecting a new location to view must be guided in part by low frequency information gathered from the periphery during earlier ﬁxations. A matter of debate is whether lowlevel visual information gained like this is a direct control of behavior or whether it is primarily used when integrated into a higher level understanding. The precise factors involved in control and planning of eye movements are an active and highly debated topic [Kowler, 1990].
Much effort has gone into attempts to identify purely low-level image measurements that can account for a signiﬁcant amount of viewing behavior. Clearly it would be interesting if what appears to be a highly complex behavior requiring general understanding could be modeled or at least reasonably predicted by a simple approach. Results have been mixed. Fixation locations do not correlate very well over time with the presence of simple low level image features such as areas of high contrast, junctions, etc... [Underwood and Radach, 1998]. More complex models have been formulated, such as the salience methods mentioned earlier. All measure contrast in one sense or another. In general, salience methods embody the assumption that unusual features are likely to be important and looked at. Choice of feature space, and scale of measurement and comparison differ. One
popular approach [Itti et al., 1998, Itti and Koch, 2000] uses center surround ﬁlters to measure local contrast in color, orientation and intensity to model general viewing behavior. [Rosenholtz, 2001] uses a probabilistic framework to measure the probability of a feature given a Gaussian model of color or velocity in the surround. This was used to predict visual search performance. A related salience framework was proposed [Walker et al., 1998] to select unique image locations to match for image alignment. This approach used kernel estimation to measure the rarity of local differential features in the global image wide distribution of those features. These approaches share the same basic idea but vary in what they attempt to model. This begs the question of what one is really trying to capture with salience. One can look at salience as simply a quantitative method of deciding whether something is present in a particular location in the visual ﬁeld. In this context, salience doesn’t actually state the location is important, just that it might be because something is there. It seems quite plausible that a measure like this plays a role in perception. However, more is usually claimed for salience, for example that it predicts most of viewing behavior or the valuable content in an image. Salience would seem to have some additional predictive power because in a wide class of images the semantically important subject does contrast with the rest of the scene. Relatively few people take pictures of their family members dressed in camouﬂage and lurking in the bushes. Nobody takes a picture of a leaf of grass in a ﬁeld. The tendency of meaningful features to be visually prominent is by no means universal. It is also unclear if this is really a property of the world, or a property of pictures people take, but it does seem to underlie some of the success of salience as an engineering tool in graphics. Salience models have also been used to model viewing in narrower domains where their applicability is more clear. The presence or absence of pop out effects in search for example [Rosenholtz, 1999, Rosenholtz, 2001] is effectively modeled by simple salience models that measure how distracting a distracter actually is.
Debate about how useful salience is in understanding general viewing is ongoing. Some optimistically state that salience predictions correlate well with real eye motions of subjects free viewing images [Privitera and Stark, 2000,Parkhurst et al., 2002]. Others are more doubtful and claim that when measured more carefully and in the context of a goal driven activity, the correlation is quite poor [Land et al., 1999, Turano et al., 2003]. This mismatch in experimental results ﬁts the intuition that visually prominent, ’eye catching’ features might be more correlated with idle exploration of a scene, and much less related to eye movements made during a task. In spite of this controversy, salience methods are quite popular and have seen a fair amount of application in computer graphics. They show some correlation with visually prominent features and are fairly simple to implement. Code for some is publicly available. Clearly both semantics and low-level features play a part in eye movements. Further investigation is necessary to clarify the contributions to viewing behavior of salience and scene semantics. Though they seem unable to model important aspects of viewing behavior, salience models may provide important measures of visual prominence.
Much of the knowledge above about human eye motion has been gained through the use of eye-tracking. A system measures a viewer’s eye in one of several manners and records the point where it is looking, termed the point of regard or POR. One common approach involves a video camera and an infrared light source. The relative positions of the pupil and corneal reﬂection in the resulting image are used to calculate point of regard [Duchowski, 2000]. These systems are reasonably reliable and accurate and improve with each generation, though they are still subject to drift over time and variability between viewers. The same technology is used in producing units that sit in front of a ﬁxed display, and in head mounted units for use in more general scenes. Video based trackers have the virtue of not interfering directly with a viewer, making
them useful as both a natural interactive method and a research tool. Outside of research in human vision, eye-trackers have seen increasing use as a mode of human computer interaction. It has also enabled the use of eye movements as a gauge of cognitive activity for psychological investigations and for evaluation of visual displays. Eye position has been used as a cursor for selection tasks in a GUI [Sibert and Jacob, 2000]. They have also been used to indicate a users’ attention to others in a videoconferencing environment [Vertegaal, 1999]. Another class of use, related to ours, uses POR to control simplifying images or scenes for efﬁciency purposes. Knowing where a user looks enables pruning of information that is not perceptible, and need not be transmitted in a video stream [Duchowski, 2000]. Similarly, unexamined content need not be rendered in a 3D environment. In practice, few current systems that make use of such simpliﬁcation actually use eye tracking, presumably because of limited availability, head tracking is typically used instead [Reddy, 2001]. On the whole, eye tracking has been found more useful in interaction where it serves as an indirect measure of user interest. Eye movements are not under full voluntary control. Because of this, when viewers attempt to explicitly point with their eyes the result tends to lack control and suffer from the so called “Midas Touch” problem [Jacob, 1993] where struggling to control eye position, like a cursor, based on visual feedback creates even more uncontrolled looking, touching on many irrelevant or undesirable locations. The same involuntary link of eye movement to thought processes that makes eye tracking a bad mouse have made it useful as an indirect measure of interest and cognitive activity. Eye tracking has been used to evaluate the effectiveness of informational displays including application interfaces [Crowe and Narayanan, 2000], web pages [Goldberg et al., 2002], and air trafﬁc control systems [Mulligan, 2002]. As mentioned earlier, eye movements may even reveal information that viewers are trying to report, but cannot, because it is not consciously available. Experiments have shown
that professional radiologists examining slides look longer at locations where tumors are present, even when they fail to identify and report them [Mello-Thoms et al., 2002]. In the future, this might hold the promise of computer assisted technologies to avoid such mistakes. Several consulting companies currently sell evaluation services using eye tracking to graphic design houses and web content creators among others1 .
Limits of Vision
Eye movements are related to the resolutional limitations of the eye. At any of the ﬁxations with which a viewer explores a scene, the most detailed information is received only in the fovea, but lower frequency information is received throughout the visual ﬁeld. These limits on sensitivity within the visual ﬁeld are not a weakness of the visual system. On the contrary, they are part of our ability to efﬁciently process wide ﬁelds of view and integrate information across eye movements and changes in viewpoint.
Models of Sensitivity
Quantitative models of visual acuity and contrast sensitivity have been developed to model sensitivity to stimuli with different properties. Models of acuity predict whether an observer can detect a black feature of a particular size on a white background. Contrast sensitivity measures an observer’s ability to discriminate a repeating pattern of a particular contrast and frequency from a uniform gray ﬁeld. The drop-off in these sensitivities away from the visual center is modeled as a function of eccentricity, location relative to the point of ﬁxation. Contrast sensitivity has been studied extensively in a variety of conditions usually using monochromatic sinusoidal gratings (smoothly varying, repeating patterns of light and dark bands). This sensitivity declines sharply with eccentricity [Kelly, 1984, Mannos and Sakrison, 1974, Koenderink et al., 1978]. Contrast threshold is deﬁned as the
(unitless) contrast value (0 to 1 with 1 being maximal contrast) at which a grating and uniform gray become indistinguishable. Contrast sensitivity is the reciprocal of this value.
invisible inverse contrast
Figure 3.3: Log-log plot of contrast sensitivity from equation (3.2) This function is used to deﬁne a threshold between visible and invisible features. Many researchers have empirically studied human contrast sensitivity and several have developed mathematical models from their data. Researchers in computer science have also used existing data and models in applications. Different aspects of a stimuli are important in different situations. Fitting models to data collected from different viewers under different circumstances gives somewhat different results. Two examples are given here to illustrate the form these mathematical models take. Kelly  developed a mathematical model for the contrast sensitivity curve (at the center of the visual ﬁeld) including appropriate scaling factors describing the effects of velocity (v) as well as frequency ( f in cyles/degree) of a grating on sensitivity.
A( f , v) = (6.1 + 7.3(log10 (v/3)3 )v f 2 e−2 f (v+2)/4.59
Mannos and Skarinson  ﬁt a mathematical model appropriate to still imagery to results of prior empirical studies for use as a metric in evaluating image compression.
A( f ) = smax 2.6(0.0192 + 0.144 f )e−(0.144f)
Where smax is the peak contrast sensitivity (this is around 400, but varies from person to person).
Sensitivity Away from the Visual Center
A number of researchers have explored how sensitivity varies with eccentricity [Kelly, 1984,Rovamo and Virsu, 1979]. At larger eccentricities (expressed in degrees of visual angle) the contrast sensitivity function is multiplied by another function which models the drop-off of sensitivity in the visual periphery. This function is termed the cortical magniﬁcation factor. It is not radially symmetric, but drops off faster vertically than horizontally. It can be approximated [Rovamo and Virsu, 1979] with separate formulas for decrease in sensitivity in four areas. For simplicity a bound from the most sensitive area can be used in estimating visibility [Reddy, 2001, Reddy, 1997].
M (e) =
1 1 + 0.29e + 0.000012e3
The cubic term can usually be ignored, as its contribution in the range of eccentricities normal in a screen display is negligible [Reddy, 1997]. The contrast sensitivity is then M (e) · A( f ).
Applicability to Natural Imagery
Some caution is necessary in applying these models derived from simple monochromatic repeating patterns to complex natural imagery. Though these models have been applied with good results in graphics [Reddy, 2001], our goal of creating visible abstraction rather than conservative level of detail is more ambitious, and more likely to stress the models involved.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Figure 3.4: Cortical Magniﬁcation describes the drop-off of visual sensitivity with angular distance from the visual center.
How to measure contrast is relatively obvious in gratings, there are only two extrema. A single contrast exists for the entire grating. Between two regions in a scene the meaning of contrast is less clear. Regions are neither uniform in color nor uniformly varying. No strong perceptually motivated approach to this problem appears to have been formulated. Lillesaeter  attempts to address this by deﬁning a contrast between a nonuniform ﬁgure and ground. This contrast measure is a weighted average of the contrast between the region and background and the integral of the contrast along the edge of the region. This is demonstrated to provide more intuitive results than simpler alternatives on regions with ﬂat colors. Issues related to sampling in real images are not addressed. Measuring contrast in a color image presents another problem. Contrast in colored gratings has been studied, and much work has been done in general on color perception. However, there does not appear to be a simple general contrast sensitivity model deﬁned in color space [Regan, 2000]. Adapting a luminance based model therefore remains a plausible course of action in designing a model for a practical application. Applying the notion of visibility for a grating to a non-repeating pattern of regions
also presents problems. The hump-like shape of the contrast sensitivity curve tells us something counterintuitive if the size of an area is treated as proportional to an inverse frequency [Reddy, 2001]. Very low frequencies are much less visible than some higher ones at a given contrast. This is because detectability of a grating is related to the density of receptive ﬁelds of corresponding size. There are upper bounds on the size of human receptive ﬁelds. Intuitively, a large slowly varying sine wave may be difﬁcult to see. This has been less of a concern in previous work where judgments were being made mostly about high frequency parts of the curve [Reddy, 2001], but will be noticeable when visibly abstracting images. It can be argued [Reddy, 1997] that natural images, at least in places (and certainly the uniform color regions that we will ultimately use in rendering) more closely resemble square wave, rather than sine, gratings. Since a square wave can be approximated by the sum of an inﬁnite sequence of sine waves, and sensitivity to combined sinusoidal patterns is closely related to that of the independent components [Campbell and Robson, 1968] one might think the visibility for low frequency square waves would be higher than that for equal frequency sine waves. The actual relation has been studied empirically [Campbell and Robson, 1968] and conﬁrms this intuition. For square waves at frequencies below about 1 cycle/degree sensitivity levels off rather than dropping. A theoretical derivation of the difference is presented in [Campbell and Robson, 1968]. It matches some but not all features of the empirical data. These concerns remind us that when applying these models to real images they cannot serve as an accurate absolute perceptual measure of visibility. Rather, they provide a plausible relative sense of the visibility of different features. The absolute contrast or acuity threshold at which a feature becomes visible is not necessary for our application. What is important is the relative ordering of feature visibility, that allows us to create a prioritization. It is necessary to model visual sensitivity only up to the level where results correspond to our intuitions about this prioritization.
To apply these models in actual scenes, we need to decide on a deﬁnition of the features whose visibility we are judging with these methods. For example, these models have been used in 3D level of detail [Reddy, 1997] to avoid rendering invisible features. In this context the obvious choice of feature is a polygon which may or may not be included in the rendering. For images the choice is less clear, as image properties can be measured in an unstructured, local way or an image can be partitioned into a more structured representation. We review some of the possibilities for image representation in the following chapter.
Chapter 4 Vision and Image Processing
4.1 Image Structure Features and Representation
Figure 4.1: (a) Scale space of one dimensional signal. Features disappear through scale space but no new features appear. (b) Plot of inﬂection points of another one dimensional signal through scale space. Reproduced from [Witkin 1983] Image representation and processing is a large ﬁeld of relevance in both human and computer vision. We concentrate on some basic concepts relevant to the task of simplifying images. Scale space theory provides a way of characterizing the different scales of information present in an image and making correspondences between features at different scales. Segmentation divides an image into distinct regions, enabling an explicit, non-local representation of image content. Edge detection provides a measure of the prominent boundaries in an image. An important unifying concept in image analysis is that the same image data can be represented in many forms. In any of these certain information in the image is explicit and other information is less easy to access [Marr, 1982]. The information and representation appropriate is task dependent. A variety of representations with different properties are available. With the exception of 3D techniques, NPR applications have largely used low-level representations, often functioning locally on the original image itself. However, human artistic processes operate on richer representations. Ruskin,
one of the 19th century’s most prominent art historians and theorists, famously argued that in teaching art technique, the most important lesson was teaching the student to see [Ruskin, 1858]. There seems to be an assumption in image based NPR that seeing is simply capturing a bitmap representation of the scene, and that it can be considered accomplished in the presence of a source photo. Human vision however is much more than simply capturing an image. If a computer is to produce artistic renderings that capture some of the expressiveness of real art, especially in highly abstracted styles, some higher level representation is necessary, analogous to those created in the artists head as she understands the scene before her, and begins to paint. The better suited this representation is to the task, the easier it should be to drastically simplify an image while retaining its important features. The lowest level representation is the image itself, analogous to the retinal image. This is the starting point of any further representation, making explicit the light intensities at each pixel. There is structure here that can be more explicitly represented in other ways. Information in the image exists over a variety of scales, small and large features, making up parts and whole objects in the scene. One common way to come to terms with the multiple scales of information in an image is through its scale space. From a single image, a three dimensional stack of images is generated in which each contains progressively coarser scale information. Again, this representation has an analogue in human vision where neurons have receptive ﬁelds of different sizes, in effect generating a multi scale representation from the retinal image. Scale space has come to refer to such a space of increasingly simple images generated by a range of processes. Generically this can be thought of as a stack of images with decreasing information contained at each level as scale increases. This stack is in theory continuous, in practice sampled at some discrete interval. Starting with the original image, detail is progressively lost until a uniform color is all that remains (see Figure 4.1).
A number of constructions for such a space have been developed. Perhaps the simplest approach creates something like an image pyramid, successively downsampling the image so it is more coarsely pixelated. This approach has a problem in that detailed, high frequency information (the edges between the new larger pixels) may have been introduced which was not in the original image. This is the problem of spurious resolution [Koenderink, 1984]. New information has been hallucinated into existence by imposing a coarser grid structure on the data. Convolution with a Gaussian kernel (blurring) generates a space that avoids this problem [Witkin, 1983,Koenderink, 1984]. In fact this blurring has been proven [Koenderink, 1984] to be the unique way to generate a scale space which is both uniform or uncommitted, (i.e., the process is uniform across image space and through the scale dimension), and also avoids spurious resolution. Information disappears but cannot be created. In one dimension, this ensures that any feature will only disappear as scale increases. In two dimensions new features, maxima for example can appear. However in both cases clear judgments can be made about what features exist at what range of scales. That the process of blurring is uniform is an advantage in that ﬁltering can be applied to any signal, one doesn’t need to have a model of what the important features present are. A disadvantage is that coarser features are more coarsely located, the blurring process that reveals them distorts their spatial extent. If you know what you’re looking for, there is no reason why the blurring operation must be uniform or uncommitted. A number of nonuniform or nonlinear scale spaces have been formulated which do not introduce false content but remove information selectively in certain locations. One of the best known of such methods is anisotropic diffusion [Perona and Malik, 1990]. Here the diffusion process is not uniform but rather inversely proportional to the magnitude of the gradient at any position. This results in an edge preserving blurring which removes low contrast detail while preserving strong edges. This has the advantage that edges are better preserved in their initial location until the point at which they disappear. Niessen et al  compares this and several
other nonlinear methods in the context of segmentation. Nonlinear methods perform well but are signiﬁcantly more expensive. A practical application must sample the continuous scale space at some discrete intervals. One would like to sample sufﬁciently ﬁnely to capture interesting events, the order of disappearance of different features, but not more densely than need be. Looking at the linear scale space, Koenderink  derives an appropriate sampling as logarithmic along the scale axis corresponding to a uniform sampling in the scale parameter t , the standard deviation of the Gaussian kernel used in blurring. This is intuitive. At small scales many tiny regions are merging quite often, requiring dense sampling. At higher scales, there are fewer regions, fewer events to capture, and much less dense sampling in t is required. The issue is the same for nonlinear spaces. Relating scales in different spaces is not straightforward. Some attempt at doing this has been made in [Niessen, 1997].
Figure 4.2: Interval tree for 1D signal illustrating decomposition of the signal into a hierarchy. Reproduced from [Witkin 1983]. While a scale space such as this begins to capture structural relations of features across scales, this is still largely an implicit representation. To make this explicit, features at different scales need to be directly related to each other. Witkin 
addresses this problem in 1D signals. In the scale space of a one dimension signal features will never appear at coarse scales. So, any features found at a coarse scale can (if the sampling is dense enough) be traced directly back to their ﬁne scale origin. This allows localization of features found at a coarse level. Witkin demonstrates this choosing as a feature zero crossings in the second derivative, inﬂection points in the signal (Figure 4.1). Similarly, using these correspondences across scale it is also possible to create a structure that captures the relationship between all features at all scales. Intervals between two zero crossings (which again correspond to sections of the signal between two inﬂection points) disappear in only one way. Two successive zero crossings merge together, with the result that three intervals, the one between the crossings and those on either side, merge into one. These three intervals can be made children of the resulting interval to create an interval tree which characterizes the structure of the signal at all scales. Witkin observes that those intervals which have longer persistence through scale space appear to be those identiﬁed by human observers as subjectively salient or important in the signal. Extending this nice analytical derivation to a practical application in 2D is not trivial. In 2D features such as maxima, or curves deﬁned by inﬂection points may split into two at coarser scales. Koenderink  suggests the use of equiluminance curves in the image as a 2D equivalent to Witkins intervals. Generic equiluminance curves form a single closed curve. There are two singularities: extrema where the curve is just a point, and saddle points where the curve forms multiple loops which intersect at one point. Each loop may contain other saddle points and has to contain at least one extrema [Koenderink and van Doorn, 1979]. The nesting of these saddle points gives the structure of the image regions. Though new saddle points may appear inside a loop, centermost saddle points must disappear before outer ones. Because of this the saddle points present at all scales can be represented as a tree. Such a structure is difﬁcult to calculate in practice. It is not obvious how to ﬁnd these saddle points efﬁciently or if
they provide a subjectively intuitive partitioning of the image. In addition it’s not clear how color could be handled. In a naive approach, each band would produce its own surface with its own saddle points, resulting in 3 separate scale space trees that would need to be uniﬁed in some way.
The process described above of dividing up a signal based on the intervals between features is a particular approach to the general problem of segmentation. This problem again occurs in both computer and human vision. Segmentation makes explicit the association (or disassociation) between different areas of an image. It produces an explicit representation of parts of the image that are associated with each other, assigning each pixel to one, usually connected group or region. These regions should be uniform by some measure. Separate regions, at least the adjoining ones, should be markedly different. How people do this, parsing shapes and objects from the background is only partially understood. In computer vision, a tremendous variety of methods have been devised to deﬁne similarity measures for this using color, gray scale intensity, texture etc. This segmentation is usually a partitioning of an image at a single scale. However it is sometimes desirable to deﬁne a segmentation over a range of scales. Scale space has been considered in segmentation. It is typically used to make segmentations produced with other methods more robust. Niessen et al  link pixels with their neighbors who have similar color in both the spatial and scale dimensions to create a hierarchy. The end product is a single ﬂat segmentation taking its set of regions from a coarse scale and their spatial extent from a ﬁne scale. A similar approach is taken in Bangham et al . Here, the desire is to create a hierarchical segmentation tree that describes the image over a variety of scales. An alternate approach [Ahuja, 1996] creates a multi-scale representation without explicitly generating a scale space.
Each of these methods compute a hierarchical representation of image structure. However, there is no clear relation between the hierarchy and the theoretical hierarchy induced by scale space. This is not a major concern; scale space structure is attractive because of its simple formal deﬁnition, but is not the single correct answer in any meaningful sense for a given practical application. Hierarchical representations are not general purpose, desirable properties depend on the application. For the purposes of image abstraction, an important question is whether each subtree in the structure represents some coherent area or region. This is guaranteed in some geometric sense by scale space proper, since nodes occur in the tree only when features disappear. In contrast, methods for building a hierarchy that iteratively merge regions, may have many intermediate nodes that consist of fragments of regions that have not yet all merged together. Such hierarchical representations are harder to use directly for tasks that require meaningful regions like image abstraction. Scale space captures the structure of intensities in the image, not the structure of what is pictured. In some cases these may correspond closely (e.g., eyes, nose, and mouth in a head), but this is not necessarily the case. A hierarchy that corresponds to actual objects in the image would allow abstraction on an object by object basis. This is a difﬁcult problem. Some attempts have been made to rearrange subtrees based on color (so for example a hole in an object would be associated with the background, not the object) [Bangham et al., 1998]. Any general solution would need to draw on analogues of high-level vision tasks that do not currently exist.
Edges are another important image feature. A region is a uniform area. An edge is the boundary or discontinuity between uniform regions. Edges are important in human vision; they are one of the low level features built into the visual systems. Edges are commonly detected using derivative ﬁlters. Discontinuities produce a ﬁlter response
that can be processed to extract edges as chains of positions. Like regions, edges themselves exist at a number of scales. Edges at different scales are usually detected by using derivative ﬁlters of different widths or, equivalently, by ﬁrst convolving the image with a blurring kernel. A popular procedure for performing the detection and thresholding is the Canny edge detector [Shapiro and Stockman, 2001]. An interesting modiﬁcation on this procedure adds a measure of how closely the local image resembles an edge model [Meer and Georgescu, 2001]. This allows faint edges to be captured while false alarm responses from texture can be suppressed. As with regions, larger scale features are detected with less spatial resolution. Nonlinear anisotropic diffusion [Perona and Malik, 1990] was originally proposed as a better method of blurring images for coarse scale edge detection since it removes ﬁne scale detail while better preserving the shape and position of high contrast low frequency edges. Edges and regions are related to each other. A region identiﬁes a homogeneous area. An edge indicates a break between two areas. Since edges and regions are closely related but their identiﬁcation usually draws on different measures, each feature can be used to improve results from the other [Christoudias et al., 2002]. These two features, regions and edges provide a fairly complete representation of image content, one that knowledge of human vision suggests is biologically important. These features are the building block on which our rendering techniques will be built.
Chapter 5 Our Approach
The ideas and techniques above suggest a particular path for achieving minimally interactive abstraction in computer renderings. Eye tracking can serve as a bridge between the computer and a user’s interests and intentions. As the evidence discussed above suggests, locations of a viewer’s ﬁxations can be reasonably interpreted as identifying content that is important to that viewer. Preserving detail where the viewer looked and removing it elsewhere should create a sensible NPR image that captures what the viewer found important. The nature of abstraction in art suggests this is worthwhile because it may help future viewers see the main point of the image more easily. It may even be able to steer successive viewers into the same patters of viewing as the original viewer, encouraging a particular interpretation of the image.
Eye tracking as Interaction
Eye tracking is used in our work as a minimal form of interaction to answer the question of what is important in an image. Because eye movements are informed by semantic features and ﬁxations are economically placed, they provide a guide to important features that require preservation. Because they are natural and effortless, they are a desirable modality for interaction. Abstracting an image becomes as simple as looking at the image, an action that requires no conscious effort. Interaction in all of our systems proceeds in the same way. A viewer simply looks at a photograph for a set period of time, usually around ﬁve seconds while their eye movements are monitored by an eye tracker. The recording is then processed to identify ﬁxations, and discard saccades. The ﬁxations found are taken to represent important features and detail in these locations will be preserved.
An apparent contradiction is worth noting here. There seem to be two possibilities concerning this interaction. Either the viewer will look at distracting image features, and these will be included in the rendering, making ﬁxation data useless for determining importance. Or, the viewer will not look at supposedly distracting elements, in which case removing the information seems to be of negligible value. This is related to the basic question of whether abstracting images is itself worthwhile, or if we should just let people do the abstraction in their head. There are several responses to this. Firstly, there is a great deal of content like texturing on a wall or detail in grass that is not directly examined but is certainly visible. This kind of information will be removed. Artists certainly seem to manipulate this kind of detail in a purposeful way, suggesting it is worth doing. In a particular style, the prominence of small features could be emphasised inappropriately without this kind of omission. An ink drawing of a ﬁeld of grass with each leaf depicted in silhouette would be distracting. Even on occasions when the eyes ﬁxate low level visual distracters, or regions of high contrast or brightness, which don’t say anything about the image meaning, these ﬁxations can be expected to be shorter in duration [Underwood and Radach, 1998]. So we can still hope to remove such distracters despite their being looked at, and avoid the distraction for future viewers. This suggests it will be important to have a model of attention that takes into account the length of a ﬁxation. Considering ﬁxation duration will allow brief ﬁxations to be discounted as distraction, while giving more weight to longer ﬁxations at which it can be assumed more processing occurs. The simple model we use for this is shown in Figure 5.1 (b). This model is a piecewise linear function in which ﬁxations below a certain duration have no effect, ramping relatively quickly up to a maximal weight at ﬁxations of at least a particular duration. This is a very simple model considering the complexity of visual cognition. More sophisticated attention models may be useful. Various information might be useful such as the time course and grouping of ﬁxations.
(xi , yi) ei (p)
Figure 5.1: (a) Computing eccentricities with respect to a particular ﬁxation at p. (b) A simple attention model deﬁned as a piecewise-linear function for determining the scaling factor ai for ﬁxation fi based on its duration ti . Very brief ﬁxations (below tmin ) are ignored, with a ramping up (at tmax ) to a maximum level of amax .
Using Visibility for Abstraction
To apply ﬁxation data to simpliﬁcation, we need to link ﬁxations to individual decisions about what content to include. The models of visual sensitivity discussed above allow us to decide what features are visible. However, a difﬁculty exists when applying these models to image simpliﬁcation. These models all deﬁne a boundary between perceptible and imperceptible features. In our application the goal is to remove features in a way that is perceptible but makes sense to the viewer and preserves meaning. On a well positioned monitor, nearly everything should be visible, down to nearly a pixel resolution in the course of a brief viewing, so an accurate model could tell us to include everything. In order to accomplish abstraction, the hope is that these models can be scaled back by some global constant, representative of a particular amount of abstraction. This will allow us to remove visible detail in a way that still reﬂects the relative perceptibility of features given a particular viewers eye movements. By making our algorithm interpret the viewer as having, in a sense, worse and worse eyesight, we can force it to remove visible information in a way that is still perceptually motivated. To the best of our knowledge this is the ﬁrst work to attempt to formalize abstraction in this manner.
This constant scaling factor can be seen either as a separate scaling factor, indicating degree of abstraction or as part of the attention model mentioned above. When folded into the attention model Figure 5.1 (b) as amax it can be seen as representing a certain background amount of detail, that is not interesting even when a location is ﬁxated. In general, a framework for accomplishing abstraction using this methodology includes 3 main choices: ﬁrst a system of image analysis to represent the image content in a meaningful way, second a method of indicating which of this content is actually important, third a style in which to render the important content. In successive chapters we present several systems that share the same basic interactive technique of eye tracking described above but vary in the particulars of these decisions.
Chapter 6 Painterly Rendering
Our ﬁrst system for creating abstracted NPR images does so in a painterly style. Painterly rendering creates an image from a succession of individual strokes, mimicking the methodology of a human painter. The intuition behind our system is that in most painting styles, painters use fewer and larger strokes to model secondary subject matter such as background ﬁgures or scenery. This system is relatively simple, and served as an initial proof of concept for our interactive technique. The model of image content is simple and unstructured, the perceptual model equally minimal.
Our approach to painterly rendering [Santella and DeCarlo, 2002] has no explicit model of visual form, it uses a simple scale space representation. This is possible because painterly rendering is a somewhat forgiving style in which imprecision in stroke placement typically is not distracting. Our approach follows an existing methodology, [Hertzmann, 1998] in which curved strokes of different widths are painted with colors taken from the appropriate level of a scale space of blurred versions of the original image. Our only model of image contents is this scale space of images, and the corresponding image gradient at each scale. This information is used in a standard algorithm to generate candidate strokes. These strokes are the features to which we apply our perceptual model. The features we consider are therefore not image features properly speaking, but strokes that exist only in the rendering, not in the original image.
Applying the Limits of Vision
Given a choice of feature, we now need to make judgments about what features to include. To do this we need to pick a model of visual sensitivity and decide how to apply it to our system. The simplest model we could use would be an acuity model that modulates brush size. This corresponds to considering each brush stroke in isolation as a mark of maximal contrast with its background. Such a model is a fairly large oversimpliﬁcation, but provides an intersting starting point. Simple acuity models like this have been used in graphics before [Reddy, 1997]. In that work, Reddy ﬁt a function to the threshold frequencies provided by a contrast sensitivity model for maximal contrast features at varying velocities and eccentricity. This provides an acuity model that is simple to apply. It takes an input speed and eccentricity and outputs a threshold frequency. Thought simple to apply and based on psychophysical data, this model is not useful for our purposes. Because the model is crafted to be highly conservative a fairly large central region 5.79 degrees in width in assigned a maximal acuity. A conservative estimate like this may be desirable for imperceptible LOD. Where the goal is to remove unattended information this model is overly conservative. The circular region where detail is maximal can be highly visible and distracting in an abstracted rendering. We would prefer a function closer in shape to the actual drop-off in visual sensitivity. A similar model that provides more intuitive results is equally simple. The maximum frequency humanly visible is assigned to just the center of the visual ﬁeld (G = 60cycles/degree). From there sensitivity drops off as a function of the standard cortical magniﬁcation factor equation (3.3). This produces a continuous degrading of detail from center of vision purely on a frequency basis. This value is scaled by the simple attention model described in Section 5.1 to produce a ﬁnal frequency threshold. Each ﬁxation deﬁnes a potentially different threshold at each point in the image. The highest threshold, (usually corresponding to the closest ﬁxation) is used as the actual
threshold at point p, termed fmax ( p). We model a brush stroke with width D as a half cycle in a grating and compare the resulting frequency f = 1/2D to the cutoff provided by our perceptual model. The stroke is included if it is lower in frequency than the cutoff.
To actually produce candidate brush strokes our approach uses a standard algorithm [Hertzmann, 1998] to approximate the underlying image with constant color spline strokes that generally run perpendicular to the image gradient. These strokes originate at points on a randomly jittered grid. An extended stroke is created by successively lengthening the stroke by a set step size in the direction perpendicular to the image gradient. When the color difference between the start and end points of the stroke crosses a threshold, or the stroke reaches a maximal length the stroke terminates (see [Hertzmann, 1998] for further details). Our method varies in a few particulars. When strokes are created by moving perpendicular to the image gradient, they can be excessively curved and worm like in appearance. Hertzmann’s strokes are B-splines, which can meander to a sometimes excessive extent. Real paint strokes do not tend to do this. Even in paintings by artists that one thinks of as using very salient curving strokes, for example van Gogh, compound curves are made up of multiple gently curving strokes. In response to this our maximal stroke length is shorter and we use a single bezier curve for each stroke, using a subset of the calculated control points. This produces somewhat more natural, simple curves. However, even these can curve more sharply than is normal with real paint strokes. Strokes are painted from coarse to ﬁne with the entire image being covered with the coarsest strokes used. For ﬁner scale strokes, a choice is made at each stroke origin point whether a stroke of that size is necessary based on the perceptual model. Only
Figure 6.1: Painterly rendering results. The ﬁrst column shows the ﬁxations made by a viewer. Circles are ﬁxations, size is proportional to duration, the bar at the lower left is the diameter that corresponds to one second. The second column illustrates the painterly renderings built based on that ﬁxation data.
necessary strokes are generated. In addition to removing detail, artists also use color as a vehicle for abstraction. Vibrant colors and high contrast can enhance the importance of a feature or make it easier to see. Muted color and contrast can de-emphasize unimportant items. Stroke colors can be adjusted to achieve this [Haeberli, 1990]. Our perceptual model provides a means of deciding where to make these adjustments. For instance, lowering the contrast in unviewed regions makes them even less noticeable; raising it emphasizes viewed objects. Since color contrast is not well understood [Regan, 2000] we use a simple approach to adjust colors. Though where we apply these manipulations is controlled by our perceptual model, the extent of these manipulations was simply picked by experimentation. We start by deﬁning a function of location u(p) ranging from 0 (where the user did not look at point p) to 1 (where the user ﬁxated p for a sufﬁciently long period of time). This is deﬁned as the ratio between the perceptual threshold at point p, and the maximal threshold possible: u(p) = fmax (p) amax G (6.1)
We then adjust color locally for each stroke based on this function. • Contrast enhancement: Contrast is enhanced by extrapolating from a blurred version of the image at p out beyond the original pixel value. The amount of extrapolation changes linearly with u(p), being cmin when u = 0 and cmax when u = 1 (a cmin and cmax of 1 would produce no change). cmin and cmax are global style parameters for controlling the type of contrast change. For example, choosing [cmin , cmax ] to be [0, 2] raises contrast where the user looked, and lowers contrast where they didn’t. (Default: [cmin , cmax ] = [0, 2].) • Saturation enhancement: Colors can also be enhanced; colors are intensiﬁed in important regions and de-saturated in background areas. The transformation
proceeds the same as with contrast, now speciﬁed using [smin , smax ], and extrapolating between the original pixel value and its corresponding luminance value. As an example, choosing [smin , smax ] to be [0, 1] just desaturates the unattended portions of the image. (Default: [smin , smax ] = [0, 1.2].)
Figure 6.2: Detail in background adjacent to important features can be inappropriately emphasized. The main subject has a halo of detailed shutter slats.
Results from this technique capture some of the abstraction present in paintings (see Figure 6.1). Focal objects are emphasized with more tight rendering, intense color and contrast. Background features are accordingly de-emphasized. All this is done with virtually no effort on the part of the user. In contrast, hand painting strokes or even painting a detail map to control stroke size would require greater effort. This painterly rendering framework has some limitations. Of course neither the placement nor individual appearance of paint strokes seriously mimic real paint. Realistic paint strokes were never a goal here. How to accomplish this to various degrees of approximation is fairly well understood. Placement of strokes is a more interesting shortcoming of this approach. Using few strokes is a major part of painterly abstraction. In our renderings, despite throwing out small strokes in most places, too many
Figure 6.3: Sampling strokes from an anisotropic scale space avoids giving the image an overall blurred look, but produces a somewhat jagged look in background areas.
Figure 6.4: Color and contrast manipulation. Side by side comparison or rendering with and without color and contrast manipulation (precise stroke placement varies between the two images due to randomness).
strokes of too small a size are used to approximate any given part of the image. Other painterly rendering methods exist which could be used to carefully place strokes while retaining our underlying methods for choosing detail levels [Shiraishi and Yamaguchi, 2000]. Aside from the limitations of the rendering techniques we’ve appropriated, our approach to abstraction has some inherent limitations of its own. Since there is no explicit model of image structure, detail can only be modulated in a continuously varying way across the image. Stroke placement is designed to respect edges, however the size of
strokes does not. Detail spreads from important locations to neighboring areas creating distracting haloing artifacts (see Figure 6.2). In addition, sampling coarse stroke colors from blurred versions of the image tends to give results an overall blurry look, especially when down sampled. An artist wouldn’t actually blur colors and shapes in this way, but would preserve more of a regions original appearance while removing detail. Abstraction in paintings, even when objects are rendered indistinctly is not just blurring. High frequency information is not blurred out. Rather, small elements are removed completely. One way to accomplish this would be to sample strokes from an anisotropic scale space that blurs out detail while still retaining sharp edges. An example of a painterly rendering that does this is shown in Figure 6.3. Stroke detail has been modulated by the same perceptual model. Color has been left unchanged. This image does preserve more of the original image structure. However, because intensities have not been blurred together in the background, imperfections in placement of coarse strokes are emphasized. In background areas strokes appear jagged, their orientations varying excessively. A more global strategy for orienting [Hays and Essa, 2004] and placing [Gooch and Willemsen, 2002] strokes along with blending might help overcome these difﬁculties.
Chapter 7 Colored Drawings
Some of the limitations of a purely local model of image structure are addressed by our second NPR system. This approach utilizes a richer model of image structure to create images in a line drawing style with dark edge strokes and regions of ﬂat color [DeCarlo and Santella, 2002]. This style resembles ink and colored wash drawings or lithograph prints such as those of Toulouse-Lautrec. However, it is something of a minimal style. Images are rendered with the primitives of our new image representation itself. The more structured image representation used here allows us to create a more simpliﬁed visual style. It also allows us to control detail across the image in a discontinuous way. This means we can keep detail from leaking from important objects to unimportant surrounding regions.
The primary image representation underlying this visual style is a hierarchical segmentation that approximates the scale space structure of the image. As noted above, analytical calculation of this structure is problematic. Our choice of approximation was motivated by the desire for reasonable efﬁciency and also by a requirement that regions at each level of the tree should, as much as possible, be reasonable areas to draw. Any of them might wind up in a rendering with some given distribution of detail.
Our approach is to use a robust mean shift segmentation independently at each level of a linear scale space of blurred images (which are downsampled for efﬁciency). We use publicly available code to accomplish this (http://www.caip.rutgers.edu/riul).
Figure 7.1: Slices through several successive levels of a hierarchical segmentation tree generated using our method.
To avoid artifacts from downsampling a dense image pyramid is created. Each level of the pyramid is smaller than its predecessor by a factor of square root of two. Each level is segmented. Once each of these independent segmentations has been generated, the regions from each separate level are then linked to a parent region chosen from the next coarser segmentation. This is done using a simple method that chooses a parent region with maximal overlap, using color information to disambiguate difﬁcult choices, see [DeCarlo and Santella, 2002]. This is not a particularly robust method of assignment. We are however conservative, putting off merging if there is no parent that is a reasonably good match. These ill matching regions are propagated up to the parent level (implying that the tree is not necessarily complete). Because the segmentations on each level are themselves quite good, difﬁcult situations do not often occur. This approach allows leveraging existing segmentation methods and implementations, and is ﬂexible enough to incorporate alternate segmentation methods or alternate scale spaces, like anisotropic diffusion. Inherently, the way different colored regions corresponding to different objects merge together at very coarse scales is somewhat unpredictable and unstable. An engineering decision was to simply not sample very coarse scales. The coarsest scale segmentation was selected to still contain a fair number of regions. Coarser scales with only a few regions are not usually useful for rendering images. For all images (most at 1024x768 resolution) 9 levels of downsampling were used. All the coarsest scale regions were simply set as children of the tree root. Figure 7.1 illustrates some of the slices through one of these trees. Though good, these segmentations are neither perfect on each level nor in decisions made across scale. Certain features consistently present difﬁculties. Textures tend to be over segmented. Smoothly varying regions are broken into a number of patchy, roughly constant color areas. Better segmentation techniques could be incorporated into our method as they materialize. A separate concern is the extent to which an image region hierarchy is appropriate
for abstraction, since image structure does not directly correspond to object structure. Our system makes no attempt to turn a scale space structure into one that represents the structure of actual scene objects [Bangham et al., 1998]. This is a difﬁcult problem though there are some interesting possibilities for applying image understanding techniques [Torralba et al., 2004] to add a top down component to the bottom up image structure generated here. For the time being we have addressed this problem by creating a simple interface that allows interactive editing of the segmentation tree. A user can take a segmentation tree as a starting point, and manually assign children areas to different parents, or split and merge regions. With a fair amount of effort this could allow an object hierarchy to be built. More realistically, occasional segmentation errors can be corrected fairly easily. On the whole however we have not found this hand editing necessary. It was not used in any of the published results in DeCarlo and Santella  . In creating a larger set of 50 renderings for [Santella and DeCarlo, 2004a] it was used to correct a few prominent segmentation errors in a handful of the images. Most segmentation errors that violate object boundaries appear high in the segmentation tree. These incorrect regions are rendered only when the area is highly abstracted. In these abstracted areas violations of object boundaries are usually not particularly distracting.
Our initial implementation tried rendering dark edges using a subset of the edge borders of the regions. This did not on the whole provide sufﬁciently clean and ﬂexible edges. Edges extended beyond where desired. Because only average color for each region was stored, signiﬁcant region boundaries were difﬁcult to distinguish from boundaries due to smooth shading variation. Because of this a separate representation of edges was used. Edges were detected with a robust variant of the Canny edge detector [Meer and Georgescu, 2001]. Edges were detected on only one scale. This presents some limitations but was enough to capture a reasonable set of important edges.
The combination of edges and regions provides a reasonable model of image content. The structure of this model will allow us to abstract images by more intelligently removing detail in a coherent region-based manner.
Our richer model of image structure provides us the opportunity to use richer models of visual perception to interpret eye tracking data in the context of our image. Though useful, a perceptual model based only on frequency like that used in our painterly rendering system is limited. For example, in the output of our painterly system simpliﬁed parts of the image are still painted with quite a large number of strokes. Most of these strokes are quite similar in color. If larger strokes were used in all background areas, those background areas with content would be completely obliterated. As it is, the long uniformly colored strokes introduce detail, exaggerating the local color variation in mostly blank areas (this is akin to the problem of aliasing or spurious resolution). This is in sharp contrast to skilled examples of real painting where coarsely rendered forms of near uniform color are blocked in with just a few very large strokes. The problem is that strokes are selected based purely on size. It would be desirable to remove features based on contrast as well as size. This system continues to use a frequency model like that in [Santella and DeCarlo, 2002] for judging line strokes. Here, frequency is taken as proportional to the length of the stroke, rather than its width. This decision is somewhat arbitrary, though it results in the intuitive behavior of shorter lines being more easily ﬁltered out. Since strokes are rendered in our system with a width proportional to their length, one could look at the perceptual model as measuring the prominence of the feature being drawn rather than the original edge. We can use a contrast sensitivity model to judge the visibility of regions that have
both a size and color. Of the available contrast sensitivity models, Equation (3.2) seems appropriate because it is derived from a variety of experimental data and has been used in computational setting on real images. This is the model we use [Santella and DeCarlo, 2002]. As mentioned earlier, a perceptual model by itself will remove little content. Most of the features we are interested in are visible. Some kind of global scaling down of relative visibility is necessary to create an interesting amount of abstraction. Several possibilities for scaling the model exist. Scaling can be applied to frequency making smaller regions progressively less visible and removing them from the image as was done for painterly strokes. Scaling in frequency is fairly intuitive in some respects. Given a desired smallest feature size, it is simple to derive a scaling factor that will include features of this frequency foveally, and degrade size further at larger eccentricities [Santella and DeCarlo, 2002]. Scaling frequency in a contrast sensitivity model is a bit problematic. The contrast sensitivity function has a hump like shape (see Figure 3.3). Visibility degrades at both extremes of frequency. This produces unintuitive behavior. When scaling up frequencies, a region can become more visible as it becomes smaller, before becoming less visible. Several approaches to solving this are possible. As mentioned above, there is a different pattern of contrast sensitivity for square wave gratings. A square wave model might be more appropriate for our region features since, like a square wave grating, they have a sharp (high frequency) boundary. A simple mathematical model for the square wave contrast sensitivity function does not seem to be available but an approximation built by analyzing the frequency spectrum of the square wave signal has been derived [Campbell and Robson, 1968]. Using this model is a bit complicated. An alternative is to simply use the maximal sensitivity for regions larger than the frequency corresponding to the peak sensitivity. This replaces the low frequency slope of the function with a horizontal line at peak sensitivity. This could be considered a ﬁrst level of approximation of the square wave model, which very roughly states that visibility is
largely governed by the most visible frequency component in a square grating. Another possible approach for scaling to remove content is to scale only in the contrast domain. After calculating a contrast threshold value this can be scaled before being compared to the actual contrast of a region. This behaves in a more intuitive manner, as more scaling always removes more content. From our experimentation, scaling in the contrast domain seems to be the more useful option for scaling region visibility. It corresponds more to our intuitive sense of which detail should be removed ﬁrst, as a fairly wide range of frequency regions are desired in a ﬁnal image. Removing large low contrast regions using frequency scaling also removes the few desired high frequency regions. Contrast scaling preferentially removes lower contrast regions, which looks better. This may be due in part to our segmentation method. Because it breaks shading into many patchy regions, there is a desire to get rid of these sometimes large but low contrast shading regions faster than proportionally smaller but higher contrast features. Ultimately, [DeCarlo and Santella, 2002] we combined both these approaches in our ﬁnal system. Scaling is applied only to contrast and region frequencies are capped at a minimum value to keep the visibility of low frequency regions from degrading too much. This is an approximate patch to problems in the applicability of the current contrast sensitivity model. The most perceptually realistic approach is still an open question. Once a contrast threshold has been calculated, the contrast of a particular feature needs to be measured for comparison to that threshold. As mentioned above, contrast models are derived over gratings in which there is one relatively obvious way to measure contrast. Our regions are more complex, they include color and are non-uniform (in the initial image). One approach is to measure contrast between the average color of parent and child regions in the tree. This could be thought of as capturing whether the next branching represents a signiﬁcant change. This works reasonably, but can be susceptible to chance
features in the segmentation tree. For example, two pairs of black and white regions might merge to form two gray regions. These then merge to form a single gray region. The difference between parent and child region colors on this level would be minor. An approach that seems more successful is to use the average of the contrasts between a region’s color and those of adjoining regions in the same level of the tree. In this case it would be possible to take a cue from [Lillestaeter, 1993] and measure contrast between regions as a weighted average of contrast between their average colors and the contrast across their shared edge in the initial image. The correct scale at which to measure contrast across the edge is however unclear. We choose ultimately to measure contrast between a region and adjoining regions on that level using only region color. The contrast for the region is the average of contrast with each of its neighbors weighted by extent of the border they share. Since we are not aware of any simple color contrast framework, we take a simple approach to measuring contrast between individual pairs of region colors. We use a slight variation of the Michelson contrast:
c1 −c2 c1 + c2
. Each color exists in the percep-
tually uniform L*u*v* color space, so the measure provides a steady increase with increasing perceptual differences in color. This formula sensibly reduces to the standard Michelson contrast in monochromatic cases. More sophisticated models of color perception for regions may help the system work better on a broader class of images.
For regions, we choose to emphasize abstraction by removing even more information than was directly speciﬁed by the (scaled) perceptual model. Instead of applying the perceptual model to regions uniformly across the image, we divide the image into foreground regions where the standard perceptual model is used, and background regions where a more aggressive version of the model is used. Unﬁxated background areas are identiﬁed. In them, a high constant eccentricity is used in place of the actual distance
to the nearest ﬁxation, removing more detail in these regions. Foreground regions are regions that appear to have been examined by the viewer. A region is considered examined if a ﬁxation rests in its bounding circle, or if the region is in a subtree that is approximately fovea sized and centered on a ﬁxation. Foreground regions could be identiﬁed by searching the segmentation tree from its leaves upward. In practice, it is simplest to descend the tree, applying the default model to regions that contain ﬁxations. When a subtree that matches a fovea is identiﬁed, the standard model is applied throughout this subtree. When a subtree does not contain a ﬁxation, the background model is used. This avoids having to touch the majorityof elements in the tree that will not be rendered. Applying the model produces a trimmed tree. The leaves of this smaller tree are the regions that will be rendered into an image. The set of selected regions and lines are then rendered by drawing the regions as areas of ﬂat color and drawing the edges in black on top of them. This is a simple style and there are relatively few explicit style choices. One important stylistic choice is region smoothing. Before rendering, regions are smoothed by an amount proportional to their size. This removes high frequency detail along the border of large regions giving the regions a smooth, organic look. Smoothing is needed because the spatial extent of a region is taken from the union of its child leaves at the lowest level of segmentation. Without smoothing the region border would contain inappropriate distracting detail. Edges are smoothed by a small constant amount. The resulting misalignment between highly smoothed region boundaries in abstracted areas and the corresponding edges adds to the ’sketchy’ look of results. Edges are ﬁltered in several ways based on their length, in order to eliminate clutter from the many fragmentary edges resulting from edge detection. Very short edges of only a few pixels are directly thrown out. Somewhat longer edges, (by default less than 15 pixels in length) are drawn only if near the border of a region and included by the acuity model. Longer edges are compared only against the acuity threshold.
Selected edges are drawn with a width proportional to their length. Edge width interpolates between a minimal and maximal thickness (3 and 10 pixels) for edges between 15 and 500 pixels in length. Thickness is constant outside this range. Edges also taper to points at either end. This is a rough approximation of the appearance of many lines in traditional media illustration. Thickness being proportional to length is a debatable choice. It occasionally produces odd results, but does succeed in adding some variation to line weight, while capturing the sense that long lines are usually more important. Rendering lines tapered at either end serves the additional purpose of disguising the sometimes broken nature of automatically detected edges. Without this, the fact that a single edge is often broken or dotted in many places tends to be highly distracting. Our lines make no attempt to simulate the ﬁne grained look of real brush or pen and ink lines, something which has been done [Gooch and Gooch, 2001] and could be applied if desired.
Some results from this system can be seen in Figure 7.2. The visual style of ﬂat colored regions and dark lines is attractive. The abstraction provided by eye tracking seems to succeed in highlighting important areas. How important this abstraction is can be seen by comparing the result with abstraction to those in Figure 7.4. Images with uniformly high detail look excessively busy, while those with uniform low detail appear overly simpliﬁed. Abstraction is vital in producing clear images with a distribution of detail that is not distracting. Uniformly detailed images are created by removing the cortical magniﬁcation factor from the perceptual model. The constant scaling factor on contrast now provides a single global control for simpliﬁcation. This model is applied uniformly across the image. Regions are still removed based on contrast sensitivity; low contrast and small
Figure 7.2: Line drawing style results.
(c) Figure 7.3: Stylistic decisions. Lines in isolation (a) are largely uninteresting. Unsmoothed regions (b) can look jagged. Smoothed regions (c) have a somewhat vague and bloated look without the black edges superimposed.
Figure 7.4: Renderings with uniform high and low detail. regions are eliminated ﬁrst. However the effect of this uniform simpliﬁcation is not nearly as successful. The additional scaling factors on the region and edge perceptual models provide detail sliders for the user. These preserve the relative distribution of detail between examined and unexamined locations while reducing or increasing the overall amount of detail. Most images pictured use similar settings for these values. Though there is some variability from image to image in what global amount of detail looks best, a single scaling factor usually looks acceptable on most images. Tweaking for the very best look involves searching only a small area of parameter space.
Figure 7.5: Several derivative styles of the same line drawing transformation. (a) Fully colored, (b) color comic, (c) black and white comic
Results conﬁrm our intuition that regions and edges are complimentary features. Both are necessary to create comprehensible images. Each feature in isolation fails to convey the scene. This is illustrated in Figure 7.3. Edges in isolation, in part because of broken outlines fail to convey the sense of a scene made up of solid objects. Regions in isolation do clearly make up a scene, but without edges their smoothed borders are distracting. As in the ink and colored wash styles commonly used in illustration, dark edges add a kind of deﬁnition that is difﬁcult to achieve with color alone. Regions and edges make up our rendering style, but they can also be considered building blocks of many other styles. Figure 7.5 illustrates some trivial derivative styles, a color comic book look created by thresholding dark regions to black, and a black and white comic look created by fully thresholding the image. More interesting styles can also be built from the same building blocks. A natural possibility that we have not attempted is a painterly style taking advantage of this structured image model. Watercolor would be a particularly interesting possibility [Curtis et al., 1997]. Regions could provide areas of color to ﬁll with simulated watercolor, while rather than being explicitly rendered, strong edges could indicate locations where a hard edge, wet in dry technique should be used. We have argued that abstraction is a quality of any visual display designed with the purpose of clear communication. Even depictions usually considered true to life contain similar kinds of abstraction. Photorealist painters do this with subtle manipulations of tone and texture. Photographers composing studio shots do the same thing by manipulating the physical objects present. Graphic artists touching up photographs act similarly, editing out small distracting features. In the next chapter we present a very simple preliminary attempt to deﬁne a semi-automatic photorealistic abstraction using the same techniques we’ve applied to artistic rendering.
Chapter 8 Photorealistic Abstraction
Our goal is to perform abstraction in a photographic style, removing detail while preserving the sense that an image is an actual photograph. This is a more challenging goal than abstraction in an artistic style. Artistic styles provide a clear statement that an image does not directly reﬂect reality, and provides a fairly free license to change content. Viewers are much less forgiving of artifacts in an image that claims to be an accurate depiction of reality. The approach we present here is far from a solution to this problem but presents some interesting images, and suggest semi-automatic photographic abstraction is possible.
The technique used here is simple. Anisotropic diffusion [Perona and Malik, 1990] is used to smooth away small, light detail while preserving strong edges. Our contribution is using eye tracking data to locally control the amount of simpliﬁcation, allowing for meaningful photorealistic abstraction.
Figure 8.1: Mean shift ﬁltering tends to create images that no longer look like photographs.
Mean shift ﬁltering is an alternative space for simpliﬁcation. In theory it should provide greater control of the simpliﬁcation process. Achieving a photo-like result with it is a bit tricky since small areas quickly converge to a constant color. A high contrast discontinuous border is then visible between that region and adjoining areas that have converged to a different mode. Though an interesting form of image simpliﬁcation, mean shift ﬁltered images no longer look like photographs (see Figure 8.1). Though its deﬁnition implicitly takes into account edges, anisotropic smoothing is deﬁned on a ﬂat image. In performing abstraction we’d still like to use the structure of the image to avoid artifacts like those seen in our painterly rendering system, where detail leaks from important features into surrounding background. To achieve this, we combine a segmentation with eye tracking data to create a piecewise constant importance map reﬂecting the interest shown in each image region. This importance image will control the amount of smoothing performed.
Though the attention model we used in the work above took dwell time into account to tell if a ﬁxation was really meaningful, it was not primarily intended to measure the relative importance of different areas of the image. As long as a fairly long ﬁxation was present in an area, it was considered important. Relative importance was largely a function of distance from a ﬁxation. Here however, we want to capture a ﬁner grained measure of relative feature importance to control a more delicate process of abstraction. As mentioned above, the length of ﬁxations tells us something about how important a feature is because they relate to time spent processing the content of a location [Just and Carpenter, 1976]. Fixations can vary quite widely in duration. It appears from our initial experiments that the total dwell time in two ﬁxated locations does provide at least a rough measure of their relative importances. With this in mind, our strategy is to create an importance map that is brighter in important areas. This is done by coloring
a segmentation using an estimate of the total amount of time spent ﬁxating different parts of the image. We begin by breaking an image into regions using a ﬂat mean shift segmentation of the image. Using a multiscale segmentation based on our previous work is an interesting possibility, which we have so far not explored. Based on ﬁxations, we assign a weight, which can be considered an importance value, or empirical salience to each region in the image. Conceptually, we wish to count the amount of time spent ﬁxating each region in our segmentation. However, a ﬁxation might not actually rest within the boundary of the region that best represents the feature being examined. No segmentation is perfect, so for example a ﬁxation may rest in a region that represents half of the object examined. In addition drift and noise in the point of regard as measured by the eye tracker can cause ﬁxations to sit just over a boundary in another object entirely. Because eye trackers are at best accurate to about a degree of visual angle, (roughly 25 pixels in our setup) noise is particularly apparent when a small object is ﬁxated. Depending on how small the object is, the corresponding ﬁxation has a good chance of lying within a surrounding background region. Some smoothing of containment to deal with this problem was implicitly built into our previous work because bounding circles were used to calculate intersections between ﬁxations and regions. To explicitly deal with this we make a soft assignment between each ﬁxation and each region and weight each region by the sum of these values. This smooths the containment of ﬁxations in regions because each ﬁxation contributes to a range of segments near it, rather than simply the one it rests in. To set the contribution for a ﬁxation fi = xi , yi , ti to segmented region r j we compute the average (A) over the distances between the ﬁxation and all points in the region. We deﬁne a threshold distance T=175 pixels (more generally about 7 degrees of visual angle) and the contribution for ﬁxation fi to r j ’s weight is equal to (1 − A/T ) if A < T, 0 otherwise. The weight for each region is the sum of the weight contributed by each ﬁxation. Weights are capped at a maximum viewing time of 1 second. The region is
then drawn into the importance map using this intensity. The result of this is an image where each region in the segmentation has a constant color that reﬂects the total time spent examining that region. This is fundamentally different from the perceptual measures used in our other approaches. It is not a measure of visibility, but instead a measure of how much something has been looked at. It is similar conceptually to the subject/background distinction used to render background areas particularly abstractly in our line drawing style. While that was a binary distinction, this approach creates a relative measure of importance using ﬁxation length. More sophisticated ways of matching ﬁxations and regions are possible but we have found this sufﬁcient for our prototype. The resulting subject map is then used in a very straightforward way. At each point in the image, n iterations of anisotropic diffusion are performed where n interpolates linearly between 1 at the brightest parts of the importance map and an abstraction parameter M at the darkest parts (M =250 in most results shown here).
Results and Discussion
Some results of this process are pictured in Figure 8.2. The abstraction is much subtler than in other styles, but when viewed at high resolution an interesting subtle falloff in detail, largely small scale texture, is noticeable. This captures at least a bit of the effect that can sometimes be seen in photorealistic paintings where low contrast detail seems to disappear, while some high contrast details, for example in reﬂections and specularities, appear to be emphasized. Figure 8.4 illustrates the importance of taking region boundaries into account in assigning importance to image regions. If importance just varies locally based on dwell time in the vicinity, ﬁxated objects have a halo of detail around them. A number of limitations to this approach are obvious. Clearly it doesn’t capture the ﬂexibility an artist uses in abstracting texture, and removing entire elements. Even
Figure 8.2: Photo abstraction results
(g) Figure 8.3: Photo in (a) is abstracted using ﬁxations in (b) in a variety of different styles. (c) Painterly rendering, (d) line drawing, (e) locally disordered [Koenderink and van Doorn, 1999], (f) blurred, (g) anisotropically blurred.
Figure 8.4: (a) Detail of our approach, (b) the same algorithm using an importance map where total dwell is measured locally. Notice in (b) the leaking of detail to the wood texture from the object on the desk. Here differences are relatively subtle; but in general it is preferable to allocate detail in a way that respects region boundaries. for performing simple textural abstraction it is limited. Importantly, the total amount of abstraction possible without creating disturbing artifacts is limited. When relatively few iterations of smoothing are performed, abstraction is limited and small high contrast features in the least important areas remain quite distinct. In contrast, if many iterations of smoothing are performed, blurring becomes very apparent and the image takes on a foggy appearance that distracts from the scene (see Figure 8.5). There is an interesting unanswered question here of what features are important for the percept of a realistic image. What about an image makes it appear like a photograph, as distinct from a highly ﬁnished traditional painting or a painting by a photorealist artist. The range of contrasts is clearly one cue, that has been shown to be perceptually important for material recognition [Adelson, 2001, Fleming et al., 2003] Figure 8.3 provides an interesting comparison of abstraction performed using a number of different methods, which give very different impressions. Though a principled understanding of the perception involved is ultimately necessary, there are various techniques that might currently be brought to bear on this problem. Anisotropic diffusion provides a gradient weight that controls how sensitive to the local gradient blurring is. At one extreme, blurring is uniform. At the other
Figure 8.5: The range of abstraction possible with this technique is limited. With greater abstraction the scene begins to appear foggy. In some sense it no longer looks like the same scene. extreme, variations in the image are so carefully respected that almost no blurring occurs. This parameter could also be varied based on importance, though it is not clear how useful this parameter would be. A similar process of ﬁltering using a mean shift or bilateral ﬁltering framework might provide more control. Though mean shift ﬁltering tends to produce images that no longer look like photographs, a careful scheme for controlling the number of iterations, color and spatial scale of ﬁltering might overcome this problem. Ultimately, to capture a wider variety of artistic effects a more structured understanding of texture and grouping of scene elements is important; these are more difﬁcult problems.
Chapter 9 Evaluation
Though the problem of photorealistic abstraction is difﬁcult, our results for artistic styles suggest we have succeeded in achieving meaningful abstraction. Results look interesting and the reduction of detail does not seem visually jarring. Often in graphics this kind of informal impression is evaluation enough. This is not an illegitimate viewpoint. In the context of art or entertainment, formal evaluation may not be necessary. An appeal to visual intuitions about what looks good can be enough. Though our methods are targeted at creating artistic images for entertainment, we are interested in applying these techniques to illustration or visualization. Because of this we would like to be able to empirically evaluate the claim that our system can direct viewers to areas highlighted with detail. Even this does not categorically prove the technique actually makes images easier to understand. Being able to show this would require a visualization application, where goals and task related factors are in play. However, showing that abstraction directs visual interest would prove a quantiﬁable perceptual effect resulting from our technique. To establish this we perform a user study, comparing viewing behavior over our images to viewing of the original photographs and renderings in our style created with several different distributions of detail. We ﬁrst motivate our choice of evaluation technique, then present the speciﬁcs of how we conducted our experiments. Results and some implications of our ﬁndings are then discussed. Our aim is not just to validate our system, but is instead a threefold goal:
• Present a method of evaluation new to NPR (Section 9.2)—one based on tracking viewers’ eye movements.
• Use this method to provide quantitative validation for our system (Section 9.4) as well as interesting new insights into the role of detail in imagery (Section 9.5). • Explain why this evaluation methodology is widely applicable in NPR, even when the NPR system itself does not use eye tracking.
Evaluation of NPR
Prior methodologies used to evaluate NPR fall into one of two categories. The ﬁrst method polls a representative number of users, collecting their opinions to ﬁnd out how they respond to the system. Schumann et al.  polled architects for their impressions of sketchy and traditional CAD renderings, and based on the results, argued for the suitability of sketchy renderings for conveying the impression of tentative or preliminary plans. Similarly, Agrawala and Stolte  demonstrate the effectiveness of their map design system using feedback from real users. The second approach measures users’ performance at speciﬁc tasks as they use a system (or its output). When the task depends on information gained from using the system, performance provides a measure of how effectively the system conveys information. An early study [Ryan and Schwartz, 1956] looked at the time required to judge the position of features in photos and hand rendered illustrations in different styles. Faster responses suggested more simple illustrations were clearer. Interrante  assessed renderings of transparent surfaces using medical imaging tasks. Performance provided a measure of how clearly the rendering method conveyed shape information. Gooch and Willemsen  tested users’ ability to walk blindly to a target location in order to understand spatial perception in a non-photorealistic virtual environment. Gooch et al.  compared performance on learning and recognition tasks using photographs and NPR images of faces. Heiser et al  evaluated automatic instructional diagrams by having subjects assemble physical objects and assessing their speed and errors. Investigations like this draw on established research methodologies
in psychology and psychophysics. Both of these methods have their limitations. For example, the goal of imagery is not always task related. In advertising or decorative illustration (and possibly in much ﬁne art) the goal is more to attract the eye than to convey information. Success is measurable, but not by a natural task. Surveys have their own limitations. The information desired may not be reliably available to subjects by introspection. In addition, both task performance and user approval ratings assess only the quality of a system as a whole. Neither directly say why a pattern in performance or experience occurs. To understand this, the system needs to be systematically changed and the experiment repeated. This process can be costly and time consuming (or impossible). Any additional information that aids the interpretation of results is therefore highly valuable. We evaluate only one of the several styles of rendering presented in this work, the segmentation based line and color drawings [DeCarlo and Santella, 2002]. This system was chosen for evaluation in large part because it is the most developed of these systems. It is also an interesting candidate for evaluation in that it performs a very clean aggressive kind of simpliﬁcation. Unlike the other methods, it removes everything from abstracted regions leaving them completely featureless. There is also no randomness in the algorithm, as opposed to the painterly renderings system where there are random variations in stroke placement. This allows multiple, otherwise identical, renderings for comparison to be created with different distributions of detail. Our hope is that removing detail can enhance image understanding. Further, successive viewers may be encouraged to examine the image in a way similar to the ﬁrst viewer, and take away a similar meaning or impression. There is no natural task in which to evaluate this effect, because our goal is creating artistic imagery rather than visualizations for some task. Systematic questioning of viewers might substantiate the intuition that the images are well designed, but would not inform future work. Here, we present an alternate evaluation methodology which draws on established
psychophysical research. This approach analyzes eye movements and provides an objective measure of cognition. It can be the basis of evaluation, or provide complementary evidence when a task or other method is available. Regardless of the context in which the user is viewing an image, the common factor is the act of looking. This mediates all information that passes from the display to the user. In all of this work, this key insight has provided an easy and intuitive method for abstraction. For the same reason, we apply eye tracking to evaluation. These choices are independent; evaluation via eye tracking is a general methodology that can be used regardless of how imagery is created. Our study also looks at renderings that are created without the use of eye tracking.
Analysis of Eye Movement Data
Basic parsing of eye movements into ﬁxations and saccades has already been discussed Section 3.1. Once individual ﬁxations have been isolated, it is often useful to impose more structure on the data. In looking at an image, viewers examine many different features, some closely spaced on a single object, others more distant. A common pattern of looking is to scan a number of different features and then return back to particularly interesting ones. Multiple close ﬁxations suggest interest and increased processing in the same location. Because of this, cumulative interest in a location is often a valuable measurement. This was used as the basis of an importance map in our system for photorealistic abstraction Section 8. When the location of features is known, this is often measured by counting viewing time spent within a bounding box [Salvucci and Anderson, 2001]. When there are not predetermined features, clustering can be used to characterize regions of interest in a data driven fashion [Privitera and Stark, 2000]. Nearby ﬁxations are clumped together, yielding larger, spatially structured units of visual interest. The number of clusters indicates the number of regions of interest (ROI) present, and the number of points contained in them provides a measure of cumulative
interest. This is achieved using a mean shift clustering that considers only the x,y positions of locations viewed [Santella and DeCarlo, 2004b]. In the experiment described in the next section this will reveal important information about how viewers look at images.
Figure 9.1: Example stimuli. Detail points in white are from eye tracking, black detail points are from an automatic salience algorithm.
The images used in this experiment were 50 photographs, and four NPR renderings of each photo for a total of 250 images and ﬁve conditions. Most photos were taken from an on-line database1 . Photos spanned a broad range of scenes. Images that could not be processed successfully were avoided, such as blurry or heavily textured scenes. Prominent human faces were also excluded, although human ﬁgures were present in a number of the images. All NPR images were generated using the method of DeCarlo and Santella  presented in Chapter 6. The four renderings differed in how decisions about the inclusion of detail were made. The ﬁve conditions are pictured in Figure 9.1, they are: Photo: This is the unmodiﬁed photograph. High Detail: A low global threshold on contrast ensures that most detail is retained, removing primarily areas of low contrast texture and shading. Low Detail: A high contrast threshold is used, removing most detail throughout the image. The resulting image is drastically simpliﬁed but still for the most part recognizable. Eye Tracking: Detail is modulated as in [DeCarlo and Santella, 2002], using a prior record of a viewer’s eye movements over the image. Detail is preserved in locations the original viewer examined (here we call these locations detail points) and removed elsewhere. The eye tracking data was recorded from a single subject who viewed each image for ﬁve seconds (and was instructed to simply look at the image). Salience Map: Detail is modulated in the same manner as eye tracking, but the detail points are selected automatically by a salience map algorithm [Itti et al., 1998,Itti
and Koch, 2000]2 . The algorithm has a model of the passage of time. So, like ﬁxations, each point has an associated duration. Five seconds worth of detail points were created. The locations viewed by people and chosen by the salience algorithm can be similar in some cases, but in general result in renderings with noticeably different distributions of detail. This set of conditions represents a systematic manipulation of an image. The effects of NPR style, detail, and abstraction are separated. Local simpliﬁcation is present in two forms: one based on a viewer, and the other on purely low level features. Because detail is controlled by choosing the levels of a hierarchical segmentation, simpliﬁed images consist of a subset of the features in higher detail images. The eye tracking and salience conditions are rendered literally using a part of the tree used to render the high detail condition, while the low detail case generally includes the least content.
Data was collected from a total of 74 subjects including 50 undergraduates participating for course credit and 24 subjects (graduate and undergraduate) participating for pay.
All images were displayed on a 19 inch LCD display at 1240 x 960 resolution. The screen was viewed at a distance of approximately 33.75 inches, subtending a visual angle of approximately 25 degrees horizontally. Eye movements were monitored using an ISCAN ETL-500 table-top eye-tracker (with a RK-464 pan/tilt camera). The movement of the pan/tilt unit introduced too much noise in practice; it was not active during the experiment. Instead, subjects placed their heads in an optometric chin rest to minimize head movements.
Calibration and Presentation
Eye trackers need to be calibrated in order to map a picture of a subject’s eye to a position in screen space. This is accomplished by having the viewer look at a series of predetermined points. In our experiments, a nine point calibration was used. The quality of this calibration was checked visually, and also recorded. Every 10 images, the calibration was checked and re-calibration was performed if necessary. Recordings were used to measure the average quality of the calibrations. Errors had a standard deviation of approximately 24 pixels (about a half degree), which agrees with the published sensitivity of the system. Note that this does not account for systematic drift from small head movements during a viewing.
After calibration, subjects were instructed to look at a target in the center of the screen and click the mouse to view the ﬁrst picture when ready. On the user’s click, the image was presented for 8 seconds, and eye movements were recorded. After this, the target reappeared for one second. A question then appeared. The subject clicked on a radio button to select their response, clicked again to go on, and the process repeated.
Subjects normally saw one condition of each of the 50 images. The condition and order were randomized. While viewing the images, subjects were told to pay attention so they could answer questions which came after each image. Questions were divided into two types, the order of which was randomized. Questions asked the viewer either to rate how much they liked the image on a scale of 1 to 10, or whether they had already seen the image, or a variant of it, earlier in the experiment. Occasional duplicate images were inserted randomly when this question was used; data for these repeated viewings is not included in the analysis. The questions were selected to keep the viewer’s attention from drifting, while at the same time not giving them speciﬁc instructions which might bias the way they looked at the image.
Figure 9.2: Illustration of data analysis, per image condition. Each colored collection of points is a cluster. Ellipses mark 99 % of variance. Large black dots are detail points. We measure the number of clusters, distance between clusters and nearest detail point, and distance between detail points and nearest cluster.
Analysis Data Analysis
Analysis draws upon a number of established measures and techniques tailored to our experiment, to provide complimentary evidence about how stylization and abstraction modiﬁes viewing. Some processing is common to all our analysis. First, all eye movement data is ﬁltered to discard point of regard samples during saccades. We then perform clustering on the ﬁltered samples [Santella and DeCarlo, 2004b]. The clusters are not always meaningful, but on the whole they correspond well to features of interest in the image. There is reason to believe the number of points contained in a cluster may reveal how important a feature is; this is not considered here. Our clustering method requires a scale choice. Clusters whose modes are closer than this scale value will be collapsed together. We select a scale of 25 pixels (roughly half a degree) for all analysis, which is about the level of tracker noise present. Results depend on the scale choice used in the clustering process. Clearly, at coarser scales there will be fewer clusters and a smaller difference between the condition means. We argue below that this does not affect interpretation of our results.
All clustering was conducted in two ways. In the ﬁrst, which we will refer to as per viewer analysis, each viewer’s data was clustered separately. In the second analysis, which we will refer to as per image analysis, data for all viewers of a particular image was combined before clustering. It is reasonable to think that as one adds data from individual viewers, the data will approach some hypothetical distribution of image feature interest [Wooding, 2002]. This second analysis may therefore provide a better measure of aggregate effects. Below, we describe the measurements performed using the clusters. See Figure 9.2 for an illustration of the data. Clusters: Because clusters roughly correspond to areas examined in the image, we would expect to ﬁnd fewer clusters in the eye tracking and salience cases if they succeed in focusing viewer interest. We might also expect uniform simpliﬁcation to reduce the number of clusters, because it reduces detail. Distance (from data to detail points) : In the eye tracking and salience conditions, we wish to measure whether interest is focused on the locations where detail is preserved. The change in distance from each of the cluster centers to the closest detail point between conditions tells us how effective the manipulation is in drawing interest to these locations. If the abstraction is successful, we would expect that clusters will be closer. This tests the system as a whole. There will be no change in distance if our hypothesis is wrong, which would mean that varying detail does not attract more focused interest. It is also possible that in a particular image there was no detail that could be put in a particular location, because there was none in the original image, or because it cannot be represented in our system’s visual style. Distance (from detail points to data): Implicit in the choice of detail points is the assumption that viewers should look at all of the locations. This is not captured by the distance measure. A viewer could spend all the time looking at one detail point yielding a zero distance. To quantify this, it is possible to measure the distance from each detail point to the closest cluster. A high average value means the locations of
a signiﬁcant number of detail points were not closely examined. This distance will decrease in salience and eye tracking conditions if detail modulation makes people look at high detail areas that were not normally examined.
1 0.014 0.8 0.6 effect magnitude 0.4 0.004 p value 0 0 50 100 150 200 0.2
Clu ster Sc a le
Figure 9.3: Statistical signiﬁcance is achieved for number of clusters over a wide range of clustering scales. The magnitude of the effect decreases, but its signiﬁcance remains quite constantly over a wide interval. Our results do not hinge on the scale value selected.
Data for all subjects was clustered as discussed in Section 9.3.1. In total there are 10 eye tracking records for each of the 50 images in each of the 5 conditions for a total of 2500 individual recordings. More data than this was gathered; a matched number of recordings for each condition was selected randomly. As noted in Section 9.2.4, data was recorded in blocks where one of two questions was asked. Analysis showed no effect of the questions, and these results are based on roughly equal numbers of images presented in blocks of each question type. Analysis of variance (ANOVA) are used to test whether differences between conditions are signiﬁcant. These tests produce a p value: the probability the measured
Diffe renc e Hig h -Eye Tra c k
P Valu e
difference could occur by chance. The per viewer case gives itself naturally to statistical testing by a two-way repeated measure ANOVA. In this context a two-way ANOVA separately tests both the contribution that the particular image and the condition make to the results. This lets one look at the effect of a condition while factoring out the variation among the different images. A repeated measure analysis treats each viewers’ eye tracking record as an independent measurement, so there are 10 data points per image and condition pair. In the per image analysis, the 10 recordings are collapsed together and data is analyzed instead by a simple two-way ANOVA. There is now only one data point per image and condition pair, so it is more difﬁcult to show a statistically signiﬁcant effect. We want to know not only if some of these conditions are different from each other, but also which pairs are different. This requires a number of tests. When performing this kind of analysis there is a concern that, since each test is associated with a certain probability that the results could occur by chance, there will be an unacceptably high cumulative risk that some positive results may occur by chance. Several approaches exist to deal with this problem. We adopt a common methodology for minimizing this risk. One test is used to establish that all of the means are not the same, and only if this test succeeds are pairwise tests performed. This method is implicit in all pairwise test results reported.
Figure 9.4 graphs the average results for all measures. The take-away message, quantiﬁed below, is that on the whole: • Eye tracking and salience conditions have fewer clusters than photo and uniform detail conditions in all analyses. In the per image analysis, eye tracking has fewer clusters than salience. • Distance between the viewed locations and the detail points decreased as a result of modulating detail.
Per Image Analysis, data for all viewers of an image is clustered together.
Number of Clusters
high eye eye track detail points
high sal. salience detail points
Distance from Cluster to Detail Point
Detail Point Distance
60 50 40 30 20
hi. eye. eye track detail points
hi. sal. salience detail points
Distance from Detail Point to Cluster Figure 9.4: Average results for all analyses per image.
Per Viewer Analysis, data for each viewing is clustered separately.
12 11 10 9 8 7
high eye sal. photo low
Number of Clusters
180 170 160
150 140 130 120 110 100
high eye eye track detail points high sal. salience detail points
Distance from Cluster to Detail Point
Detail Point Distance
hi. eye. eye track detail points
hi. sal. salience detail points
Distance from Detail Point to Cluster Figure 9.5: Average results for all analyses per viewer.
• Distance between detail points and viewed locations showed no change; however the distances for salience points were signiﬁcantly higher than those for eye tracking points.
Clusters: In the per viewer analysis, there was about one fewer cluster in the eye tracking and salience conditions, compared to the others. This means each viewer examined one fewer region on average. Analysis showed this difference was signiﬁcant (p < .001). There was no signiﬁcant difference (p > .05) between the photo or uniform detail conditions, or between eye tracking and salience. In the per image analysis, eye tracking had about 6 fewer clusters than uniform detail and photo conditions, while salience had about 3 fewer. Eye tracking differed signiﬁcantly from all other conditions including salience (p < .001). Salience differed from original at p < .01, and from high and low at p < .05. Distance (from data to detail points): Clusters in the eye tracking condition were about 20 pixels closer to the eye tracking detail points than high detail clusters, in both per viewer and per image analysis (p < .0001). Salience clusters were about 10 pixels closer to salience detail points (per viewer: p < .0001, per image: p < .01). This is not spatially very large, but it represents a consistent shift of cluster centers towards the detail points. The magnitudes of the two shifts (10 and 20 pixels) were not signiﬁcantly different from each other. For per image analysis, distances measured to eye tracking detail points were signiﬁcantly higher p < .01) than corresponding distances to salience points. Distance (from detail points to data): There was no signiﬁcant change (p > .05) for saliency or eye tracking renderings in either analysis. In both analyses however, the distances were signiﬁcantly smaller (p < .001) when measured from eye tracking detail points than from salience detail points (a difference of about 40 in the per viewer and 10 in the per image condition).
All of the two-way ANOVAs tested the signiﬁcance of both the experimental condition, and the particular image. In all tests, the effect of the image was highly signiﬁcant (p < .001). This is neither surprising nor particularly informative. It simply states that individual images have varying numbers of interesting features and they are distributed differently in relation to the detail points. As mentioned above, all of this analysis used clusters created with a particular choice of scale. Figure 9.3 shows that results do not depend on this choice. The difference between mean number of clusters in the high detail and eye tracking conditions (per viewer analysis) is plotted along with the corresponding p value. Though the magnitude of the difference varies, p values show an effect of approximately equal signiﬁcance over a range of scales. The effects we have shown are therefore not due to the particular scale selected.
These results provide evidence that local detail modulation does change the way viewers examine an image. Eye tracking and salience renderings each had signiﬁcantly fewer clusters than all uniform detail images in both per image and per viewer analysis (signiﬁcance is stronger in the per viewer analysis, but that is to be expected based on the number of samples). Distances to detail points also showed an improvement for both salience and eye tracking renderings. This indicates that not only were fewer places examined, but the examined points were closer to the detail points. Distances from detail points to data did not show improvement. This indicates that though interest was concentrated by the manipulation in places with detail, it did not bring new interest to detail points that were not already interesting in the high detail renderings. Viewers look more at detailed locations when other areas have been simpliﬁed, but this did not beneﬁt all locations equally. Rather, locations that were already somewhat interested received increased interest. Results do not prove enhanced or facilitated understanding per se; however, this is strongly suggested by the more focused pattern of looking.
Results also indicate that although improvement can be seen with detail modulation based on both eye tracking and salience, the two behave differently. Modulation based on both produces fewer clusters of interest, and decreased distance to detail points. However, in the per image analysis, the number of clusters for the eye tracking condition was signiﬁcantly lower than the salience condition. Also, the distances measured from salience points were consistently higher than those from eye tracking points; this is further evidence that eye tracking points are more closely examined. Distance to detail points shows the opposite relationship (though more weakly) and argues against this conclusion. However, we show below that this is almost certainly due to the number of detail points, and is not meaningful. These results ﬁt our intuition that the locations a viewer examined will, in general, be a better predictor of future viewing than a salience model, which has no sense of the meaning of image contents. There is considerable controversy in the human vision literature about how much of eye movements can be accounted for by low level feature salience. Some optimistically state that salience predictions correlate well with real eye motions [Privitera and Stark, 2000, Parkhurst et al., 2002]. Others are more doubtful and claim that when measured more carefully and in the context of a goal driven activity, the correlation is quite poor [Turano et al., 2003,Land et al., 1999]. Our results show salience points (at least those produced by the algorithm used) are less interesting in general. Though abstraction does attract increased interest to salience points; people look nearer to some of them, but still at less of them overall. It is not clear at ﬁrst glance that distance values measured against the eye tracking and salience detail points can legitimately be compared to each other in order to judge if they are functionally equivalent. Differences may be due to the number and distribution of the two kinds of detail points, rather than their locations relative to features in the images and hence the locations of ﬁxation data collected. One very obvious difference clouds comparison of ﬁxation and salience detail points. The salience algorithm produces more ﬁxations than real viewers, so there were typically more detail points
in the salience case (10.9 per image on average) than eye tracking (5.96 on average). This would seem to bias distances to salience detail points toward lower values, possibly bias distances from salience points to higher values and make comparison difﬁcult. This is not as bad as it might seem because there is usually a fair amount of redundancy in the salience detail points. Multiple points lie close to each other, so twice as many points does not mean twice as many actual locations. Still this complicates quantitative interpretation. In fact we do see that distances to detail points are higher overall for ﬁxation detail points, if interpreted as reﬂecting on the ﬁxation data this would suggest the strange idea that salience points are examined more closely. Fortunately, some simple controls indicate that the lower distances to salience points is an unimportant artifact, while the higher distances from salience detail points is meaningful. Replacing recorded ﬁxation data with random points allows one to test if a particular effect is due to the relationship between detail points and data, or if detail points alone drive the effect. If an effect disappears when random data is substituted for recorded ﬁxations, it was driven by data. If it persists the detail points drive the effect. Replacing all ﬁxation data for all viewers with uniform random points eliminates all effects in distances from detail points. The location of ﬁxation data drives this difference. In contrast, when assesed using this random ﬁxation data, distances to detail points are still signiﬁcantly higher for eye tracking and lower for salience detail points. This difference is at least partly driven by the detail points themselves (most likely the fact that there are more of them in the salience condition, more points in a conﬁned space will mean a shorter distance to the nearest one). A second control adds evidence that the higher number of detail points in the salince case is responsible for the overall lower distance to salience detail points. We can discard some detail points in the salience case so that the number of detail points is equal across conditions. We would expect effects driven by the number of points to disappear in this case. When this is done, there is no qualitative change in the magnitude
of distances from detail points. Distances to salience detail points however become signiﬁcantly higher than those to eye track points: a reversal. The main effect of our experiment (the decreasing distance resulting from detail modulation) is not affected by this. These two controls provide fairly strong evidence that eye tracking detail points really are more examined than salience points overall. The second control could have done before creating renderings and recording data. This would have provided a more complete match between both conditions. It would however provide some extra, usually not available information to the salience algorithm, the number of ﬁxations a viewer made, a rough guide to how many important features are present. In contrast to the changes caused by abstraction, there is little evidence that the style manipulation alone produces a signiﬁcant change in viewing. There is no signiﬁcant difference between the photo and high detail images in number of clusters. A qualitative comparison of ﬁxation scatter plots in these two conditions also suggests the distribution of points in both is largely similar. There are however some large differences in the effect on individual images. In some images large areas of low contrast texture are removed by the stylization itself. In these cases, viewer interest is different between the high detail and photo conditions (see Figure 9.6 for an example). Removing prominent but low contrast texture is abstraction, but it is abstraction over which one has no control. Rather, it is built implicitly into the system (in this case into the segmentation technique). The opposite effect can also occur: the style can attract attention to less noticeable features. Notice in Figure 9.6, how drawing ripples on the water in black has attracted the eye to them. A method for quantifying when and where these effects occur is a topic for future research. These appear to be primarily low-level effects, so work on salience [Itti et al., 1998] may provide a good starting point. Interestingly, our results also indicate the number of regions of interest in an image is not primarily driven by detail. It is surprising how much the pattern of interest in the low detail case qualitatively and quantitatively resembles the high detail. The highly
detailed and highly simpliﬁed renderings have the same number of clusters while the mixed detail images, in which the number of regions lies between these extremes, have less. This implies it is locally increased detail that attracts the eye. Too much, or too little detail everywhere leads to a broader dispersion of interest. Locally high detail seems to attract the eye, globally low detail in particular appears to produced scattered ﬁxations. The distribution of ﬁxation is similar overall but groups of ﬁxations are smaller and more scattered (clusters in the low detail case in fact represent lower total dewll times on average than in the other conditions). This pattern is suggestive of a (failed) search for interesting content. Substantiating and quantifying this is an interesting subject for future research and has direct application in designing future NPR and visualization systems. It would for example, be interesting to see how detail relates to the time course of viewing and how behavior might vary with longer or shorter viewing times. In summary: • viewers look at fewer locations in images simpliﬁed using eye tracking and salience data, • these locations tend to be near locations where detail is locally retained, • neither the NPR style itself, nor application of uniform simpliﬁcation modify number of locations examined, and • this effect exists for both eye tracking and salience detail points, but there is less interest overall in salience points. These results might seem like exactly what one would expect, given the use of abstraction in art. This was not the only possible outcome however. One could imagine that abstraction performed in this semi-automatic manner would simply now work. Simplifying arbitrary areas of an image might confuse viewers, and they might, for example, spend all their time looking at background regions obscured by simpliﬁcation,
trying to ﬁgure out what is there. This bears a certain resemblance to behavior in the low detail images. But when detail is retained in some locations and removed elsewhere viewers seem to get the point, and explore the detailed areas.
These results validate our attempt to focus interest by manipulating image detail using eye tracking data. Results also have broader implications for those designing NPR systems, using salience maps in graphics, and designing future experimental evaluations of NPR systems. Our results show meaningful abstraction is important for effective NPR. Abstraction that does not carry any meaning is implicit in many NPR styles; for example, there are no shading cues in a pure line drawing. Uniform control of detail is also common in NPR systems. These are important considerations. But, both were tested in this study and produced no change in the number of locations viewers examined. In contrast, meaningful abstraction clearly affected viewers in a way that supports an interpretation of enhanced understanding. Directed meaningful abstraction should be considered seriously in designing future NPR systems. Similarly, although low level (salience map) and high level (eye track) detail points behave similarly in their ability to capture increased interest, they differ in their absolute capture of interest. The increased capture of interest seems to be a low level effect; people don’t bother looking where there isn’t anything informative. However, semantic factors are also active. The locations that interest another person are inﬂuenced by image meaning and are a better predictor than salience of where future viewers will look. This has implications for the use of salience maps in graphics. It would be highly desirable to automatically locate places viewers will look in a number of applications, not the least of which would be automatic abstraction in NPR and visualization.
Though salience points behave similarly to eye tracking points in part of our analysis, results indicate that on the whole salience is not suitable for this purpose. It can be successful in adaptive rendering applications [Yee et al., 2001], where it is only necessary that people be somewhat more likely to look at selected locations. Salience does provide information about how likely the structural qualities of a feature are to attract interest. However, we want to encourage a later viewer to get the same content from an image that an earlier, perhaps more experienced viewer examined closely. Current salience map algorithms are hardly expert viewers. This kind of application requires better predictions, motivated by semantic information that salience is generally unlikely to provide. In addition, eye tracking may be a useful technique for evaluations of other NPR systems. It provides an alternative to questionnaires and task performance measures. Even when a task based method is possible, eye tracking can be useful in investigating what features underlie the performance observed. Information is, after all, extracted from the imagery by looking, and the large body of research available suggests that locations examined indeed reveal the information being used to complete a task. The experiments performed by Gooch et al. can serve to illustrate this. If users can perform a task better using an NPR drawing of a face rather than a photograph, it is valuable to see where clusters of visual interest occur in the two conditions. This information may explain performance. It could focus future experiments and inform design choices about rendering faces, without exhaustive experimental testing. Similarly, in evaluation of assembly diagrams [Agrawala et al., 2003], eye tracking can provide very speciﬁc information about how people use such instructions. For example, eye tracking could help further explain the way users interleave actions in the world and examination of the relevant part of the instructions. Eye tracking records are directly and obviously related to the imagery that evokes them. This makes them very interpretable—a desirable quality in any measurement. In turn, this guards against the danger [Kosara et al., 2003] of performing a user evaluation that ultimately doesn’t
yield any useful result.
Figure 9.6: Original photo and high detail NPR image with viewers’ ﬁltered eye tracking data. Though we found no global effect across these image types, there are sometimes signiﬁcantly different viewing patterns, as can be seen here. Our experimental results quantify our intuition that our technique can focus interest on areas highlighted with increased detail in abstracted images. Loosely interpreted, our results could even be looked at as an experimental conﬁrmation of the widely held informal theory [Graham, 1970] that art functions in part by carefully guiding viewers eyes through an image. Our results also resonate with ﬁndings from the literature on the psychology of art [Locher, 1996] that suggests viewers spend more time in long ﬁxations in the somewhat vaguely deﬁned category of ’well balanced’ images while less long ﬁxations occur in ’unbalanced images’. This convergence of work from different ﬁelds is encouraging.
Chapter 10 Future Work
We have demonstrated the effectiveness of our approach to artistic abstraction. This success encourages investigation in a number of related areas. These include improvements and extensions to image processing and representation techniques, better perceptual models to control abstraction, and application of these and related techniques to practical problems.
The models of image features used in our work are drawn from the state of the art, but this is constantly changing. As image processing and understanding techniques improve these can be incorporated into a richer model of image contents. Interesting developments are possible in a number of areas.
Our model of image regions could be extended in a number of ways. Our segmentation technique is limited by a piecewise constant color model of image regions. A segmentation technique that could model a segment as a region of uniform texture or smooth variation would better represent meaningful areas of the image. Once able to capture coherent textured areas, how to abstractly render them becomes an interesting question. Simply rendering them in a mean or median color is possible. More meaningful textural abstraction presents an interesting challenge. A natural question to ask is what features of a texture make it look like what it is. The ability to create NPR versions of textures from images could be applied in 3D as well as image based NPR. The inability of the segmenter to capture shading on smoothly varying regions is
Figure 10.1: A rendering from our line drawing system (b), can be compared to an alternate locally varying segmentation (c). This segmentation more closely follows the shape of shading contours.
also problematic. Ideally, one would like the boundaries of regions created by shading to be clearly distinguishable from other regions, and to smoothly follow isocontours of the image. Instead, mean shift segmentation tends to produce patchy jigsaw puzzle like regions (see Figure 10.1). If segmentation parameters are changed to create a coarser segmentation, the result collapses an entire area of gradient leaving a number of small island regions of greater variation dotted throughout. It also seems that, when available, ﬁxation information itself should be able to provide an extra guide to the segmentation process. A ﬁxation gives a fairly strong clue that some important ﬁne scale feature exists in its vicinity. This information should be of some beneﬁt to segmentation. We have made some initial experiments addressing these issues. We begin by creating a segmentation using an alternate segmentation technique that tends to follow isocontours [Montanvert et al., 1991]. This method iteratively merges regions based on a class label derived from color information. We also use ﬁxation data to locally control the color threshold used in the segmentation. The contrast threshold used to
(c) Figure 10.2: Locally varying segmentation cannot replace a segmentation hierarchy. Another example of a locally varying segmentation controlled by a perceptual model (c), compared to a rendering from our line drawing system. Note ﬁne detail in the brick preserved near the subjects head in (c). This is a consequence of the threshold varying continuously as a function of distance from the ﬁxations on the face.
decide whether two regions should merge is calculated using the same contrast sensitivity model applied in our region and line drawing system. The result of this is a single segmentation that displays locally varying resolution with smaller regions being preserved where a viewer looked. This achieves a kind of abstraction very similar to the renderings in our colored line drawing style (see Figures 10.1 10.2). Note that shading region boundaries in these images follow much more natural curves. This technique has limitations as a form of abstraction. Detail is determined locally, so we again see detail preserved near important features, like the bricks near the ﬁgure’s head in Figure 10.2. Though currently this technique only creates a single segmentation, it could be easily extended to create a hierarchy that would allow us to modulate detail discontinuously. Even when creating a hierarchy, it may be useful to segment important areas more ﬁnely. Indeed, even if our goal is not abstraction, but rather segmentation for its own sake, the locally varying resolution of such segmentations might be useful. Various additional data could also be added to our segmentation. Items such as cars and people can be identiﬁed [Torralba et al., 2004] and these labels could be added to sets of regions in our representation. Knowing what an object is would aid more informed abstraction. This contrasts with our system that views everything as a collection of blobs. Such information could also be used in automatic attempts to modify the segmentation tree to better reﬂect object structure. If some set of regions at the ﬁnest level are identiﬁed as say, a car, the segmentation hierarchy can be modiﬁed to ensure that these regions form a subtree that does not merge with the background until the whole car is reduced to one region. This style of abstracted rendering has been extended to video [Wang et al., 2004]. However, abstraction was performed largely by hand selecting groups of 3D spacetime regions. More fully automatic methods would require a 3D analogue of our 2D hierarchical segmentation. Such a blob hierarchy could be created at fairly high computational expense by repeated mean shift with successively larger kernels. A careful
iterative merging of regions could potentially create similar results at much less cost.
Our edge representation is another interesting area for improvement. Edges in our current results are a weak point. They are detected at only one scale and tend to be very broken in appearance, with an excessive number of small textural edges scattered about. Very short edges are discarded. This ﬁltering makes ﬁne detail impossible to capture with lines. In addition, because edges are detected at only one scale, we know nothing about the range of frequencies an edge exists at and so are pressed into the questionable decision of using edge length as a size measure. Detecting edges at multiple scales is an obvious next step. There are several ways this might be done. One would be to create a hierarchical edge representation similar to our region hierarchy. Some work has been done on this problem. Edges have been detected at multiple scales and correspondences made across scales to trace coarse scale edges to their ﬁne scale causes [Hong et al., 1982]. Such approaches seek to achieve both of the normally conﬂicting goals of robust detection of features at all scales, and ﬁne localization of feature position. This work could be built on to represent all edges in an image as a collection of tree structures of connected edges. A more modern and popular approach to deal with multi-scale edges is scale selection. Only a single set of edges is detected, but the scale at which they are detected varies locally [Lindeberg, 1998]. Conceptually, one searches everywhere for edges at a range of scales and picks the scale with the maximal edge response. This approach does not consider the tracing of coarse scale features to ﬁne scale. But this is not particularly necessary. Ideally at least, features detected at coarse scales actually exist at a coarse scale, and ﬁner scales in those locations will only contain noise. This approach provides a more complete, continuous set of edges. It also provides important additional information, for each point on each edge there is an corresponding contrast and scale value. The availability of this information suggests the use of more interesting
Figure 10.3: A rendering from our line drawing system demonstrates how long but unimportant edges can be inappropriately emphasized. Also, prominent lower frequency edges like creases in clothing are detected in fragments and ﬁltered out because edges are detected at only one scale. perceptual models in making decisions about edge inclusion.
Like representation of edges, decisions about edge inclusion are a weak point in our current approach. Currently, an acuity model uses edge length as a proxy for frequency. This succeeds in removing shorter edges in unimportant regions, but is poorly motivated perceptually and produces some unintuitive artifacts. This can be seen for example in Figure 10.3 where unimportant edges in the background are inappropriately included because of their great length. Edges in are system are in fact, ﬁltered not once but three times: ﬁrst by the hysteresis threshold used in the original edge detection scheme, second by a global length threshold, only then are edges judged by our perceptual model. If scale selection is used, we would have for each point on each edge a frequency estimate as well as a contrast measure at that scale. This would allow us to use a contrast sensitivity model to judge edge inclusion. A decision could be made at each point along each edge, or a single scale and contrast could be assigned to the whole of
each candidate edge, perhaps using the median value. As we currently do with regions, we could then plug frequency into the model and receive a contrast threshold that can be compared to the measured contrast along the edge. Recall that in applying contrast sensitivity models to regions, some modiﬁcations were made to avoid the unintuitive effect of very large scale regions having lower visibilities. This was loosely justiﬁed by the properties of square wave gratings. For multi-scale edges the unmodiﬁed model makes sense. Very coarse scale edges, such as the edge of a soft shadow, are in fact less visually prominent and would be correctly judged as less worthy of inclusion than somewhat higher frequency edges of similar contrast. A model like this could take over all judgments about what constitute signiﬁcant edges. Such an approach could be used to intuitively ﬁlter detected edges outside of NPR. Scale selection detects a very large number of low contrast high frequency edges. A variety of strength measures have been used to ﬁlter them out. A model like this would provide a perceptually motivated metric, as well as a way of creating locally varying thresholds based on viewer input. A complete approach to edge extraction and ﬁltering would require higher level effects like grouping and completion, but perceptual metrics like this could be an interesting ﬁrst step. Similarly, there is room for improvement in perceptual models of region visibility. The next step is less clear here. Better psychophysical models of color contrast sensitivity could be applied if available. Better methods of measuring contrast between regions and their surround would also be useful. Our current approach takes into account only the mean color of each region. It is easy to think up examples where this provides a poor measure of the distinguishability of two regions. A method that measured contrast using the color histograms of two regions would likely be an improvement. Taking into account both the interior of the regions and the characteristics of their shared border [Lillestaeter, 1993], could distinguish object and shading boundaries. Alternatively, we could reduce region visibility to the visibility of boundaries between two regions. This could be done using the contrast and frequency
of a best ﬁtting edge along the border between them. Since we are measuring not the size of a region, but the frequency of the boundary between two regions, an unmodiﬁed contrast sensitivity model is again appropriate. Region boundaries due to slow shading changes would have appropriately low contrasts, low frequencies and therefore low visibility. In another alternative, the whole range of frequencies and corresponding contrasts present on the border could be looked at, and visibility could be based on the most visible among these. This kind of perceptually driven scale selection might produce some interesting effects. All of these additions and modiﬁcations could lead to more interesting and expressive imagery. We have shown that the abstraction embodied in these images can communicate what a viewer found important and provide an effective guide to future viewers. A component that remains missing in our argument that this methodology will be useful in visualization is a demonstration of its beneﬁts in a practical task.
The presence of similar abstraction in many technical and practical illustrations encourages us that there are many applications of these techniques in visualization and illustration. A practical problem has been choosing a domain in which to test our method. Our approach gives itself to illustrative rather than exploratory applications, since the methodology requires that someone know what is important, so their ﬁxations can be used to clarify the point for successive viewers. The domains where this might be most useful present some challenges for our current image analysis. Medical images for example tend to be low contrast, noisy and difﬁcult to segment with general-purpose segmentation techniques. Photographs of technical apparatus such as for example a car engine (see Figure 10.4) present their own difﬁculties. Though clean man made edges are generally easier to segment, these images are very crowded and often poorly lit. In these circumstances, segmentations fail to respect object structure in
Figure 10.4: Attempting technical illustration of mechanical parts pushes our image analysis techniques close to (if not over) their limits.
a way that can be confusing. Extra sources of information, such as sets of photos taken with ﬂashes in different locations have been used to ease image analysis in situations like this [Raskar et al., 2004]. Despite these technical challenges, we feel conﬁdent that these methods of abstraction will be useful for illustration in a number of domains. These applications are not limited to photo abstraction. Similar kinds of abstraction can be performed in 3D scenes. This removes the difﬁculties of image analysis, though it presents a number of new challenges. Beyond textural indication in line drawings, abstraction in 3D scenes has received relatively little attention. Perceptual metrics like those we present could provide an interesting basis for a general framework of 3D abstraction.
Chapter 11 Conclusion
Our goal was to create images that capture some of the expressive omission of art. Several kinds of such images have been presented. These methods are motivated by artistic practice and current models of human visual perception. Such images have been experimentally shown to create a difference in the way viewers look at images. This suggests our method has the ability to direct a viewers gaze, or at least focus interest in particular areas. We therefore believe that these techniques are applicable not only to art but also to wider problems of graphical illustration and visualization. Rather than just a test of our system, our experiments can be seen as empirical validation on controlled stimuli of the general idea that artists direct viewers gaze through detail modulation. Our success also provides an experimental conﬁrmation of sorts for the hypothesis [Zeki, 1999] that at least part of the appeal of great art lies in the artists careful control of detail, enticing the viewer with information, while not overwhelming them with irrelevant detail. This balance serves to engage the viewer, leaving them free to ponder an image’s meaning, without the burden of having to decipher its contents. Detail modulation in illustration and art is a complex topic which we have only begun to investigate. The work presented here has already inspired related approaches from other researchers to problems in cropping, [Suh et al., 2003] and ﬂuid visualization [Watanabe et al., 2004]. Detail modulation is only part of visual artistry—one of the many techniques available. Color, contrast, shape, and a host of higher level concerns are manipulated in art and play a part in well designed images. All of these techniques have some cognitive motivation. Understanding this perceptual basis is an important guide in creating effective automatic instantiations of these techniques. Continuing investigation of the role and functioning of abstraction in its many forms, especially through building new quantitative models, should yield new ways to create
easily understood illustration. The work presented here suggests a number of general insights to guide future investigation: • An understanding of the cognitive processing involved in human understanding of an image or stimuli is important for effective stylized and abstracted illustration. • The importance of some user input in our system highlights the fact that current automatic techniques cannot replace the semantic knowledge of a human viewer. People can perform abstraction but so far computers cannot in the domain of general images. • The fact that eye tracking is sufﬁcient for some level of abstraction in our context makes an interesting point. It suggests that the understanding underlying abstraction, and perhaps other artistic judgments, is not some mysterious ability of a visionary few, but a basic visual competence. Though not everyone can draw, everyone it seems can control abstraction in a computer rendering. • Eye tracking is a useful tool for understanding in this context. It is useful not only as a minimal form of interaction, but also as a cognitive measure for evaluation and for understanding what features are attended and hence may be critical to processing. • In a perceptually motivated framework, experimentation is useful not only to evaluate or validate a ﬁnal system, but also to investigate and, if possible, build quantitative models of perception as it relates to questions of interest. Work in psychology and cognitive science can provide a framework for undestanding a problem, as well as general methodologies. Sometimes models applicable to a speciﬁc problem are also available. However, these do not always address questions in the way most useful to those building applied systems. This provides
a need for a cyclical process of cognitive investigation and system engineering to build more effective systems for visual communication. These considerations suggest a future path for research in NPR that diverges somewhat from traditional areas of investigation, but holds the promise of a consistent intellectual underpinning for an expanding ﬁeld, as well, of course, as the promise of more expressive and perhaps ultimately artistic computer generated imagery.
[Adelson, 2001] Adelson (2001). On seeing stuff: the perception of materials by humans and machines. Proceedings of the SPIE, 4299:1–12. [Agrawala et al., 2003] Agrawala, M., Phan, D., Heiser, J., Haymaker, J., Klingner, J., Hanrahan, P., and Tversky, B. (2003). Designing effective step-by-step assembly instructions. In Proceedings of ACM SIGGRAPH 2003, pages 828–837. [Agrawala and Stolte, 2001] Agrawala, M. and Stolte, C. (2001). Rendering effective route maps: improving usability through generalization. In Proceedings of ACM SIGGRAPH 2001, pages 241–249. [Ahuja, 1996] Ahuja, N. (1996). A transform for multiscale image segmentation by integrated edge and region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(12):1211–1235. [Arnheim, 1988] Arnheim, R. (1988). The Power of the Center. University of California Press. [Bangham et al., 1998] Bangham, J., Hidalgo, J. R., Harvey, R., and G.Cawley (1998). The segmentation of images via scale-space trees. In Proceedings of British Machine Vision Conferenc, pages 33–43. [Baxter et al., 2001] Baxter, B., Scheib, V., and Lin, M. (2001). Dab: interactive haptic painting with 3d virtual brushes. Proceedings of ACM SIGGRAPH 2001, pages 403–421. [Campbell and Robson, 1968] Campbell, F. and Robson, J. (1968). Application of fourier analysis to the visibility of gratings. Journal of Physiology, 197:551–566. [Cater et al., 2003] Cater, K., Chalmers, A., and Ward, G. (2003). Detail to attention: Exploiting visual tasks for selective rendering. In Proceedings of the Eurographics Symposium on Rendering, pages 270–280. [Chen et al., 2002] Chen, L., Xie, X., Fan, X., Ma, W., Shang, H., and Zhou, H. (2002). A visual attention mode for adapting images on small displays. MSRTR-2002-125, Microsoft Research, Redmond, WA. [Christoudias et al., 2002] Christoudias, C., Georgescu, B., and Meer, P. (2002). Synergism in low level vision. In Proceedings ICPR 2002, pages 150–155. [Collomosse and Hall, 2003] Collomosse, J. P. and Hall, P. M. (2003). Genetic painting: a salience adaptive relaxation technique for painterly rendering. Technical Report CSBU2003-02, Dept. of Computer Science, University of Bath.
[Crowe and Narayanan, 2000] Crowe, E. C. and Narayanan, N. H. (2000). Comparing interfaces based on what users watch and do. In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium 2000, pages 29–36. [Curtis et al., 1997] Curtis, C. J., Anderson, S. E., Seims, J. E., Fleischer, K. W., and Salesin, D. H. (1997). Computer-generated watercolor. In Proceedings of ACM SIGGRAPH 97, pages 421–430. [DeCarlo et al., 2003] DeCarlo, D., Finkelstein, A., Rusinkiewicz, S., and Santella, A. (2003). Suggestive contours for conveying shape. In Proceedings of ACM SIGGRAPH 2003. [DeCarlo and Santella, 2002] DeCarlo, D. and Santella, A. (2002). Stylization and abstraction of photographs. In Proceedings of ACM SIGGRAPH 2002, pages 769– 776. [Deussen and Strothotte, 2000] Deussen, O. and Strothotte, T. (2000). Computergenerated pen-and-ink illustration of trees. In Proceedings of ACM SIGGRAPH 2000, pages 13–18. [Duchowski, 2000] Duchowski, A. (2000). Acuity-matching resolution degradation through wavelet coefﬁcient scaling. IEEE Trans. on Image Processing, 9(8):1437– 1440. [Durand et al., 2001] Durand, F., Ostromoukhov, V., Miller, M., Duranleau, F., and Dorsey, J. (2001). Decoupling strokes and high-level attributes for interactive traditional drawing. In Proceedings of the 12th Eurographics Workshop on Rendering, pages 71–82. [Fleming et al., 2003] Fleming, R. W., Dror, O. R., and Adelson, E. (2003). Realworld lillumination and the perception of surface reﬂectance properties. Journal of Vision, 3:347–368. [Goldberg et al., 2002] Goldberg, J. H., Stimson, M. J., Lewenstein, M., Scott, N., and Wichansky, A. M. (2002). Eye tracking in web search tasks: design implications. In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium 2002, pages 51–58. [Gombrich et al., 1970] Gombrich, E. H., Hochberg, J., and Black, M. (1970). Art, Perception, and Reality. John Hopkins University Press. [Gooch and Willemsen, 2002] Gooch, A. A. and Willemsen, P. (2002). Evaluating space perception in NPR immersive environments. In Proceedings of the Second International Symposium on Non-photorealistic Animation and Rendering (NPAR), pages 105–110. [Gooch and Gooch, 2001] Gooch, B. and Gooch, A. (2001). Non-Photorealistic Rendering. A K Peters.
[Gooch et al., 2004] Gooch, B., Reinhard, E., and Gooch, A. (2004). Human facial illustration: Creation and psychophysical evaluation. ACM Transactions on Graphics, 23:27–44. [Grabli et al., 2004] Grabli, S., Durand, F., and Sillion, F. (2004). Density measure for line-drawing simpliﬁcation. In Proceedings of Paciﬁc Graphics. [Graham, 1970] Graham, D. (1970). Composing Pictures. Van Nostrand Reinhold. [Haeberli, 1990] Haeberli, P. (1990). Paint by numbers: Abstract image representations. In Proceedings of ACM SIGGRAPH 90, pages 207–214. [Hays and Essa, 2004] Hays, J. H. and Essa, I. (2004). Image and video-based painterly animation. In Proceedings of the Third International Symposium on Nonphotorealistic Animation and Rendering (NPAR), pages 113–120. [Heiser et al., 2004] Heiser, J., Phan, D., Agrawala, M., Tversky, B., and Hanrahan, P. (2004). Identiﬁcation and validation of cognitive design principles for automated generation of assembly instructions. In Advanced Visual Interfaces, pages 311–319. [Henderson and Hollingworth, 1998] Henderson, J. M. and Hollingworth, A. (1998). Eye movements during scene viewing: An overview. In Underwood, G., editor, Eye Guidance in Reading and Scene Perception, pages 269–293. Elsevier Science Ltd. [Hertzmann, 1998] Hertzmann, A. (1998). Painterly rendering with curved brush strokes of multiple sizes. In Proceedings of ACM SIGGRAPH 98, pages 453–460. [Hertzmann, 2001] Hertzmann, A. (2001). Paint by relaxation. In Computer Graphics International, pages 47–54. [Hong et al., 1982] Hong, T.-H., Shneier, M., and Rosenfeld, A. (1982). Border extraction using linked edge pyramids. IEEE Transactions on Systems, Man and Cybernetics, 12:660–668. [Interrante, 1996] Interrante, V. (1996). Illustrating Transparency: communicating the 3D shape of layered transparent surfaces via texture. PhD thesis, University of North Carolina. [Itti and Koch, 2000] Itti, L. and Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40:1489–1506. [Itti et al., 1998] Itti, L., Koch, C., and Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:1254–1259. [Jacob, 1993] Jacob, R. J. (1993). Eye-movement-based human-computer interaction techniques: Toward non-command interfaces. In Hartson, H. and Hix, D., editors, Advances in Human-Computer Interaction, Volume 4, pages 151–190. Ablex Publishing.
[Just and Carpenter, 1976] Just, M. A. and Carpenter, P. A. (1976). Eye ﬁxations and cognitive processes. Cognitive Psychology, 8:441–480. [Kalnins et al., 2002] Kalnins, R. D., Markosian, L., Meier, B. J., Kowalski, M. A., Lee, J. C., Davidson, P. L., Webb, M., Hughes, J. F., and Finkelstein, A. (2002). WYSIWYG NPR: Drawing strokes directly on 3D models. In Proceedings of ACM SIGGRAPH 2002, pages 755–762. [Kelly, 1984] Kelly, D. (1984). Retinal inhomogenity: I. spatiotemporal contrast sensitivity. Journal of the Optical Society of America A, 74(1):107–113. [Koenderink and van Doorn, 1979] Koenderink, J. and van Doorn, A. (1979). The structure of two dimensional scalar ﬁelds with applicaitons to vision. Biological Cybernetics, 30:151–158. [Koenderink, 1984] Koenderink, J. J. (1984). What does the occluding contour tell us about solid shape? Perception, 13:321–330. [Koenderink et al., 1978] Koenderink, J. J., M.A. Bouman, A. B. d. M., and Slappendel, S. (1978). Perimetry of contrast detection thresholds of moving spatial sine wave patterns. II. the far peripheral visual ﬁeld (eccentricity 0-50). Journal of the Optical Society of America A, 68(6):850–854. [Koenderink and van Doorn, 1999] Koenderink, J. J. and van Doorn, A. (1999). The structure of locally orderless images. International Journal of Computer Vision, 31(2/3):159–168. [Kosara et al., 2003] Kosara, R., Healey, C., Interrante, V., Laidlaw, D., and Ware, C. (2003). User studies: Why, how and when? IEEE Computer Graphics and Applications, 23(4):20–25. [Kowalski et al., 1999] Kowalski, M. A., Markosian, L., Northrup, J. D., Bourdev, L., Barzel, R., Holden, L. S., and Hughes, J. (1999). Art-based rendering of fur, grass, and trees. In Proceedings of ACM SIGGRAPH 99, pages 433–438. [Kowler, 1990] Kowler, E. (1990). The role of visual and cognitive processes in the control of eye movements. In Kowler, E., editor, Eye Movements and Their role in Visual and Cognitive Processes, pages 1–70. Elsevier Science Ltd. [Land et al., 1999] Land, M., Mennie, N., and Rusted, J. (1999). The roles of vision and eye movements in the control of activities of daily living. Perception, 28:1311– 1328. [Leyton, 1992] Leyton, M. (1992). Symmetry, causality, mind. MIT Press. [Lillestaeter, 1993] Lillestaeter, O. (1993). Complex contrast, a deﬁnition for structured targets and backgrounds. Journal of the Optical Society of America, 10(12):2453–2457.
[Lindeberg, 1998] Lindeberg, T. (1998). Edge detection and ridge detection with automatic scale selection. International Journal of Computer Vision, 30(2):117–154. [Litwinowicz, 1997] Litwinowicz, P. (1997). Processing images and video for an impressionist effect. In Proceedings of ACM SIGGRAPH 97, pages 407–414. [Locher, 1996] Locher, P. J. (1996). The contribution of eye-movement research to an understanding of the nature of pictorial balance perception: a review of the literature. Empirical Studies of the Arts, 14(2):146–163. [Mackworth and Morandi, 1967] Mackworth, N. and Morandi, A. (1967). The gaze selects informative details within pictures. Perception and Psychophysics, 2:547– 552. [Mannos and Sakrison, 1974] Mannos, J. L. and Sakrison, D. J. (1974). The effects of a visual ﬁdelity criterion on the encoding of images. IEEE Trans. on Information Theory, 20(4):525–536. [Markosian et al., 1997] Markosian, L., Kowalski, M. A., Trychin, S. J., Bourdev, L. D., Goldstein, D., and Hughes, J. F. (1997). Real-time nonphotorealistic rendering. In Proceedings of ACM SIGGRAPH 97, pages 415–420. [Marr, 1982] Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W.H. Freeman, San Francisco. [Meer and Georgescu, 2001] Meer, P. and Georgescu, B. (2001). Edge detection with embedded conﬁdence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12):1351–1365. [Mello-Thoms et al., 2002] Mello-Thoms, C., Nodine, C. F., and Kundel, H. L. (2002). What attracts the eye to the location of missed and reported breast cancers? In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium 2002, pages 111–117. [Montanvert et al., 1991] Montanvert, Meer, P., and Rosenfeld, A. (1991). Hierarchical image analysis using irregular tesselations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):307–316. [Mulligan, 2002] Mulligan, J. B. (2002). A software-based eye tracking system for the study of air trafﬁc displays. In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium 2002, pages 69–76. [Niessen, 1997] Niessen, W. (1997). Nonlinear multiscale representations for image segmentation. Computer Vision and Image Understanding, 66(2):233–245. [Parkhurst et al., 2002] Parkhurst, D., Law, K., and Niebur, E. (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42:107– 123.
[Perona and Malik, 1990] Perona, P. and Malik, J. (1990). Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7):629–639. [Privitera and Stark, 2000] Privitera, C. M. and Stark, L. W. (2000). Algorithms for deﬁning visual regions-of-interest: Comparison with eye ﬁxations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):970–982. [Ramachandran and Hirstein, 1999] Ramachandran, V. S. and Hirstein, W. (1999). The science of art. Journal of Consciousness Studies, 6(6-7). [Raskar et al., 2004] Raskar, R., Tan, K.-H., Feris, R., Yu, J., and Turk, M. (2004). Non-photorealistic camera: depth edge detection and stylized rendering using multiﬂash imaging. In Proceedings of ACM SIGGRAPH 2004, pages 679–688. [Reddy, 1997] Reddy, M. (1997). Perceptually Modulated Level of Detail for Virtual Environments. PhD thesis, University of Edinburgh. [Reddy, 2001] Reddy, M. (2001). Perceptually optimized 3D graphics. IEEE Computer Graphics and Applications, 21(5):68–75. [Regan, 2000] Regan, D. (2000). Human Perception of Objects: Early Visual Processing of Spatial Form Deﬁned by Luminance, Color, Texture, Motion and Binocular Disparity. Sinauer. [Rosenholtz, 1999] Rosenholtz, R. (1999). A simple saliency model predicts a number of motion popout phenomena. Vision Research, 39:3157–3163. [Rosenholtz, 2001] Rosenholtz, R. (2001). Search asymmetries? what search asymmetries? Perception and Psychophysics, 63:476–489. [Rovamo and Virsu, 1979] Rovamo, J. and Virsu, V. (1979). An estimation and application of the human cortical magniﬁcation factor. Experimental Brain Research, 37:495–510. [Ruskin, 1857] Ruskin, J. (1857). The Elements of Drawing. Smith, Elder and Co. [Ruskin, 1858] Ruskin, J. (1858). Address at the opening of the Cambridge school of art. [Ryan and Schwartz, 1956] Ryan, T. A. and Schwartz, C. B. (1956). Speed of perception as a function of mode of representation. American Journal of Psychology, pages 60–69. [Saito and Takahashi, 1990] Saito, T. and Takahashi, T. (1990). Comprehensible rendering of 3-D shapes. In Proceedings of ACM SIGGRAPH 90, pages 197–206. [Salisbury et al., 1994] Salisbury, M. P., Anderson, S. E., Barzel, R., and Salesin, D. H. (1994). Interactive pen-and-ink illustration. In Proceedings of ACM SIGGRAPH 94, pages 101–108.
[Salvucci and Anderson, 2001] Salvucci, D. and Anderson, J. (2001). Automated eyemovement protocol analysis. Human-Computer Interaction, 16:39–86. [Santella and DeCarlo, 2002] Santella, A. and DeCarlo, D. (2002). Abstracted painterly renderings using eye-tracking data. In Proceedings of the Second International Symposium on Non-photorealistic Animation and Rendering (NPAR), pages 75–82. [Santella and DeCarlo, 2004a] Santella, A. and DeCarlo, D. (2004a). Eye tracking and visual interest: An evaluation and manifesto. In Proceedings of the Third International Symposium on Non-photorealistic Animation and Rendering (NPAR), pages 71–78. [Santella and DeCarlo, 2004b] Santella, A. and DeCarlo, D. (2004b). Robust clustering of eye movement recordings for quantiﬁcation of visual interest. In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium 2004. [Schumann et al., 1996] Schumann, J., Strothotte, T., and Laser, S. (1996). Assessing the effect of non-photorealistic rendering images in computer-aided design. In ACM Human Factors in Computing Systems, SIGHCI, pages 35–41. [Setlur et al., 2004] Setlur, V., Takagi, S., Raskar, R., Gleicher, M., and Gooch, B. (2004). [Shapiro and Stockman, 2001] Shapiro, L. and Stockman, G. (2001). Computer Vision. Prentice-Hall. [Shiraishi and Yamaguchi, 2000] Shiraishi, M. and Yamaguchi, Y. (2000). An algorithm for automatic painterly rendering based on local source image approximation. In Proceedings of the First International Symposium on Non-photorealistic Animation and Rendering (NPAR), pages 53–58. [Sibert and Jacob, 2000] Sibert, L. E. and Jacob, R. J. K. (2000). Evaluation of eye gaze interaction. In Proceedings CHI 2000, pages 281–288. [Suh et al., 2003] Suh, B., Ling, H., Bederson, B. B., and Jacobs, D. W. (2003). Automatic thumbnail cropping and it’s effectivness. ACM Conference on User Interface and Software Technolgy (UIST 2003), pages 95–104. [Torralba et al., 2004] Torralba, A., Murphy, K., and Freeman, W. (2004). Contextual models for object detection using boosted random ﬁelds. In Adv. in Neural Information Processing Systems. [Tufte, 1990] Tufte, E. R. (1990). Envisioning Information. Graphics Press. [Turano et al., 2003] Turano, K. A., Geruschat, D. R., and Baker, F. H. (2003). Oculomotor strategies for the direction of gaze tested with a real-world activity. Vision Research, 43:333–346.
[Underwood and Radach, 1998] Underwood, G. and Radach, R. (1998). Eye guidance and visual information processing: Reading, visual search, picture perception and driving. In Underwood, G., editor, Eye Guidance in Reading and Scene Perception, pages 1–27. Elsevier Science Ltd. [Vertegaal, 1999] Vertegaal, R. (1999). The gaze groupware system: Mediating joint attention in mutiparty communication and collaboration. In Proceedings CHI ’99, pages 294–301. [Walker et al., 1998] Walker, K. N., Cootes, T. F., and Taylor, C. J. (1998). Locating salient object features. in Proceedings BMVC, 2:557–567. [Wandell, 1995] Wandell, B. A. (1995). Foundations of Vision. Sinauer Associates Inc. [Wang et al., 2004] Wang, J., Xu, Y., Shun, H.-Y., and Cohen, M. (2004). Video tooning. In Proceedings of ACM SIGGRAPH 2004, pages 574–583. [Watanabe et al., 2004] Watanabe, D., Mao, X., Ono, K., and Imamiya, A. (2004). Gaze-directed streamline seeding. In APGV 2004. [Winkenbach and Salesin, 1994] Winkenbach, G. and Salesin, D. H. (1994). Computer-generated pen-and-ink illustration. In Proceedings of ACM SIGGRAPH 94, pages 91–100. [Witkin, 1983] Witkin, A. (1983). Scale-space ﬁltering. pages 1019–1021. [Wooding, 2002] Wooding, D. S. (2002). Fixation maps: quantifying eye-movement traces. In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium 2002, pages 31–36. [Yarbus, 1967] Yarbus, A. L. (1967). Eye Movements and Vision. Plenum Press. [Yee et al., 2001] Yee, H., Pattanaik, S. N., and Greenberg, D. P. (2001). Spatiotemporal sensitivity and visual attention in dynamic environments. ACM Transactions on Graphics, 29:39–65. [Zeki, 1999] Zeki, S. (1999). Inner Vision: An Exploration of Art and the Brain. Oxford University Press.
2005 1999 Ph.D. in Computer Science, Certiﬁcate in Cognitive Science from Rutgers University B.A in Computer Science from New York University
2001-2004 Research Assistant, The VILLAGE, Department of Computer Science, Rutgers University 1999-2001 Teaching Assistant, Department of Computer Science, Rutgers University Publications A. Santella and D. DeCarlo, ”Visual Interest and NPR: an Evaluation and Manifesto”. In Proceedings of the Third International Symposium on Non-Photorealistic Animation and Rendering (NPAR) 2004, pp 71-78 A. Santella and D. DeCarlo, ”Robust Clustering of Eye Movement Recordings for Quantiﬁcation of Visual Interest”. In Proceedings of the Third Eye Tracking Research and Applications (ETRA) 2004, pp 27-34 D. DeCarlo, A. Finkelstein, S. Rusinkiewicz and A. Santella, ”Suggestive Contours for Conveying Shape”. In ACM Transactions on Graphics, 22(3) (SIGGRAPH 2003 Proceedings), pp 848-855 D. DeCarlo and A. Santella, ”Stylization and Abstraction of Photographs”. In ACM Transactions on Graphics, 21(3) (SIGGRAPH 2002 Proceedings), pp 769-776 A. Santella and D. DeCarlo, ”Abstracted Painterly Renderings Using Eye-tracking Data”. In Proceedings of the Second International Symposium on Non-Photorealistic Animation and Rendering (NPAR) 2002, pp 75-82
This action might not be possible to undo. Are you sure you want to continue?