Professional Documents
Culture Documents
ACM Transactions on Graphics, Vol. 99, No. 99, Article 99, Publication date: 99 99.
is needed. A solution is required that is closer, in concept, to To evaluate our algorithm, we eyetrack viewers on our results and
reediting the movie for the new target display. compare this data to the gaze data captured on the original
widescreen videos. For a re-edit to be faithful, viewers must have
Because it is usually impossible to reshoot scenes, change the
attended to the same regions in the result as in the original video.
shooting angle, or add new shots, our challenge is to re-edit the
This measure allows the evaluation to be performed behaviorally.
original footage to fit a different display without losing significant
In our experiments, we find that viewer eye movements are
content. We employ the three primary editing operations: pans,
similar before and after the re-edits. We also perform a subjective
cuts, and zooms. Panning is used to show different parts of the
evaluation of our results by asking users to provide their
underlying scene on the screen. A cut allows for a quick shift from
preferences.
one part of the scene to another, when two people are talking for
example. Cropping the widescreen footage gives the effect of
“zoom”, i.e., moving closer to the action. Our method re-edits the
given video footage by automatically combining pans, cuts, and
zooms. Examples are shown in Figure 1.
This approach of re-editing, rather than re-scaling, has the (a) Original frame (b) Scaling (c) Cropping (d) Letterboxing
advantage that it guarantees that there will be no distortion in the (1.75:1) (1:1) (1:1) (1:1)
resulting video. Because the editing operations only remove
contiguous portions of the original frame, all scene structure is Fig. 2: Commonly used methods to resize video: (a) Original frame, (b)
preserved: lines remain straight and people are not made tall and Scaling squeezes the objects and characters, (c) Cropping removes content,
skinny. At the same time, because this approach makes a hard (d) Letterboxing wastes screen space.
commitment about what is and is not included in the frame, it
depends on reliably determining which parts of the original Contributions: Our main contribution is a method for re-editing
widescreen frame are integral to the narrative in the original video to fit to a non-native aspect ratio using pans, cuts, and
video. We allow the viewers to reveal what is important to the zooms on the original widescreen video. We present a RANSAC-
narrative by tracking their gaze. based algorithm that fits a curve to viewers’ temporal gaze data.
This curve represents the center of the output window, guiding
Artists employ various devices to highlight the regions of the video the selection and combination of the re-editing operators (pans,
that are integral to the narrative, including color (the lady wears a and cuts). The size of the window guides the zoom operator. We
striking red dress), motion (the rebel walks opposite to the evaluate our method through a subjective two-alternative forced
crowd), semantics (the witness points to the gun), and audio choice test of viewer preferences, and a comparison of viewer
(dialogue or sound effects). These devices attract the viewer’s gaze on pre- and post-edited video. Viewer preferences indicates
gaze through a combination of bottom-up influences such as that amongst reediting algorithms, our gaze-driven results are
color, motion, or the presence of a human face, and top-down preferred to an optimized crop-and-warp method. When viewers
influences such as semantics and narrative 1 . Current saliency are presented with a letterboxed version, they prefer that to a
algorithms for images or videos primarily capture bottom-up cropped video.
influences, and as a result, often misfire. We expect that research
will eventually yield new algorithms that incorporate top-down
influences to better predict visual saliency. Rather than waiting for 2. RELATED WORK
these algorithms, we directly use measurements of the viewers’
gaze as input data.
Resizing a video to fit a non-native aspect ratio is a controversial
We use recorded gaze data from several human participants as the task. Cinema enthusiasts often prefer letterboxing, which involves
driving input to a RANSAC algorithm that finds pans, cuts and placing a black matte around the original video (Figure 2(d)).
zooms while retaining the regions attended to by the participants Though this method preserves the shape of the original video, it
in the original video. These editing operations mimic the approach wastes precious screen space, especially on small displays. As a
of studio professionals when a film shot in widescreen is fit by result, there have been several proposed solutions to
hand to smaller aspect ratios. The algorithm is able to handle noise automatically create non-letterboxed versions of widescreen
in the eyetracking data (a result of individual variation in the gaze films. Linear scaling (Figure 2(b)) squeezes or stretches the video
patterns across participants, and measurement error). We use to match the size of the new display device, thereby changing the
RANSAC to search for a path that moves a cropping window shapes of objects in the video. Center-cropping (Figure 2(c)) trims
through the video cube while maximizing the number of gaze the original video to the desired size, which can result in the
points included. When cuts are required to produce a video with ‘talking noses’ artifact.
adequate coverage of the eyetracking data, the algorithm As a result of these difficulties with simple automatic solutions,
simultaneously finds two panning paths and an optimal cut skilled editors are called upon to create a “pan and scan” version
between them. The size of the cropping window is determined of a widescreen video. The editor determines the region of the
from the spread of the gaze samples. The change in size creates screen that is important to the story for each frame of the video,
the effect of ‘zoom’ in the resulting video. and moves the cropping window to that region [Wikipedia ]. As
the “important region” moves across the screen, the cropping
window pans across or zooms in and out. The editor could also
Fig. 3: Viewer gaze is plotted, and an example frame is shown from the 200
3.2.2 Fitting the cropping window path to data. Recorded gaze . (6)
data consists of noisy measurements of ‘what is important’ to a
viewer because it is subject to individual idiosyncrasies and The score of the associated trial is |η|. The parameter K2 is set to
measurement error. We fit a nonuniform cubic B-spline robustly 1/3 to compute the number of gaze points enclosed within the
to this data with a RANSAC algorithm[Fischler and Bolles 1981]. third guides of the cropping window. The curve with the highest
The trial set τ for each RANSAC iteration comprises of four score is selected as the cropping window path. The result of this
randomly selected gaze points. This selection is done by picking
Fig. 5: The x-coordinate of the recorded gaze data is plotted for each frame of the ‘couple at the waterfront’ video shown in Figure 1. The
green and purple curves are the nonuniform B-splines computed by our algorithm corresponding to the two cropping window paths. The
black dotted line is the resulting cropping window path after a cut is introduced. The gray band is the original wide screen, the light green
and purple bands are the cropping windows at the given aspect ratio (1:1), and the darker green and purple bands represent the region
used
algorithm for an example video is shown in Figure 5. The green counted in both categories (for example, if the scene camera was
and purple curves are two possible pans. stationary for half the duration and moved in the other half).
3.3 Cuts
3.4 Zoom
When viewer attention shifts quickly across the original
widescreen video, a cut may be preferable to a fast-moving Reducing the size of the cropping window gives the effect of
cropping window; our method allows the introduction of a cut “zoom” by making the scene look bigger. We compute the change
based on recorded gaze data. We fit two spline curves Ax and Bx to in cropping window size from the spread of the gaze data.
randomly selected trial sets τA and τB. 2 A cut is generated by Intuitively, when the gaze samples are clustered tightly, a smaller
switching from Ax to Bx based on viewer attention shifts. In cropping window can capture the region underneath the gaze
practice, we find that cutting more than once in a shot from a samples.
professionally edited movie is unnecessary because the shots are We measure the spread of gaze data through the standard
tightly edited to begin with. deviation σi of the gaze points x˜ij for the ith frame. The ratio ρi is
A shift in viewer attention is computed from the change in median computed as
gaze position across consecutive frames. Candidates for cuts are
posited when the shift in the median is above a threshold value . (10)
and the two cropping windows A and B are sufficiently far apart
(to avoid a ‘jump’ cut, a cut that appears jarring to the viewer
We fit a nonuniform piecewise B-spline, denoted by q, to the
because the scenes before and after the cut are too similar
samples ρi with a RANSAC algorithm similar to Section 3.2.2. The
[Dmytryk 1984]):
consensus set η for each RANSAC trial is the set of sample points
that are within a given distance K5 of the spline curve,
, (7)
|Axi −B xi| > K4, (8) η = ρi : |ρi − qi| < K5, (11)
where γi is the number of gaze points recorded for frame i. Each where K5 = 0.2. The score of the associated trial is |η|. Let the
candidate cut κ is ranked by the number of gaze points enclosed curve with the highest score be q∗. Then this curve is transformed
by the resulting cropping window path. Let xfinali denote the linearly (scaled and shifted)
cropping window path for the cut κ:
, (12)
final Axi if i ≤ κ, where K6 = 0.5. This transformation allows us to modulate the
xi = Bx i otherwise. (9) extent of the zoom effect by changing the parameter K6.
Decreasing the value causes the zoom to become less
pronounced. The shift up causes the maximum window size to be
Then, the consensus set is computed as in Equation 6, η =
equal to the maximum possible size that will fit inside the original
and the score of the associated cut widescreen frame. The zoom effect is then created by scaling the
is |η|. If the candidate cut with the highest score κ∗ encloses more size of the cropping window by qi∗∗ before rendering.
gaze samples than either one of the individual cropping windows,
the corresponding resulting path xfinal∗ is the path selected for this
RANSAC iteration. In Figure 5, the final path is shown as a black 4. RESULTS
and white dotted line.
Background objects are moving. Background objects are stationary. We present results on eighteen video clips, selected to have
Foreground Foreground
Scene object slight motion
significant
Scene object slight motion
significant moving and stationary background, foreground objects, and
motion motion
camera camera camera. Figure 6 shows the number of clips in each category. Eight
clips involve a moving camera. The foreground objects and
stationary 2 clips 3 clips stationary 2 clips 7 clips
background objects both show significant movement in seven
clips.
moving none 4 clips moving 1 clip 3 clips
The same parameter values were used for all the examples. The
number of RANSAC trials is N = 1000, and the parameters are
Fig. 6: We ran our algorithm on a variety of clips (12-24 seconds in
duration) taken from three Hollywood films: Herbie Rides Again, Who
Framed Roger Rabbit, and The Black Hole. The clips are categorized based
on the motion of the foreground object, the scene camera and the
background objects. When one clip fits more than one category, it is
Original widescreen video (1.75:1) Our result (1:1) Wang et al. 2011(1:1)
0.8
0.6
5. EVALUATION
Fig. 8: Two example videos where our method zooms into the scene. Images
from Herbie Rides Again courtesy The Walt Disney Company. 5.1 Eyetracking
A 30 second sequence takes approximately 40 minutes of For the eighteen example clips we tested (12 − 43 seconds each),
computation time because the nonuniform B-spline blending the percentage of included gaze samples are shown in Figure 9.
functions need to be computed for each RANSAC trial. The trials The average included percentage is ψ00 = 81.2%.
could be parallelized to reduce the computation time. The run The second check examines the extent to which our re-editing
time for our method is independent of the resolution of the video. algorithm alters viewer eye movements. We compare the
eyetracking data of viewers watching our result videos with the
eyetracking data collected on the original clips. Let rv =
[[x˜1,x˜2,··· ,x˜n]T,[y˜1,y˜2,··· ,y˜n]T] be the gaze data on the original
video, and r0v be the recorded gaze data on the retargeted video,
where v indexes the video clip. This data is a distribution, with a Because we compare viewer gaze on videos of different sizes,
per-frame mean µ, median η, and standard deviation σ. directly computing differences in the pixel locations of the points
2-34+/.5 ï 6784947/.:7*.-+;<.:*+,- of regard will not yield the comparison we are looking for: we
( ""
. want
=+>-.7/.7*4?4/+ @
Distance between mean gaze position on pre- and post-edited videos for the aspect ratio 1:1
New cut =+>-.7/.*-80@9 (averaged over all frames)
'"" #90 A93.3-B.C7*4?4/+@D %! +
#333 A93.3-B.C*-80@9D
%"" %
&
#""
New cut #437 $ -)./0123+*3:+2()*
G301+9<3:+0((+2()*.
# G3;)01+9<3:+0((+2()*.
".
! "" #"" $"" %"" &"" '"" H13+./01;0:;+;3<)0/)91
)*+,- ./ 0,1-*
+
! " # $ % %! %" %# %$ !& !!
2-34+ /.E ï 6784947/.:7*.-+;<.:*+,- & '()*+,-
.
=+>-.7/.7*4?4/+ @ Fig. 11: The red markers show the percentage distance between the
%""
=+>-.7/.*-80@9 median gaze position on our results at the 1:1 aspect ratio and the median
$"" A93.3-B.C7*4?4/+@D position of the included gaze samples for the original widescreen video, i.e.,
A93.3-B.C*-80@9D the distance between the median of the blue markers and the median of
#""
those red markers that lie on the colored portion of the frame in Figure 10.
! "" The average distance over all example clips is 79.3 pixels (σ =21.8), which
is less than 10% of the width of the original widescreen frame.
".
! "" #"" $"" %"" &"" '""
)*+,- ./ 0,1-*
to measure whether viewers could be looking at the same object
in the result video as in the original video. Thus, the gaze data on
the result video are transformed back to the coordinates of the
original video with the inverse operator w.
The gaze distribution on the re-edited video will likely be different
from the original distribution even if viewers are looking at the
same objects. For example, the standard deviation of the gaze
data on the original widescreen video is generally larger than on
the result video because the eyes move across a larger region of
the screen. Therefore, metrics like Chi-square and Earth Movers
Distance (after histogram normalization), will return non-zero
values.
In Figure 10, we plot the median gaze positions for each frame of
the ‘lawyer’ video. Then, the distance δ between included gaze
data on the original video (i.e., those red markers that are inside
the colored portion of the frame in Figure 10) and gaze data on
the retargeted video is defined
Fig. 10: The gaze data captured on the original widescreen video is shown , (15)
in red and the data captured on our result (transformed to widescreen where ∆ is the root mean squared distance between the median
coordinates) is shown in blue. We plot the median instead of the mean to gaze position per frame, averaged over all the frames for video v.
reduce the effect of outlier gaze data on the computed distance values. The For our eighteen example clips, the distance ∆ is plotted in red in
thick blue line represents the median gaze data recorded on our result Figure 11 and the average distance δ = 79.3, shown as a blue
videos and the thick red line represents the median gaze data for the dotted line. Sample frames are shown in Figure 10.
original widescreen videos. The thin blue and red lines represent the
standard deviations for each frame. Images from Herbie Rides Again We take a closer look at two cuts introduced by our method; the
courtesy The Walt Disney Company. corresponding frames are shown in the first and third rows in
Figure 10. Our method selected the appropriate locations to
introduce the cuts because viewer gaze shifts from left to right and
There are several metrics to compute the distance between two
right to left respectively in the original widescreen video. This shift
distributions. A first-order metric is the distance between the
can be seen in the sample frames, and on the graph where we
perframe mean or median values. A higher-order metric is a
have marked the frames with the label ‘New cut’. The sample
statistic such as the chi-squared distance, or the Earth Mover’s
frames shown in the second row correspond to a cut that was part
Distance [Judd et al. 2012; Zhao and Koch 2012].
of the original video (Frame #333 and #334). Our algorithm placed
the cropping window to the right side of the original widescreen early. Thus, we collect eyetracking data on re-edited results with
after this cut. It takes some time for the viewers of the original pans and cuts.
widescreen video to shift their attention to the group of lawyers.
The consistency in viewer eye movements on the original video
and our re-edited result (as illustrated for the lawyer video in 5.2 User Preference
Figure 10) suggests that viewers are absorbing the same
information We also evaluate our method by asking users to submit their
preferences. We conduct a two alternative forced choice study
15 where fifteen participants compared results generated by our
method, with the results generated by an optimized crop-and-
warp method by [Wang et al. 2011], and the letterboxed version
of the same clip.
10
80
5
Number of bins=3
60
Number of bins=10
0
2 4 6 8 10 12 14 16 18 20 22 40
Clip ID
REFERENCES
ACM Transactions on Graphics, Vol. 99, No. 99, Article 99, Publication date: 99 99.
13
HEALEY, C. G. AND SAWANT, A. P. 2012. On the limits of resolution and visual AND KANKANHALLI, M. S. 2004. Video content representation on tiny
angle in visualization. ACM Transactions on Applied Perception 9, 4, devices. In Proc. Of IEEE Conference on Multimedia and Expo.
20:1–20:21. WANG, Y.-S., FU, H., SORKINE, O., LEE, T.-Y., AND SEIDEL, H.-P. 2009. Motion-
JAIN, E., SHEIKH, Y., AND HODGINS, J. 2012. Inferring artistic intention in comic aware temporal coherence for video resizing. ACM Transactions on
art through viewer gaze. In ACM Symposium on Applied Perception Graphics 28, 127:1–127:10.
(SAP). WANG, Y.-S., HSIAO, J.-H., SORKINE, O., AND LEE, T.-Y. 2011. Scalable and
JUDD, T., DURAND, F., AND TORRALBA, A. 2012. A benchmark of computational coherent video resizing with per-frame optimization. ACM Transactions
models of saliency to predict human fixations. Tech. Rep. MITCSAIL-TR- on Graphics 30, 4, 88:1–88:8.
2012-001, Massachusetts Institute of Technology. January. WANG, Y.-S., LIN, H.-C., SORKINE, O., AND LEE, T.-Y. 2010. Motionbased video
KATTI, H., RAJAGOPAL, A. K., KANKANHALLI, M., AND KALPATHI, retargeting with optimized crop-and-warp. ACM Transaction on
R. 2014. Online estimation of evolving human visual interest. ACM Graphics 29, 90:1–90:9.
Transactions on Multimedia Computing, Communications, and WANG, Y.-S., TAI, C.-L., SORKINE, O., AND LEE, T.-Y. 2008. Optimized scale-and-
Applications 11, 1. stretch for image resizing. ACM Transactions on Graphics 27, 118:1–
KATZ, S. D. 1991. Shot by Shot. Michael Wiese Productions, Focal Press. 118:8.
KNOCHE, H., MCCARTHY, J., AND SASSE, M. 2008. How low can you go? The WIKIPEDIA. http://en.wikipedia.org/wiki/pan and scan.
effect of low resolutions on shot types in mobile TV. Multimedia Tools XIANG, Y. Y. AND KANKANHALLI, M. S. 2010a. Automated aesthetic
and Applications 36, 1-2, 145–166. enhancement of videos. In ACM International Conference on
KOPF, S., HAENSELMANN, T., KIESS, J., GUTHIER, B., AND EFFELSBERG, W. 2011. Multimedia. 218–290.
Algorithms for video retargeting. Multimedia Tools XIANG, Y.-Y. AND KANKANHALLI, M. S. 2010b. Video retargeting for aesthetic
• E. Jain et al. enhancement. In ACM International Conference on Multimedia. 919–
922.
and Applications (MTAP), Special Issue: Hot Research Topics in YOUNG, J. 2008. Sydney Pollack dies at 73. Variety. [published May 26].
Multimedia 51, 819–861. ZHAO, Q. AND KOCH, C. 2012. Learning visual saliency by combining feature
KRAHENB¨ UHL¨ , P., LANG, M., HORNUNG, A., AND GROSS, M. 2009. A system maps in a nonlinear manner using adaboost. Journal of Vision 12, 6.
for retargeting of streaming video. ACM Transactions on Graphics 28,
126:1–126:10.
LIU, F. AND GLEICHER, M. 2006. Video retargeting: Automating pan and scan.
In ACM International Conference on Multimedia. 241–250.
LIU, L., CHEN, R., WOLF, L., AND COHEN-OR, D. 2010. Optimizing photo
composition. Computer Graphic Forum (Proceedings of Eurographics)
29, 2, 469–478.
MITAL, P. K., SMITH, T. J., HILL, R. L., AND HENDERSON, J. M. 2010.
Clustering of gaze during dynamic scene viewing is predicted by motion.
Cognitive Computation, 1–20.
NIU, Y., LIU, F., LI, X., AND GLEICHER, M. 2010. Warp propagation for video
resizing. Computer Vision and Pattern Recognition, 537–544.
RUBINSTEIN, M., SHAMIR, A., AND AVIDAN, S. 2008. Improved seam carving for
video retargeting. ACM Transactions on Graphics 27, 3, 16:1–16:9.
RUDOY, D., GOLDMAN, D. B., SHECHTMAN, E., AND ZELNIK-MANOR,
L. 2012. Crowdsourcing gaze data collection. arXiv, Collective
Intelligence.
SANTELLA, A., AGRAWALA, M., DECARLO, D., SALESIN, D., AND CO-
HEN, M. 2006. Gaze-based interaction for semi-automatic photo
cropping. In ACM SIGCHI Conference on Human Factors in Computing
Systems. 771–780.
SHAMIR, A. AND SORKINE, O. 2009. Visual media retargeting. In ACM
SIGGRAPH Asia 2009 Courses. 11:1–11:13.
SMITH, T. J. AND HENDERSON, J. M. 2008. Edit blindness: The relationship
between attention and global change blindness in dynamic scenes.
Journal of Eye Movement Research, 1–17.
TAO, C., JIA, J., AND SUN, H. 2007. Active window oriented dynamic video
retargeting. In Proceedings of the Workshop on Dynamical Vision,
International Conference on Computer Vision (ICCV).
WANG, J., REINDERS, M. J. T., LAGENDIJK, R. L., LINDENBERG, J.,
ACM Transactions on Graphics, Vol. 99, No. 99, Article 99, Publication date: 99 99.