Professional Documents
Culture Documents
2.2. Temporal Naturality the nodes represent the pixels of the retargeted video. Retar-
geting is achieved by finding the optimal mapping between
When considering the temporal information, naturality
the source and the retargeted video. Specifically, we encode
requires that two temporal neighbors in the retargeted video
the spatial-temporal naturality as well as other constraints
are similar to some temporal neighbors in the source video.
into the following form, which can be minimized by graph
Figure 1 (b) is an illustration of this constraint on two tem-
cut algorithm, similar to [11]:
porally adjacent pixels in the retargeted video. The pixel
black (+) in the (t − 1)th frame of the retargeted video E(M ) = α Ed (Mt (p)) +
is mapped to black (+) in the source video. The temporal (p,t)∈R
neighbor of this pixel is the red (x) in the tth frame of the
β Es (Mti (pi ), Mtj (pj ))
source video. If this pixel is similar to black (x), which is
(pi ,ti ),(pj ,tj )∈N
the mapping of the temporal neighbor of black (+) in the tth
frame of the retargeted video, then the black (+) and black (3)
(x) in the retargeted video are considered temporally natu-
ral. Similar analysis can be applied on the pixel black (x) in where Ed is the data term encoding the unary energy and Es
tth frame of the target. If the black (+) in the (t−1)th frame is the smoothness term encoding pairwise energy. In this
of the source is similar to the red (+) in the same frame, two section, we first develop a 3D shift map to retarget video
temporal neighbors (marked as black (+) and (x)) in the re- with spatial-temporal naturality. To improve the computa-
targeted video are temporally natural as in the source video. tional efficiency, an incremental 2D shift map is then in-
We maximize the temporal naturality of the retargeted troduced to retarget video. Compared with 3D shift map,
video this method can only achieve a local optimum. However,
by minimizing
its computational complexity is much lower while still pre-
(p,t)∈R t∈{−1,+1}
serving spatial-temporal naturality. Finally, a novel solution
D(S(p + Mt (p), t + t), S(p + Mt+t (p), t + t)) (2) for video retargeting is provided by combining these two
methods in a multi-resolution hierarchy, which is called the
where the definitions of R, S, M and D are the same as in Hybrid Shift Map.
equation (1). The distance function operates on the source
between (i) a pixel that is a temporal neighbor of a mapped 3.1. 3D Shift Map
pixel and (ii) the mapped pixel of the temporal neighbor of
a target pixel. Any suitable distance measure can be used. We model the retargeted video as a 3D grid graph where
every node is connected to its 4 spatial and 2 temporal
3. Hybrid Shift Map neighbors. There are two types of constraints related to
video retargeting: pixel preservation during resizing and
The retargeted video that preserves the spatial-temporal spatial-temporal naturality for artifact reduction. To find
naturality of the source video is modeled as graph(s) where the optimal 3D shift map using graph cut, we encode the
pixel preservation in the data term and the spatial-temporal the 2D shift map [11] to retarget individual frames indepen-
naturality in the smoothness term of equation (3). dently, the retargeted video will be not temporally smooth
and will contain jitters. The temporal information in the
Energy Function
source video must be utilized in order to ensure temporal
In video retargeting, it is required that some pixels need to consistency. We propose an incremental solution to retarget
be preserved. For example, in changing the width of the video using temporal information so that the result is tempo-
video, the leftmost and rightmost columns of every frame rally consistent (smooth). In this scheme, the first frame of
should be preserved in the target video. We use data term to the sequence is retargeted using the 2D shift map [11], and
encode such pixel preservation by assigning the tth frame is processed based on the retargeted (t − 1)th
⎧ frame. The temporal consistency is improved by maximiz-
⎨ ∞ if (x = 0) ∩ (tx = 0) ing the temporal naturality between the current retargeted
Ed (Mt (p)) = ∞ if (x = WR ) ∩ (tx = WS − WR ) frame and the retargeted result of previous frame.
⎩
0 otherwise Given the shift map of the (t − 1)th frame, the tth frame
(4) is retargeted by finding a minimum cut in an augmented 2D
where WS and WR are the widths of the video frames in the grid graph. Each node of this graph is not only associated
source and the retargeted videos, respectively. with the coordinate shift in the current frame, but also as-
The spatial and temporal naturalities described in sec- sociated with the corresponding shift in the previous frame.
tion 2 can be unified into a single pairwise measure. Con- The shift map of previous frame is utilized to constrain the
sider a pair of neighboring pixels (pi , ti ) and (pj , tj ), which retargeting of the current frame. Specifically, we extend the
can be either spatial neighbors (ti = tj ) or temporal neigh- energy function to consider the temporal naturality.
bors (pi = pj ). For (pj , tj ), the mapping of its neighbor
(pi , ti ) onto the source video is given by Energy Function
In the data term of the energy function, in addition to the
Si = S(pj + pji + Mtj +tji (pj + pji ), tj + tji ). pixel preservation term of equation (4), we encode the
temporal naturality with respect to the shift map of previous
where pji = (xi − xj , yi − yj ) and tji = ti − tj .
frame . The data term containing the measure of temporal
The neighbor of the mapping of (pj , tj ) that corresponds to
naturality is
(pi , ti ) is given by Ŝi = S(pj +Mtj (pj )+pji , tj +tji ).
Similarly, Ed (Mt (p)) = min(D̂t−1 (p), D̂t (p)). (5)
Sj = S(pi + pij + Mti +tij (pi + pij ), ti + tij ), where,
D̂t−1 (p) =
Ŝj = S(pi + Mti (pi ) + pij , ti + tij ),
D(S(p + Mt (p), t − 1), S(p + Mt−1 (p), t − 1)),
where pij = (xj − xi , yj − yi ) and tij = tj − ti .
Therefore, the smoothness term in equation (3) can measure D̂t (p) = D(S(p + Mt (p), t), S(p + Mt−1 (p), t)). (6)
spatial-temporal naturality as
Minimization of the data terms that comprise equations (5)
Es (M (pi , ti ), M (pj , tj )) = min(D(Ŝi , Si ), D(Ŝj , Sj )). and (4) results in the optimal shift map for the tth frame,
which is temporally smooth (natural) with respect to the re-
If either D(Ŝi , Si ) = 0 or D(Ŝj , Sj ) = 0, this pair of targeted (t − 1)th frame.
neighboring pixels in the retargeted video is perfectly natu- The spatial naturality between pixel p and its neighbors
ral with respect to the source video. in the tth frame is encoded as:
If we wish to eliminate one row/column from the video, Es (Mt (p), Mt (p + ei ))
the binary 3D graph cut guarantees a globally optimal so-
= D(S(p + Mt (p) + ei , t), S(p + ei + Mt (p + ei ), t)) (7)
lution. However, the computational complexity of this
method is very high due to the large size of the 3D graph re- where ei represents the four unit vectors representing the
sulting in high computational time and memory usage. This four spatial neighbors, similar to [11].
precludes the application of this method on large video vol- While the 3D shift map achieves a global optimal so-
umes and hence, a more efficient solution is required. lution, the 2D shift map can only obtain the local optimum
which is dependent on the initial guess. If the initial solution
3.2. Incremental 2D Shift Map
in the first frame is far away from the global optimum, the
Instead of considering the video as a 3D volume, it can retargeting results of other frames will be also away from
be viewed as a collection of frames, each of which is rep- the global optimum. However, the incremental 2D approach
resented as a 2D grid graph. However, if we simply apply is much more efficient in terms of computational time and
to band the whole 2D manifold interpolated from the lowest
resolution. When the manifold involves a large variance in
the temporal domain, the rectangular band in the 3D graph
is larger than that in the hybrid shift map. Figure 2 shows
this difference between hybrid shift map and video seam
carving using the same banding method. Note that more
advanced banded multi-resolution graph cut technique [10]
(a) (b) can also be applied on hybrid shift map to further reduce
Figure 2. Comparison of the bands of hybrid shift-map and video complexity.
seam carving [12] on the same grid graph. The shaded area in (a)
is the band for hybrid shift-map and in (b) is the band for seam 4. Experimental Evaluation
carving.
In this section, we evaluate different properties of the
proposed method on real video sequences and compare
memory usage, as shown in the experimental results. Since them with other retargeting methods. Some test video se-
the original 3D problem is simplified into a series of 2D quences are the same as used in [12] and some are web
problems, the number of nodes in each graph is much less videos downloaded from Youtube, which contain large cam-
than that in the 3D graph. era/object motion in a complex scene. For all experiments,
we use the following setting: 3-layer Gaussian pyramid is
3.3. Hybrid Shift Map as a Hierarchical Solution built. 3D shift map is estimated in the lowest resolution
and individual 2D shift map is refined in the original res-
An efficient solution for video retargeting is provided
olution incrementally. The parameters are set as α = 1
by combining 3D shift map and 2D shift map in a multi-
and β = 1. The distance function D((p1 , t1 ), (p2 , t2 )) =
resolution framework. We build a Gaussian pyramid of the
|S(p1 , t1 ) − S(p2 , t2 )|, which is simply the grayscale dif-
source video. In the lowest resolution, an initial retarget-
ference of two pixels.
ing result for every frame is estimated using 3D shift map.
The global optimum property of 3D shift map guarantees 4.1. Temporal Consistency
a good initial solution, which is used to constrain the fi-
nal retargeting result in higher resolution not too far away Compared with 2D shift map [11], our method retargets
from this initial guess. In the higher resolutions, we itera- video by considering temporal consistency between source
tively apply incremental 2D shift map to obtain retargeted and target videos. Without motion analysis, our method can
video: the initial solution in lower resolution is first inter- remove the flickering/waving artifacts and generate tempo-
polated to the higher resolution. Starting from first frame, rally smooth retargeted video. We compare our method with
the shift map is optimized to refine the interpolated solution naive solution of applying [11] independently on individual
on banded 2D graph incrementally, where the band covers frames. Figure 3 (a) shows 4 consecutive frames of a bas-
the initial solution as shown in Figure 2. For comparison ketball sequence. The corresponding retargeted frames of
between hybrid shift map and video seam carving [12], we the proposed hybrid shift map and the naive solution are
use a simple multi-resolution banding method on standard shown in Figure 3 (b) and (c), respectively. The red curves
grid graph. Compared to more advanced multi-resolution in every frame are the next optimal seams to be removed.
banding methods e.g. [10], the banded graph in Figure 2 is From the figure, we can see that the seams obtained by
still a grid graph although the band is not minimum. hybrid shift map are temporally smooth while those ob-
This multi-resolution framework, denoted as Hybrid tained by naive solution are not. Consequently, hybrid shift
Shift Map, improves the computational complexity for re- map generates retargeted video which is temporally consis-
targeting in two aspects. First, the 3D shift map is more tent with the source video. Naive solution using [11] can-
efficient than 3D seam carving because the proposed graph not achieve this temporal consistency. Our method enforces
is much simpler than seam graph with forward energy. Ev- not temporal smoothness as in [12], but temporal natural-
ery node in the graph used in 3D shift map has only 6 edges ity, which is adaptive to the video content. When the con-
while the node in the seam graph has 14 edges. Second, tent is homogeneous, even the non-smooth seams can pre-
the 3D graph cut at higher resolution is divided into a se- serve temporal naturality and generate temporally consis-
ries of 2D graph cuts. Its computational time increases only tent video. Compared to other retargeting algorithms e.g.
linearly with the length of video sequence. Using the same [18] which can preserve temporal consistency, our method
banding method, the incremental 2D shift map has a nar- does not rely on any motion analysis and is robust enough
rower band in every frame since it bands the individual seam to be applied on challenging videos where motion analysis
in the frame. On the other hand, video seam carving needs may be erroneous.
Figure 3. (a) Four consecutive frames of basketball sequence. Retargeted by (b) hybrid shift map and (c) the naive solution of applying 2D
shift map [11] on every frame independently.
(a)
hybrid shift map instead uses 3D shift map in the lowest
resolution to constrain every 2D shift map to be close to the
global optimum. Hence, the retargeted frames from the hy-
brid shift map (shown in Figure 4 (b)) are more natural than
(b) incremental 2D shift map alone.
In terms of computational complexity, we compare the
computational time on reducing 1 pixel in width for the
whole video using different components. For a 320 × 240
video sequence of 110 frames, the computational time of
different components are summarized in Table 1. The 3D
(c) shift map and incremental 2D shift map are applied on the
Figure 4. Comparison of hybrid shift map and incremental 2D shift original resolution. For hybrid shift map, 3D shift map
map. (a) two frames (1st and 98th frames) of a cycling sequence; is applied on the third layer of Gaussian pyramid and in-
corresponding retargeted frames using (b) hybrid shift map and (c) cremental 2D shift map is applied on the original resolu-
incremental 2D shift map. tion. We can see the proposed hybrid shift map significantly
improves the efficiency while preserving the effectiveness.
The 3D shift map component (9s) improves since it is ap-
plied at the lowest resolution and the incremental 2D shift
4.2. Global vs Local vs Hybrid map component (8s) in the original resolution improves due
Our retargeting framework combines two components: to the smaller graph for every frame.
3D shift map at the lowest resolution and incremental 2D
shift map at the original resolution. We analyze the retar- 4.3. Hybrid Shift Map vs Video Seam Carving
geting results as well as the computational complexities of In this section, we compare our proposed method with
our framework and individual components. Figure 4 shows [12], which improves seam carving for video retargeting.
the retargeted frames obtained from hybrid shift map and in- Both methods model the video as graph(s) and use graph
cremental 2D shift map for two frames of a video sequence. cut techniques to iteratively retarget video content. When
We can see that the retargeted frames output from incremen- reducing 1 pixel in width, each binary shift map actually
tal 2D shift map (Figure 4 (c)) introduce large distortions corresponds to a removed seam. However, the proposed
within the red box. This is because without the initializa- method differs from [12] in several aspects. First, graph
tion of 3D shift map, the incremental method does not con- constructions are different. [12] models the source video as
sider all frames and can only achieve local optimum. The a graph while our method represents the retargeted video as
Figure 5. Example of disconnected seam (marked as red).
initial solution in the lowest resolution and incremental 2D [8] J.-H. K. Jun-Seong Kim and C.-S. Kim. Adaptive Image and Video
shift map is designed to refine the initial solution in the orig- Retargeting Based on Fourier Analysis. In CVPR, June 2009.
[9] F. Liu and M. Gleicher. Video Retargeting: Automating Pan-and-
inal resolution. Compared with related retargeting methods, Scan. In ACM Multimedia, October 2006.
the proposed hybrid shift map significantly improves the [10] H. Lombaert, Y. Sun, L. Grady, and C. Xu. A Multilevel Banded
efficiency in terms of both computational time and mem- Graph Cuts Method for Fast Image Segmentation. In ICCV, October
ory usage while still retargeting video with spatial-temporal 2005.
[11] Y. Pritch, E. K. Venaki, and S. Peleg. Shift-Map Image Editing. In
naturality. ICCV, September 2009.
[12] M. Rubinstein, A. Shamir, and S. Avidan. Improved Seam Carving
Acknowledgement for Video Retargeting. SIGGRAPH, 27(3), December 2008.
[13] M. Rubinstein, A. Shamir, and S. Avidan. Multi-operator Media Re-
This research was supported by the Media Development targeting. SIGGRAPH, 28(3), August 2009.
Authority (MDA) under grant NRF2008IDM-IDM004-032. [14] A. Santella, M. Agrawala, D. DeCarlo, D. Salesin, and M. Co-
hen. Gaze-based Interaction for Semi-automatic Photo Cropping.
In SIGCHI, April 2006.
References [15] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani. Summarizing
[1] S. Avidan and A. Shamir. Seam Carving for Content-Aware Image Visual Data Using Bidirectional Similarity. In CVPR, June 2008.
Resizing. SIGGRAPH, 26(3), July 2007. [16] C. Tao, J. Jia, and H. Sun. Active Window Oriented Dynamic Video
[2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patch- Retargeting. In ICCV Workshop on Dynamic Vision, October 2007.
Match: A Randomized Correspondence Algorithm for Structural Im- [17] S.-F. Wang and S.-H. Lai. Fast Structure-Preserving Image Retarget-
age Editing. SIGGRAPH, 28(3), August 2009. ing. In ICCASP, April 2009.
[3] B. Chen and P. Sen. Video Carving. In Eurographics, April 2008. [18] Y.-S. Wang, H. Fu, O. Sorkine, T.-Y. Lee, and H.-P. Seide. Motion-
[4] T. Deselaers, P. Dreuw, and H. Ney. Pan, Zoom, Scan - Time- Aware Temporal Coherence for Video Resizing. SIGGRAPH ASIA,
coherent, Trained Automatic Video Cropping. In CVPR, June 2008. 28(5), December 2009.
[5] R. Gal, O. Sorkine, and D. Cohen-Or. Feature-aware Texturing. In [19] Y.-S. Wang, C.-L. Tai, O. Sorkine, and T.-Y. Lee. Optimized Scale-
Eurographics Symposium on Rendering, June 2006. and-Stretch for Image Resizing. SIGGRAPH ASIA, 27(5), December
2008.
[6] J.-W. Han, K.-S. Choi, T.-S. Wang, S.-H. Cheon, and S.-J. Ko.
[20] L. Wolf, M. Guttmann, and D. Cohen-Or. Non-homogeneous
Wavelet Based Seam Carving For Content-Aware Image Resizing.
Content-driven Video-Retargeting. In ICCV, October 2007.
In ICIP, November 2009.
[21] X. Xie, H. Liu, W.-Y. Ma, and H.-J. Zhang. Browsing Large Pictures
[7] H. Huang, T. Fu, P. L. Rosin, and C. Qi. Real-Time Content-Aware
under Limited Display Sizes. IEEE Transaction on Multimedia, 8(4),
Image Resizing. Science in China Series F: Information Science,
August 2006.
52(2), February 2009.