You are on page 1of 8

Hybrid Shift Map for Video Retargeting

Yiqun Hu Deepu Rajan


School of Computer Engineering
Nanyang Technological University, Singapore 639798
{yqhu,asdrajan}@ntu.edu.sg

Abstract the image pixels in a single direction [5] or over several


directions [19] according to their importance. Visually im-
We propose a new method for video retargeting, which portant regions are preserved while homogeneous regions
can generate spatial-temporal consistent video. The new are merged. Some image editing techniques solve the retar-
measure called spatial-temporal naturality preserves the geting problem by redistributing pixels under completeness
motion in the source video without any motion analysis in and coherence constraints [15, 2]. Other extensions are also
contrast to other methods that need motion estimation. This proposed: [8] introduced an image retargeting framework
advantage prevents the retargeted video from degenerating based on Fourier analysis to improve efficiency. Retarget-
due to the propagation of the errors in motion analysis. It ing results are improved by preserving the image structures
allows the proposed method to be applied on challenging in [17]. In [13], multiple operators are integrated to obtain
videos with complex camera and object motion. To improve optimally retargeted images.
the efficiency of the retargeting process, we retarget video
using a 3D shift map in low resolution and refine it using Video Retargeting
an incremental 2D shift map in higher resolution. This new
hierarchical framework, denoted as hybrid shift map, can Some video retargeting methods have been proposed by
produce satisfactory retargeting results while significantly adapting image retargeting methods for video content. The
improving the computational efficiency. local temporal consistency between adjacent pixels in a
spatial-temporal video cube is enforced. For example, in
1. Introduction [20], local temporal consistency was enforced by introduc-
ing a penalty for changes in position of temporally adjacent
Media retargeting aims to increase or decrease the ‘size’ pixels in a least-squares optimization formulation. Seam
of media according to the inherent content and not blindly carving operator was improved to retarget video both spa-
as in scaling and cropping. The ‘important’ content is pre- tially [12] and temporally [3] by searching monotonic and
served during the retargeting process. For images, this may connected manifolds using graph cut. However, such lo-
involve simply resizing, while for videos, the resizing could cal consistency in temporal domain is invalid when there
be in the spatial and/or the temporal domain. With the de- exists large object/camera motion. Some methods enforce
velopment of diverse terminal devices e.g. mobile phones, global temporal consistency by estimating motion informa-
large-screen displays etc., this technique is useful for adapt- tion. For example, cropping-based methods [9, 16, 4] were
ing multimedia content onto devices with different screen extended to find temporally smooth cropping windows for
resolutions. videos. Motion segmentation is used to model the back-
ground or to extract moving objects. A scale-and-stretch
Image Retargeting operator was extended to retarget video in [18]. Consecu-
Various retargeting algorithms have been proposed to tive video frames are aligned to estimate inter-frame camera
adapt images to different resolutions and aspect ratios. As motion which can then be used to constrain retargeting. Al-
opposed to homogeneous resizing methods [21, 14] that though these methods can handle complex object or camera
crop the most important regions to be included, recent meth- motion, they require motion estimation which in itself is a
ods focus on nonlinear image retargeting according to im- challenging task.
age content. For example, seam carving [1] and its vari-
1.1. Motivation
ants [7, 6] remove horizontal/vertical seams from the image.
Seams are monotonically connected curves with minimum This work is motivated by two major limitations of cur-
perceptual energy. Adaptive warping methods redistribute rent video retargeting techniques. First, the effectiveness of
the current methods depend on the performance of motion ing pixels are considered to ensure the naturality in both the
estimation, which is performed prior to the actual retarget- domains. Without explicitly modeling the distortion, a re-
ing. The errors in motion estimation will degrade the fi- targeted video which looks as natural as the source video
nal result, especially for scenes with complex motion and can be generated by maximizing this measure over all pairs
large background clutter. Second, for videos, the compu- of neighboring pixels. This measure is used to define the
tational complexity of retargeting methods based on graph energy function for graph cut. In the rest of this section,
cut is very large. In the multi-resolution framework, the 3D we first generalize the naturality measure in spatial domain,
graph cut is inefficient at a higher resolution although the which was used in [11] and then extend it to temporal do-
solution at lower resolution can be efficiently solved. main.
We propose a new framework for video retargeting with-
out relying on any motion analysis, which results in an ef- 2.1. Spatial Naturality
ficient algorithm in terms of both computational time and In spatial domain, naturality requires that two spatial
memory usage. Within this framework, a new measure is neighboring pixels in the retargeted video are similar to
introduced to estimate temporal consistency (naturality) for some spatial neighboring pair in the source video. Fig-
video retargeting. This measure does not require motion ure 1 (a) illustrates this constraint for two adjacent pixels
analysis and easily integrates both spatial and temporal do- marked as black (x) and black (+) in a frame of the re-
mains into a unified framework. Using this measure in an targeted video. Under our assumption, they are sampled
energy function, we propose a multi-resolution framework from two pixels that are not necessarily adjacent in the cor-
for video retargeting. 3D shift map is applied to find the responding source frame. Given the mappings of two pixels
initial retargeting solution of a video volume in the lowest from the target to the source video, we can measure their
resolution. Incremental 2D shift map is applied to refine ev- naturality by considering each pixel in turn and computing
ery seam in the individual frame with temporal consistency the difference between the neighbor of its mapped pixel in
with respect to the retargeting result of the previous frame. the source and the mapped pixel of its neighbor. For exam-
Compared to the traditional multi-resolution solution, our ple, consider the black (x) pixel in the target whose neigh-
method solves the 3D retargeting problem by solving a se- bor is black (+), which corresponds to the black (+) in the
ries of 2D retargeting problems, which is much more com- source. Also consider the black (x) pixel in the target which
putationally efficient, especially at high resolution. maps to the black (x) pixel in the source having neighbor
The rest of this paper is organized as follows. A new as red (+). If the black (+) and red (+) pixels in the source
measure spatial-temporal naturality is introduced in Sec 2, are similar, then the neighboring pixels black (x) and black
which is used to calculate the energy for graph cut. The (+) in the target are considered natural with respect to the
proposed retargeting framework including 3D shift map, in- source. Otherwise, they will introduce artifacts that do not
cremental 2D shift map as well as the new multi-resolution appear in the source frame. Similarly, if the red(x) and black
scheme is described in Sec 3. We evaluate the proposed (x) pixels in the source are similar, the black (+) and black
method and analyze different properties on real video se- (x) in the target are also considered natural. This formula-
quences in Sec 4. Finally, we present conclusion in Sec 5. tion is same as the pairwise smoothness of [11] in which
the target pixel R(u, v) is derived from the source pixel
2. Spatial-Temporal Naturality S(u + tx , v + ty ) through the shift map M (u, v) = (tx , ty ).
Most existing retargeting methods resort to minimizing We extend the shift map along the temporal domain and
a distortion measure in order to retarget a source video. For denote it by Mt (p) indicating the value of the shift map
example, seam carving techniques [1, 12] try to minimize at frame t and location p = (x, y) in the target domain.
the distortion due to a new pair of pixels becoming adja- Note that the mapping is from the target to the source. We
cent. Similarly, warping-based methods [20, 18] minimize maximize the spatial naturality of the retargeted video by
the distortion resulting from the warping operation. How minimizing
  4
to model various forms of distortion is still an open ques- D(S(p+Mt (p)+ei , t), S(p+ei +Mt (p+ei ), t))
tion. In this paper, we assume that every pixel in the retar- (p,t)∈R i=1
geted video is sampled from some pixel in the source video. (1)
When performing video retargeting, the retargeted video is where R denotes the collection of all pixels in the retargeted
visually pleasing if as less artifacts as possible are generated video, ei are the four unit vectors representing the four spa-
in both the spatial and the temporal domain. We introduce tial neighbors, D(·, ·) is the distance function to measure the
a measure to quantify the strength of artifacts introduced in similarity between two pixels, and S refers to source pixels.
the retargeted video and call it the spatial-temporal natural- The distance function operates on the source between (i) a
ity, which is computed on every pair of neighboring pixels. pixel that is a spatial neighbor of a mapped pixel and (ii) the
As the name implies, both spatial and temporal neighbor- mapped pixel of the spatial neighbor of a target pixel.
(a) (b)
Figure 1. Illustration of Spatial-Temporal Naturality. (a) Spatial naturality within same frame; (b) temporal naturality between neighboring
frames.

2.2. Temporal Naturality the nodes represent the pixels of the retargeted video. Retar-
geting is achieved by finding the optimal mapping between
When considering the temporal information, naturality
the source and the retargeted video. Specifically, we encode
requires that two temporal neighbors in the retargeted video
the spatial-temporal naturality as well as other constraints
are similar to some temporal neighbors in the source video.
into the following form, which can be minimized by graph
Figure 1 (b) is an illustration of this constraint on two tem-
cut algorithm, similar to [11]:
porally adjacent pixels in the retargeted video. The pixel 
black (+) in the (t − 1)th frame of the retargeted video E(M ) = α Ed (Mt (p)) +
is mapped to black (+) in the source video. The temporal (p,t)∈R
neighbor of this pixel is the red (x) in the tth frame of the 
β Es (Mti (pi ), Mtj (pj ))
source video. If this pixel is similar to black (x), which is
(pi ,ti ),(pj ,tj )∈N
the mapping of the temporal neighbor of black (+) in the tth
frame of the retargeted video, then the black (+) and black (3)
(x) in the retargeted video are considered temporally natu-
ral. Similar analysis can be applied on the pixel black (x) in where Ed is the data term encoding the unary energy and Es
tth frame of the target. If the black (+) in the (t−1)th frame is the smoothness term encoding pairwise energy. In this
of the source is similar to the red (+) in the same frame, two section, we first develop a 3D shift map to retarget video
temporal neighbors (marked as black (+) and (x)) in the re- with spatial-temporal naturality. To improve the computa-
targeted video are temporally natural as in the source video. tional efficiency, an incremental 2D shift map is then in-
We maximize the temporal naturality of the retargeted troduced to retarget video. Compared with 3D shift map,
video this method can only achieve a local optimum. However,
 by minimizing
 its computational complexity is much lower while still pre-
(p,t)∈R t∈{−1,+1}
serving spatial-temporal naturality. Finally, a novel solution
D(S(p + Mt (p), t + t), S(p + Mt+t (p), t + t)) (2) for video retargeting is provided by combining these two
methods in a multi-resolution hierarchy, which is called the
where the definitions of R, S, M and D are the same as in Hybrid Shift Map.
equation (1). The distance function operates on the source
between (i) a pixel that is a temporal neighbor of a mapped 3.1. 3D Shift Map
pixel and (ii) the mapped pixel of the temporal neighbor of
a target pixel. Any suitable distance measure can be used. We model the retargeted video as a 3D grid graph where
every node is connected to its 4 spatial and 2 temporal
3. Hybrid Shift Map neighbors. There are two types of constraints related to
video retargeting: pixel preservation during resizing and
The retargeted video that preserves the spatial-temporal spatial-temporal naturality for artifact reduction. To find
naturality of the source video is modeled as graph(s) where the optimal 3D shift map using graph cut, we encode the
pixel preservation in the data term and the spatial-temporal the 2D shift map [11] to retarget individual frames indepen-
naturality in the smoothness term of equation (3). dently, the retargeted video will be not temporally smooth
and will contain jitters. The temporal information in the
Energy Function
source video must be utilized in order to ensure temporal
In video retargeting, it is required that some pixels need to consistency. We propose an incremental solution to retarget
be preserved. For example, in changing the width of the video using temporal information so that the result is tempo-
video, the leftmost and rightmost columns of every frame rally consistent (smooth). In this scheme, the first frame of
should be preserved in the target video. We use data term to the sequence is retargeted using the 2D shift map [11], and
encode such pixel preservation by assigning the tth frame is processed based on the retargeted (t − 1)th
⎧ frame. The temporal consistency is improved by maximiz-
⎨ ∞ if (x = 0) ∩ (tx = 0) ing the temporal naturality between the current retargeted
Ed (Mt (p)) = ∞ if (x = WR ) ∩ (tx = WS − WR ) frame and the retargeted result of previous frame.

0 otherwise Given the shift map of the (t − 1)th frame, the tth frame
(4) is retargeted by finding a minimum cut in an augmented 2D
where WS and WR are the widths of the video frames in the grid graph. Each node of this graph is not only associated
source and the retargeted videos, respectively. with the coordinate shift in the current frame, but also as-
The spatial and temporal naturalities described in sec- sociated with the corresponding shift in the previous frame.
tion 2 can be unified into a single pairwise measure. Con- The shift map of previous frame is utilized to constrain the
sider a pair of neighboring pixels (pi , ti ) and (pj , tj ), which retargeting of the current frame. Specifically, we extend the
can be either spatial neighbors (ti = tj ) or temporal neigh- energy function to consider the temporal naturality.
bors (pi = pj ). For (pj , tj ), the mapping of its neighbor
(pi , ti ) onto the source video is given by Energy Function
In the data term of the energy function, in addition to the
Si = S(pj + pji + Mtj +tji (pj + pji ), tj + tji ). pixel preservation term of equation (4), we encode the
temporal naturality with respect to the shift map of previous
where pji = (xi − xj , yi − yj ) and tji = ti − tj .
frame . The data term containing the measure of temporal
The neighbor of the mapping of (pj , tj ) that corresponds to
naturality is
(pi , ti ) is given by Ŝi = S(pj +Mtj (pj )+pji , tj +tji ).
Similarly, Ed (Mt (p)) = min(D̂t−1 (p), D̂t (p)). (5)
Sj = S(pi + pij + Mti +tij (pi + pij ), ti + tij ), where,
D̂t−1 (p) =
Ŝj = S(pi + Mti (pi ) + pij , ti + tij ),
D(S(p + Mt (p), t − 1), S(p + Mt−1 (p), t − 1)),
where pij = (xj − xi , yj − yi ) and tij = tj − ti .
Therefore, the smoothness term in equation (3) can measure D̂t (p) = D(S(p + Mt (p), t), S(p + Mt−1 (p), t)). (6)
spatial-temporal naturality as
Minimization of the data terms that comprise equations (5)
Es (M (pi , ti ), M (pj , tj )) = min(D(Ŝi , Si ), D(Ŝj , Sj )). and (4) results in the optimal shift map for the tth frame,
which is temporally smooth (natural) with respect to the re-
If either D(Ŝi , Si ) = 0 or D(Ŝj , Sj ) = 0, this pair of targeted (t − 1)th frame.
neighboring pixels in the retargeted video is perfectly natu- The spatial naturality between pixel p and its neighbors
ral with respect to the source video. in the tth frame is encoded as:
If we wish to eliminate one row/column from the video, Es (Mt (p), Mt (p + ei ))
the binary 3D graph cut guarantees a globally optimal so-
= D(S(p + Mt (p) + ei , t), S(p + ei + Mt (p + ei ), t)) (7)
lution. However, the computational complexity of this
method is very high due to the large size of the 3D graph re- where ei represents the four unit vectors representing the
sulting in high computational time and memory usage. This four spatial neighbors, similar to [11].
precludes the application of this method on large video vol- While the 3D shift map achieves a global optimal so-
umes and hence, a more efficient solution is required. lution, the 2D shift map can only obtain the local optimum
which is dependent on the initial guess. If the initial solution
3.2. Incremental 2D Shift Map
in the first frame is far away from the global optimum, the
Instead of considering the video as a 3D volume, it can retargeting results of other frames will be also away from
be viewed as a collection of frames, each of which is rep- the global optimum. However, the incremental 2D approach
resented as a 2D grid graph. However, if we simply apply is much more efficient in terms of computational time and
to band the whole 2D manifold interpolated from the lowest
resolution. When the manifold involves a large variance in
the temporal domain, the rectangular band in the 3D graph
is larger than that in the hybrid shift map. Figure 2 shows
this difference between hybrid shift map and video seam
carving using the same banding method. Note that more
advanced banded multi-resolution graph cut technique [10]
(a) (b) can also be applied on hybrid shift map to further reduce
Figure 2. Comparison of the bands of hybrid shift-map and video complexity.
seam carving [12] on the same grid graph. The shaded area in (a)
is the band for hybrid shift-map and in (b) is the band for seam 4. Experimental Evaluation
carving.
In this section, we evaluate different properties of the
proposed method on real video sequences and compare
memory usage, as shown in the experimental results. Since them with other retargeting methods. Some test video se-
the original 3D problem is simplified into a series of 2D quences are the same as used in [12] and some are web
problems, the number of nodes in each graph is much less videos downloaded from Youtube, which contain large cam-
than that in the 3D graph. era/object motion in a complex scene. For all experiments,
we use the following setting: 3-layer Gaussian pyramid is
3.3. Hybrid Shift Map as a Hierarchical Solution built. 3D shift map is estimated in the lowest resolution
and individual 2D shift map is refined in the original res-
An efficient solution for video retargeting is provided
olution incrementally. The parameters are set as α = 1
by combining 3D shift map and 2D shift map in a multi-
and β = 1. The distance function D((p1 , t1 ), (p2 , t2 )) =
resolution framework. We build a Gaussian pyramid of the
|S(p1 , t1 ) − S(p2 , t2 )|, which is simply the grayscale dif-
source video. In the lowest resolution, an initial retarget-
ference of two pixels.
ing result for every frame is estimated using 3D shift map.
The global optimum property of 3D shift map guarantees 4.1. Temporal Consistency
a good initial solution, which is used to constrain the fi-
nal retargeting result in higher resolution not too far away Compared with 2D shift map [11], our method retargets
from this initial guess. In the higher resolutions, we itera- video by considering temporal consistency between source
tively apply incremental 2D shift map to obtain retargeted and target videos. Without motion analysis, our method can
video: the initial solution in lower resolution is first inter- remove the flickering/waving artifacts and generate tempo-
polated to the higher resolution. Starting from first frame, rally smooth retargeted video. We compare our method with
the shift map is optimized to refine the interpolated solution naive solution of applying [11] independently on individual
on banded 2D graph incrementally, where the band covers frames. Figure 3 (a) shows 4 consecutive frames of a bas-
the initial solution as shown in Figure 2. For comparison ketball sequence. The corresponding retargeted frames of
between hybrid shift map and video seam carving [12], we the proposed hybrid shift map and the naive solution are
use a simple multi-resolution banding method on standard shown in Figure 3 (b) and (c), respectively. The red curves
grid graph. Compared to more advanced multi-resolution in every frame are the next optimal seams to be removed.
banding methods e.g. [10], the banded graph in Figure 2 is From the figure, we can see that the seams obtained by
still a grid graph although the band is not minimum. hybrid shift map are temporally smooth while those ob-
This multi-resolution framework, denoted as Hybrid tained by naive solution are not. Consequently, hybrid shift
Shift Map, improves the computational complexity for re- map generates retargeted video which is temporally consis-
targeting in two aspects. First, the 3D shift map is more tent with the source video. Naive solution using [11] can-
efficient than 3D seam carving because the proposed graph not achieve this temporal consistency. Our method enforces
is much simpler than seam graph with forward energy. Ev- not temporal smoothness as in [12], but temporal natural-
ery node in the graph used in 3D shift map has only 6 edges ity, which is adaptive to the video content. When the con-
while the node in the seam graph has 14 edges. Second, tent is homogeneous, even the non-smooth seams can pre-
the 3D graph cut at higher resolution is divided into a se- serve temporal naturality and generate temporally consis-
ries of 2D graph cuts. Its computational time increases only tent video. Compared to other retargeting algorithms e.g.
linearly with the length of video sequence. Using the same [18] which can preserve temporal consistency, our method
banding method, the incremental 2D shift map has a nar- does not rely on any motion analysis and is robust enough
rower band in every frame since it bands the individual seam to be applied on challenging videos where motion analysis
in the frame. On the other hand, video seam carving needs may be erroneous.
Figure 3. (a) Four consecutive frames of basketball sequence. Retargeted by (b) hybrid shift map and (c) the naive solution of applying 2D
shift map [11] on every frame independently.

Table 1. Computational time for reducing 1 pixel in width for a


320 × 240 video of 110 frames.
3D shift map 2D shift map Hybrid shift map
Time 178s 20s (9+8)s

(a)
hybrid shift map instead uses 3D shift map in the lowest
resolution to constrain every 2D shift map to be close to the
global optimum. Hence, the retargeted frames from the hy-
brid shift map (shown in Figure 4 (b)) are more natural than
(b) incremental 2D shift map alone.
In terms of computational complexity, we compare the
computational time on reducing 1 pixel in width for the
whole video using different components. For a 320 × 240
video sequence of 110 frames, the computational time of
different components are summarized in Table 1. The 3D
(c) shift map and incremental 2D shift map are applied on the
Figure 4. Comparison of hybrid shift map and incremental 2D shift original resolution. For hybrid shift map, 3D shift map
map. (a) two frames (1st and 98th frames) of a cycling sequence; is applied on the third layer of Gaussian pyramid and in-
corresponding retargeted frames using (b) hybrid shift map and (c) cremental 2D shift map is applied on the original resolu-
incremental 2D shift map. tion. We can see the proposed hybrid shift map significantly
improves the efficiency while preserving the effectiveness.
The 3D shift map component (9s) improves since it is ap-
plied at the lowest resolution and the incremental 2D shift
4.2. Global vs Local vs Hybrid map component (8s) in the original resolution improves due
Our retargeting framework combines two components: to the smaller graph for every frame.
3D shift map at the lowest resolution and incremental 2D
shift map at the original resolution. We analyze the retar- 4.3. Hybrid Shift Map vs Video Seam Carving
geting results as well as the computational complexities of In this section, we compare our proposed method with
our framework and individual components. Figure 4 shows [12], which improves seam carving for video retargeting.
the retargeted frames obtained from hybrid shift map and in- Both methods model the video as graph(s) and use graph
cremental 2D shift map for two frames of a video sequence. cut techniques to iteratively retarget video content. When
We can see that the retargeted frames output from incremen- reducing 1 pixel in width, each binary shift map actually
tal 2D shift map (Figure 4 (c)) introduce large distortions corresponds to a removed seam. However, the proposed
within the red box. This is because without the initializa- method differs from [12] in several aspects. First, graph
tion of 3D shift map, the incremental method does not con- constructions are different. [12] models the source video as
sider all frames and can only achieve local optimum. The a graph while our method represents the retargeted video as
Figure 5. Example of disconnected seam (marked as red).

Figure 7. Enlarging using hybrid shift map. The sequences can be


seen in the supplementary material.

(a) 86 frames. For the computational time of hybrid shift map,


the two numbers are the time spent on initialization and re-
finement, respectively. For video seam carving, the three
numbers indicate the time spent in 3 different resolutions
of Gaussian pyramid. As analyzed in Section 3.3, hybrid
(b)
shift map reduces the time spent on both initialization and
refinement. We can also see that hybrid shift map signifi-
cantly reduces the memory usage compared to seam carv-
ing method from Table 2. On the higher resolution, hybrid
(c) shift map only needs to maintain a 2D graph corresponding
Figure 6. Comparison of hybrid shift map and seam carving [12]. to a narrow band of single frame in the memory while seam
(a) 3 consecutive frames of a video; (b) retargeted frames of hybrid carving method has to maintain the whole 3D graph. Note
shift map; (c) retargeted frames of seam carving. The sequences
that the same multi-resolution banding method is applied on
can be seen in the supplementary material.
both methods for fair comparison. The complexity of video
seam carving is higher than that reported in [12] because
Table 2. Computational complexity of hybrid shift map and seam of the banding method. When applying the more advanced
carving. banded method [10], the hybrid shift map can also further
Hybrid shift map Seam carving reduce the complexity and is still more efficient than seam
Time (7+8)s (22+14+54)s carving.
Memory 572Mb 3550Mb The hybrid shift map can also change the height of a
video in a similar way. As illustrated in Figure 7, it also can
be used for increasing the width of a video sequence. Figure
8 shows more retargeting results on 5 video sequences. We
a graph. By modeling the retargeted video, forward energy can see that the retargeted videos from the proposed hybrid
in [12] can be easily derived in our method. Since only the shift map are visually more natural than those from other
binary-labeling problem is guaranteed to achieve global op- methods. Compared to simple scaling and warping based
timum, we recursively reduce 1 pixel in width by solving method, seam carving method generates retargeted videos
a binary-labeling problem. Second, the properties of the with less distortion. The proposed hybrid shift map outputs
seam which is removed from every frame are different. In even better results. For example, the head and the leg of
our method, the removed seam is only required to be mono- the player in two basketball sequences have less distortion
tonic and need not be connected. A vertical/horizontal seam than seam carving results. For the fourth sequence, the left
is monotonic if there is only 1 pixel belonging to it in every person is less distorted than that in other methods.
row/column, respectively. As shown in Figure 5, the con-
nectivity of a seam is automatically controlled by the con- 5. Conclusion
tent: it can be disconnected in the homogeneous area and
will be connected in the boundary area. Hence, our method In this paper, we introduce a new method denoted as hy-
is more flexible than [12]. Figure 6 compares hybrid shift brid shift map for video retargeting. Without applying any
map and seam carving on a video sequence. We can see motion analysis, this method can retarget video by maxi-
hybrid shift map produces a retargeted video with less dis- mizing the spatial-temporal naturality between source and
tortion than seam carving. target video. A novel multi-resolution framework is pro-
Table 2 summarizes both the time and memory usage posed to break the computational bottleneck of video re-
when removing 1 pixel in width from a 480 × 272 video of targeting. Specifically, 3D shift map is designed to get the
Figure 8. More retargeting results on 5 video sequences. The first column shows a sample frame in the source video. The second column
to fifth column are the corresponding retargeted frames using proposed hybrid shift map, seam carving [12], a recent warping-based
retargeting method [8] and simple down-scaling, respectively. The sequences can be seen in the supplementary material.

initial solution in the lowest resolution and incremental 2D [8] J.-H. K. Jun-Seong Kim and C.-S. Kim. Adaptive Image and Video
shift map is designed to refine the initial solution in the orig- Retargeting Based on Fourier Analysis. In CVPR, June 2009.
[9] F. Liu and M. Gleicher. Video Retargeting: Automating Pan-and-
inal resolution. Compared with related retargeting methods, Scan. In ACM Multimedia, October 2006.
the proposed hybrid shift map significantly improves the [10] H. Lombaert, Y. Sun, L. Grady, and C. Xu. A Multilevel Banded
efficiency in terms of both computational time and mem- Graph Cuts Method for Fast Image Segmentation. In ICCV, October
ory usage while still retargeting video with spatial-temporal 2005.
[11] Y. Pritch, E. K. Venaki, and S. Peleg. Shift-Map Image Editing. In
naturality. ICCV, September 2009.
[12] M. Rubinstein, A. Shamir, and S. Avidan. Improved Seam Carving
Acknowledgement for Video Retargeting. SIGGRAPH, 27(3), December 2008.
[13] M. Rubinstein, A. Shamir, and S. Avidan. Multi-operator Media Re-
This research was supported by the Media Development targeting. SIGGRAPH, 28(3), August 2009.
Authority (MDA) under grant NRF2008IDM-IDM004-032. [14] A. Santella, M. Agrawala, D. DeCarlo, D. Salesin, and M. Co-
hen. Gaze-based Interaction for Semi-automatic Photo Cropping.
In SIGCHI, April 2006.
References [15] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani. Summarizing
[1] S. Avidan and A. Shamir. Seam Carving for Content-Aware Image Visual Data Using Bidirectional Similarity. In CVPR, June 2008.
Resizing. SIGGRAPH, 26(3), July 2007. [16] C. Tao, J. Jia, and H. Sun. Active Window Oriented Dynamic Video
[2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patch- Retargeting. In ICCV Workshop on Dynamic Vision, October 2007.
Match: A Randomized Correspondence Algorithm for Structural Im- [17] S.-F. Wang and S.-H. Lai. Fast Structure-Preserving Image Retarget-
age Editing. SIGGRAPH, 28(3), August 2009. ing. In ICCASP, April 2009.
[3] B. Chen and P. Sen. Video Carving. In Eurographics, April 2008. [18] Y.-S. Wang, H. Fu, O. Sorkine, T.-Y. Lee, and H.-P. Seide. Motion-
[4] T. Deselaers, P. Dreuw, and H. Ney. Pan, Zoom, Scan - Time- Aware Temporal Coherence for Video Resizing. SIGGRAPH ASIA,
coherent, Trained Automatic Video Cropping. In CVPR, June 2008. 28(5), December 2009.
[5] R. Gal, O. Sorkine, and D. Cohen-Or. Feature-aware Texturing. In [19] Y.-S. Wang, C.-L. Tai, O. Sorkine, and T.-Y. Lee. Optimized Scale-
Eurographics Symposium on Rendering, June 2006. and-Stretch for Image Resizing. SIGGRAPH ASIA, 27(5), December
2008.
[6] J.-W. Han, K.-S. Choi, T.-S. Wang, S.-H. Cheon, and S.-J. Ko.
[20] L. Wolf, M. Guttmann, and D. Cohen-Or. Non-homogeneous
Wavelet Based Seam Carving For Content-Aware Image Resizing.
Content-driven Video-Retargeting. In ICCV, October 2007.
In ICIP, November 2009.
[21] X. Xie, H. Liu, W.-Y. Ma, and H.-J. Zhang. Browsing Large Pictures
[7] H. Huang, T. Fu, P. L. Rosin, and C. Qi. Real-Time Content-Aware
under Limited Display Sizes. IEEE Transaction on Multimedia, 8(4),
Image Resizing. Science in China Series F: Information Science,
August 2006.
52(2), February 2009.

You might also like