You are on page 1of 6

Layered Representation for Motion Analysis

John Y. A. Wang Edward H. Adelson
The MIT Media Laboratory The MIT Media Laboratory
Dept. Elec. Eng. and Comp. Sci. Dept. Brain and Cognitive Sciences
Massachusetts Institute of Technology Massachusetts Institute of Technology
Cambridge, MA 02139 Cambridge, MA 02139

Abstract a different point of view.
Consider an image that is formed by one opaque
Standard approaches to motion analysis assume object moving in front of a background. In Figure
that th,e optic pow is smooth; such techniques have 1, this is illustrated with a moving hand in front of
troublc dealing with occlusion boundaries. The most a stationary checkerboard. The first row shows the
popular solution is to alloiw discontinuities in the flow objects that compose the scene; the second row shows
field, imposing the smoothness constraint in a piece- the image sequence that will result. An animation
wise fashion. But there is a sense i n which the dis- system - whether traditional cel animation or modern
continuities in flow are artifactual, resulting from the digital compositing - can generate this sequence by
attempt to capture the motion of multiple overlapping starting with an image of the background, an image
objects i n a single flow field. Instead we can decom- of the hand, an opacity map (known as a “matten or
pose the image sequence into a set of overlapping lay- an “alpha channeln) for the hand, motion fields for
ers, where each layer’s motion is described b y a smooth the hand and the background, and finally the rules of
,flow field. The discontinarifies in the description are image formation.
then attributed to object opacities rather than to the The resulting image sequence will pose challenges
flow itself, mirroring the structure of the scene. We for standard motion analysis because of the occlusion
have devised a set of techniques for segmenting im- boundaries. But in principle we should be able to re-
ages into coherently moving regions using aDne mo- trieve the same simple description of the sequence that
tion ana1ys:s and clustering techniques. We are able the animator used in generating it: an opaque hand
to decompose an image inlto a set o f layers along with moving smoothly in front of a background. The de-
information about occlusion and depth ordering. We sired description for the hand is shown in the third row
have applied the techniques to the ufiower garden” se- of Figure 1; it involves an intensity map, an opacity
quence. We can analyze the scene into four layers, map, and a warp map. The background (not shown)
and then represent the entire 90-frame sequence with would also be extracted. Having accomplished this
a single image of each layer, along with associated mo- decomposition we could transmit the information very
tion porameters. efficiently and could then resynthesize the original se-
quence, as shown in the bottom row. In addition, the
description could be an important step on the way
1 Introduction to a meaningful object-based description of the scene,
rather than a mere description of a flow fivld.
Occlusions represent one of the difficult problems Adelson [l]has described a general framework for
in motion analysis. Smoothing is necessary in order “layered image representation,” in which image se-
to derive reliable flow fields, but when smoothing oc- quences are decomposed into a set of layers ordered
curs across boundaries the result is a flow field that in depth along with associated maps defining their
i s simply incorrect. Various techniques have been de- motions, opacities, and intensities. Given such a de-
vised to allow for motion discontinuities but none are scription, it is straightforward to synthesize the image
entirely satisfactory. In addition, transparency due to sequence using standard techniques of warping and
various sources (including motion blur) can make it compositing. The challenge is to achieve the descrip-
meaningless to assign a single motion vector to a sin- tion starting with an image sequence from a natural
gle point. It is helpful to reconsider this problem from scene. In other words: rendering is easy, but vision

1063-6919/93 $03.00 0 1993 1EE.E

Authorized licensed use limited to: BIBLIOTECA D'AREA SCIENTIFICO TECNOLOGICA ROMA 3. Downloaded on October 8, 2009 at 04:27 from IEEE Xplore. Restrictions apply.
moving objects. Since the motion model used in the
analysis will determine the descriptiveness the rep-
resentation, we use the &ne motion model in our
layered representation to describe a wide range of
motions commonly encountered in image sequences.
Movinghand Background
These motions include translation, rotation, zoom,
and shear. Affine motion is parameterized by six pa-
rameters as follows:

Vy(2, Y) = ay0 + ay1 2 + =y2 Y (2)
where a t each point (x,y), Vz(z,y) and V,(z,y) are
the z and y components of velocity respectively, and
the ak’s are the affine motion parameters.
Intensitymap Opacity map Warp map

Synthesized 3 Implementation
Typical methods in multiple affine motion estima-
Frame 1 Frame 2 Frame 3 tion use an iterative motion estimation techniques to
detect multiple affine motion regions in the scene. At
Figure 1: This figures show the decomposition of an im- each iteration, these methods assume that a dominant
age sequence consisting of a hand moving in front of a motion region can be detected and eliminated from
checkerboard background. The conventional method of subsequent analysis. Estimation of these regions in-
representing motion is by a dense motion field with motion volve global estimation using a single motion model,
discoritinuit,ies at object boundary. The layered represen- and thus, often result in accumulating data from mul-
tation describes these objects with smooth motions, and tiple objects.
discontinuities in opacity. The apparent motion disconti- Our implementation of multiple motion estimation
nuities result when the layers are composited according to
is similar to robust techniques presented by [3, 4, 51.
the ocdusion relationship between objects.
We use a gradual migration from a local motion r e p
resentation to a global object motion represerrtation.
is difficult, as usual. In this paper we describe some By performing optic flow estimation follow by ;fine
techniques that are helpful in accomplishing the vision estimation instead of a direct global affine motion es-
side of the procedure. timation, we can minimize the problems of multiple
objects within our analysis region. The layer’s image,
opacity map are obtained by integrating the motion
2 Image analysis and regions over time. Our analysis of an image se-
quence into layers consists of three stages: 1)local mo-
Analysis of the scene into the layered representation tion estimation; 2) motion-based Segmentation and; 3)
requiies grouping the points in the image into multiple object image recovery.
regioiu where each region undergoes a smooth motion.
However, multiple motion estimation and segmenta- 3.1 Motion segmentation
tion is a difficult problem that involves a simultaneous
estimation of the object boundary and motion. With- Our motion segmentation algorithm is illustrated
out the knowledge of the object boundaries, motion in Figure 2. The segmentation algorithm is is divided
estimation will incorrectly apply the image constraints into two primary steps: I) local motion estimation
across multiple objects. Likewise, object boundaries and, 2) affine motion segmentation. Multiple affine
are difficult to determine without some estimation of motions are estimated within subregions of the image
motion. and coherent motion regions are determined based on
Recent works by [7,2, 91 have shown that the affine the estimated affine models. By iteratively updating
motican model provides a good approximation of 3-D the affine models and the regions, this architecture


Authorized licensed use limited to: BIBLIOTECA D'AREA SCIENTIFICO TECNOLOGICA ROMA 3. Downloaded on October 8, 2009 at 04:27 from IEEE Xplore. Restrictions apply.
then the affine equations 1 and 2 become:

VZ(I, Y) = m z . (3)

VY(I, Y) = P H y , (4)

and a linear least squares estimate of Hi for an given
condtlons local motion field is as follows:
7 [ay, HZ,l = [E4 n'X ( P [ V Y ( . , Y> VZ(Z, Y)l)
P. P.
M (5)
The summation is taken over Pi corresponding to the
ith subregion in the image.
Region Assgn We avoid estimating motion across object bound-
aries by initially using small arbitrary subregions
within the image to obtain a set of hypotheses of likely
re ions m a s k affine motions exhibited in the image. Many of these
adne motion pavameters
hypotheses will be incorrect because these initial sub-
regions may contain object boundaries. We identify
Figure 2: This figures shows the technique used in mo- these hypotheses by their large residual error and elim-
tion segmentation. Affine motion models are determined
by regression on the dense motion fields and the regions
inate them from our analysis.
are assigned to minimizes the error between the motion However, motion estimates from patches that cover
expected by the models and the estimated dense motion. the same object will have similar parameters. These
are grouped in the affine motion parameter space
with a k-means clustering algorithm described in [Ill.
In the clustering process, we derive a represent,ative
minimizes the problem of intergrating data across ob- model for each group of similar models. The model
ject boundaries. clustering produces a set of likely affine motion mod-
els that are exhibited by objects in the scene.
Our local motion estimation is obtained with a
Next, we use hypothesis testing with the motion
multi-scale coarse-to-fine algorithm based on a gradi-
models to reassign the regions. We use a simple cost
ent approach described by [8].Since only one motion is
function, C(i(z, y)), that minimizes the velocity errors
visible at any point when dealing with opaque objects,
between the local motion estimates and the expected
the single motion model assumed in the optic flow esti-
motion described by the affine models. This cost func-
mation is acceptable. The multi-scale implementation
tion is summarized as follows:
allows for estimation of large motions. When analyz-
ing scene exhibiting transparent phenomena, the mo-
tion estimation technique described by Shizawa and
c ( i ( z , Y)) = E ( v ( z , Y) - VH,(I, (6)
Mase [lo] may be suitable.
where i(z, y) is the indicates the model that location
Motion segmentation is obtained by iteratively re- (I, y) is assigned to, V(z, y) is the estimated local mo-
fining the estimates of affine motions and the corre- tion field, and V H (I,
~ y) is the affine motion field cor-
sponding regions. We estimate the affine parameters responding to the ith hypothesis. Since each location
within each subregion of ithe image by standard re- is assigned to only one of the hypotheses, we obtain
gression techniques on loctal motion field. This esti- the minimum total cost by minimizing the cost a t each
mation can be seen as a plane fitting algorithm in location. We summarize the assignment in the follow-
the velocity space since the affine model is a linear ing equations:
rnodel of local motion. The regression is applied sep-
arately on each velocity component since the compo- h(I,Y) = a% min[V(z, Y) - V H , ( t , Y)]' (7)
nents are independent. If we let Hi = [Hy,BZ,]
be the i*h hypothesis vector in the affine parameter where Zo(z, y) is the minimum costs assignment. Re-
space with components = [aZoi a c l i QcZi] and gions that are not easily described by any of the mod-
H Y I T = [ayoi ayli aYz.] corresponding to the I and els are unassigned. These regions usually occur a t ob
y components , and #
= [l I y] be the regressor, ject boundaries because the assumptions used by the


Authorized licensed use limited to: BIBLIOTECA D'AREA SCIENTIFICO TECNOLOGICA ROMA 3. Downloaded on October 8, 2009 at 04:27 from IEEE Xplore. Restrictions apply.
optic flow estimation are violated. We assign these easily seen as a temporal median filtering operation on
regions by warping the images according to the affine the motion compensated sequence in regions defined
motion models and selecting the model that minimizes by the region masks. Earlier studies have shown that
the error in intensity between the pair of images. motion compensation median filter can enhance noisy
We now define the binary region masks that de- images and preserve edge information better than a
scribe the support regions for each of the affine hy- temporal averaging filter [6].
potheses as: Finally, we determine occlusion relationship. For
each location of each layer, we tabulate the number
of corresponding points used in the median filtering
operation. These images are warped to their respec-
tive positions in the original sequence according to the
These region masks allow us to identify the object re- estimated affine motions and the values are compared
gions and to refine our &ne motion estimates in the at each location. An layer that is derived from more
subsequent iterations according to Equations 5. points occludes an image that is derived from fewer
As we perform more iterations, we obtain more ac- points, since an occluded region necessarily has fewer
curate motion segmentation because the affine motion corresponding points in the recovery stage. Thus the
estimation is performed within single motion regions. statistics from the motion segmentation and temporal
Convergence is obtained when only a few points are re- median filtering provide the necessary description of
assigned or when the number of iterations reaches the the object motion, texture pattern, opacity, and oc-
maximum allowed. Models that have small support clusion relationship.
regions are eliminated because their affine parameters Our modular approach also allows us to easily in-
will be inaccurate in these small regions. corporate other motion estimation and segmentation
We maintain the temporal coherence and stability algorithm into a single robust framework.
of the segmentation by using the current motion seg-
mentation results as initial conditions for segmenta-
tion on the next pair of frames. Since an object’s shape 4 Experimental results
and motion change slowly from frame to frame, the
segmentation results between consecutive frames are We implemented the image analysis technique on
similar and require fewer iterations for convergence. a SUN workstation and use the first 30 frames of the
When the motion segmentation on the entire sequence MPEG “flower garden” sequence to illustrate the anal-
is completed, each object will have a region mask and ysis, the representation, and synthesis. Three frames
an affine motion description for each frame of the se- of the sequence, frames 0,15 and 30, are shown in Fig-
quence. ure 3. In this sequence, the tree, flower bed, and row
of houses move towards the left but at different veloc-
3.2 Analysis of layers ities. Regions of the flower bed closer to the camera
move faster than the regions near the row of houses in
The images of the corresponding regions in the dif- the distance.
ferent frames differ only by an affine transformation. Optic flow obtained with a multi-scale coarse-to-
By applying these transformations to all the frames, fine gradient method on a pair of frames is shown on
we align the corresponding regions in the different the left in Figure 4. The initial regions used for the
frames. When the motion parameters are accurately segmentation consisted of 215 square regions. Notice
estimated, objects will appear stationary in the mG the poor motion estimates along the occlusion bound-
tion compensated sequence. The layer images and aries of the tree as shown by the different lengths of
opacity map are derived from these motion compen- the arrows and the arrows pointing upwards. In the
sated sequences. same figure, results of the affine motion segmentation
However, some of the images in the compensated is shown on the middle. The affine motion regions are
sequence may not contain a complete image of the ob- depicted by different gray levels and darkest regions
ject because of occlusions. Additionally, an image may along the edges of the tree in the middle figure cor-
have small intensity variations due to different lighting respond to regions where the local motion could not
conditions. In order to recover the complete represen- be accurately described by any of the affine models.
tative image and boundary of the object, we collect Region assignment based on warping the images and
the data available at each point in the layer and apply minimizing intensity error reassigns these regions and
a median operation on the data. This operation can be is shown on the right.


Authorized licensed use limited to: BIBLIOTECA D'AREA SCIENTIFICO TECNOLOGICA ROMA 3. Downloaded on October 8, 2009 at 04:27 from IEEE Xplore. Restrictions apply.
Our analysis decomposed the image into 4 primary the object image, boundary and occlusion relation-
regions: tree, house, flower-bed and sky. A a n e pa- ship. Our approach provides useful tools in image un-
rameters and the support regions were obtained for the derstanding and object tracking, and has potentials as
entire sequence, and the layer images for the four ob- an efficient model for image sequence coding.
jects obtained by motion compensated temporal me-
dian filtering are shown in Figure 5. We use Frame 15 Acknowledgements
as the reference frame for the image alignment. The
occluding tree has been removed and occluded regions This research was supported in part by a contract
recovered in the flower-bed layer and the house layer. with SECOM Co., and Goldstar Co., Ltd.
The sky layer is not shown. Regions with no texture,
such as the sky, cannot be readily assigned to a layer
since they contain no motion information. We assign References
these regions to a single layer that describes stationary
textureless objects. [l] E.H. Adelson, Layered representation for image cod-
ing, Technical Report No. 181, Vision and Modeling
We can recreate the entire image sequence from the
Group, The MIT Media Lab, December 1991.
layer images of Figure 5 , along with the occlusion in-
formation, the affine parameters that describe the ob- [2] J. R. Bergen, P. J. Burt, R. Hingorani, and S. Peleg ,
ject motion, and the stationary layer. Figure 6 shows Computing two motions from three frames, Interna-
three synthesized images corresponding to the three tional Conference on Computer Vision, 1990.
images in Figure 3. The objects are placed in their [3] M. J . Black, P. Anandan, Robust dynamic motion
respective positions and occlusion of background by estimation over time, Proc. IEEE Computer Vision
the tree is correctly described by the layers. Figure 7 and Pattern Recognition91, pp. 296-302, 1991.
shows the corresponding frames synthesized without
[4] T. Darrell, and Alex Pentland, Robust estimation of
the tree layer. Uncovered regions are correctly recov-
multi-layered motion representation, IEEE Workshop
ered because our layered representation maintains a on Visual Motion, pp. 173-178, Princeton, 1991.
description of motion in these regions.
[5] R.Depommier R., E. Dubois, Motion estimation with
detection of occlusion areas, Proc. IEEE ICASSP92,
5 Conclusions Vol. 3, pp. 269-273, San Francisco, March 1992.
[6] T. S. Huang and Y. P. Hsu, “Image Sequence En-
We employ a layered image motion representation hancement,” Image Sequence Analysis, Editor T. S .
that provides an accurate description of motion dis- Huang, pp. 289-309., Springer-Verlag, 1981.
continuities and motion occlusion. Each occluding and
occluded object is explicitly represented by a layer [7] M. Irani, S . Peleg, Image sequence enhancement using
multiple motions analysis, Proc. IEEE Computer Vi-
that describes the object’s motion, texture pattern, sion and Pattern Recognition92, pp. 216-221, Cham-
shape, and opacity. In this representation, we describe paign, June, 1992.
motion discontinuities as discontinuities in object sur-
face opacity rather than discontinuities in the actual [8] Lucas, B., Kanade, T., An iterative image registra-
object motion. tion technique with an application to stereo vision,
To achieve the layered description, we use a robust Image Understanding Workshop, pp. 121-130, April,
motion segmentation algorithm that produces stable
image segmentation and accurate &ne motion esti- [9] S. Nagahdaripour, S. Lee, Motion recovery from im-
mation over time. We deal with the many problems age sequences using first-order optical flow informa-
in motion segmentation by appropriately applying the tion, Proc. IEEE Workshop on Visual Motion 91, pp.
image constraints a t each step of our algorithm. We 132-1 39, Princeton, 1991.
initially estimate the local motion within the image, [IO] M. Shizawa and K . Mase, A unified computational
then iteratively refine the estimates of object’s shape theory for motion transparency and motion bound-
and motion. A set of likely affine motion models ex- aries based on eigenenergy analysis,” Proc. IEEE
hibited by objects in the scene are calculated from the Computer Vision and Pattern Recognition91, pp. 296-
local motion data and used in a hypothesis testing 302, 1991.
framework to determine the coherent motion regions. [ll] C. W. Therrien, Decision Estimation and Cla&fica-
Finally, the temporal coherence of object shape and tion, John Wiley and Sons, New York, 1989.
texturc: pattern allows us to produce a description of


Authorized licensed use limited to: BIBLIOTECA D'AREA SCIENTIFICO TECNOLOGICA ROMA 3. Downloaded on October 8, 2009 at 04:27 from IEEE Xplore. Restrictions apply.
Figure 3: Frames 0, 15 and 30, of MPEG flower garden sequence.

Figure 4: Affine motion segmentation of optic flow.

Figure 5: Images of the flower bed, houses, and tree. Affine motion fields are also shown here.

Figure 6: Corresponding frames of Figure 3 synthesized from layer images in Figure 5

Figure 7: Corresponding frames of Figure 3 synthesized without the tree layer.


Authorized licensed use limited to: BIBLIOTECA D'AREA SCIENTIFICO TECNOLOGICA ROMA 3. Downloaded on October 8, 2009 at 04:27 from IEEE Xplore. Restrictions apply.