Professional Documents
Culture Documents
WACV18 Arango ME Spotting Riesz
WACV18 Arango ME Spotting Riesz
Carlos Arango Duque Olivier Alata Rémi Emonet Anne-Claire Legrand Hubert Konik
Univ Lyon, UJM-Saint-Etienne, CNRS, IOGS, Laboratoire Hubert Curien UMR 5516
F-42023, Saint-Etienne, France
carlos.arango.duque@univ-st-etienne.fr
the Riesz pyramid was proposed which produced motion- quency domain. If we filter a given image subband I using
magnified videos of comparable quality to the previous one, Eq. 1, the result is the pair of filter responses, (R1 ; R2 ). The
but the videos can be processed in one quarter of the time, input I and Riesz transform (R1 ; R2 ) together form a triple
making it more suitable for real-time or online processing (the monogenic signal) that can be converted to spherical
applications [16, 18]. coordinates to yield the local amplitude A, local orientation
Although, video magnification is a powerful tool to mag- θ and local phase φ using the equations
nify subtle motions in videos, it does not precisely indicate
the moment when these motions take place. However, the I = A cos (φ)
filtered quaternionic phase difference obtained during the R1 = A sin (φ) cos (θ) (2)
Riesz magnification approach [18] seems to be a good proxy R2 = A sin (φ) sin (θ)
for motion, thus it could potentially be used for temporally
segmenting subtle motions. In this section, we introduce the 2.2. Quaternion Representation of Riesz Pyramid
Riesz pyramid, its quaternionic representation and quater-
The Riesz pyramid coefficient triplet (I; R1 ; R2 ) can be
nionic filtering.
represented as a quaternion r with the original subband I
2.1. Riesz Monogenic Signal being the real part and the two Riesz transform components
(R1 ; R2 ) being the imaginary i and j components of the
The Riesz pyramid is constructed by first breaking the quaternion.
input image into non-oriented subbands using an efficient, r = I + iR1 + jR2 (3)
invertible replacement for the Laplacian pyramid, and then
taking an approximate Riesz transform of each band. The The previous equation can be rewritten using (2) as:
key insight into why this representation can be used for mo-
tion analysis is that the Riesz transform is a steerable Hilbert r = A cos (φ) + iA sin (φ) cos (θ) + jA sin (φ) sin (θ) (4)
transformer and allows us to compute a quadrature pair that
However, the decomposition proposed by (4) is not unique.
is 90 degrees out of phase with respect to the dominant ori-
That means that both (A, φ, θ) and (A, −φ, θ + π) are pos-
entation at every pixel. This allows us to phase-shift and
sible solutions. This can be solved if we consider
translate image features only in the direction of the domi-
nant orientation at every pixel. φ cos (θ), φ sin (θ) (5)
Following [14], in two dimensions, the Riesz transform
is a pair of filters with transfer functions which are invariant to this sign ambiguity. If the Riesz pyra-
ωx ωy mid coefficients are viewed as a quaternion, then Eq. 5 is the
−i → , −i → (1) quaternion logarithm of the normalized coefficient1 . Thus,
k−
ωk k−
ωk
1 An extended review of the quaternionic representation including com-
with →
−
ω = [ωx , ωy ] being the signal dimensions in the fre- plex exponentiation and logarithms can be found in [18].
the local amplitude A and quaternionic phase defined in Sec. 2.2 for each frame n ∈ [1, . . . , N ]. However, depend-
Eq. 5 are computed: ing on the spatial resolution of the cropped faces and the
speed of the image acquisition system, the local phase com-
A = krk (6) puted from certain frequency subbands will contribute more
iφ cos (θ) + jφ sin (θ) = log (r/krk) (7) to the construction of a certain motion compared to the ones
from other levels. In other words, not all the levels of the
In previous Eulerian motion amplification papers, mo- pyramid are able to provide useful information about the
tions of interest were isolated and denoised with temporal subtle motion. Thus, we must test the quaternionic phase
filters. However, the quaternionic phase cannot be naively from different subbands (pyramid level) and select the level
filtered since it is a wrapped quantity. Therefore, a tech- which better represents the subtle motion we want to detect.
nique based on [4] is used to filter a sequence of unit quater- We then obtain both local amplitude An and quater-
nions by first unwrapping the quaternionic phases in time nionic phase (φn cos (θ), φn sin (θ)) from the selected sub-
and then using a linear time invariant (LTI) filter [18] in or- band. We apply the process described in Sec. 2.2 to ob-
der to isolate motions of interest in the quaternionic phase. tain the filtered quaternionic phase φ (Sec. 2.2). However,
Furthermore, the signal to noise ratio (SNR) of the phase since we are aiming to detect any significant quaternionic
signal can be increased by spatially denoising each frame phase shifts between frames and to compensate for the cu-
with an amplitude-weighted spatial blur using a Gaussian mulative sum made done in the quaternionic phase unwrap-
kernel with standard deviation ρ on the i and j components ping [18], we calculate the difference of two consecutive
of the temporally filtered signal. In the following section, filtered quaternionic phases:
we show how the phase signal can be used to spot MEs.
∆φn u = φn u − φn−1 u (8)
3. Proposed Framework
where u = i cos θ + j sin θ.
Our proposed algorithm goes as follows: First we de- We also must consider what kind of temporal filter we
tect and track stable facial landmarks through the video. must implement for our application. The previous works
Secondly, we use the Riesz Pyramid to calculate the am- in Eulerian motion magnification (See Sec. 2.2) have given
plitude and quaternionic phase of the stabilized face im- their users freedom to choose any temporal filtering method
ages and we implement a proper spatio-temporal filtering available. However, since we require to pinpoint the exact
scheme which can enhance motions of interest without pro- moment when a subtle motion is detected we cannot use
ducing delays or undesired artifacts. Thirdly, we isolate ar- traditional causal filters which delay the signal response (for
eas of potential subtle motions based on amplitude and fa- example, Fig. 2c shows the delay caused by the filtering
cial landmarks. Finally, we measure the dissimilarities of scheme used in [16]). Therefore, we propose to use a digital
quaternionic phases over time and transform it into a series non-causal zero-phase finite impulse response (FIR) filter.
of 1-D signals, which are used to estimate the apex frame
p
of the MEs. A graphic representation of our framework can X
be seen in Fig. 1. Φn u = b0 ∆φn u + bk (∆φn+k u + ∆φn−k u) (9)
k=1
3.1. Face Detection and Registration
where p is an even number and bk is a coefficient of a FIR
We start by detecting the face in the first frame using the filter of length 2p + 1. One limitation of this method is that
cascade object detector of Viola and Jones [15]. Then, we non-causal filters requires to use the previous and follow-
use the active appearance model (AAM) proposed by Tz- ing p frames from the current frame (therefore for online
imiropoulos and Pantic [13] to detect a set of facial land- applications there must be a delay of at least p frames).
marks. Next, we select certain facial landmarks which Another element to consider is that Eulerian amplifica-
won’t move during facial expressions (we selected the inner tion methods are tailored for a particular task. These meth-
corner of the eyes and the lower point of the nose between ods focus on amplifying subtle periodical movements (such
the nostrils). We track these points using the Kanade-Lucas- as human breathing, the vibration of an engine, the oscil-
Tomasi (KLT) algorithm [12] and use them to realign the lations of a guitar string, etc) by temporally band-passing
face over time. Finally, we use these landmarks to crop the some potential movements. However, these methods do not
aligned facial region of interest for each frame in the video consider subtle non periodical movements (such as blink-
(See Fig. 1(b)). ing or facial MEs). The latter type of motion, when band-
passed, creates some large oscillations near the beginning
3.2. Riesz Transform and Filtering
and the end of the subtle motion (Fig. 2d) as stated by Gibbs
For the sequence of cropped images obtained in 3.1 of phenomenon. Therefore, we decided to use low-pass filter-
N frames, we perform the process described in Sec. 2.1 and ing for this type of signals (Fig. 2e).
(a) (b) (a) (b) (c)
0.05
0.02
AUC average
0.6 0.3 0.82 0.6
0.3
0.25 0.845
0.7 0.9 0.918
0.8
0.25
0.4 0.2 6 0.65 0.4
0.9 0.924
0.8 0.86 0.82
0.2
18 0.924 8
0.91
0.15 0.6 0.75
0.88
0.15
18
0.2 0.2
0.9
0.8
0.1 0.918
0.55 0.1
45
0.05 0.86 0.918 0.7
0 0.82 0.8 0 0.05 0.88 0.9
0 0.2 0.4 0.6 0.8 5 10 15 20 25 0 0.2 0.4 0.6 0.8 5 10 15 20 25 30 35 40
that occur within a small window of time. Considering that example, [7] evaluate their work using mean absolute error
some videos in the SMIC-HS dataset contain multiple MEs and standard error. [1] also uses ROC curves to evaluate
but all videos in the CASME II dataset contain only one their work but fails to disclose how they compute their false
ME, it would explain why augmenting the value of ψ affects positives. Although our evaluation method is very similar
the results for SMIC-HS but not the results for CASME II. from the one by [5], the false positive penalties were dif-
After analyzing our results, we estimate that, since both ferent. That is the reason why the initial AUCs obtained
of the evaluated datasets have the same spatial resolution in the beginning of Sec. 4.3 are different from the ones in
(640 × 480 pixels), we could use the same level of the pyra- our comparative Table 1, because the first tests had higher
mid for both and we could fairly establish a robust value false positive penalties. This happens because these penal-
range for β (between 0.05 and 0.25). Furthermore, this ties were based on an assumed duration of MEs. However,
range value for β is coherent with our values obtained by since micro expressions have different duration times (as
our cross validation scheme. We can also estimate that a discussed in Sec. 3.4), a different approach should be con-
robust value for ψ for both datasets should correspond to at sidered for computing false positives.
least 50 milliseconds. Moreover, we can conclude that the
level of the pyramid and β are the parameters which have
more impact in the performance of our system. 5. Conclusions
4.5. Discussion We presented a facial micro-expression spotting method
based on the Riesz pyramid. Our method adapts the quater-
Although both databases are similar, the SMIC-HS con- nionic representation of the Riesz monogenic signal by
tain longer video-clips making the spotting task more diffi- proposing a new filtering scheme. We are also able to
cult and more prone to false positive. We suspect this might mask regions of interest where subtle motion might take
be the reason why a higher AUC is achieved on CASME place in order to reduce the effect of noise using facial land-
II database. Furthermore, the subjects in the CASME II marks and the image amplitude. Additionally, we propose
dataset were at a closer distance to the camera during video a methodology that separates real micro-expressions from
recording, thus the captured faces had a bigger resolution subtle eye movements decreasing the quantity of possible
which result in a shift of the ME motion to low frequencies. false positives. Experiments on two different databases
This might explain why we could obtain good results us- show that our method surpasses other methods from the
ing the fourth level of the Riesz pyramid in the CASME II state of the art. Moreover, the results of our parameter anal-
dataset but not in the SMIC-HS dataset. A scale normaliza- ysis suggest that the method is robust to changes in parame-
tion of the captured faces could allow us to fix the pyramid ters. Furthermore, the quaternionic representation of phase
level selection, regardless of the database. and orientation from the Riesz monogenic signal could po-
Upon detailed examination of the spotting results, we tentially be exploited in the future for a more general sub-
found that a large portion of false negatives were caused tle motion detection scheme and for facial micro-expression
by our algorithm dismissing true MEs as eye movements. recognition.
This might happened because our system does not differ-
entiate eye blinks from eye-gaze changes, thus discarding
MEs that happen simultaneously with eye-gaze changes. Acknowledgements
One of the main challenges of comparing our ME spot-
ting method with the state of the art comes from the fact This work has been supported by a research grant from
that there is not a single standard performance metric. For the Foundation of Jean Monnet University (UJM).
References [14] M. Unser, D. Sage, and D. V. D. Ville. Multiresolu-
tion Monogenic Signal Analysis Using the Riesz-Laplace
[1] A. K. Davison, C. Lansley, C. C. Ng, K. Tan, and M. H. Yap. Wavelet Transform. IEEE Transactions on Image Process-
Objective Micro-Facial Movement Detection Using FACS- ing, 18(11):2402–2418, Nov. 2009.
Based Regions and Baseline Evaluation. arXiv preprint [15] P. Viola and M. Jones. Rapid object detection using a boosted
arXiv:1612.05038, 2016. cascade of simple features. In Proceedings of the 2001 IEEE
[2] P. Ekman and E. L. Rosenberg. What the Face Reveals: Ba- Computer Society Conference on Computer Vision and Pat-
sic and Applied Studies of Spontaneous Expression Using tern Recognition, 2001. CVPR 2001, volume 1, pages I–511–
the Facial Action Coding System (FACS). Oxford University I–518 vol.1, 2001.
Press, Apr. 2005. [16] N. Wadhwa, M. Rubinstein, F. Durand, and W. Freeman.
[3] D. H. Kim, W. J. Baddar, and Y. M. Ro. Micro- Riesz pyramids for fast phase-based video magnification. In
Expression Recognition with Expression-State Constrained 2014 IEEE International Conference on Computational Pho-
Spatio-Temporal Feature Representations. In Proceedings of tography (ICCP), pages 1–10, May 2014.
the 2016 ACM on Multimedia Conference, MM ’16, pages [17] N. Wadhwa, M. Rubinstein, F. Durand, and W. T. Freeman.
382–386, New York, NY, USA, 2016. ACM. Phase-based Video Motion Processing. ACM Trans. Graph.,
[4] J. Lee and S. Y. Shin. General construction of time-domain 32(4):80:1–80:10, July 2013.
filters for orientation data. IEEE Transactions on Visualiza- [18] N. Wadhwa, M. Rubinstein, F. Durand, and W. T. Freeman.
tion and Computer Graphics, 8(2):119–128, Apr. 2002. Quaternionic representation of the riesz pyramid for video
[5] X. Li, X. Hong, A. Moilanen, X. Huang, T. Pfister, G. Zhao, magnification. Technical report, 2014.
and M. Pietikainen. Towards Reading Hidden Emotions: A [19] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand,
Comparative Study of Spontaneous Micro-expression Spot- and W. Freeman. Eulerian Video Magnification for Re-
ting and Recognition Methods. IEEE Transactions on Affec- vealing Subtle Changes in the World. ACM Trans. Graph.,
tive Computing, PP(99):1–1, 2017. 31(4):65:1–65:8, July 2012.
[6] X. Li, T. Pfister, X. Huang, G. Zhao, and M. Pietikainen. [20] W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen,
A Spontaneous Micro-expression Database: Inducement, and X. Fu. CASME II: An Improved Spontaneous Micro-
collection and baseline. In 2013 10th IEEE International Expression Database and the Baseline Evaluation. PLoS
Conference and Workshops on Automatic Face and Gesture ONE, 9(1), Jan. 2014.
Recognition (FG), pages 1–6, Apr. 2013. [21] W.-J. Yan, S.-J. Wang, Y.-H. Chen, G. Zhao, and X. Fu.
[7] S. T. Liong, J. See, K. Wong, A. C. L. Ngo, Y. H. Oh, and Quantifying Micro-expressions with Constraint Local Model
R. Phan. Automatic apex frame spotting in micro-expression and Local Binary Pattern. In L. Agapito, M. M. Bron-
database. In 2015 3rd IAPR Asian Conference on Pattern stein, and C. Rother, editors, Computer Vision - ECCV 2014
Recognition (ACPR), pages 665–669, Nov. 2015. Workshops, number 8925 in Lecture Notes in Computer
[8] A. Moilanen, G. Zhao, and M. Pietikainen. Spotting Rapid Science, pages 296–305. Springer International Publishing,
Facial Movements from Videos Using Appearance-Based Sept. 2014. DOI: 10.1007/978-3-319-16178-5 20.
Feature Difference Analysis. In 2014 22nd International [22] W.-J. Yan, Q. Wu, J. Liang, Y.-H. Chen, and X. Fu. How Fast
Conference on Pattern Recognition (ICPR), pages 1722– are the Leaked Facial Expressions: The Duration of Micro-
1727, Aug. 2014. Expressions. Journal of Nonverbal Behavior, 37(4):217–
[9] S. Park and D. Kim. Subtle facial expression recognition 230, July 2013.
using motion magnification. Pattern Recognition Letters,
30(7):708–716, May 2009.
[10] S. Y. Park, S. H. Lee, and Y. M. Ro. Subtle Facial Expres-
sion Recognition Using Adaptive Magnification of Discrim-
inative Facial Motion. In Proceedings of the 23rd ACM In-
ternational Conference on Multimedia, MM ’15, pages 911–
914, New York, NY, USA, 2015. ACM.
[11] M. Shreve, S. Godavarthy, D. Goldgof, and S. Sarkar. Macro-
and micro-expression spotting in long videos using spatio-
temporal strain. In 2011 IEEE International Conference on
Automatic Face Gesture Recognition and Workshops (FG
2011), pages 51–56, Mar. 2011.
[12] C. Tomasi and T. Kanade. Detection and Tracking of Point
Features. Technical report, International Journal of Com-
puter Vision, 1991.
[13] G. Tzimiropoulos and M. Pantic. Optimization Problems
for Fast AAM Fitting in-the-Wild. In 2013 IEEE Interna-
tional Conference on Computer Vision (ICCV), pages 593–
600, Dec. 2013.