You are on page 1of 6

Semi-Supervised Kernel-Based Temporal Clustering

Rodrigo Araujo Mohamed S. Kamel


Department of Electrical and Department of Electrical and
Computer Engineering Computer Engineering
University of Waterloo University of Waterloo
Waterloo, Ontario N2L 3G1 Waterloo, Ontario N2L 3G1
Email: raraujo@uwaterloo.ca Email: mkamel@uwaterloo.ca

Abstract—In this paper, we adapt two existing methods to per- techniques assume that the time series are already segmented,
form semi-supervised temporal clustering: Aligned Cluster Anal- and this is the main difference between the traditional time
ysis (ACA), a temporal clustering algorithm, and Constrained series clustering techniques and what is being defined as TC.
Spectral Clustering, a semi-supervised clustering algorithm. In One technique that uses a clustering approach to segment the
the first method, we add side information in the form of pairwise data is Aligned Cluster Analysis (ACA) [4], which frames this
constraints to its objective function, and in the second, we add a
temporal search to its framework. We also extend both methods
problem as an energy-based temporal clustering and solves it
by propagating the constraints throughout the whole similarity via dynamic programming.
matrix. In order to validate the advantage of the proposed semi- Similar to the general clustering problem, temporal seg-
supervised methods to temporal clustering, we evaluate them in
mentation using clustering may not have satisfactory results
comparison to their original versions as well as another semi-
supervised temporal cluster on three temporal datasets. The due to being a totally unsupervised approach. In some situa-
results show that the proposed methods are competitive and tions, prior high-level knowledge is known about the clusters,
provide good improvement over the unsupervised approaches. or some labeled data is available. This information can be
used to aid the clustering algorithm, and this class of learning
is known as semi-supervised clustering.
Keywords—Semi-supervised clustering; Temporal segmentation;
Kernel k-means. The most common way to add supervisory information
is the use of two pairwise constraints: must-link and cannot-
link. In must-link constraints, two points must be in the same
I. I NTRODUCTION
cluster, and in cannot-link constraints, two points should not
Temporal clustering (TC) can be defined as the factor- be in the same cluster. This type of constraint is a more
ization of multiple time series into a set of non-overlapping general form of supervisory information compared to labeled
segments that belong to k temporal clusters [1]. Temporal data, and it can be expanded by using transitive closure on
clustering is similar to normal clustering, which also requires the must-link relations. Some popular unsupervised algorithms
a similarity measure, a clustering algorithm, and an evaluation that have been adapted into a semi-supervised framework are
criterion; however, the temporal nature of the data requires Constrained K-means Clustering with Background Knowledge
special treatment at one or more of these components. The [5], Semi-supervised Kernel Mean Shift Clustering [6], Semi-
two major ways to handle time series is to either modify supervised Kernel K-means [7], and Constrained Spectral
existing static data clustering algorithms to handle time, or Clustering [8]. However, all these methods are not designed
convert the time series into a form that allows static algorithms to handle temporal data.
to work on. The former approach relies, in most cases, on
In this work, we investigate developing semi-supervised
modifying the similarity measure to an appropriate measure
temporal clustering methods. First, we present Semi-
of time series, for example, dynamic time warping (DTW).
Supervised Aligned Clustering Analysis (SSACA), an adap-
The latter maps the time series into a different representation
tation of the successful temporal clustering method, Aligned
or domain that embeds the temporal information, such as
Cluster Analysis (ACA) [4], where we add side information
Wavelets, Fourier, and Haar transform [2]; or into a number of
in the form of pairwise constraints. Second, we change Con-
model parameters and then apply the conventional static data
strained Spectral Clustering, which is a semi-supervised clus-
clustering algorithm.
tering method, to a temporal semi-supervised method using the
Clustering time series is a tool that can be applied in many same framework. We also add to both methods, an exhaustive
problems such as data mining, visualization, and segmenta- constraint propagation algorithm to improve their performance.
tion. The problem of segmentation is specially important in We test our methods on both synthetic and real-world datasets.
applications like human motion analysis, audio-visual emotion
analysis, animal behaviour analysis, etc. Solving the problem II. R ELATED WORK
of segmentation using a clustering approach requires modelling
the temporal variability of the segments in a time series. As Segmentation and clustering of time series are problems
a result, simply applying some of the traditional time series in various fields including video segmentation [9], facial ex-
clustering techniques such as the ones reviewed in [3] may not pression interpretation [4], animal behavior recognition [10],
produce satisfactory results. Most of the temporal clustering human-motion analysis [11], stock market analysis, etc. How-
ever, an in-depth understanding of this problem is necessary where xi ∈ Rd (see notation1 ) is a vector representing the ith
to differentiate the techniques used to solve it. data point, φ(.) is a nonlinear mapping function, and zc ∈ Rd
represents the centroid of the data points in cluster c. Matrix
It is noteworthy that the TC problem addressed in this paper G ∈ {0, 1}k×n indicates, in a binary format, the cluster to
is different from clustering time series, which refers to the which xi belongs, and Z ∈ Rd×n is the actual value of the
problem of clustering time series when they are already seg- means.
mented. The proposed approach performs temporal segmenta-
tion and groups similar clusters simultaneously. Another way ACA considers distances between segments, which can
of doing temporal segmentation is by change-point detection even be of different lengths, instead of distances between
[12] [13]. Change-point detection consists of an analysis of points. The chosen distance for ACA is Dynamic Time Align-
changes on the distribution of the points within a window ment Kernel (DTAK), which is an extension of DTW. DTAK,
of temporal observations in order to locate their boundaries. unlike DTW, satisfies the triangle inequality. In DTAK, the
However, this technique detects only local boundaries and does distance between two sequences, X = [x1 , . . . , xn ] ∈ Rd×n
not provide a global model for temporal events. HACA [11] and Y = [y1 , . . . , yn ] ∈ Rd×n can be computed recursively
and ACA [4] solve this problem by minimizing the errors (details on [11]).
across various segments for each k cluster.
The goal of ACA is to decompose a segment X =
In terms of semi-supervised methods, we have identified [x1 , . . . , xn ] ∈ Rd×n into m disjoint segments, where each
two approaches that add supervision to temporal clustering. segment belongs to a single k cluster. Each segment is
[14] proposes Temporal-Driven Constrained K-Means (TDCK- constrained by a maximum length nmax , which controls the
Means), an extension of k-means that incorporates a temporal- temporal granularity of the segmentation. The segments begin
aware dissimilarity measure, which combines Euclidean dis- at position si and end at si+1 − 1, such that ni = si+1 − si ≤
tance in the multidimensional space with the distance between nmax . An indicator matrix G ∈ {0, 1}k×m assigns each
timestamps. TDCK-means makes an assumption that adjacent segment to a cluster; gci = 1 if Zi belongs to cluster c.
observations of the same entity should be assigned to the same
cluster, and based on this assumption it creates a contiguity ACA achieve temporal clustering by minimizing:
penalty function that penalizes contiguous observations that
are not assigned to the same cluster. Therefore, external
information knowledge is embedded to the algorithm based on
this notion that similar observations should be close in time. m
k X
X 2
Essentially, they have must-link soft constrain between all pairs Jaca (G, s) = gci ψ(X[si ,si+1 ) ) − zc =
of observations belonging to the same entity. c=1 i=1
| {z }
dist2ψ (Yi , zc ) (2)
[15] proposes a system that summarizes a user’s daily k[ψ(Y1 , . . . , ψ(Ym ) − ,
2
ZGkF
activities, such as sleeping, walking, and working by com- T
bining data from body sensors, GPS, computer monitoring s.t.G 1k = 1m and si+1 − si ∈ [1, nmax ],
data, videos, and audio. The paper describes a semi-supervised
temporal clustering algorithm to group large amounts of
multimodal data into different activities. Similar to [14], the where G ∈ {0, 1}k×m is a cluster indicator matrix, and s ∈
constraints are used to penalize non-smooth changes (over Rm+1 is the segment vector Y = X[si ,si+1 ) . In the case of
time) on the assigned clusters. In this method, there is no ACA, the dist2ψ (Yi , zc ) is the squared distance between the
control over the granularity of the temporal term, and there ith segment and the center of cluster c in the nonlinear mapped
is also no robust metric between time series. feature space represented by ψ(.).

III. M ODEL D ESCRIPTION


B. Kernel-Based Semi-Supervised Clustering
A. Aligned Clustering Analysis (ACA)
[7] shows that the objective function of semi-supervised
Aligned Cluster Analysis (ACA) [4] is an extension of the clustering based on Hidden Markov Random Fields (HMRF),
kernel k-means clustering algorithm. ACA combines kernel with squared Euclidian distance and a certain class of con-
k-means with Dynamic Time Alignment Kernel (DTAK), an straint penalty function, can be expressed as a special case
alternative to DTW (Dynamic Time Warping). of weighted kernel k-means. They also point out the the-
Kernel k-means is a partitioning clustering algorithm, oretical equivalence between weighted kernel k-means and
where n data points are divided into k disjoint partitions. It graph clustering, which unifies several vector-based and graph-
minimizes an objective function that represents the within- based approaches. One implication of this unification is that a
cluster and finds the partitions of the data that is a local single algorithm for semi-supervised clustering make several
optimum of the following energy function: objective functions are included in the same framework as
special cases.
k X
X n
2 2 The semi-supervised clustering objective of HMRF can be
Jkkm (G) = gci kφ(xi ) − zc k = kX − ZGkF ,
c=1 i=1
|
2
{z } (1)
distφ (xi , zc ) 1 Bold capital letters denote a matrix X, bold lower-case letters a column
T
s.t.G 1k = 1n , vector x
expressed as where M is the set of must-link constraints, C is the set of
cannot-link constraints, wij is the penalty cost for violating a
k X
n
X 2 constraint xi and xj , and gi refers to the cluster label of xi .
JHM RF (Z, G) = gci kxi − zc k Similar to the objective function of HMRF k-means, we have
c=1 i=1
X X (3) the same three terms: S, must-link, and cannot-link.
− wij + wij ,
Because the constraints are held in the segment level,
xi ,xj ∈M xi ,xj ∈C
gi =gj gi =gj we have two kernel matrices: K and T. K is the frame
kernel matrix, K = φ(X)T φ(X) ∈ Rn×n , and each entry
where M is the set of must-link constraints, C is the set of kij defines the similarity between two frames, xi and xj .
cannot-link constraints, wij is the penalty cost for violating a T = [τij ]m×m ∈ Rm×m is the segment kernel matrix that
constraint xi and xj , and gi refers to the cluster label of xi . represents the similarity of the segments using the distance
There are three terms in this objective function. The first term DTAK. To use the same framework, the kernel matrix should
is the unsupervised k-means term of the objective function. be constructed as T = S + W, where S represents the
2
Note that the distance kxi − zc k can be represented as a similarity matrix in the segment level, and W is the weight
matrix of pairwise squared Euclidean distances among the data matrix.
points (proof in [7]). We refer later to this distance matrix as S. The optimization of SSACA’s objective function is a non-
The second term is based on the must-link constraints, which convex problem. In ACA, the method adopted is a dynamic
state that, for every must-link xi and xj (if xi and xj are in the programming (DP)-based algorithm, which has a complexity
same cluster, i.e., they satisfied that constraint), the objective O(n2 nmax ) to examine all possible segmentations. In SSACA,
function is rewarded by subtracting some pre-specified weight. we adapt the DP-based search to a semi-supervised framework,
Similarly, the third term in the objective function states that, for where we incorporate the pairwise constraints into the algo-
every xi , xj that is a cannot-link, if they are in the same cluster, rithm. We call the new algorithm SS DPSearch (Algorithm
they have violated that constraint, so the objective function is 1). SS DPSearch optimizes SSACA w.r.t. G and s, as well
penalized by some pre-specified penalty weight. Later, we refer as rewarding or penalizing the distance between segments
to the second and third term of the function as W. τ (X[i,v] , Y˙j ), according to the given pairwise constraints.
[7] also shows that, for the equivalence of the HMRF k-
means and the weighted kernel k-means to hold, it is necessary Algorithm 1 SS DPSearch
to construct a certain kernel matrix and set weights in a parameter: nmax , k, nml , ncl
certain way. A kernel matrix K should have two components input: G ∈ {0, 1}k×ṁ , ṡ ∈ R(ṁ+1) , K ∈ Rn×n , T ∈ Rm×m ,
K = S + W. S is the similarity matrix and comes from the M ∈ Znml ×2 , C ∈ Zncl ×2
unsupervised term, while W is the constraint matrix. This output: G ∈ {0, 1}k×m , ṡ ∈ R(m+1)
matrix W has a pre-specified wij weight for must-link and 1: headTail = getHeadTails(M, C);
−wij for cannot-link, and zero otherwise. Thus, this objective 2: for v = 1 to n do
function is mathematically equivalent to the weighted kernel k- 3: J(v) ← ∞;
means objective function. In other words, we can run weighted 4: if v ≥ headTail(:,1) and v < headTail(:,2) then
kernel k-means to decrease the objective function. 5: continue;
6: end if
It may be the case that K is not positive semidefinite, a
7: if isTail(v) then
requirement for the kernel k-means to converge. This problem
8: for j = 1 to ṁ do
can be avoided by performing diagonal shifting [7], which does
9: Retrive directly from T(X[i,v] , Y˙j );
not change the global optimal solution. However, in practice,
10: end for
we observe that diagonal shifting is not necessary in order to
11: c∗ ← arg minc distψ (X[i,v] , z˙c ); J ←
guarantee convergence.
distψ (X[i,v] , żc∗ );

12: J([i, v]) ← J, g[i,v] ← e∗c , i∗[i,v] ← i;
C. Semi-Supervised Aligned Clustering Analysis (SSACA) 13: else
14: for nv = 1 to min(nmax , v) do
In SSACA, we use of the equivalence of ACA and
15: { Same as DPSearch [11]}
weighted kernel k-means to incorporate the kernel k-means
16: end for
semi-supervised framework into ACA. Therefore, we incorpo-
17: end if
rate the penalty terms, which are derived from the must-link
18: end for
and cannot-link, into the objective function. Thus, we can write
{Perform backward segmentation [11]}
the SSACA objective function as

As parameters, the algorithm receives the length constraint


k X
m
2 nmax and the number k of clusters. As inputs, the algorithm
requires an initial indicator matrix G ∈ {0, 1}k×m , that
X
Jssaca (G, s) = gci ψ(X[si ,si+1 ) ) − zc
c=1 i=1
| {z } assigns segments to a cluster, an initial segmentation ṡ ∈
dist2ψ (Yi , zc ) (4) R(ṁ+1) , and the frame kernel matrix K ∈ Rn×n . Additionally
X X to the original approach inputs, there are pairs of must-links
− wij + wij ,
∈ Znml ×2 , cannot-links ∈ Zncl ×2 , and T ∈ Rm×m , which is
xi ,xj ∈M xi ,xj ∈C
gi =gj gi =gj the segment kernel matrix.
1) Exhaustive and Efficient Constraint Propagation: The the exhaustive propagation (EP) to its algorithm, and we have
pairwise constraints are used to adjust the similarity matrix SSACA+EP. Algorithm 2 shows SSACA+EP.
for the kernel k-means clustering algorithm by penalizing
or rewarding constrained segments. This technique, however, Algorithm 2 SSACA + EP (Exhaustive propagation)
affects only the constrained segment similarities. To make the input: S ∈ Rn×n : input frame kernel matrix, T ∈ Rn×n : in-
propagation of constraints more efficient, we borrow the idea put segment kernel matrix, W ∈ Rn×n : constraint penalty,
of exhaustive and efficient constraint propagation (E 2 CP) from k: number of clusters, M: set of must-link constraints, C:
Lu and Ip [8] and adapt it to our framework. The rationale set of cannot-link constraints, ṡ: initial segmentation.
behind this method is to spread the effects of the constraints output: G ∈ {0, 1}k×n : Final partitioning of the points, s:
throughout the whole similarity matrix S. Final segmentation.
E 2 CP tackles the problem of constraint propagation by 1: Propagate the constraints F ∗ ← W
decomposing it into sets of label propagation subproblems. 2: Form the matrix T̃ according to Equation 5.
Given the dataset X = [x1 , . . . , xnz ] ∈ Rd×nz , a set of must- 3: Diagonal-shift T̃ by adding σI to guarantee positive
link M, and a set of cannot-link C, we can represent all the definiteness of T̃ .
pairwise constraints in a single matrix W = Zij N ×N , where 4: Get initial clusters G(0) using constraints.
Wij = 1 if (xi , xj ) ∈ M, Wij = −1 if (xi , xj ) ∈ C, and 5: Return G, s = SSDPSearch(G(0) , ṡ, S, T̃ , M , C, k).
Wij = 0 otherwise.
Each j-th column of W.j can be seen as a two-class semi- D. Semi-Supervised Temporal Spectral Clustering (SSTSC)
supervised learning problem, where the positive class (Wij >
0) represents the segments that should be in the same cluster Spectral clustering is a graph clustering algorithm that
and the negative class (Wij < 0) represents the segments uses the eigenvalues of the similarities to reduce their di-
that should not be in the same cluster. If (Wij ) = 0, xi xj mensionality before clustering them. As explained in III-B,
are not constrained. Each column and row are solved by label there is an equivalence between weighted kernel k-means and
propagation in parallel [16] to ensure that all the segments will graph clustering. In order to use the same methodology that
be affected by the propagation. The algorithm can be described we use for SSACA and to comply with the semi-supervised
as follows: kernel k-means framework, we need to form a kernel matrix
in the K = S + W format. The output T˜ij of the exhaustive
1) Create the similarity matrix T or a symmetric k-NN propagation method satisfies that requirement. To perform
graph. spectral clustering, we need to find the k largest nontrivial
−1 −1
2) Create the matrix L̄ = D 2 T D 2 , where D is a eigenvector of T̃ to derive the low dimension feature vector.
diagonal matrix with its (i,i)-element equal to the sum Next, we perform the temporal clustering search, which adds
of the i-th row of T . the temporal aspect to this approach. The first steps are
3) Iterate Fv (t + 1) = αL̄Fv (t) + (1 − α)W for ver- described below, and the complete algorithm is depicted in
tical constraint propagation until convergence, where Algorithm 3.
Fv (t) ∈ F and α is a parameter between 0 and 1.
4) Iterate Fh (t + 1) = αFh (t)L̄ + (1 − α)Fv∗ for 1) Find k largest nontrivial eigenvectors v1 . . . vk of
−1 −1
horizontal constraint propagation until convergence, D̃ 2 T̃ D̃ 2 , where D̃ is a diagonal matrix with its
where Fh (t) ∈ F and Fv∗ is the limit of {Fv(t) }. (i,i)-element equal to the sum of the i-th row of T̃ ;
5) Output F ∗ = Fh∗ as the final representation of the 2) Form E = [v1 , . . . , vk ], and normalize each row to
pairwise constraints, where Fh∗ is the limit of {Fh(t) }. the unit length. The rows of E represent the low-
dimensional features of segment X[si ,si+1 ] ;
Intuitively, the algorithm receives information from its 3) Create a kernel with the new feature vector Ei (i =
neighbor at each iteration, and the parameter α controls the 1, . . . , m).
relative amount of information passed from the neighbors. The
final label of the segments is set to be the cluster from which it Algorithm 3 SS Temporal Spectral Clustering + EP
has received the most information during the iteration process.
input: S ∈ Rn×n : input frame kernel matrix, T ∈ Rn×n : in-
Without loss of generality, [16] shows that {F (t)} can put segment kernel matrix, W ∈ Rn×n : constraint penalty,
be calculated in a closed form. The output F ∗ represents k: number of clusters, M: set of must-link constraints, C:
an exhaustive set of pairwise constraints with the associated set of cannot-link constraints, ṡ: initial segmentation.
confidence scores |F ∗ |. Now, we can adjust the similarities in output: G ∈ {0, 1}k×n : Final partitioning of the points, s:
T with the output scores of F ∗ as described in Equation 5. Final segmentation.
This resulting similarity matrix T̃ has the same properties as 1: Propagate the constraints F ∗ ← W .
the kernel matrix specified in the kernel k-means framework; 2: Form the matrix T̃ according to Equation 5.
it is both nonnegative and symmetric [8]. 3: Diagonal-shift T̃ by adding σI to guarantee positive
definiteness of T̃ .
4: Get initial clusters G(0) using constraints.
1 − (1 − Fij∗ )(1 − Wij ), Fij∗ ≥ 0

T˜ij = (5) 5: Create a new kernel T̂ with the new feature vector Ei (i =
(1 + Fij∗ )Wij , Fij∗ < 0 1, . . . , m).
6: Return G, s = SSDPSearch(G(0) , ṡ, S, T̂ , M , C, k).
In order to improve the performance of SSACA, we add
TABLE I. AVERAGE ACCURACY RESULTS OF SOME TEMPORAL - BASED TABLE II. AVERAGE ACCURACY RESULTS ON HIGH VARIANCE
k- MEANS METHODS . SEGMENTS (V IDEO FEATURES ).

Average Accuracy Average Accuracy


TDCK-means TCK-means Constrained Temp-driven Simple Arousal Valence Power Expectancy
[14] k-means K-means K-means SSACA+EP 0.75 ± 0.09 0.73 ± 0.09 0.76 ± 0.12 0.83 ± 0.08
0.38 ± 0.05 0.36 ± 0.05 0.34 ± 0.04 0.30 ± 0.04 0.28 ± 0.02 SSTSC+EP 0.84 ± 0.10 0.77 ± 0.09 0.82 ± 0.09 0.92 ± 0.07
SSACA 0.75 ± 0.11 0.74 ± 0.09 0.76 ± 0.12 0.83 ± 0.08
TDCK [14] 0.52 ± 0.05 0.53 ± 0.06 0.46 ± 0.09 0.39 ± 0.04
ACA [4] 0.42 ± 0.02 0.54 ± 0.04 0.39 ± 0.04 0.42 ± 0.02
SC 0.37 ± 0.00 0.44 ± 0.00 0.41± 0.00 0.40 ± 0.00
IV. E XPERIMENTS
We demonstrate the feasibility of the proposed methods on TABLE III. AVERAGE ACCURACY RESULTS ON HIGH VARIANCE
three datasets: a synthetic time series, a motion capture dataset, SEGMENTS (AUDIO FEATURES ).
and an audio-visual emotion dataset. The accuracy evaluation
Average Accuracy
criterion is the same as in [17]. In all experiments, we compare Arousal Valence Power Expectancy
the proposed methods (SSACA, SSACA+EP, and SSTSC+EP) SSACA+EP 0.78 ± 0.08 0.78 ± 0.08 0.81 ± 0.14 0.92 ± 0.06
SSTSC+EP 0.76 ± 0.12 0.76 ± 0.08 0.83 ± 0.12 0.93 ± 0.08
to their baseline unsupervised approach (ACA), a version of SSACA 0.74 ± 0.10 0.74 ± 0.10 0.74 ± 0.13 0.84 ± 0.08
spectral clustering (SC) modified to handle time, as described TDCK [14] 0.50 ± 0.06 0.50 ± 0.08 0.45 ± 0.11 0.39 ± 0.04
in [11], and the recently proposed semi-supervised tempo- ACA [4] 0.47 ± 0.03 0.42 ± 0.03 0.73 ± 0.11 0.49 ± 0.02
SC 0.44 ± 0.00 0.39 ± 0.00 0.70 ± 0.00 0.45 ± 0.00
ral cluster, Temporal Driven Constrained k-means (TDCK)
[14]. Compared to other methods such as simple k-means,
Temporal-driven k-means, Constrained k-means, and Temporal
Constrained k-means, TDCK had the best results as can be sensible, happy, angry, and sad. The dataset is labeled for
attested by our own experiments in Table I. arousal, valence, power, and expectancy.
We perform the experiments on a combined video of four
A. Synthetic Dataset characters, so we have a longer conversation and a higher
The synthetic dataset is a randomly generated time series variability. We categorize the continuous values of the affective
created by sampling 2D gaussian distributions. It contains 78 dimensions, which range from [-1, +1], in 6 categories, and
segments (m = 78), four clusters (k = 4), and approximately we observe the results of visual and audio features on arousal,
4400 frames. For this experiment, we analyze the average valence, power, and expectancy. The audio features consist of
accuracy at different numbers of random constraints over 30 1871 features, including 25 energy and spectral related low-
runs. We use a Gaussian kernel with σ = 1 for all the level descriptors (LLD) x 42 functionals, 6 voicing related
kernel-based methods (all methods, except TDCK [14]), and LLD x 32 functionals, and 10 voiced/unvoiced durational
nmax = 17, which control the granularity of the ACA-based features. Details for LLD and functionals can be found in
methods. Specifically for the methods that uses exhaustive [18]. For visual features, we extract Local Binary Patterns
propagation, i.e SSACA+EP and SSTSC+EP, we use α = 0.01 (LBP), based on the approach described in [19]. We apply
and α = 0.4 respectively as the constraints propagation rate. approximately 5% of the total number of possible constraints.
For the TDCK algorithm the weights given for the time-aware
dissimilarity are γt = 0.25 and γd = 1, and the parameters for C. CMU Motion Capture Dataset
the penalty function are β = 0.3 and δ = 3.
Carnegie Mellon University Motion Capture Dataset (MO-
CAP) [20] is a human motion dataset. We perform the exper-
Synthe6c  Dataset   iments on fourteen sequences of subject 86, which contain
0.8   roughly ten actions (e.g., walking, punching, kicking). We
0.75  
0.7  
use the same setup as in [11]. The range of frames in the
0.65   experiment is from 160 to 300, the nmax is set to 60, and
SSACA+EP  
0.6   the number of constraints are based on 20% of all possible
Accuracy  

0.55   SSTSC+EP  
0.5   SSACA  
constraints.
0.45   TDCK  
0.4  
0.35  
ACA   V. R ESULTS AND D ISCUSSIONS
SC  
0.3  
0.25  
In the first experiment (Figure 1), we evaluate the methods
0   10   20   30   40   50   on different numbers of constraints. Note that the proposed
Number  of  Constraint  Pairs   semi-supervised methods increase consistently as the number

Fig. 1. Accuracy of synthetic data using different numbers of constraints. TABLE IV. ACCURACY OF 14 SEQUENCES OF SUBJECT 86 FROM THE
MOCAP DATASET.
Average over 14 sequences
B. Audio-visual Emotion Dataset Accuracy
SSACA+EP 0.92 ± 0.08
Audio-Visual Emotion Challenge (AVEC) [18] is an audio- SSTSC+EP 0.92 ± 0.07
visual emotion recognition dataset created for the emotion SSACA 0.92 ± 0.07
recognition challenge (AVEC 2012). The dataset consists of TDCK [14] 0.49 ± 0.05
ACA [4] 0.87 ± 0.09
conversations among several participants and four stereotyped SC 0.74 ± 0.13
characters. Each character has a specific emotion stereotype:
of constraints gets higher. Furthermore, after the number of R EFERENCES
constraints reach over 40 pairs, the exhaustive propagation [1] M. Hoai and F. De la Torre, “Maximum margin temporal clustering,”
method SSACA+EP starts to differentiate from its counter part in Proceedings of the Fifteenth International Conference on Artificial
(SSACA). The single process of keeping the compatibilities Intelligence and Statistics (AISTATS-12), vol. 22, 2012, pp. 520–528.
correct gets very expensive in high number of constraints, [2] T. Fu, “A review on time series data mining,” Engineering Applications
therefore we stopped at 50 pairs. of Artificial Intelligence, vol. 24, no. 1, pp. 164 – 181, 2011.
[3] T. W. Liao, “Clustering of time series data – a survey,” Pattern
Granted that the TDCK algorithm was originally created Recognition, vol. 38, no. 11, pp. 1857 – 1874, 2005.
for multi-entity problems, and in this case we are using it in [4] F. Zhou, F. De la Torre, and J. F. Cohn, “Unsupervised discovery of
a single entity scenario, it can be said that this algorithm may facial events,” in Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, June 2010, pp. 2574–2581.
be under used; however, it is one of the few approaches that
[5] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, “Constrained k-
is temporal and semi-supervised, and it is still fair to make means clustering with background knowledge,” in Proceedings of the
comparisons with our approaches. Differently from TDCK, Eighteenth International Conference on Machine Learning, ser. ICML
our method consider distances between segments instead of ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
distances between points, which captures better the temporal 2001, pp. 577–584.
dynamic of the data. Also, the pairwise constraints are directed [6] S. Anand, S. Mittal, O. Tuzel, and P. Meer, “Semi-supervised kernel
applied to the affected pairs, instead of using them as a unique mean shift clustering,” Pattern Analysis and Machine Intelligence, IEEE
Transactions on, vol. PP, no. 99, pp. 1–1, 2013.
general assumption applied to all the data. These aspects
[7] B. Kulis, S. Basu, I. S. Dhillon, and R. J. Mooney, “Semi-supervised
favourite the proposed methods over TDCK. graph clustering: a kernel approach,” Machine Learning, vol. 74, no. 1,
pp. 1–22, 2009.
Tables II and III show the results of the complex problem of [8] Z. Lu and H. H. S. Ip, “Constrained spectral clustering via exhaustive
audio-visual emotion analysis. On both cases, SSTSC+EP and and efficient constraint propagation,” in Proceedings of the 11th Euro-
SSACA+EP, which have exhaustive propagation, performed pean Conference on Computer Vision: Part VI, ser. ECCV’10. Berlin,
similar or better than the regular propagation used in SSACA. Heidelberg: Springer-Verlag, 2010, pp. 1–14.
The high complex nature of visual features makes the spectral- [9] S. Arifin and P. Cheung, “Affective Level Video Segmentation by
based approach (SSTSC+EP) a more suitable method. For Utilizing the Pleasure-Arousal-Dominance Information,” Multimedia,
IEEE Transactions on, vol. 10, no. 7, pp. 1325–1341, 2008.
the audio features, the advantage of SSTSC+EP is more
[10] S. Oh, J. Rehg, T. Balch, and F. Dellaert, “Learning and inferring motion
evident when we analyze the power and expectancy emotion patterns using parametric segmental switching linear dynamic systems,”
dimensions. In this type of application, emotions are recurrent, International Journal of Computer Vision, vol. 77, no. 1-3, pp. 103–124,
and may be further apart, which causes TDCK to perform 2008.
poorly due to its assumption that benefits closer clusters. [11] F. Zhou, F. De la Torre, and J. K. Hodgins, “Hierarchical aligned cluster
analysis for temporal clustering of human motion,” IEEE Transactions
In the human motion segmentation dataset, we also ob- Pattern Analysis and Machine Intelligence (PAMI), vol. 35, no. 3, pp.
served improvements of the proposed methods compared to 582–596, 2013.
the baselines (Table IV). However, differences between the [12] X. Xuan and K. Murphy, “Modeling changing dependency structure
in multivariate time series,” in Proceedings of the 24th International
semi-supervised methods are not very visible due to the very Conference on Machine Learning, ser. ICML ’07. New York, NY,
low number of segments. USA: ACM, 2007, pp. 1055–1062.
[13] Z. Harchaoui, E. Moulines, and F. R. Bach, “Kernel change-point
In general, the methods show substantial improvements analysis,” in Advances in Neural Information Processing Systems 21,
with minimal addition of side information compared to ex- D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds. Curran
clusively unsupervised methods. In addition, it can be seen Associates, Inc., 2009, pp. 609–616.
that some specific problems, such as audio-visual emotion [14] M. Rizoiu, J. Velcin, and S. Lallich, “Structuring typical evolutions
using temporal-driven constrained clustering,” in Tools with Artificial
analysis is more suitable for SSTSC due to the spectral nature Intelligence (ICTAI), 2012 IEEE 24th International Conference on,
of the algorithm, specially in the case of power and expectancy vol. 1, 2012, pp. 610–617.
dimensions. Simpler cases such as the synthetic dataset is [15] F. De la Torre and C. Agell, “Multimodal diaries,” in Multimedia and
more suitable for SSACA. Moreover, the use of exhaustive Expo, 2007 IEEE International Conference on, July 2007, pp. 839–842.
propagation can enhance the accuracy of the proposed methods [16] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schlkopf, “Learning
in situations with a higher number of constraints and segments. with local and global consistency,” in Advances in Neural Information
Processing Systems 16. MIT Press, 2004, pp. 321–328.
[17] R. Araujo and M. Kamel, “A semi-supervised temporal clustering
method for facial emotion analysis,” in Multimedia and Expo Workshops
VI. C ONCLUSION (ICMEW), 2014 IEEE International Conference on, July 2014, p. to
appear.
In this paper, we propose two semi-supervised temporal [18] B. Schuller, M. Valstar, R. Cowie, and M. Pantic, “Avec 2012: The
clustering methods. First, we transform a kernel-based unsu- continuous audio/visual emotion challenge - an introduction,” in Pro-
ceedings of the 14th ACM International Conference on Multimodal
pervised temporal clustering into a semi-supervised temporal Interaction, ser. ICMI ’12. New York, NY, USA: ACM, 2012, pp.
clustering by adding some constraints in the form of must-links 361–362.
and cannot-links. Then, we transform a constrained spectral [19] A. Sayedelahl, R. Araujo, and M. Kamel, “Audio-visual feature-decision
clustering into a temporal semi-supervised clustering. We level fusion for spontaneous emotion estimation in speech conversa-
evaluate our methods on a synthetic dataset and on two distinct tions,” in Multimedia and Expo Workshops, 2013 IEEE International
complex problems. Results show substantial improvements Conference on, July 2013, pp. 1–6.
compared to the original unsupervised methods and other semi- [20] M. Shell. (2012) Carnegie mellon university motion capture database.
[Online]. Available: http:// mocap.cs.cmu.edu
supervised method with minimal supervision.

You might also like