You are on page 1of 4



Nilanjan Dasgupta, Shihao Ji, Lawrence Carin

Department of Electrical and Computer Engineering, Duke University, NC 27708

A semi-supervised hidden Markov tree (HMT) model is developed PARAMETER ESTIMATION
for texture analysis, incorporating both labeled and unlabeled data
for training; the optimal balance between labeled and unlabeled data Consider an image Icd , sampled uniformly in two dimensions. Defin-
is estimated via the homotopy method. In traditional EM-based ing LL0cd = Icd , a sequential ν-level wavelet decomposition of Icd
semi-supervised modeling, this balance is dictated by the relative yields four subsampled images: LLνcd , HLνcd , LHcd ν
, and HHcd ν
ν ν ν
size of labeled and unlabeled data, often leading to poor perfor- Each point in HLcd , LHcd , and HHcd corresponds to the root node
mance. Semi-supervised modeling may be viewed as a source al- of a wavelet tree [6], and each node within a tree has four children
location problem between labeled and unlabeled data, controlled by at the next finer level (hence termed quadtrees). Each quadtree cor-
a parameter λ ∈ [0, 1], where λ = 0 and 1 correspond to the purely responds to a 2ν × 2ν block in the original image Icd [7], and our
supervised HMT model and purely unsupervised HMT-based clus- objective is to obtain a parametric model that captures the underlying
tering, respectively. We consider the homotopy method to track a statistics within the wavelet quadtrees.
path of fixed points starting from λ = 0, with the optimal source The HMT is a statistical model that assumes a Markovian rela-
allocation identified as a critical transition point where the solution tionship between any wavelet node with its parent within a quadtree
is unsupported by the initial labeled data. Experimental results on [6]. For simplicity the HH, HL and LH quadtrees are treated as
real textures demonstrate the superiority of this method compared to statistically independent, and each node in a quadtree is modeled by
the EM-based semi-supervised HMT training. a hidden M -state process, with each state represented by a Gaus-
sian distribution parameterized by its mean and variance. Crouse et
1. INTRODUCTION al. [6] developed an EM algorithm obtaining a most likely estimate
of the model parameters θ y = {πm y
, ǫyt,k,l , µytm , σtm
} for texture y

Semi-supervised learning exploits both labeled and unlabeled data to with m, k, l ∈ {1, · · · , M }, and t, t ∈ {1, · · · , R} representing
estimate parameters of an underlying model, yielding natural adap- indices of the wavelet nodes numbered sequentially from the root to
tivity to new unlabeled data. Labeled data are generally expensive the leaves. The term πm denotes the probability of hidden state m
to acquire and are often sparse, while unlabeled data are relatively associated with the root node, and ǫy,k,l
t,t′ defines the transition prob-
inexpensive to acquire and therefore are often abundant. Presuming ability to hidden state st (= k) from its parent (= l). The terms µytm
the existence of an underlying structure in the data, unlabeled sam- and σtm denote the mean and variance of the Gaussian distribution
ples may provide information about the data manifold, and they may representing the mth state of the tth wavelet node. We have three
be used to regularize a purely supervised solution. A conventional such independent models for each texture, one for each subband, but
approach for parameter optimization with a generative model (e.g., we suppress band-specific notation here for simplicity.
HMTs), incorporating both labeled and unlabeled data, is to use the Assume we have L labeled texture blocks {(x1 , y1 ), · · · , (xL , yL )}
expectation-maximization (EM) algorithm [1], in which the labels of and U unlabeled texture blocks {xL+1 , · · · , xL+U }, where each
the unlabeled data are treated as hidden variables, and the optimality xi corresponds to three quadtrees one for each subband. For both
criterion is the likelihood maximization of both labeled and unla- labeled and unlabeled texture blocks the associated wavelet coeffi-
beled data [2]-[3]. However, the EM approach for semi-supervised cients are observable, whereas the underlying Markov states are hid-
learning is unstable, and Nigam et al. [3] proposed a heuristic way den. In the context of classification among C textures, we employ
to alleviate the instability by weighting the contribution from the un- distinct HMT models, one for each texture, and we wish to estimate
labeled data, while the choice of suitable scaling parameter remains the joint set of model parameters Θ = {wy , θ y }C y=1 based on the set
an important issue. Corduneanu et al. [4] proposed the homotopy of labeled and unlabeled data, where wy represents the probability of
method for stable estimation of a naive Bayes classifier, where the class membership. Assuming that λ represents the balance between
optimal scaling parameter λ is selected at the point at which a crit- labeled and unlabeled data, the semi-supervised EM updates [8] for
ical transition occurs in the homotopy. We propose the homotopy Θ can be written as
method, a generalization of continuation [5], as an alternative to the
EM-based solution for both supervised and semi-supervised HMTs (1 − λ) PL λ PL+U
wy = i=1 δy,yi + p(y|xj )
in the context of texture classification, along with estimating the op- L U j=L+1
timal balance λ specific to semi-supervised modeling. We present L L+U
y (1 − λ) X i,y λ X j,y
an overview of the HMT model and associated EM update equations π̃m = γ1,m δy,yi + γ p(y|xj ) (1)
in Sec. 2. The homotopy method is presented in Sec. 3, followed by L i=1
U j=L+1 1,m
its application to HMT parameter estimation in Sec. 4. Quantitative L L+U
performance analyses of the proposed approach and conclusions are (1 − λ) X i,y λ X j,y
ǫ̃yt,k,l = ξt,k,l δy,yi + ξ p(y|xj )
presented in Sec. 5 and 6, respectively. L i=1
U j=L+1 t,k,l

k h i lem). We then track the solution while gradually transforming H(Θ̃. ξ and p(y|x) with respect to model parameters θ y and develop the Jacobian ma- in terms of model parameters Θ.m = p(s1 = m|xi . ‚Θ̇. 4.t. As an alternative. µy . λ) = (1 − λ)(θ y − θ 0 ) + λ(θ y − EM0 (θ y )) = 0 (4) p(st = m. xj ).2. This is a special case of the semi-supervised scenario shown = Φjt′ . In a purely ∂γt. ∂γt. xj ) and Φt. where EM0 is the supervised EM solution obtained using » –» – only labeled texture blocks.m. homotopy method for optimizing the HMT parameters. with λ ∈ [0. (1) may be written in a concise form = γt. whereas the E-step involves evaluating γ. y). listed in Eq. λ) = 0 is for λ = 0. with λ fixed at U/(L + U )) is often unreliable 3.k − γtj′ .[8]. The expressions EM0 (Θ̃) and EM1 (Θ̃) represent the H(Θ.k. found by solving the differential equation EM0 . Note that ‘∼’ in the LHS of the first two expressions denote set of parameter updates. The tangent vector dΘ̃ I − λ∇Θ̃ EM1 (Θ̃) EM0 − EM1 (Θ̃) = 0 (6) [Θ̇ λ̇]T lies in the one-dimensional null space of J (J is always full | {z } dλ rank along the path [9]). cor- during the E-step [6]. along with obtaining an optimal balance λ between the labeled and unlabeled 4. Purely supervised fixed. H(Θ.m.n.y i.[9]. (5) to form the Jacobian J) h i» – with initial conditions λ = 0 and Θ = a.k γt. ξt.k i = φj1.k − γ1. In addition. under first-order approximation. Iterative refinement of the model-parameters responding to the purely supervised HMT model. (1). (3)). (2).k − ξtj′ . we propose the semi-supervised HMT modeling.m. The above equations supervised and supervised EM updates respectively. Note that evaluation of ∇Θ̃ EM1 (Θ̃) involves partial derivatives of γ. (1).m. sparent(t) = where θ 0 represents initialized HMT parameters. sparent(t) = n|st′ = k.n ǫt′ . (1) for λ = 0. along with the corresponding HMT model- The theory of the globally-convergent homotopy method involves parameters representing the textures. Note that the above expressions here without deliberating on details regarding partial derivatives of assume λ to be known a priori. σ y }. we only present the final expressions here for brevity: The homotopy method may be employed as an alternative to the EM.k = p(st = m|st′ = k.t′ . As an alternative. We obtain the only present the M-step of the EM algorithm [1] for a particular partial derivatives of the fixed-point expressions (θ y = EM0 (θ y )) iteration. A basic idea of based on the E and M step yields a guaranteed convergence (albeit the homotopy method for HMT parameter optimization is presented a local optima) on parameters Θ. Hence the direction [dΘ̃ dλ]T can be obtained. the ho- unnormalized model parameters which are subsequently normalized motopy function tracks the EM solution as it reaches λ = 1.t. whereas the above track the solution of H(Θ. with a final objective of obtaining transformation function is itself a fixed-point EM (supervised EM) (Θ = Θ∗ .m j supervised environment.n.y where γ1. Semi-supervised HMT modeling data. we apply the homotopy method to obtain an optimal balance λ. We approximate the above expression as EM0 (Θ̃) ≈ a trajectory. as the one-dimensional null-space of the Jacobian 4. generally set to U/(L + U ) for the fixed-point EM updates. y). π̃ y . A simple choice of the y transformation function is where Θ̃ = {wy . j j h based parameter estimation for most generative models. λ) = 0. Note that the above set .m. ǫ̃y . which we present subsequently for the semi-supervised EM modeling [8]. λ).k φjt′ . Analytical derivations of these derivatives are elaborate. respectively (purely supervised and n n n unsupervised EM updates). written in a concise form as we start from an ‘easy’ problem G(Θ) = 0 having a known solution (or roots). Analogous to Eq.m. from which we obtain the direction and next tion. λ) = (1 − λ)(Θ − a) + λF (Θ) = 0. 1] being a scalar parameter.t. may be Rather than solving an original difficult problem F (Θ) = 0 directly. where θ y and EM0 (θ y ) j ∂γt.n j data from only a single class is present.t′ .t.k j h i point EM expressions of Eq. λ = 0). wy is irrelevant here since ∂ǫt′ .m. the above expression and the generic homotopy form (see Eq.k = H(θ y .1.k − γtj′ .m. ξ and p(y|x) with respect to model parameters 4. (5) the ‘easy’ problem into the original one.m. θ̃ }C y=1 is an unnormalized version of Θ (LHS of Eq. (1) for λ = 0 and 1. semi-supervised HMT (for any arbitrary λ).n in Eq. ∂γt. estimated at the previous itera.k h i (x ′ − µ ′ )2 − σ ′ represent the LHS and RHS of Eq. HOMOTOPY METHOD [4]. λ) = (1−λ)(Θ̃−EM0 (Θ̃))+λ(Θ̃−EM1 (Θ̃)) = 0. The solution of H(Θ. (1) with λ = 0.m t j t m t m . (1) and (4) have different meanings since they are used for semi- ance σ y ) [8] are not shown here for brevity.k φjt′ .1 to obtain EM0 . The fixed-point equations of a finding zeros or fixed points of nonlinear system of equations [5].m (xt′ j − µt′ m )/σt′ m ∂µt′ m as θ y = EM0 (θ y ) (for texture class ‘y’). Note that the only difference between where a ∈ R and F : R → R is the original system of equa- tions we want to solve. HOMOTOPY METHOD FOR HMT MODELING matrix J. λ̇‚2 = 1. and the direction is chosen to maintain an J acute angle with the previous tangent vector (initial tangent direction is chosen to be Θ̇ = 0 and λ̇ = 1). we gradually that the homotopy starts with a ‘trivial’ solution. a generative model is trained exclusively ∂πm πm on data from one particular class (a particular texture in our prob.l = p(st = k. Starting with (λ = 0. (2) RHS of Eq. θ y = θ 0 ).kj j γt. Supervised HMT modeling {wy . (3) ∂Θ ∂λ λ̇ The homotopy path is obtained by solving the differential equation | {z } J (differentiating Eq. Eq. = j γt. λ) H(Θ. The iterative EM approach to semi-supervised learning of HMT (as shown in Eq. λ = 1) (F (Θ∗ ) = 0). trix J (see Eq. (1)). (2)) is Starting from a ‘trivial’ solution (Θ = a. Note that λ used in l|xi . One may use either traditional EM or ∂ ∂ Θ̇ ‚ ‚ the homotopy-based solution proposed in Sec. The update for the Gaussian parameters (mean µy and vari. the supervised HMT expressions may be written in terms of the ∂σt′ m 2(σt′ m )2 transformation function H as where φt. i.

In addition. n) = PM y . The compu- Classification Error and λ −5530 0. sp(t′ ) = n|st = k.m.2 to unnormalized parameters Θ̃.y ′ ′ ′ to our true objective.l ξt.k.k.m.l ptt. ∂σt′ m 2(σt′ m )2 −5490 1 Traditional EM Classitication Error −5500 Continuation via Homotopy λ 0.m and number of texture classes (the computational cost increases expo- ∂µytm nentially with the number of training classes used). which can be shown as method depends on the step size used for piecewise-linear approxi- mation of the homotopy path.9 ′ t .k.m. For supervised HMT modeling. Each two-level quadtree consisting of a root node ∂πm πm j j and four children nodes (hence R = 5) corresponds to a 4 × 4 tex- ∂ξt. this kink represents a good operating point for λ within subsequently as trained semi-supervised HMT parameters.HL. 2.m. which are related to their normalized −5590 0.m. Accord- path of λ.j − ξtj′ .l ptt. C} ventional EM-based HMT training can also be viewed as a verifica- ∂ π̃m tion of accuracy for the homotopy method. ‘grass’ and ‘wool’ from pub- algorithm [6].Fig. ‘sand’. The corresponding Θ (normalized from Θ̃) are treated ing to [4].y ′ y′ ′ = W xtj − µytm γt. We first obtain the HMT parameters trained only on the ized parameters Θ̃. (6)) requires partial derivatives with respect 0. 2 we also plot the probabil- .n − γp(t).k.l h ′ i tural image block. RESULTS of a particular texture class y.k. ′ = W ξt.g.01 along the direction of λ. n) point found by EM and homo.. Each texture of size 512×512 pixels is subjected to a j j two-level wavelet decomposition with Haar wavelets yielding 16384 ∂ξt. i. In our Gaussian mixture model-based = p(y ′ |xj ) [δy.e.m. from which we obtain labeled samples and treat them as a starting point (EM0 at λ = 0) ˙ λ̇]T and update the model parameters Θ̃ for our semi-supervised HMT modeling via homotopy. j i (x ′ − µ ′ )2 − σ ′ ∂ξt.p(t) (m. n) πm = PM m y and ǫyt.j are straightforward but not shown −5550 0. Evolution of classifica- n=1 π̃n k=1 ǫ̃t.m.l h ′ i (x ′ − µ ′ ) both the EM and homotopy method. semi-supervised learning of HMTs. textures).7 and pt.m.l .tion error and λ as a function of topy method..m. we obtain the Jacobian J. although the method is applicable to an arbitrary j. xj ).m. For both ∂f ∂f P ∂f P = − n ψn / Mn=1 ψ̃n . sp(t) = l. or LH) 5.k.k.k. evaluated directly using the upward-downward step of the HMT-EM Three different textures. continuity around λ = 0.m.y ′ − p(y|xj )] /wy = W/wy HMT algorithm.j − γtj′ .y ′ j.p(t) (k.k.n ǫyt.l and ptt.5 here due to length constraints. and we therefore proceed ∂p(y|xj ) h i j. Note that the same random pa- j t j t m = ξt.6 t′ . ψ ∈ {π. convergence is guaranteed (albeit to a local optimum).3 −5570 −5580 0. We randomly ∂p(y|xj ) ′ ′ ′ ′ choose L labeled wavelet quadtrees (with known underlying tex- y′ = W [(xtj − µytm )2 − σtm y j.m. One can derive analytical expressions for We present here a quantitative performance analysis of the proposed φ and Φ in terms of training data and HMT parameters.m /σt. e.m y /2(σtm )2 ∂σtm ture) from two textures and U unlabeled ones (each unlabeled texture block or wavelet quadtree is chosen randomly from one of the two Given the partial derivative of EM1 (Θ̃) with respect to unnormal. but these are homotopy method for parameter estimation of both supervised and not shown here due to space constraints.n ∂ǫt′ .j − γtj′ . 2 the directional vector [Θ̃ we plot the evolution of λ starting at 0 and observe a sharp dis- and balance parameter λ.m rameter initialization is used for both methods.n (7) ∂ǫ̃yt.n We present the semi-supervised HMT performance analysis with ∂p(y|xj ) “ ′ ” C = 2 classes. xj ) −5510 0. the semi-supervised classifier. and the derivatives of an arbitrary function f with respect to unnor- malized parameters may be written as models as a function of iteration number (the explicit HMT parame- „ « ters are also very close) for the ‘sand’ image with L = 30. we randomly = ptt. ∂ ψ̃m ∂ψm ∂ψn although the homotopy method is approximately 20% slower than its EM counterpart for the chosen example (although both algorithms Now the only partial derivative remains to be specified is p(y|xj ) are very fast).m .n. to be chosen specific to the nonlinear ∂p(y|xj ) ′ ′ system of equations at hand.l j h ′ t j t m t m Figure 1 shows the convergence of expected log-likelihood for both = ξt.n ǫt′ . the rate of convergence for the homotopy with respect to Θ̃.l .y ]γt.j where pt.k.l .1 counterpart as −5600 0 10 20 30 40 50 60 70 80 90 0 0 20 40 60 80 100 120 140 Iteration index Iterations y π̃ y ǫ̃yt.j ′ −5540 tations of pt. y ′ ∈ {1. The terms γ and ξ can be semi-supervised HMT models in the context of texture classification. This comparison to con- y′ = W γ1. Note that computation of the Jaco.n.l = p(st′ = m. ǫ}.m. Similarly one can show that licly available “Brodatz” textural images [10] are chosen for subse- quent analyses.p(t) (m.m wavelet quadtrees. · · · .l ξt.m. algorithms. Fig.l − γ1.m.l h j i = φ1. In Fig. but these symbols are suppressed to avoid cluttering notation.n choose L samples from each texture and train the HMT model using j ∂ξt. 1.7 while training with ‘sand’ and ‘’grass’ erate the above procedure until we reach the first local optima in the image with L = 20 (10 from each texture) and U = 400.l . −5560 0.l = p(st′ = m|st = k. iteration index.p(t).4 bian matrix J (see Eq.j 0. and we observe that ∂µt′ m σt ′ m both algorithms converge to the same estimated HMT parameters. In Fig.m − πm /π̃m y.k. Comparison of the fixed.k.n /ǫ̃yt.of expressions is specific to a particular subband (HH.8 Log−Likelihood of training data −5520 t′ . Starting with Θ̃ ˙ = 0 and λ̇ = 1.k. sp(t) = l.y ′ y′ i y′ step size of 0.k.k. we it. we typically observe smooth convergence using a ∂wy ′ ∂p(y|xj ) h j.n.

Nowak. Using 20 labeled and 400 unlabeled data. ral images from a publicly available image database [10].” Tech. G.. There are sev- . N. 6. Royal Statistical Society B. 761–780. “Theory of globally convergent probability-one The major contributions of this paper are: 1) development of homotopies for nonlinear programming. model as an alternative to the EM algorithm. homotopy based semi-supervised training is much higher than its size that one would not have access to this curve in practice. and T. Comput. 2000. utilizing both labeled and unlabeled image blocks.. Watson. 39. 4. 3. as a function of the eral areas of interest for future research. proper balance between labeled and unlabeled data. 2002. We observe superior classification performance using Machine Learning.” IEEE Trans. J. Note that we perform the classification task only on the U [2] M. Crouse. rameter space until a phase transition is manifested. Baraniuk.. Univer- over ten different runs with fixed number of labeled and unlabeled sity of Edinburgh. Institute for Adaptive and Neural Computation. REFERENCES kink encountered in the homotopy path of λ. Figure 3(c) shows the variation of λ. Thrun.75 Semi−supervised via Continuation Semi−supervised via Continuation 60 60 λ obtained via Continuation Semi−supervised via EM Semi−supervised via EM 0.” IEEE Trans. Seeger. In addition. homotopy-based semi-supervised HMT modeling in comparison to [4] A. and D.85 70 70 0. 1978. Optim. pp.” in Proceedings of the Eigh- significantly superior to the supervised algorithm. respectively using performance of semi-supervised algorithms using both the EM and Matlab on a 2. D. data are shown using errorbars in both Figs. ‘grass’ topy is O(n3 ) (n being the the size of training data).” of a semi-supervised HMT model. Edinburgh. “Context-based graphical model- starts from a purely supervised solution and then tracks in the pa- ing for wavelet domain signal processing. 886–902. 2000. Inoue and N. semi- supervised HMM is indeed minimized about the point for which the supervised GHz Pentium machine. vol. However. 2 minutes and 100 minutes of CPU time.8 65 65 0. our experiments that the initial such transition is indicative of a good balance between the labeled and unlabeled data. 485–488. which is too and ‘sand’ vs. Chow. vol. albeit significant performance gain.95 80 80 Percentage of correct classification Percentage of correct classification 0.” in ICASSP. Mallet-Paret. 12. 3. incremental learning procedure is available for new unlabeled data. ture). 32. along with its variation presented sification from labeled and unlabeled documents using em. 1998. Results are presented for real textu. Dempster.65 20 30 40 50 60 70 80 20 30 40 50 60 70 80 20 30 40 50 60 70 80 Number of labeled training data (L) Number of labeled training data (L) Number of labeled training data (L) Fig. T.” in errorbars. “Maximum likelihood U = 400 unlabeled data is presented as a function of the number from incomplete data via the em algorithm. “Finding zeros of maps: Homotopy methods that are constructive with probabil- We apply the homotopy method to HMT model-parameter estima. The variation in classification performance Rep. it is interesting to note that the probability of error for the semi. and R. “Continuation methods for its EM counterpart. uses a fixed balance (λ = U/(L + U )). Jaakkola. 85 85 1 (a) (b) (c) 0. on PAMI. (c) Variation of optimal balance (λ) averaged over 10 runs as a function of number of labeled data for ‘sand’ vs.” Siam J. pp. no classifier scoring obviously requires access to the labels. since EM counterpart. ‘wood’. Mccallum. Yorke. Ueda. based statistical signal processing using hidden markov mod- rithm is focused on determining the proper balance between these els. 25.usc. with automatic estimation of the http://sipi. and the homotopy method take approximately 10 kink in λ is manifested.. A. 3. 1570–1581. Rubin. The homotopy method [7] N. 39.” Math. It is important to empha. Laird. In Fig. obtained via the homotopy method [3] K. ‘grass’. 3(a) and 3(b) we summarize the seconds. The complexity of homo- homotopy method using two sets of texture pairs. “Wavelet- tion. the supervised-EM. UK. 1–38. Mitchell.9 Average Balance ( λ ) 75 75 0. no.7 fixed λ =U/(L+U) for EM Supervised via EM Supervised via EM 55 55 0. Carin.(a) performance for ‘sand’ vs ‘grass’ texture. 135–167. tion in the context of wavelet-based texture analysis and classifica. respectively. and J. teenth Annual Conference on Uncertainty in Artificial Intelli- gence (UAI). [9] L. We observe in vol. Computation cost of the same evolution index (iteration number). A. “Exploitation of unlabeled sequences homotopy method as an alternative to the widely used EM algorithm in hidden markov models. 2003. vol. 2003. accounting for the fact that unlabeled data are typi. Dasgupta and L. S. unlabeled image blocks. for supervised HMT modeling. cally far more plentiful than labeled data. Texture classification performance using homotopy. CONCLUSIONS [5] S. The traditional EM algorithm slow for many applications. 46. two data sets. no. [6] M. “Text clas- and averaged over different runs. S. with L/2 samples from each tex. Corduneanu and T. R. whereas we estimate the optimal λ in the homotopy method that corresponds to the first sharp 7. 2000. supervised-EM and semi-supervised EM method for a texture pair as a function of number of labeled data (L). ‘sand’ vs. the homotopy method for purely supervised training of the HMT 11. ity of classification error on the unlabeled data. pp. vol. Proc. no. (b) performance for ‘sand’ vs ‘wood’. “Learning with labeled and unlabeled data. on Sig. vol. vol. We also used the [8] M. 3(a) and 3(b). The algo. 1977. with fixed number (U = 400) of unlabeled data. pp. 887–899. ity one.” Journal of the of labeled training samples (L. pp. In both figures the per- centage classification (averaged over 10 different runs) on a set of [1] A. pp. while both of the semi-supervised algorithms are mixing heterogeneous sources. pp. Nigam. N. and 2) development [10] “University of southern california image database.