You are on page 1of 30

MACHINE LEARNING II

UNSUPERVISED AND SEMI-SUPERVISED LEARNING


JUN.-PROF. DR. SEBASTIAN PEITZ
Summer Term 2022
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised and semi-supervised learning

• Until now, our data has always been labeled:


𝒟 = (𝒙(𝑖) , 𝑦 (𝑖) )𝑁
𝑖=1
• Everything until now has been supervised learning, as – during training – we can tell our learning algorithm
𝒜 for each sample what the outcome should be and whether the prediction ℎ(𝒙(𝑖) ) was false or true
• However, in many situations, we do not necessarily have labels…
• … or maybe just for some of the samples
• Think about the effort of an expert having to label a gigantic number of
images / documents / …
• Side note: The ImageNet library for visual recognition (> 14 Million images)
has had a massive impact on the advances of modern ML techniques!
• In unsupervised learning, our training data is 𝒟𝑈 = (𝒙(𝑖) )𝑁 𝑖=1 ,
where the index 𝑈 indicates the absence of labels.
• In semi-supervised learning, our training data consists of both labeled
and unlabeled data:
𝒟𝐿 = (𝒙(𝑖) , 𝑦 (𝑖) )𝐿𝑖=1 , 𝒟𝑈 = (𝒙(𝑖) )𝑁 𝑖=𝐿+1

Sebastian Peitz 2
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning

• If we have no labels, what can we hope to find / learn?


→ Patterns in the data! E.g., clusters.

• These patterns allow for classification of new samples


• Example: the identification of customer preferences in social networks

• If we have previously labeled some elements from a cluster, then we can


easily label new samples as well:
• Categorization/labeling of movies

𝒙2 : Duration of the eruption [min]


• Central question: what are important
features in unsupervised learning?

𝒙1 : Time to next eruption [min]


Sebastian Peitz 3
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection

• What distinguishes two classes in a high-dimensional feature space?


• Example: “Cat vs. dog” images (32 × 32 = 1024 pixels → 𝑥 ∈ ℝ
1024 )

• Remember: All real-world data possesses a massive


amount of structure!
→ There should be some lower-dimensional latent
variable which allows us to distinguish between
two classes just as well
→ Principal components, Fourier modes, …

Sebastian Peitz 4
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (1/5)

• In the following data set 𝒟 = 𝒙(1) , … , 𝒙(𝑁) = 𝑿, how many features do we need
to (approximately) describe the data?
→ If we perform a coordinate transform, one direction is clearly more important in
characterizing the structure of 𝑿 than the second one!
→ This is nothing else but representing the same data in a different coordinate
system: Instead of using the standard Euclidean basis
𝑥1 1 0
𝒙 = 𝑥 = 𝑥1 𝒆1 + 𝑥2 𝒆2 = 𝑥1 + 𝑥2
2 0 1
we can use a basis 𝑼 tailored to the data:
𝒙 = 𝑎1 𝒖1 + 𝑎2 𝒖2
• Which properties should such a new basis 𝑼 have?
⊤ 1 if 𝑖 = 𝑗
• It should be orthonormal: 𝒖𝑖 𝒖𝑗 = 𝛿𝑖,𝑗 = ቊ
0 else
• For every dimension 𝑟, it should have the smallest approximation error:
2
𝑁 𝑟

𝑼= arg min ෍ 𝑥𝑖 − ෍ 𝒖𝑗⊤ 𝒙𝑖 𝒖𝑖 ⇔ 𝑿𝑟 = arg min ෡−𝑿


𝑿 𝐹
෡ s.t. rank 𝑼
𝑼 ෡ =𝑟 ෡ s.t. rank 𝑿
𝑿 ෡ =𝑟
𝑖=1 𝑗=1
Sebastian Peitz 5
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (2/5)

• Due to the famous Eckart-Young theorem, we know that the solution to this optimization problem can be
obtained using a very efficient tool from linear algebra: the Singular Value Decomposition (SVD)
𝑿 = 𝑼𝚺𝑽∗
• This is a product of three matrices. If 𝑿 ∈ ℂ𝑛×𝑁 , then 𝑼 ∈ ℂ𝑁×𝑁 , 𝚺 ∈ ℝ𝑁×𝑛 and 𝑽 ∈ ℂ𝑛×𝑛 (𝑽∗ is the
conjugate transposed matrix of 𝑽)
• The matrices have many favorable properties:
• 𝑼 = 𝒖1 , … , 𝒖𝑁 and 𝑽 = 𝒗1 , … , 𝒗𝑛 are unitary matrices (column-wise orthonormal):
𝒖⊤ ⊤
𝑖 𝒖𝑗 = 𝛿𝑖,𝑗 and 𝒗𝑖 𝒗𝑗 = 𝛿𝑖,𝑗

𝚺
• 𝚺 = is a diagonal matrix with diagonal entries 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑛 ≥ 0: the singular values
𝟎
• Since the last 𝑁 − 𝑛 rows (assuming that 𝑁 > 𝑛) are zero, we have the following economic version:
∗ ෡ ∗ ෡෡ ∗
𝚺
𝑿 = 𝑼𝚺𝑽 = 𝑼 𝑼 ෡ ෡ ⊥ 𝑽 = 𝑼𝚺𝑽
𝟎
• Since the columns of 𝑼 and 𝑽 all have unit length, the relative importance of a particular column of 𝑼 is encoded
in the singular values:
𝑛

𝑿 = 𝑼𝚺𝑽∗ = ෍ 𝜎𝑖 𝒖𝑖 𝒗∗𝑖 = 𝜎1 𝒖1 𝒗1∗ + ⋯ + 𝜎𝑛 𝒖𝑛 𝒗∗𝑛


𝑖=1
Sebastian Peitz • 6
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (3/5)

• Do we need all columns of 𝑼 to reconstruct the matrix 𝑿?


• What if we are willing to accept a certain error?
→ Truncate 𝑼 after 𝑟 columns!
𝑿෩=𝑼 ෩𝚺෩𝑽෩∗ ≈ 𝑿
→ The Eckart-Young theorem says that this is the
best rank-𝑟 matrix we can find

Sebastian Peitz 7
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (4/5)

• The same can be done with high-dimensional data


• Example: Yale Faces B database (192 × 168 = 32256 pixels, 2414 images → 𝑿 ∈ ℝ32256×2414

First 16 eigenfaces:
෩ ∈ ℝ32256×16
𝑼 Low-rank
reconstruction

෡𝚺
𝑿=𝑼 ෡𝑽∗

Economy
SVD

Flatten / reshape: 𝑿 =

Only the first 1000 rows


Sebastian Peitz 8
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (5/5)

• Now let’s try to distinguish two individuals from the data base by projecting onto two modes:
𝐏𝐂𝟓 𝒙𝑖 = 𝒙⊤ 𝑖 𝒖5 , 𝐏𝐂𝟔(𝒙𝑖 ) = 𝒙⊤ 𝑖 𝒖6

Sebastian Peitz 9
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Fourier transform / Fast Fourier Transform (1/3)

• Consider a function 𝑓(𝑥) that is piecewise smooth and 2𝜋-periodic. Any function of this class can be
expressed in terms of its Fourier transform:
∞ ∞ ∞
𝑎0
𝑓 𝑥 = + ෍ 𝑎𝑘 cos 𝑘𝑥 + 𝑏𝑘 sin(𝑘𝑥) = ෍ 𝑐𝑘 𝑒 𝑖𝑘𝑥 = ෍ 𝑎𝑘 + 𝑖𝑏𝑘 cos 𝑘𝑥 + 𝑖 sin(𝑘𝑥)
2
𝑘=1 𝑘=−∞ 𝑘=−∞
• The (real) coefficients are given by
1 𝜋
𝑎𝑘 = න 𝑓 𝑥 cos 𝑘𝑥 𝑑𝑥 ,
𝜋 −𝜋
1 𝜋
𝑏𝑘 = න 𝑓 𝑥 sin 𝑘𝑥 𝑑𝑥
𝜋 −𝜋
• This is nothing else but representing 𝑓 𝑥 in terms of an
orthogonal basis: the Fourier modes cos 𝑘𝑥 and sin 𝑘𝑥
• Closely related to the SVD basis transform, only that 𝑓 𝑥
is not a vector, but an infinite-dimensional function
• Instead of point-wise data, the Fourier modes contain
global information over the entire domain.

Sebastian Peitz 10
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Fourier transform / Fast Fourier Transform (2/3)

• The Fourier transform can be adapted to vectors using the Discrete Fourier Transform (DFT) or its highly
efficient implementation: the Fast Fourier Transform (FFT)
𝒙 ∈ ℝ𝑛 → 𝒄 ∈ ℂ𝑛
𝑘𝜋
• The entries of 𝒄 are the complex Fourier coefficients of increasing frequency (𝜔𝑘 = )
𝐿
• In 2D: First in one direction, then in the second direction

Sebastian Peitz 11
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Fourier transform / Fast Fourier Transform (3/3)

• Very powerful compression technique (This was the JPEG algorithm until a few years ago)

Sebastian Peitz 12
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection in latent variables (1/3)

• Let’s consider the cats and dogs example once more:

• These are the first four SVD modes

• Alternative feature identification method:


First four modes according to the following procedure
• Transform images 𝑿 to Wavelet domain 𝑪 (think of this
as a hierarchical version of the Fourier transform)
→ This is today’s JPEG compression technique
• Perform an SVD on the Wavelet/Fourier coefficients
→ basis 𝑼𝑪 for the space of Wavelet/Fourier coefficients
• Inverse Wavelet/Fourier transform of the basis to the
original space → 𝑼𝑿

Sebastian Peitz 13
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection in latent variables (2/3)

Original space Wavelet space

Sebastian Peitz 14
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection in latent variables (3/3)

Dogs
Cats
Sebastian Peitz 15
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (1/4)

• Now let’s assume that we only have unlabeled data 𝒟 = (𝒙(𝑖) )𝑁 𝑖=1 , 𝒙
(𝑖)
∈ ℝ𝑛
• We would like to separate the data into 𝐾 clusters in an optimal way, represented by a set of 𝐾 prototype
vectors 𝝁1 , … 𝝁𝐾 ∈ ℝ𝑛 representing the clusters
• Which parameters do we have to optimize?
→ The prototypes as well as the assignment of data to the clusters:
𝑁 𝐾

min 𝐸 = ෍ ෍ 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 2
𝝁,𝒓
𝑖=1 𝑘=1
• 𝒓 is a matrix of binary variables (𝒓𝑖𝑘 ∈ 0,1 ), where the first index refers to the data point and the second
to the cluster – exactly one entry per row is one: σ𝐾
𝑘=1 𝒓𝑖𝑘 = 1 for all 𝑖 ∈ 1, … , 𝑁
→ We assign each data point to precisely one cluster and then seek to minimize the distance of all points
within a cluster 𝑘 to their prototype 𝝁𝑘
• Which norm for the distance? → depends!
• Euclidean: 𝒙𝑖 − 𝝁𝑘 2
2
• Squared Euclidean: 𝒙𝑖 − 𝝁𝑘 2
• Manhattan: 𝒙𝑖 − 𝝁𝑘 1
• Maximum distance: 𝒙𝑖 − 𝝁𝑘 ∞
• Mahalanobis distance: 𝒙𝑖 − 𝝁𝑘 ⊤ 𝚺 −1 𝒙𝑖 − 𝝁𝑘 with the covariance matrix 𝚺
Sebastian Peitz 16
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (2/4)

• How do we solve the optimization problem to minimize 𝐸 = σ𝑁 𝐾


𝑖=1 σ𝑘=1 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 2 ?
2

• Alternate between 𝒓 and 𝝁


• Assignment 𝒓: with 𝝁𝑘 fixed, 𝑬 can be decomposed into the individual contribution of each data point:

1 if 𝑘 = arg min 𝒙𝑖 − 𝝁𝑘 22
𝒓𝑖𝑘 = ቐ 𝑗
0 otherwise
• Prototypes 𝝁𝑘 : with 𝒓 fixed, this is a weighted least squares regression problem:
𝑁
σ𝑁
𝑖=1 𝒓𝑖𝑘 𝒙𝑖
2 ෍ 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 = 0 ⇔ 𝝁𝑘 = 𝑁
σ𝑖=1 𝒓𝑖𝑘
𝑖=1
→ This is the mean over all 𝒙𝑖 belonging to cluster 𝑘
• Repeat the two steps until there are no re-assignments
• Does this algorithm converge?
→ Yes, because a reduction of the objective function is guaranteed by design

• However, we have to be aware that the solution can be a local minimum

Sebastian Peitz 17
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (3/4)

• Example: Old Faithful

Sebastian Peitz 18
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (4/4)

• How to choose the number of clusters?


→ Depends on the data! Oftentimes, multiple runs with varying 𝐾 are required RGB

Example: Image segmentation by clustering the pixels of an image (𝑁 pixels/points, 𝒙𝑖 ∈ 0,1 3 , 𝑖 = 1, … , 𝑁)

Sebastian Peitz 19
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Dendrogram

• Another approach to identify clusters is via a hierarchical, tree-based approach → the Dendrogram
• A cloud of points is clustered / separated one by one, until some threshold is achieved
• Divisive approach (top-down):
• All points are contained in a single cluster
• The data is then recursively split into smaller and smaller clusters
• The splitting continues until the algorithm stops according to a user specified objective
• The divisive method can split the data until each data point is its own node
• Agglomerative approach (bottom-up):
• Each data point 𝑥𝑗 is its own cluster initially.
• The data is merged in pairs as one creates a hierarchy of clusters.
• The merging of data eventually stops once all the data has been merged into a single cluster
• How can we do this? → Greedy approach!

Sebastian Peitz 20
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Dendrogram

• Algorithm:
1. Compute the distance (Euclidean, Manhattan, …) between all points: 𝑑 𝒙𝑗 , 𝒙𝑖 , 𝑖, 𝑗 ∈ 1, … , 𝑁
2. Merge the closest two data points into a single new data point midway between their original locations
3. Repeat the calculation with the new 𝑁 − 1 points

Sebastian Peitz 21
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Dendrogram

• Example: Two Gaussian distributions with 50 points each, Euclidean distance: 𝑑 𝒙𝑖 , 𝒙𝑗 = 𝒙𝑖 − 𝒙𝑗


𝟐

Sebastian Peitz 22
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (1/4)

• Can we also try to find a probabilistic model for our data? This seems to be natural, as noise is often
present in measurements.
• Consider the Old Faithful dataset once more.
• Can we model this using, say, a Gaussian distribution?
• What about a superposition of multiple Gaussians?
• This leads to mixture models (or – if we consider Gaussians – Gaussian Mixture Models (GMMs))
𝐾

𝑝 𝒙 = ෍ 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1
• These can in general create highly complex densities to arbitrary precision
𝒙2 : Duration of the eruption [min]

Sebastian Peitz 𝒙1 : Time to next eruption [min] 23


MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (2/4)

• The coefficients 𝜋𝑘 are called mixing coefficients. If both 𝑝 𝒙 and the individual
Gaussians are normalized, then a simple integration yields σ𝐾 𝑘=1 𝜋𝑘 = 1
• In addition, the requirement 𝑝 𝒙 > 0 implies 𝜋𝑘 ≥ 0 for all 𝑘 → 0 ≤ 𝜋𝑘 ≤ 1

• Using the sum and product rule, we can also write the mixing coefficients as follows:
𝐾
𝑝 𝒙 =෍ 𝑝(𝑘)
ถ 𝑝 𝒙𝑘
𝑘=1
𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘

• Via Bayes’ theorem, we thus get access to the posterior probability of 𝑘 given 𝒙,
a.k.a. responsibility:
𝑝 𝑘 𝑝 𝒙𝑘
𝛾𝑘 𝒙 = 𝑝 𝑘 𝒙 =
𝑝(𝒙)

• This responsibility 𝛾𝑘 can be used to infer a cluster membership:


Given a new sample 𝒙, which cluster has the highest responsibility for this sample?

Sebastian Peitz 24
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (3/4)

• In an entirely Bayesian approach, we thus need to learn the parameters 𝝁𝑘 and 𝚺𝑘 of the individual
Gaussian distributions as well as the mixing coefficients 𝜋𝑘 .
• As these can themselves be seen as random variables, let us introduce a corresponding 𝐾-dimensional
latent state 𝒛 in the form of a 1-of-𝐾 representation: 𝒛𝑘 ∈ 0,1 , σ𝐾 𝑘=1 𝒛𝑘 = 1.
→ 𝒛 can be in 𝐾 different states.
→ 𝑝 𝒛𝑘 = 1 = 𝜋𝑘 and σ𝐾 𝑘=1 𝑝 𝒛𝑘 = 1 = 1
• The probability of a specific latent variable 𝒛 and a specific sample 𝒙 given 𝒛 is thus
𝐾 𝐾
𝒛 𝒛𝑘
𝑝 𝒛 = ෑ 𝜋𝑘𝑘 and 𝑝 𝒙 𝒛 = ෑ 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1 𝑘=1
• As a consequence, the Gaussian mixture model can be expressed as before, but using the latent variable 𝒛
(we’ll see in the next chapter how this is beneficial for learning → Expectation Maximization):
𝐾 𝐾

𝑝 𝒙 = ෍ 𝑝 𝒛 𝑝 𝒙 𝒛 = ෍ 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1 𝑘=1
• Responsibility (𝜋𝑘 = prior; 𝛾 𝒛𝑘 = posterior):
𝑝 𝑧𝑘 = 1 𝑝 𝒙 𝑧𝑘 = 1 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝛾 𝒛𝑘 = 𝑝 𝒛𝑘 = 1 𝒙 = 𝐾 = 𝐾
σ𝑗=1 𝑝 𝒛𝑗 = 1 𝑝 𝒙 𝒛𝑗 = 1 σ𝑗=1 𝜋𝑗 𝒩 𝒙 𝝁𝑗 , 𝚺𝑗
Sebastian Peitz 25
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (4/4)

• Responsibility in the previous example:


𝑝 𝒛 𝑝 𝒙𝒛 𝑝 𝒙 Colors averaged using 𝛾𝑘

• How can we train this model given a data matrix 𝑿 ∈ ℝ𝑛×𝑁 ?


→ Likelihood maximization over 𝒛 and the parameters of the distribution, i.e., 𝝁k and 𝚺k
𝑁 𝐾

log 𝑝 𝑿 𝝅, 𝝁, 𝚺 = ෍ log ෍ 𝜋𝑘 𝒩 𝒙𝑖 𝝁𝑘 , 𝚺𝑘
𝑖=1 𝑘=1

Sebastian Peitz 26
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Self-training

• From now on, let’s again assume that we have both labeled and unlabeled data:
𝒟𝐿 = (𝒙(𝑖) , 𝑦 (𝑖) )𝐿𝑖=1 , 𝒟𝑈 = (𝒙(𝑖) )𝑁
𝑖=𝐿+1
• Consider the situation where the number of labeled data is much smaller: 𝐿 ≪ 𝑁 − 𝐿, maybe due to the
fact that the labeling has to be done by hand and is very expensive.
• Central goal: improve the learning performance by taking the additional unlabeled (and likely much
cheaper) data into account.
• In some situations, this can help to significantly improve the performance.
• However, this is very hard (or impossible) to prove formally.

• The simplest thing we can do: Self-training


→ Train a classifier 𝑔(𝒙) on 𝒟𝐿 and then label the samples in 𝒟𝑈 according to the prediction of 𝑔:
𝑦 (𝑖) = 𝑔 𝒙 𝑖 , i ∈ 𝐿 + 1, … , 𝑁
• Advantage: easily usable as a wrapper around arbitrary functions (frequently used in natural language
processing)
• Disadvantage: Errors can get amplified

Sebastian Peitz 27
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Co-training / multi-view learning (1/2)

• The idea of multi-view learning is to look at an object (e.g., a website) from two (or more) different
viewpoints (e.g., the pictures and the text on the website).
• Formally, suppose the instance space 𝒳 to be split into two parts → an instance is represented in the form
𝒙 𝑖 = 𝒙 𝑖,1 , 𝒙 𝑖,2
• Co-training proceeds from the assumption that each view alone is insufficient to train a good classifier and,
moreover, that 𝒙 𝑖,1 and 𝒙 𝑖,2 are conditionally independent given the class.
• Co-training algorithms repeat the following steps:
1 2
• Train two classifiers ℎ 1 and ℎ 2 from 𝒟𝐿 and 𝒟𝐿 , respectively.
• Classify 𝒟𝑈 separately with ℎ 1 and ℎ 2 .
• Add the 𝑘 most confident examples of ℎ 1 to the labeled training data of ℎ 2 .
• Add the 𝑘 most confident examples of ℎ 2 to the labeled training data of ℎ 1 .
• Advantages:
• Co-training is a simple wrapper method that applies to all existing classifiers.
• Co-training tends to be less sensitive to mistakes than self-training.
• Disadvantages:
• a natural split of the features does not always exist (the feature subsets do not necessarily need to be disjunct).
• models using both views simultaneously may often perform better.
Sebastian Peitz 28
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Co-training / multi-view learning (2/2)

• There are many variants of multi-view learning via combination with other techniques (majority voting,
weighting, …)
• Multiview learning (with m learners) can also be realized via regularization:
𝑚 𝐿 𝑚 𝑁
2 2
min ෍ ෍ 𝑒 𝑦𝑖 , ℎ𝑣 𝒙𝑖 + 𝜆1 ℎ𝑣 + 𝜆2 ෍ ෍ ℎ𝑢 𝒙𝑗 − ℎ𝑣 (𝒙𝑗 )
ℎ∈ℋ
𝑣=1 𝑖=1 𝑢,𝑣=1 𝑗=𝐿+1
• Minimizing a (joint) loss function of this kind encourages the learners to agree on the unlabeled data to
some extent.

Sebastian Peitz 29
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Generative models

• Generative methods first estimate a joint distribution 𝑃 on 𝒳 × 𝒴


• Predictions can then be derived by conditioning on a given query 𝒙:
𝑃 𝒙, 𝑦 𝑃 𝒙, 𝑦
𝑃 𝑦𝒙 = = ∝ 𝑃 𝒙, 𝑦
𝑃(𝒙) σ𝑦∈𝒴 𝑃 𝒙, 𝑦
• Generative methods can be applied in the semi-supervised context in a quite natural way, because they can
model the probability of observing an instance 𝑥𝑗 as a marginal probability:
𝑃 𝒙 = ෍ 𝑃 𝒙, 𝑦
𝑦∈𝒴
• Suppose the (joint) probability 𝑃 to be parametrized by 𝜽 ∈ 𝚯. Then, assuming i.i.d. observations,
𝐿 𝑁 𝐿 𝑁

𝑃 𝒟𝐿 , 𝒟𝑈 𝜽 = ෑ 𝑃 𝒙𝑖 , 𝑦𝑖 𝜽 ⋅ ෑ 𝑃 𝒙𝑖 𝜽 = ෑ 𝑃 𝒙𝑖 , 𝑦𝑖 𝜽 ⋅ ෑ ෍ 𝑃 𝒙𝑖 , 𝑦 𝜽
𝑖=1 𝑗=𝐿+1 𝑖=1 𝑗=𝐿+1 𝑦∈𝒴
• Solution of this problem: maximum likelihood estimation:
𝜽∗ = argmax 𝑃 𝒟𝐿 , 𝒟𝑈 𝜽
𝜽∈𝚯
• Advantage: Theoretically well-grounded, often effective
• Disadvantage: Computationally complex, 𝑃 𝒟𝐿 , 𝒟𝑈 𝜽 may have local minima
Sebastian Peitz 30

You might also like