Professional Documents
Culture Documents
Semi-Supervised Learning1
Jacob Eisenstein
1
With slides borrowed from John Blitzer, Mingwei Chang, Hal Daume III,
Xiaojin Zhu, and everyone they borrowed slides from...
Frameworks for learning
Phonetic transcription2
I “Switchboard” dataset of telephone conversations
I Annotations from word to phoneme sequence:
I film → f ih n uh gl n m
I be all → bcl b iy iy tr ao tr ao l dl
2
Examples from Xiaojin “Jerry” Zhu
Examples
Phonetic transcription2
I “Switchboard” dataset of telephone conversations
I Annotations from word to phoneme sequence:
I film → f ih n uh gl n m
I be all → bcl b iy iy tr ao tr ao l dl
I 400 hours annotation time per hour of speech!
2
Examples from Xiaojin “Jerry” Zhu
Examples
Natural language parsing3
I Penn Chinese Treebank
I Annotations from word sequences to parse trees
3
Examples from Xiaojin “Jerry” Zhu
Examples
Natural language parsing3
I Penn Chinese Treebank
I Annotations from word sequences to parse trees
3
Examples from Xiaojin “Jerry” Zhu
How can we learn with less annotation effort?
How can we learn with less annotation effort?
I Semisupervised learning
I {(x i , y i )}`i=1 : labeled examples
I {(x i )}`+u
i=`+1 : unlabeled examples
I often u `
How can we learn with less annotation effort?
I Semisupervised learning
I {(x i , y i )}`i=1 : labeled examples
I {(x i )}`+u
i=`+1 : unlabeled examples
I often u `
I Domain adaptation
I {(x i , y i )}`i=1
S
∼ DS : labeled examples in source domain
I {(x i , y i )}i=1 ∼ DT : labeled examples in target domain
`T
I Semisupervised learning
I {(x i , y i )}`i=1 : labeled examples
I {(x i )}`+u
i=`+1 : unlabeled examples
I often u `
I Domain adaptation
I {(x i , y i )}`i=1
S
∼ DS : labeled examples in source domain
I {(x i , y i )}i=1 ∼ DT : labeled examples in target domain
`T
height (in.)
60 60
height (in.)
55 55
50 50
45 45
40 40
80 90 100 110 80 90 100 110
weight (lbs.) weight (lbs.)
55 height (in.) 55
50 50
45 45
40 40
80 90 100 110 80 90 100 110
weight (lbs.) weight (lbs.)
height (in.)
55 55
50 50
45 45
40 40
80 90 100 110 80 90 100 110
weight (lbs.) weight (lbs.)
(a) (b)
70 70
65 65
60 60
height (in.)
height (in.)
55 55
50 50
45 45
40 40
9080 100 110 80 90 100 110
weight (lbs.) weight (lbs.)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 16 / 99
When does bootstrapping work? “Folk wisdom”
x (1) x (2) y
1. Peachtree Street located on LOC
2. Dr. Walker said PER
3. Zanzibar located in ?
4. Zanzibar flew to ?
5. Dr. Robert recommended ?
6. Oprah recommended ?
Algorithm
Co-training example
x (1) x (2) y
1. Peachtree Street located on LOC
2. Dr. Walker said PER
3. Zanzibar located in ?
4. Zanzibar flew to ?
5. Dr. Robert recommended PER
6. Oprah recommended ?
Algorithm
I Use classifier 1 to label example 5.
Co-training example
x (1) x (2) y
1. Peachtree Street located on LOC
2. Dr. Walker said PER
3. Zanzibar located in LOC
4. Zanzibar flew to ?
5. Dr. Robert recommended PER
6. Oprah recommended ?
Algorithm
I Use classifier 1 to label example 5.
I Use classifier 2 to label example 3.
Co-training example
x (1) x (2) y
1. Peachtree Street located on LOC
2. Dr. Walker said PER
3. Zanzibar located in LOC
4. Zanzibar flew to ?
5. Dr. Robert recommended PER
6. Oprah recommended ?
Algorithm
I Use classifier 1 to label example 5.
I Use classifier 2 to label example 3.
I Retrain both classifiers, using newly labeled data.
Co-training example
x (1) x (2) y
1. Peachtree Street located on LOC
2. Dr. Walker said PER
3. Zanzibar located in LOC
4. Zanzibar flew to LOC
5. Dr. Robert recommended PER
6. Oprah recommended PER
Algorithm
I Use classifier 1 to label example 5.
I Use classifier 2 to label example 3.
I Retrain both classifiers, using newly labeled data.
I Use classifier 1 to label example 4.
I Use classifier 2 to label example 6.
Building a graph of related instances
||xi − xj ||2
sim(i, j) = exp −
2σ 2
||xi − xj ||2
sim(i, j) = exp −
2σ 2
yi ∈ {0, 1}
Fix Yl = {y1 , y2 , . . . y` }
Solve for Yu = {y`+1 , . . . , y`+m }
X
min wij (yi − yj )2
Yu
i,j
Minimum cuts
I Advantages:
I unique global optimum
I natural notion of confidence: distance of yi from 0.5
Label propagation on the graph Laplacian
1X X
argmin f `(f (x i ), yi )+λ1 ||f ||2 +λ2 wij (f (x i )−f (x j ))2
`
i i,j
Manifold regularization
1X X
argmin f `(f (x i ), yi )+λ1 ||f ||2 +λ2 wij (f (x i )−f (x j ))2
`
i i,j
Manifold regularization
1X X
argmin f `(f (x i ), yi )+λ1 ||f ||2 +λ2 wij (f (x i )−f (x j ))2
`
i i,j
Manifold regularization
1X X
argmin f `(f (x i ), yi )+λ1 ||f ||2 +λ2 wij (f (x i )−f (x j ))2
`
i i,j
Manifold regularization: synthetic data
Manifold regularization: text classification
I Semisupervised learning
I {(x i , y i )}`i=1 : labeled examples
I {(x i )}`+u
i=`+1 : unlabeled examples
I often u `
How can we learn with less annotation effort?
I Semisupervised learning
I {(x i , y i )}`i=1 : labeled examples
I {(x i )}`+u
i=`+1 : unlabeled examples
I often u `
I Domain adaptation
I {(x i , y i )}`i=1
S
∼ DS : labeled examples in source domain
I {(x i , y i )}i=1 ∼ DT : labeled examples in target domain
`T
guest lecturer: Ming-Wei Chang CS 546 () Introduction to Domain Adaptation 7 / 33 Spring, 2009 7 / 33
The current status of NER
guest lecturer: Ming-Wei Chang CS 546 () Introduction to Domain Adaptation 7 / 33 Spring, 2009 7 / 33
The current status of NER
guest lecturer: Ming-Wei Chang CS 546 () Introduction to Domain Adaptation 7 / 33 Spring, 2009 7 / 33
The problem: domain over-fitting
guest lecturer: Ming-Wei Chang CS 546 () Introduction to Domain Adaptation 8 / 33 Spring, 2009 8 / 33
Domain Adaptation
• Many NLP tasks are cast into classification problems
• Lack of training data in new domains
• Domain adaptation:
– POS: WSJ biomedical text
– NER: news blog, speech
– Spam filtering: public email corpus personal inboxes
• Domain overfitting
NER Task Train → Test F1
to find PER, LOC, ORG NYT → NYT 0.855
from news text Reuters → NYT 0.641
to find gene/protein from mouse → mouse 0.541
biomedical literature fly → mouse 0.281
2
Supervised domain adaptation
In supervised domain adaptation, we have:
I Lots of labeled data in a “source” domain, {(x i , y i )}`i=1
S
∼ DS
(e.g., reviews of restaurants)
Source
Data
Slide 11 Domain Adaptation
Hal Daumé III (me@hal3.name)
Target
Data
Slide 12 Domain Adaptation
Hal Daumé III (me@hal3.name)
Source Target
Data Data
Unioned Data
Slide 13 Domain Adaptation
Hal Daumé III (me@hal3.name)
Source
Data Target
Data
Unioned Data
Slide 14 Domain Adaptation
Hal Daumé III (me@hal3.name)
SrcOnly
Slide 15 Domain Adaptation
Hal Daumé III (me@hal3.name)
α α
Slide 16 Domain Adaptation
Less obvious approaches
Key Idea:
Share some features (“the”)
Don't share others (“monitor”)
(and let the learner decide which are which)
Slide 7 Domain Adaptation
Hal Daumé III (me@hal3.name)
Feature Augmentation
We monitor the traffic The monitor is heavy
N V D N D N V R
Features
Original
SW:monitor SW:the
SP:we SP:monitor
Why should this work?
SN:the SN:traffic
Augmented
SC:a+ SC:a+
Features
TW:the TW:monitor
TP:<s> TP:the
TN:monitor TN:is
In feature-vector
TC:Aa+ lingo:
TC:a+
Φ(x) → ‹Φ(x), Φ(x), 0› (for source domain)
Φ(x) → ‹Φ(x), 0, Φ(x)› (for target domain)
Slide 8 Domain Adaptation
Hal Daumé III (me@hal3.name)
Slide 17 Domain Adaptation
Unsupervised domain adaptation
In unsupervised domain adaptation, we have:
I Lots of labeled data in a “source” domain, {(x i , y i )}`i=1
S
∼ DS
(e.g., reviews of restaurants)
Source Target
fantastic
highly recommended
fascinating defective
boring sturdy
read half leaking
couldn’t put it down like a charm
waste of money
horrible
Predicting pivot word presence
Do not buy
..
.
..
.
highly
recommend
highly recommend
• pivots generates N new features
highly highly
• recommend recommend : “Did highly
recommend appear?”
great
• Sometimes predictors capture
non-sentiment information
Finding a shared sentiment subspace
highly
recommend
highly recommend
• pivots generates N new features
highly highly I
• recommend recommend : “Did highly
recommend appear?”
great
• Sometimes predictors capture
non-sentiment information
Finding a shared sentiment subspace
highly
recommend
highly recommend
• pivots generates N new features
highly highly
• recommend recommend : “Did highly wonderful
recommend appear?”
great
• Sometimes predictors capture
non-sentiment information
Finding a shared sentiment subspace
highly recommend
• pivots generates N new features
highly highly
• recommend recommend : “Did highly wonderful
recommend appear?”
great
• Sometimes predictors capture
non-sentiment information
Finding a shared sentiment subspace
Source Target
P projects onto shared subspace
Source Target
Target Accuracy: Kitchen Appliances
88
87.7% 87.7% 87.7%
84
80
76 In Domain (Kitchen)
72
Books DVDs Electronics
Source Domain
Target Accuracy: Kitchen Appliances
88
87.7% 87.7% 87.7%
84
84.0%
80
76 In Domain(Kitchen)
No Adaptation
74.5% 74.0%
72
Books DVDs Electronics
Source Domain
Target Accuracy: Kitchen Appliances
88
87.7% 87.7% 87.7%
85.9%
84
84.0%
80 81.4%
78.9%
76 In Domain (Kitchen)
No Adaptation
74.5% Adaptation
74.0%
72
Books DVDs Electronics
Source Domain
Adaptation Error Reduction
88
87.7% 87.7% 87.7%
85.9%
84
84.0%
80 81.4%
78.9%
76 In-domain (Kitchen)
36% reduction in error due to adaptation
No Correspondence
74.5% Correspondence
74.0%
72
Books DVDs Electronics
Source Domain
Visualizing (books & kitchen)
books
engaging must_read
I In application settings,
I You rarely have all the labeled data you want.
I You often have lots of unlabeled data.
I Semi-supervised learning learns from unlabeled data too:
I Bootstrapping (or self-training) works best when you have
multiple orthogonal views: for example, string and context.
I Probabilistic methods impute the labels of unseen data.
I Graph-based methods encourage similar instances or types to
have similar labels.
Alternative frameworks
I Semisupervised learning
I learn from labeled examples {(x i , y i )}`i=1
I and unlabeled examples {(x i )}`+u
i=`+1
I often u `
Alternative frameworks
I Semisupervised learning
I learn from labeled examples {(x i , y i )}`i=1
I and unlabeled examples {(x i )}`+u
i=`+1
I often u `
I Domain adaptation
I learn from lots of labeled examples {(x i , y i )}`i=1 ∼ DS in a
source domain
I learn from a few (or zero) labeled examples
{(x i , y i )}`i=1 ∼ DT in a target domain
I evaluate in the target domain
Alternative frameworks
I Semisupervised learning
I learn from labeled examples {(x i , y i )}`i=1
I and unlabeled examples {(x i )}`+u
i=`+1
I often u `
I Domain adaptation
I learn from lots of labeled examples {(x i , y i )}`i=1 ∼ DS in a
source domain
I learn from a few (or zero) labeled examples
{(x i , y i )}`i=1 ∼ DT in a target domain
I evaluate in the target domain
I Active learning: model can query annotator for labels
Alternative frameworks
I Semisupervised learning
I learn from labeled examples {(x i , y i )}`i=1
I and unlabeled examples {(x i )}`+u
i=`+1
I often u `
I Domain adaptation
I learn from lots of labeled examples {(x i , y i )}`i=1 ∼ DS in a
source domain
I learn from a few (or zero) labeled examples
{(x i , y i )}`i=1 ∼ DT in a target domain
I evaluate in the target domain
I Active learning: model can query annotator for labels
I Feature labeling
I Provide prototypes of each label (Haghighi and Klein 2006)
I Give rough probabilistic constraints, e.g. Mr. preceeds a
person name at least 90% of the time (Druck et al 2008)