You are on page 1of 70

Case Study: Few-shot Learning with

Language

Duke Machine Learning Summer School

Shashank Srivastava
UNC Chapel Hill
Agenda

1. What is NLP?
– Some NLP Applications

2. Few-shot learning with language


– Classifiers from NL explanations
– Sequential decision making from NL explanations
– Prompt-based learning

2
What is NLP?
• Having computers understand human language and
communication
o Deeper understanding of text beyond string matching

• Why makes it hard?


o Human language is complex, diverse and ambiguous

3
NL ambiguities
Ø Word sense ambiguities
• “Kids make nutritious snacks”

vs

4
NL ambiguities
Ø Word sense ambiguities
• “Kids make nutritious snacks”

vs

Ø Syntactic ambiguities
• “Complaints about NBA referees growing ugly”
• “Ban on nude dancing on governor’s desk”

5
NL ambiguities
Ø Word sense ambiguities
• “Kids make nutritious snacks”

vs

Ø Syntactic ambiguities
• “Complaints about NBA referees growing ugly”
• “Ban on nude dancing on governor’s desk”

Ø Paralinguistics:
• “She said that she loved him”
• “She said that she loved him” 6
NLP application: Part-of-speech tagging

verb, adjective? prep, particle? noun, verb?

Bill directed plays about English kings

proper noun, noun, verb? noun, verb? proper noun, noun,


adjective?
NNP VBD NNS IN JJ NNS

Example borrowed from Noah Smith


NLP applications: Question Answering

8
NLP applications: Question Answering

9
NLP applications: Dialog agents

10
NLP applications: Machine Translation

11
NLP applications: Machine Translation

12
NLP applications: Summarization

13
NLP applications: Response generation

14
NLP applications: Creative Language
Generation

15
Language Models are Unsupervised Multitask Learners. Radford et al,
NLP applications: VQA
What food on the tray is not inside a plastic cylinder ?

GQA: A new dataset for real-world visual reasoning and compositional


16
question answering. Hudson and Manning. CVPR 2019.
NLP applications: Instruction
Following
What food on the tray is not inside a plastic cylinder ?

Learning language games through interaction. Wang et al . ACL


17
2016
NLP applications: Control

Pick up the rattling object and place it in the tray

Improving Grounded Natural Language Understanding


18
through Human-Robot Dialog. Thomason et al. arXiv 201
NLP applications: Navigation

19
https://bringmeaspoon.org/
NLP applications: Text to Scene
Generation
There is a table and there are four chairs in the room. There are four
plates with four sandwiches.

20
Text to 3D Scene Generation with Rich Lexical Grounding. Chang et al. ACL
Can computers efficiently learn new tasks through Human
Language interactions with their users?

21
Towards Conversational Learning?
Ø ML currently relies on ‘big data’
Ø Inaccessible to non-experts
Ø Theoretical limits on what can be learned
n ≈ log (H)

Ø Much of human learning uses language

Ø Extend ML to richer forms of input


Ø Explanations, instructions, clarifications …

22
1. Training Classifiers without labels from NL explanations only
‘Emails from my boss are {important, not-important}
usually important’
Srivastava et al, ACL 2018
Zero-shot training of classifiers from
natural language quantification

2. Learning procedures from explanations and single demonstration

Srivastava et al, ACL 2020


Learning Web-based procedures by
Reasoning over Explanations and
Demonstrations in Context

3. Few-shot learning using pretrained language models

Tam et al, arXiv 2021


Improving and Simplifying Pattern
Exploiting Training
1. Training Classifiers without labels from NL explanations only
‘Emails from my boss are {important, not-important}
usually important’
Srivastava et al, ACL 2018
Zero-shot training of classifiers from
natural language quantification

2. Learning procedures from explanations and single demonstration

Srivastava et al, ACL 2020


Learning Web-based procedures by
Reasoning over Explanations and
Demonstrations in Context

3. Few-shot learning using pretrained language models

Tam et al, arXiv 2021


Improving and Simplifying Pattern
Exploiting Training
Training Classifiers without Labels
Is this email important?

‘Emails that I reply to are usually important’


‘Such emails mention a deadline or a meeting’
‘If the subject says urgent …’

NL explanations Executable feature


functions

25
NL as feature functions
Semantic parsing maps NL to formal logical forms

Natural language Evaluate in a


Logical form (l) context (z = [l]x )
statement (s)
‘three less than twenty times minus( prod(20, 6), 3 ) 117
six’

‘What is the longest river argmax( river(x) ∧


that flows through traverse(x,y) ∧ const(y, Ohio
pittsburgh), length)
Pittsburgh?’

‘Emails that I reply to are (email.replied == true ) Yes/No


usually important’

26
NL as model constraints

Ø Leverage quantifier expressions in language

27
Sequential Approach

Emails that I reply to are usually


important

Mapping language to Semantic


quantitative constraints
Parser

x (email.replied == true)
y important:true
Ey|x [ (x, y)] = busually

Incorporating constraints
Posterior Regularization
in model training
Classifier
θ f :x→y
Unlabeled data 28
Sequential Approach

Emails that I reply to are usually


important

Mapping language to Semantic


quantitative constraints
Parser

x (email.replied == true)
y important:true
Ey|x [ (x, y)] = busually

Posterior Regularization

Classifier

Unlabeled data 29
Training classifiers from declarative NL
Ø Explanations encode multiple properties that can aid statistical learning

‘Emails that I reply to are usually important’

1. Features important for a learning problem


ü x : repliedTo:true
2. Class labels
ü y : Important
3. Type of Relationship b/w features and labels
ü P(y|x)
4. Strength of Relationship
ü Specified by quantifier?
30
Constraint types
Ø Constraint types:
i. About a third of the emails that I get are important : P(y)
ii. Emails that I reply to are usually important : P(y|x)
iii. I almost always reply to important emails : P(x|y)

Ø Novelty largely in identifying the type of the assertion


Ø Primarily depends on syntactic features
ü Features based on dependency paths
ü Presence/absence of negation words
ü Identifying active/passive voice
ü Order of occurrence of triggers for x and y

‘Emails that I reply to are usually important’

P (important| replied:true) ≈ pusually

31
Semantics of quantifiers
Ø Leverage semantics of linguistic quantifiers
Ø Associate point probability estimates for frequency adverbs and determiners

Frequency quantifier Probability value


always , certainly , definitely , all 0.95
usually , normally , generally , likely 0.70
most , majority 0.60
often , half 0.50
many 0.40
sometimes , frequently , some 0.30
few , occasionally 0.20
rarely , seldom 0.10
never 0.05

Ø Purely subjective beliefs, not calibrated on any data


32
Sequential Approach

Semantic
Parser

x (email.replied == true)
y important:true
Ey|x [ (x, y)] = busually

Incorporating constraints
Posterior Regularization
in model training
Classifier
θ f :x→y
Unlabeled data 33
Posterior Regularization
Ø Use the posterior regularization (PR) principle to imbue human-
provided advice in learned models
Ø Unobserved class labels as latent variables

Ø PR optimizes a latent variable model subject to a set of constraints on


the posterior distribution
p✓ (y | x)

p✓c (Y |X)
y1 = ?
<latexit sha1_base64="r2xq5Suxvhh23+soGl74X4o56X8=">AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCvZREBD0WvXisYD+kDWGz3bRLN5uwO1FK2p/ixYMiXv0l3vw3btsctPXBwOO9GWbmBYngGhzn2yqsrW9sbhW3Szu7e/sHdvmwpeNUUdaksYhVJyCaCS5ZEzgI1kkUI1EgWDsY3cz89iNTmsfyHsYJ8yIykDzklICRfLuc+FkPhgyIT6fVh0nnzLcrTs2ZA68SNycVlKPh21+9fkzTiEmggmjddZ0EvIwo4FSwaamXapYQOiID1jVUkohpL5ufPsWnRunjMFamJOC5+nsiI5HW4ygwnRGBoV72ZuJ/XjeF8MrLuExSYJIuFoWpwBDjWQ64zxWjIMaGEKq4uRXTIVGEgkmrZEJwl19eJa3zmuvU3LuLSv06j6OIjtEJqiIXXaI6ukUN1EQUPaFn9IrerIn1Yr1bH4vWgpXPHKE/sD5/AP7Yk84=</latexit>

M – step E – step
Update classifier Infer label assignments for
y2 = ? parameters using unlabeled data, regularized Q
inferred labels as given by NL constraints (Constraint
set)
y3 = ? qX (Y )
<latexit sha1_base64="zMv8NJBZ6RFA4Ix2Hu4gCAD9p44=">AAAB73icbVBNTwIxEJ3iF+IX6tFLIzHBC9k1JnokevGIicAa2JBu6UJDt7u0XROy4U948aAxXv073vw3FtiDgi+Z5OW9mczMCxLBtXGcb1RYW9/Y3Cpul3Z29/YPyodHLR2nirImjUWsvIBoJrhkTcONYF6iGIkCwdrB6Hbmt5+Y0jyWD2aSMD8iA8lDTomxkjfuZd60+njeK1ecmjMHXiVuTiqQo9Erf3X7MU0jJg0VROuO6yTGz4gynAo2LXVTzRJCR2TAOpZKEjHtZ/N7p/jMKn0cxsqWNHiu/p7ISKT1JApsZ0TMUC97M/E/r5Oa8NrPuExSwyRdLApTgU2MZ8/jPleMGjGxhFDF7a2YDoki1NiISjYEd/nlVdK6qLlOzb2/rNRv8jiKcAKnUAUXrqAOd9CAJlAQ8Ayv8IbG6AW9o49FawHlM8fwB+jzB3hqj5Q=</latexit>

34
Posterior Regularization
Ø Train with modified EM to maximize PR objective:

JQ (✓) = L(✓) min KL(q | p✓ (Y |X))


q2Q

Improve data likelihood Emulate human advice

35
Synthetic shape classification
Ø Turkers observe samples of shapes from synthetically generated
datasets, and describe them through statements.

ü 50 datasets
ü 30 workers
ü 4.3 statements per task
on average

1. Selected shapes are almost always a square


2. Other shapes rarely have a blue border
3. If a shape has a red fill, it is most likely not a
selected shape …

36
LNQ Accuracy

Harder Easier

Bayes Optimal Accuracy

Each dot represents a dataset (and corresponding classification task)


generated from a known distribution
37
Average Classification Accuracy (Shapes data)

Approach Avg Accuracy Access to Access to


labels statements
LNQ 0.751 no yes

Bayes Optimal 0.831 -- --

Logistic Regression 0.737 yes no

Random 0.524 -- --

LNQ (no quantification) 0.545 no yes

LNQ (coarse quantification) 0.679 no yes

Human 0.734 no yes

38
Real tasks: Bird species detection

ü 10 species from CUB-200


dataset
ü 60 examples per species
ü 53 pre-specified attributes
ü 6.1 statements per task on
average

Example explanations:
• A specimen that has a striped crown is likely
to be a selected bird
• Birds in the other category rarely ever have
dagger- shaped beaks

39
Real tasks: Email foldering
Ø Emails representing common email categories through AMT
Ø Reminders, meeting invitations, requests from boss, internet humor, going
out with friends, policy announcements, etc.

ü 1100 emails in all


ü 7 categories
ü 30 statements per category

Example explanations:
Most reminders mention a date and a time in the message of the email
The sender of the email is the same as the recipient
These emails usually close with a name or title
These emails sometimes have jpg attachments
The email likely has words like "policy" or "announcement" in the subject
Emails from a public domain are not office requests

40
Results

Bird Species
Identification

Email
Categorization

41
Empirical distributions of probability values

rarely sometimes often


μ=0.06 μ=0.29 μ=0.46
σ=0.05 σ=0.18 σ=0.15

majority most (most)likely


μ=0.73 μ=0.81 μ=0.86
σ=0.16 σ=0.15 σ=0.09

42
1. Training Classifiers without labels from NL explanations only
‘Emails from my boss are {important, not-important}
usually important’
Srivastava et al, ACL 2018
Zero-shot training of classifiers from
natural language quantification

2. Learning procedures from explanations and single demonstration

Srivastava et al, ACL 2020


Learning Web-based procedures by
Reasoning over Explanations and
Demonstrations in Context

3. Few-shot learning using pretrained language models

Tam et al, arXiv 2021


Improving and Simplifying Pattern
Exploiting Training
Beyond Concept Learning
A lot of routine, repetitive type/click tasks:
Ø Book flight tickets
Ø Order pizza
Ø Process reimbursements
Ø Like social media posts
Ø …

Research question: Teach such tasks from a single demonstration


paired with NL explanations?

44
Send me NLP related news everyday at 8am
Send me NLP related news everyday at 8am

Can you show me how?


Send me NLP related news everyday at 8am

Can you show me how?

Let me teach you …


Send me NLP related news everyday at 8am

Can you show me how?

Let me teach you …

First, type ‘news.com’ in the URL bar at the top


of the browser, and press enter
Send me NLP related news everyday at 8am

Can you show me how?

Let me teach you …

First, type ‘news.com’ in the URL bar at the top


of the browser, and press enter

Then, type ‘NLP’ in the search bar at the top-


right, and press enter nlp
Send me NLP related news everyday at 8am

Can you show me how?

Let me teach you …

First, type ‘news.com’ in the URL bar at the top


of the browser, and press enter

Then, type ‘NLP’ in the search bar at the top-


right, and press enter nlp

Date
Finally, email me the link to the three most
recent articles Date
Why is this interesting for NLP?

Ø Language grounding, pragmatics, script induction


Ø Environment rich in textual, structural and spatial features
Ø Reasoning on noisy and semi-structured data
Ø Limited world-semantics

51
Framework
Use the Mini World-of-Bits framework (Shi et al, ICML’17)
Ø Interactive interfaces for web-like tasks
Ø Example tasks: clicking specified buttons, forwarding emails, liking social-
media posts, etc.

Developed as testbed for RL/IL


Ø + NL explanations
Ø - restrict to single demonstrations

52
Explained Demonstration Dataset
Ø 520 demonstrations & stepwise explanations (AMT)
Ø 3.3 explanations/demonstration
Ø Most explanations (97%) follow sequence of actions in demonstration

Task: Forward an email


Click on the segment that mentions Maureen
Click on the button name “Forward” at the bottom of the page
Type in the word ‘Amata’ in front of the row tagged ‘to’
Click on the arrow button at the top of the page

Task: Select a radio button


Focus on the word sequence after Select
Click on the radio button to the left of the word sequence
Press submit

53
Web DSL
Ø DSL operators for:
Ø Click/Type actions on web-elements
Ø Filter web-elements with specific features & relations
Ø Filter strings based on features & relations

Ø Extends constraint language in (Liu et al, ICLR’18)

54
LED Approach

click(tag=square & rightOf(triangle))

Click the square to the


right of the triangle
click(elem3)

Infer latent programs ( ) in DSL that are (1) consistent with the
demonstration ( ), and (2) relevant to the NL explanations (. )

𝑃 , = Σ𝑙 𝑃 𝑥 𝑙 ) 𝑃 𝑙 𝑑 , )
Relevance Consistency
55
Two key ideas
1 Represent the logical form ( ) for any step in a context ( ) with
set of latent variables denoting:
Ø Action to perform – click or type
Ø Web-element to act on
Ø Attributes of the web-element relevant for action
Ø Relations b/w the web-element with other elements relevant for action

2 Use Inverse Semantics of DSL Operators to enumerate candidate


logical forms
Ø Reason backwards from observed demonstrations
Ø Guarantee consistency/executability of logical forms in any context

56
Model Training
Ø Optimization with Variational EM
Ø E-step: Infer LV assignments () for demonstrations( )
Ø M-step: Update parameters of semantic generation model, 𝑃 𝑥 𝑙 )

Ø Testing: choose action ( ) that best models the explanation for a step
Ø Since we’re Bayesian, chosen action may not correspond to any single
logical form

57
Evaluation: Task Completion Rates

Ø Guarantees executability of logical forms


Ø Performance comparable to Semantic parsing with full supervision
58
Ø Explanations can significantly reduce the sample complexity of RL/IL
Evaluation: Heatmap for Language
Mappings

59
1. Training Classifiers without labels from NL explanations only
‘Emails from my boss are {important, not-important}
usually important’
Srivastava et al, ACL 2018
Zero-shot training of classifiers from
natural language quantification

2. Learning procedures from explanations and single demonstration

Srivastava et al, ACL 2020


Learning Web-based procedures by
Reasoning over Explanations and
Demonstrations in Context

3. Few-shot learning using pretrained language models

Tam et al, arXiv 2021


Improving and Simplifying Pattern
Exploiting Training
Context: Few Shot Learning on NL tasks
Ø GPT-3 (2020) has surprising few-shot
performance on NL tasks
Ø ~ SOTA with 32 labeled examples on SuperGlue Tasks
Ø NLI, Sentiment prediction, etc.

Ø Schick & Schutz, 2020 show similar performance


with much smaller models using PET
Ø Leveraging manually specified patterns
Ø And lots of unlabeled task-specific data

Ø We show how to do this with no unlabeled data


Ø 0.1 % of GPT-3’s parameters
Ø And 0.3 % of PET’s data!

61
PET’s Main Ideas
1. Patterns and Verbalizers

e.g., consider sentiment analysis for movie reviews:


The acting was bad and the script was boring.

Pattern Converts examples to a cloze type question


with a manually defined template

The acting was bad and the script was boring.


All in all, the movie was terrible
_____

Verbalizer Maps some predefined token


negative
values to labels

2. Train smaller LMs for each pattern on the 32 labeled examples, and
ensemble with model distillation using unlabeled data
62
ADAPET: Improve & Simplify PET
Ø Task-specific unlabeled data is often unrealistic for low-data scenarios ( e.g. pairs of sentences for NLI tasks)

Ø Fine-tuning LMs on small labeled data can be unstable

Ø ADAPET alleviates these issues through more strongly supervised multi-task training

Ø Two Core Ideas:


1) Training with non-label tokens
2) Label conditioning

63
ADAPET: Improve & Simplify PET
1 Training with non-label tokens
The acting was bad and the script was boring.
terrible
All in all, the movie was _____ PET: Make gradient updates to improve likelihood
of the correct tokens (from verbalizer)
bogus
movie
gorilla ADAPET: also down-weight probabilities of all
other words in vocabulary
boy
boating
pink

2 Label Conditioning
The acting was bad and the script was <MASK>.
All in all, the movie was terrible
ADAPET: Given the right label, what is the context?
Predict randomly masked tokens in context given
64
the label
ADAPET Results
Unlabeled Ensemble? Gradient SuperGLUE
Data? updates? Avg
GPT-3 71.8
PET ✓ ✓ ✓ 74.0
iPET ✓ ✓ ✓ 75.4
ADAPET ✓ 76.0

Ø Comparable performance with PET/iPET ensembles that use multiple


patterns and unlabeled data

Ø Label-conditioning leads to most of the gains

65
1. Training Classifiers without labels from NL explanations only
‘Emails from my boss are {important, not-important}
usually important’
Srivastava et al, ACL 2018
Zero-shot training of classifiers from
natural language quantification

2. Learning procedures from explanations and single demonstration

Srivastava et al, ACL 2020


Learning Web-based procedures by
Reasoning over Explanations and
Demonstrations in Context

3. Few-shot learning using pretrained language models

Tam et al, arXiv 2021


Improving and Simplifying Pattern
Exploiting Training
Other directions
Ø Learning with mixed initiative dialog
Ø Allow the learner to ask questions?
Ø Ground neural conversational models for
NLP in downstream applications

Ø Learning from a Crowd/ the Web


Ø Leverage multiple teachers
Ø Learn from contradictory advice?

67
Other directions
Ø Learn complex tasks from a mix of
supervision:
Ø Demonstrations, explanations, experimentation,
observation

Ø Characterizing learning from language


Ø Which learning tasks are better to learn through language?

68
Questions?

69
Learning from fewer examples

Ø LNL consistently outperforms BoW, especially with fewer examples

70

You might also like