NLP Case Study

Case Study: Few-shot Learning with
Language
Duke Machine Learning Summer School
Shashank Srivastava
UNC Chapel Hill
Agenda
1. What is NLP?
– Some NLP Applications
2. Few-shot learning with language

– Classifiers from NL explanations
– Sequential decision making from NL explanations
– Prompt-based learning
2
What is NLP?
• Having computers understand human language and
communication
o Deeper understanding of text beyond string matching
• Why makes it hard?

o Human language is complex, diverse and ambiguous
3
NL ambiguities
Ø Word sense ambiguities
• “Kids make nutritious snacks”
vs
4
NL ambiguities
vs
Ø Syntactic ambiguities
• “Complaints about NBA referees growing ugly”
• “Ban on nude dancing on governor’s desk”
5
NL ambiguities
vs
Ø Syntactic ambiguities
• “Complaints about NBA referees growing ugly”
• “Ban on nude dancing on governor’s desk”
Ø Paralinguistics:
• “She said that she loved him”
• “She said that she loved him” 6
NLP application: Part-of-speech tagging
verb, adjective? prep, particle? noun, verb?
Bill directed plays about English kings
proper noun, noun, verb? noun, verb? proper noun, noun,

adjective?
NNP VBD NNS IN JJ NNS
Example borrowed from Noah Smith

NLP applications: Question Answering
8
NLP applications: Question Answering
9
NLP applications: Dialog agents
10
NLP applications: Machine Translation
11
NLP applications: Machine Translation
12
NLP applications: Summarization
13
NLP applications: Response generation
14
NLP applications: Creative Language
Generation
15
Language Models are Unsupervised Multitask Learners. Radford et al,
NLP applications: VQA
What food on the tray is not inside a plastic cylinder ?
GQA: A new dataset for real-world visual reasoning and compositional

16
question answering. Hudson and Manning. CVPR 2019.
NLP applications: Instruction
Following
What food on the tray is not inside a plastic cylinder ?
Learning language games through interaction. Wang et al . ACL

17
2016
NLP applications: Control
Pick up the rattling object and place it in the tray
Improving Grounded Natural Language Understanding

18
through Human-Robot Dialog. Thomason et al. arXiv 201
NLP applications: Navigation
19
https://bringmeaspoon.org/
NLP applications: Text to Scene
Generation
There is a table and there are four chairs in the room. There are four
plates with four sandwiches.
20
Text to 3D Scene Generation with Rich Lexical Grounding. Chang et al. ACL
Can computers efficiently learn new tasks through Human
Language interactions with their users?
21
Towards Conversational Learning?
Ø ML currently relies on ‘big data’
Ø Inaccessible to non-experts
Ø Theoretical limits on what can be learned
n ≈ log (H)
Ø Much of human learning uses language
Ø Extend ML to richer forms of input

Ø Explanations, instructions, clarifications …
22
1. Training Classifiers without labels from NL explanations only
‘Emails from my boss are {important, not-important}
usually important’
Srivastava et al, ACL 2018
Zero-shot training of classifiers from
natural language quantification
2. Learning procedures from explanations and single demonstration

Learning Web-based procedures by
Reasoning over Explanations and
Demonstrations in Context
3. Few-shot learning using pretrained language models
Tam et al, arXiv 2021

Improving and Simplifying Pattern
Exploiting Training


Exploiting Training
Training Classifiers without Labels
Is this email important?
‘Emails that I reply to are usually important’

‘Such emails mention a deadline or a meeting’
‘If the subject says urgent …’
NL explanations Executable feature

functions
25
NL as feature functions
Semantic parsing maps NL to formal logical forms
Natural language Evaluate in a

Logical form (l) context (z = [l]x )
statement (s)
‘three less than twenty times minus( prod(20, 6), 3 ) 117
six’
‘What is the longest river argmax( river(x) ∧

that flows through traverse(x,y) ∧ const(y, Ohio
pittsburgh), length)
Pittsburgh?’
‘Emails that I reply to are (email.replied == true ) Yes/No

26
NL as model constraints
Ø Leverage quantifier expressions in language
27
Sequential Approach
Emails that I reply to are usually

important
Mapping language to Semantic

quantitative constraints
Parser
x (email.replied == true)
y important:true
Ey|x [ (x, y)] = busually
Incorporating constraints
Posterior Regularization
in model training
Classifier
θ f :x→y
Unlabeled data 28
Sequential Approach
Emails that I reply to are usually

important
Mapping language to Semantic

quantitative constraints
Parser
y important:true
Classifier
Unlabeled data 29
Training classifiers from declarative NL
Ø Explanations encode multiple properties that can aid statistical learning
1. Features important for a learning problem

ü x : repliedTo:true
2. Class labels
ü y : Important
3. Type of Relationship b/w features and labels
ü P(y|x)
4. Strength of Relationship
ü Specified by quantifier?
30
Constraint types
Ø Constraint types:
i. About a third of the emails that I get are important : P(y)
ii. Emails that I reply to are usually important : P(y|x)
iii. I almost always reply to important emails : P(x|y)
Ø Novelty largely in identifying the type of the assertion

Ø Primarily depends on syntactic features
ü Features based on dependency paths
ü Presence/absence of negation words
ü Identifying active/passive voice
ü Order of occurrence of triggers for x and y
P (important| replied:true) ≈ pusually
31
Semantics of quantifiers
Ø Leverage semantics of linguistic quantifiers
Ø Associate point probability estimates for frequency adverbs and determiners
Frequency quantifier Probability value

always , certainly , definitely , all 0.95
usually , normally , generally , likely 0.70
most , majority 0.60
often , half 0.50
many 0.40
sometimes , frequently , some 0.30
few , occasionally 0.20
rarely , seldom 0.10
never 0.05
Ø Purely subjective beliefs, not calibrated on any data

32
Sequential Approach
Semantic
Parser
y important:true
Incorporating constraints
in model training
Classifier
θ f :x→y
Unlabeled data 33
Ø Use the posterior regularization (PR) principle to imbue human-
provided advice in learned models
Ø Unobserved class labels as latent variables
Ø PR optimizes a latent variable model subject to a set of constraints on

the posterior distribution
p✓ (y | x)
p✓c (Y |X)
y1 = ?
<latexit sha1_base64="r2xq5Suxvhh23+soGl74X4o56X8=">AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCvZREBD0WvXisYD+kDWGz3bRLN5uwO1FK2p/ixYMiXv0l3vw3btsctPXBwOO9GWbmBYngGhzn2yqsrW9sbhW3Szu7e/sHdvmwpeNUUdaksYhVJyCaCS5ZEzgI1kkUI1EgWDsY3cz89iNTmsfyHsYJ8yIykDzklICRfLuc+FkPhgyIT6fVh0nnzLcrTs2ZA68SNycVlKPh21+9fkzTiEmggmjddZ0EvIwo4FSwaamXapYQOiID1jVUkohpL5ufPsWnRunjMFamJOC5+nsiI5HW4ygwnRGBoV72ZuJ/XjeF8MrLuExSYJIuFoWpwBDjWQ64zxWjIMaGEKq4uRXTIVGEgkmrZEJwl19eJa3zmuvU3LuLSv06j6OIjtEJqiIXXaI6ukUN1EQUPaFn9IrerIn1Yr1bH4vWgpXPHKE/sD5/AP7Yk84=</latexit>
M – step E – step
Update classifier Infer label assignments for
y2 = ? parameters using unlabeled data, regularized Q
inferred labels as given by NL constraints (Constraint
set)
y3 = ? qX (Y )
<latexit sha1_base64="zMv8NJBZ6RFA4Ix2Hu4gCAD9p44=">AAAB73icbVBNTwIxEJ3iF+IX6tFLIzHBC9k1JnokevGIicAa2JBu6UJDt7u0XROy4U948aAxXv073vw3FtiDgi+Z5OW9mczMCxLBtXGcb1RYW9/Y3Cpul3Z29/YPyodHLR2nirImjUWsvIBoJrhkTcONYF6iGIkCwdrB6Hbmt5+Y0jyWD2aSMD8iA8lDTomxkjfuZd60+njeK1ecmjMHXiVuTiqQo9Erf3X7MU0jJg0VROuO6yTGz4gynAo2LXVTzRJCR2TAOpZKEjHtZ/N7p/jMKn0cxsqWNHiu/p7ISKT1JApsZ0TMUC97M/E/r5Oa8NrPuExSwyRdLApTgU2MZ8/jPleMGjGxhFDF7a2YDoki1NiISjYEd/nlVdK6qLlOzb2/rNRv8jiKcAKnUAUXrqAOd9CAJlAQ8Ayv8IbG6AW9o49FawHlM8fwB+jzB3hqj5Q=</latexit>
34
Ø Train with modified EM to maximize PR objective:
JQ (✓) = L(✓) min KL(q | p✓ (Y |X))

q2Q
Improve data likelihood Emulate human advice
35
Synthetic shape classification
Ø Turkers observe samples of shapes from synthetically generated
datasets, and describe them through statements.
ü 50 datasets
ü 30 workers
ü 4.3 statements per task
on average
1. Selected shapes are almost always a square

2. Other shapes rarely have a blue border
3. If a shape has a red fill, it is most likely not a
selected shape …
36
LNQ Accuracy
Harder Easier
Bayes Optimal Accuracy
Each dot represents a dataset (and corresponding classification task)

generated from a known distribution
37
Average Classification Accuracy (Shapes data)
Approach Avg Accuracy Access to Access to

labels statements
LNQ 0.751 no yes
Bayes Optimal 0.831 -- --
Logistic Regression 0.737 yes no
Random 0.524 -- --
LNQ (no quantification) 0.545 no yes
LNQ (coarse quantification) 0.679 no yes
Human 0.734 no yes
38
Real tasks: Bird species detection
ü 10 species from CUB-200

dataset
ü 60 examples per species
ü 53 pre-specified attributes
ü 6.1 statements per task on
average
Example explanations:
• A specimen that has a striped crown is likely
to be a selected bird
• Birds in the other category rarely ever have
dagger- shaped beaks
39
Real tasks: Email foldering
Ø Emails representing common email categories through AMT
Ø Reminders, meeting invitations, requests from boss, internet humor, going
out with friends, policy announcements, etc.
ü 1100 emails in all

ü 7 categories
ü 30 statements per category
Example explanations:
Most reminders mention a date and a time in the message of the email
The sender of the email is the same as the recipient
These emails usually close with a name or title
These emails sometimes have jpg attachments
The email likely has words like "policy" or "announcement" in the subject
Emails from a public domain are not office requests
40
Results
Bird Species
Identification
Email
Categorization
41
Empirical distributions of probability values
rarely sometimes often

μ=0.06 μ=0.29 μ=0.46
σ=0.05 σ=0.18 σ=0.15
majority most (most)likely

μ=0.73 μ=0.81 μ=0.86
σ=0.16 σ=0.15 σ=0.09
42


Exploiting Training
Beyond Concept Learning
A lot of routine, repetitive type/click tasks:
Ø Book flight tickets
Ø Order pizza
Ø Process reimbursements
Ø Like social media posts
Ø …
Research question: Teach such tasks from a single demonstration

paired with NL explanations?
44
Send me NLP related news everyday at 8am
Can you show me how?

Let me teach you …

First, type ‘news.com’ in the URL bar at the top

of the browser, and press enter

Then, type ‘NLP’ in the search bar at the top-

right, and press enter nlp

Then, type ‘NLP’ in the search bar at the top-

right, and press enter nlp
Date
Finally, email me the link to the three most
recent articles Date
Why is this interesting for NLP?
Ø Language grounding, pragmatics, script induction

Ø Environment rich in textual, structural and spatial features
Ø Reasoning on noisy and semi-structured data
Ø Limited world-semantics
51
Framework
Use the Mini World-of-Bits framework (Shi et al, ICML’17)
Ø Interactive interfaces for web-like tasks
Ø Example tasks: clicking specified buttons, forwarding emails, liking social-
media posts, etc.
Developed as testbed for RL/IL

Ø + NL explanations
Ø - restrict to single demonstrations
52
Explained Demonstration Dataset
Ø 520 demonstrations & stepwise explanations (AMT)
Ø 3.3 explanations/demonstration
Ø Most explanations (97%) follow sequence of actions in demonstration
Task: Forward an email

Click on the segment that mentions Maureen
Click on the button name “Forward” at the bottom of the page
Type in the word ‘Amata’ in front of the row tagged ‘to’
Click on the arrow button at the top of the page
Task: Select a radio button

Focus on the word sequence after Select
Click on the radio button to the left of the word sequence
Press submit
53
Web DSL
Ø DSL operators for:
Ø Click/Type actions on web-elements
Ø Filter web-elements with specific features & relations
Ø Filter strings based on features & relations
Ø Extends constraint language in (Liu et al, ICLR’18)
54
LED Approach
click(tag=square & rightOf(triangle))
Click the square to the

right of the triangle
click(elem3)
Infer latent programs ( ) in DSL that are (1) consistent with the
demonstration ( ), and (2) relevant to the NL explanations (. )
𝑃 , = Σ𝑙 𝑃 𝑥 𝑙 ) 𝑃 𝑙 𝑑 , )
Relevance Consistency
55
Two key ideas
1 Represent the logical form ( ) for any step in a context ( ) with
set of latent variables denoting:
Ø Action to perform – click or type
Ø Web-element to act on
Ø Attributes of the web-element relevant for action
Ø Relations b/w the web-element with other elements relevant for action
2 Use Inverse Semantics of DSL Operators to enumerate candidate

logical forms
Ø Reason backwards from observed demonstrations
Ø Guarantee consistency/executability of logical forms in any context
56
Model Training
Ø Optimization with Variational EM
Ø E-step: Infer LV assignments () for demonstrations( )
Ø M-step: Update parameters of semantic generation model, 𝑃 𝑥 𝑙 )
Ø Testing: choose action ( ) that best models the explanation for a step
Ø Since we’re Bayesian, chosen action may not correspond to any single
logical form
57
Evaluation: Task Completion Rates
Ø Guarantees executability of logical forms

Ø Performance comparable to Semantic parsing with full supervision
58
Ø Explanations can significantly reduce the sample complexity of RL/IL
Evaluation: Heatmap for Language
Mappings
59


Exploiting Training
Context: Few Shot Learning on NL tasks
Ø GPT-3 (2020) has surprising few-shot
performance on NL tasks
Ø ~ SOTA with 32 labeled examples on SuperGlue Tasks
Ø NLI, Sentiment prediction, etc.
Ø Schick & Schutz, 2020 show similar performance

with much smaller models using PET
Ø Leveraging manually specified patterns
Ø And lots of unlabeled task-specific data
Ø We show how to do this with no unlabeled data

Ø 0.1 % of GPT-3’s parameters
Ø And 0.3 % of PET’s data!
61
PET’s Main Ideas
1. Patterns and Verbalizers
e.g., consider sentiment analysis for movie reviews:

The acting was bad and the script was boring.
Pattern Converts examples to a cloze type question

with a manually defined template

All in all, the movie was terrible
_____
Verbalizer Maps some predefined token

negative
values to labels
2. Train smaller LMs for each pattern on the 32 labeled examples, and
ensemble with model distillation using unlabeled data
62
ADAPET: Improve & Simplify PET
Ø Task-specific unlabeled data is often unrealistic for low-data scenarios ( e.g. pairs of sentences for NLI tasks)
Ø Fine-tuning LMs on small labeled data can be unstable
Ø ADAPET alleviates these issues through more strongly supervised multi-task training
Ø Two Core Ideas:

1) Training with non-label tokens
2) Label conditioning
63
ADAPET: Improve & Simplify PET
1 Training with non-label tokens
terrible
All in all, the movie was _____ PET: Make gradient updates to improve likelihood
of the correct tokens (from verbalizer)
bogus
movie
gorilla ADAPET: also down-weight probabilities of all
other words in vocabulary
boy
boating
pink
2 Label Conditioning
The acting was bad and the script was <MASK>.
All in all, the movie was terrible
ADAPET: Given the right label, what is the context?
Predict randomly masked tokens in context given
64
the label
ADAPET Results
Unlabeled Ensemble? Gradient SuperGLUE
Data? updates? Avg
GPT-3 71.8
PET ✓ ✓ ✓ 74.0
iPET ✓ ✓ ✓ 75.4
ADAPET ✓ 76.0
Ø Comparable performance with PET/iPET ensembles that use multiple

patterns and unlabeled data
Ø Label-conditioning leads to most of the gains
65


Exploiting Training
Other directions
Ø Learning with mixed initiative dialog
Ø Allow the learner to ask questions?
Ø Ground neural conversational models for
NLP in downstream applications
Ø Learning from a Crowd/ the Web

Ø Leverage multiple teachers
Ø Learn from contradictory advice?
67
Other directions
Ø Learn complex tasks from a mix of
supervision:
Ø Demonstrations, explanations, experimentation,
observation
Ø Characterizing learning from language

Ø Which learning tasks are better to learn through language?
68
Questions?
69
Learning from fewer examples
Ø LNL consistently outperforms BoW, especially with fewer examples
70

NLP Case Study

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Case Study

Uploaded by

Copyright:

Available Formats

Case Study: Few-shot Learning with

Duke Machine Learning Summer School

2. Few-shot learning with language

• Why makes it hard?

verb, adjective? prep, particle? noun, verb?

Bill directed plays about English kings

proper noun, noun, verb? noun, verb? proper noun, noun,

Example borrowed from Noah Smith

GQA: A new dataset for real-world visual reasoning and compositional

Learning language games through interaction. Wang et al . ACL

Pick up the rattling object and place it in the tray

Improving Grounded Natural Language Understanding

Ø Much of human learning uses language

Ø Extend ML to richer forms of input

2. Learning procedures from explanations and single demonstration

Srivastava et al, ACL 2020

3. Few-shot learning using pretrained language models

Tam et al, arXiv 2021

2. Learning procedures from explanations and single demonstration

Srivastava et al, ACL 2020

3. Few-shot learning using pretrained language models

Tam et al, arXiv 2021

‘Emails that I reply to are usually important’

NL explanations Executable feature

Natural language Evaluate in a

‘What is the longest river argmax( river(x) ∧

‘Emails that I reply to are (email.replied == true ) Yes/No

Ø Leverage quantifier expressions in language

Emails that I reply to are usually

Mapping language to Semantic

Emails that I reply to are usually

Mapping language to Semantic

‘Emails that I reply to are usually important’

1. Features important for a learning problem

Ø Novelty largely in identifying the type of the assertion

‘Emails that I reply to are usually important’

P (important| replied:true) ≈ pusually

Frequency quantifier Probability value

Ø Purely subjective beliefs, not calibrated on any data

Ø PR optimizes a latent variable model subject to a set of constraints on

JQ (✓) = L(✓) min KL(q | p✓ (Y |X))

Improve data likelihood Emulate human advice

1. Selected shapes are almost always a square

Bayes Optimal Accuracy

Each dot represents a dataset (and corresponding classification task)

Approach Avg Accuracy Access to Access to

Bayes Optimal 0.831 -- --

Logistic Regression 0.737 yes no

LNQ (no quantification) 0.545 no yes

LNQ (coarse quantification) 0.679 no yes

Human 0.734 no yes

ü 10 species from CUB-200

ü 1100 emails in all

rarely sometimes often

majority most (most)likely

2. Learning procedures from explanations and single demonstration

Srivastava et al, ACL 2020

3. Few-shot learning using pretrained language models

Tam et al, arXiv 2021

Research question: Teach such tasks from a single demonstration

Can you show me how?

Can you show me how?

Let me teach you …