You are on page 1of 68

Natural Language

Processing (NLP) Research


at Boston University

Derry Wijaya
Boston University
About Me
• BSc and MSc in Computer Science from
National University of Singapore

• PhD in Computer Science, Language


Technologies Institute, from
Carnegie Mellon University

• Postdoc in Computer Science at University of Pennsylvania

• Asst. Professor in Computer Science at Boston University

• Developed interest in the field of Information Extraction as an


undergraduate doing Undergraduate Research Opportunity
Project
Research Focus
• Natural Language Processing (NLP) and Machine
Learning

• Media Framing: how different media ”frame”


news stories differently

• Low Resource Machine Translation: automatic


translation for languages that have few translation
data for training
Introduction to NLP
• Natural Language Processing

• techniques to give computers the ability to


process, learn, and understand human
languages

• other names: speech and language processing,


human language technologies, computational
linguistics, speech recognition and synthesis, …
Goal of NLP
get computers to perform tasks
involving human language

• conversational agent

• machine translation (MT)

• question answering (QA)


nell
• information extraction (IE)

• …
NLP for Interdisciplinary
Research
• Psychology: measure stress, anxiety, depression based on
patients’ tweets or social media posts
• Medicine: extract information from doctors’ or physicians’ notes
• Public Health: measure epidemiology like diabetes, obesity
risk factors from users’ tweets and Instagram
• Economy: predict stock performance of a company based on
news articles about the company
• Accounting: automatically create balance sheets from financial
reports
• Education: automated tutoring, grading
• Communication: detect media frames: angles of the story
Goal of NLP
Computers
solving task
involving human
language

Knowledge of Algorithm
task
Language solve ambiguity

captured by

Model

Image source: medium.com


Models
• NLP systems rely on models to capture knowledge
of language e.g., formal rule systems to capture
morphology, and syntax/grammar

“Hamid Ansari was nominated


for Vice President”

“Vice President was nominated


context free for Hamid Ansari”

grammar “Hamid Ansari was nominated”

“Vice President was nominated”

a set of production rules that


rules can be applied
describe all possible strings
regardless of context
Models
• e.g., vector space models to capture word
meanings “you shall know a word by the company it
keeps (Firth, J. R. 1957:11)”
pasta, lamb,
cheese, mushroom

citrus, apple,
orange, lime

aromatic, nose,
scent, perfume

http://methodmatters.blogspot.com/2017/11/using-word2vec-to-analyze-word.html
Algorithms for Solving
Task
• Given the models, search through a space of
hypotheses about an input

• e.g., a classifier can be trained to compute the


sentiment polarity of a word i.e., whether it conveys a
positive/negative sentiment based on the word vector

• e.g., a machine translation algorithm searches


through a space of translation hypotheses for the
correct translation of a sentence into another
language
NLP and the Measure of
Intelligence
using language as humans do
== truly intelligent machines?

Turing Test
by responding as a
person to the examiner’s
questions, the machine
wins if it can convince the
examiner into believing
that it is a person

https://xkcd.com/329/
NLP and the Measure of
Intelligence
ELIZA program (Weizenbaum, 1966)
NLP system that imitates a psychotherapist

ELIZA uses pattern matching, knows nothing of the world,


but many people thought that it really understood them and their problems!
NLP and the Measure of
Intelligence
ELIZA program (Weizenbaum, 1966)
NLP system that imitates a psychotherapist

says more about the people than


about the machine

ELIZA uses pattern matching, knows nothing of the world,


but many people thought that it really understood them and their problems!
NLP and the Measure of
Intelligence
using language as humans do
== truly intelligent machines?

regardless, people talk about computers and


interact with them as social entities;
expecting computers to understand their needs and
be able to interact naturally (Reeves and Nass 1996)
NLP and the Measure of
Intelligence
using language as humans do
== truly intelligent machines?

regardless, people talk about computers and


interact with them as social entities;
expecting computers to understand their needs and
be able to interact naturally (Reeves and Nass 1996)

The importance of NLP!


An Exciting Time for
NLP!
• Increase in computing resources

• Increase in the amount of data and information


available in digital form

• Development of highly successful machine


learning methods and competitive evaluations
(SemEval, NIST, CoNLL shared tasks, Kaggle)

• Richer understanding of the structure of human


language and its deployment in social contexts
State of the Art

• Simple methods often work very well when trained


on large quantities of data

• e.g., many text and sentiment classifiers still rely


on different sets of words (“bag of words”) without
regard to sentence and discourse structure or
meaning
State of the Art
• However, most NLP resources and systems are
available only for high resource languages

• Many low resource languages are spoken by


millions of people e.g., Bengali, Javanese,
Swahili, …

• The challenge is how to develop resources and


tools for thousands of languages, not just a few
The State and Fate of Linguistic
Diversity in the NLP World
• However, most NLP resources and systems are
available only for high resource languages

the winners

the hopefuls the underdogs

the rising stars


the left-behinds
the scraping-bys

https://microsoft.github.io/linguisticdiversity/
The State and Fate of Linguistic
Diversity in the NLP World
• Bahasa Indonesia is one of the rising star!

Class 5 Example L anguages #L angs #Speaker s % of Total L angs


0 Dahalo, Warlpiri, Popoloca, Wallisian, Bora 2191 1.2B 88.38%
1 Cherokee, Fijian, Greenlandic, Bhojpuri, Navajo 222 30M 5.49%
2 Zulu, Konkani, Lao, Maltese, Irish 19 5.7M 0.36%
3 Indonesian, Ukranian, Cebuano, Afrikaans, Hebrew 28 1.8B 4.42%
4 Russian, Hungarian, Vietnamese, Dutch, Korean 18 2.2B 1.07%
5 English, Spanish, German, Japanese, French 7 2.5B 0.28%

Table 1: Number of languages, number of speakers, and percentage of total languages for each language class.
The State and Fate of Linguistic
Diversity in the NLP World
• But, Javanese, Sundanese, Minangkabau are
scraping-bys
• And… Bugis, Ternate, and Manadonese are left-
behinds!

Class 5 Example L anguages #L angs #Speaker s % of Total L angs


0 Dahalo, Warlpiri, Popoloca,
Bugis, Ternate, Manadonese, …
Wallisian, Bora 2191 1.2B 88.38%
1 Cherokee, Fijian, Greenlandic,
Javanese, Sundanese,Bhojpuri, Navajo
Minangkabau, … 222 30M 5.49%
2 Zulu, Konkani, Lao, Maltese, Irish 19 5.7M 0.36%
3 Indonesian, Ukranian, Cebuano, Afrikaans, Hebrew 28 1.8B 4.42%
4 Russian, Hungarian, Vietnamese, Dutch, Korean 18 2.2B 1.07%
5 English, Spanish, German, Japanese, French 7 2.5B 0.28%

Table 1: Number of languages, number of speakers, and percentage of total languages for each language class.
The State and Fate of Linguistic
Diversity in the NLP World
• We need to collect more data on Indonesian
languages!

the winners

the hopefuls the underdogs

the left-behinds the rising stars


Bugis, Ternate,
Manadonese, … Javanese, Sundanese, Minangkabau, …
the scraping-bys

https://microsoft.github.io/linguisticdiversity/
Machine Translation
• Challenging because correct translations require:

• ability to analyze and generate sentences in


human languages

• understanding of world knowledge and context to


resolve the ambiguities of languages

• e.g., the French word “bordel” straightforward


translation is “brothel” but what if someone
says My room is un bordel?
Machine Translation
• Started with hand-built grammar based systems
(limited success) Direct Translation

Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
Transfer-based Machine Translation

Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
Interlingual MT

Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation

Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
• Transformed with the availability of parallel
sentences to collect statistics of word translations
and word sequences (circa 2006)

• small word groups often have distinctive


translations — phrase based MT, which formed
the basis of Google translate

Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
• Transformed with the availability of parallel
sentences to collect statistics of word translations
and word sequences

Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
• Another transformation using deep learning-based
sequence models (circa 2014)
Neural Machine
Machine Translation Translation

EMNLP 2014
• Another transformation using deep learning-based
Kyunghyn Cho et al
sequence models (circa 2014)

The source text is encoded by one neural network,


and then another neural network decodes it back to
the text, but, in another language. The decoder only
knows its language. Both have no idea about the each
other, and each of them knows only its own
language. Interlingua is back.
Machine Translation
Neural Machine Translation (NMT)
• Another transformation using deep learning-based
sequence models (circa 2014)

The idea is close to style transfer, the language is the style,


the essence of the text is the same. Interlingua is BACK!
Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
Neural Machine Translation (NMT)
• Another transformation using deep learning-based
sequence models (circa 2014)

Let Deep NN figure out the


specific features (i.e., the
interlingua)
Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
• Another transformation using deep learning-based
sequence models (circa 2014)

• train a deep neural network model with several


representational levels to optimize translation
quality

• the model learns intermediate representations


that are useful for translation
Machine Translation
• Long Short Term Memory Network

• maintain contextual information from early until


late in a sentence
Neural Machine Translation (NMT)
Single system to translate between multiple languages

Figure from "https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html"


Neural Machine Translation (NMT)
“Interlingua” – 3-d representation of Multilingual Google NMT
internal network data

Figure from "https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html"


BU NLP Research
Overview

• Low Resource Machine Translation: automatic


translation for languages that have few translation
data for training

• Media Framing: how different media ”frame”


news stories differently
Low Resource Machine
Translation
• Can we learn to translate without parallel data?

Few/no parallel data

Plentiful monolingual
EMNLP 2017, ACL 2018 data
EMNLP 2017, ACL 2018
Low Resource Machine
Translation
• Word vector representations in different languages
might have similar geometric arrangements
(Mikolov, T., Le, Q.V. and Sutskever, I., 2013)

english spanish
Low Resource Machine
Translation
• Starting with some ”anchors”, we can learn a
mapping (W) that aligns words in different language
spaces
Low Resource Machine
Translation
• Then, we can use the learned bilingual spaces to
find translations
Low Resource Machine
Translation
• Then, we can use the learned bilingual spaces to
find translations
BPR_W BPR_W+C BPR_LINEAR BPR_NN BPR_WE
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
so uz vi ne gu ta te az bn lv hi cy hu bs sq uk sk id sv tr nl bg it fr sr ro es
Low Resource Machine
Translation
• We can also use images to find translations in
different languages

• Since images of a word (e.g., “horse”) is the same


no matter the language
(turkish)
a Chr is Callison-Bur ch
sity of Pennsylvania
Low Resource Machine
ormation Science Department
i z, der r y, ccb} @seas. upenn. edu
Translation
kucing
y Identify translations
s. cat via images
- associated with
s animal words in different
d languages that
have a high
n persian
degree of visual
- similarity
n pet
d
r Figure 1: Our dataset and approach allow translations to be
y discovered by comparing the images associated with foreign
and English words. Shown here are five images for the Indone-
s.
Low Resource Machine
Translation
NN
78 .219 konsep
25 .084
64 .350
96 .368 oriented
80 .321
56 .257
39 .350
department
07 .449
39 .201
04 .045 gifted
18 .244
10 .392
top-level
sh dictionary
ds were used
The SM A L L Figure 2: Shown here are five images for the abstract Indone-
results from sian word konsep, along with its top 4 ranked translations
Low Resource Machine
Translation
NN
78 .219 konsep
25 .084
64 .350
96 .368 oriented
80 .321
56 .257
39 .350
concreteness of the word
department
07 .449 matters
39 .201
04 .045 gifted
18 .244
10 .392
top-level
sh dictionary
ds were used
The SM A L L Figure 2: Shown here are five images for the abstract Indone-
results from sian word konsep, along with its top 4 ranked translations
Massively Multilingual Image Dataset
(MMID)

100 10,000 250,000


Languages Words per language English word translations

100 35,000,000
Images per word Total images

20TB
of data
Hosted by Amazon Public Datasets multilingual-images.org

http://multilingual-images.org/
Low Resource Machine
Translation
Summary

• We can translate even when we don’t have parallel


training data

• But we need other data, like monolingual data:


news articles, web pages, books, images, captions,

• Some parallel data can further help


BU NLP Research
Overview

• Low Resource Machine Translation: automatic


translation for languages that have few translation
data for training

• Media Framing: how different media ”frame”


news stories differently
Media Framing
• To Frame

To select some aspects of a perceived


reality and make them more salient

CoNLL 2019, ACL 2020


CoNLL 2019, ACL 2020
Media Framing
To select some aspects of a perceived
reality and make them more salient

Frame: Race/Ethnicity

Frame: Mental Health


Media Framing
Political climate in the U.S. is
increasingly polarized
• Main reason: news media of varied political
orientations have been depicting two distinct
versions of social reality (Mitchell et al., 2014; Stroud, 2011)

Need to assess the ways in which news


reporters frame important public affairs
Media Framing
Media Framing
Gun violence is one of the most polarized
issues in the country (Pew Research Center, 2018b)

Why? One potential explanation: Framing!


Media Framing
2,990 news headlines

1300 annotated as relevant


and with up to 2 frames

319 have 2 frames


21 media outlets

Frame A: Public Opinion


Frame B: 2nd Amendment

Gun
Violence
Codebook and dataset are publicly available: Frame
https://derrywijaya.github.io/GVFC.html Corpus
Media Framing

BERT Gun Violence


(Devlin et al., 2018)
Frame Corpus
Media Framing

Some peaks represent the deadlines mass


shootings in the U.S. since 1949
Media Framing
Left Neutral/Main Stream Right

16% Society/Culture 27% Mental Health 22% Mental Health


8% Mental Health 9% Society/Culture 5% Society/Culture
Media Framing
ACL 2020 Submission ***. Confidential

(a) U.S. Frame Network (b) German Frame Network

Figure 2: Comparison of frame association networks in


Media Framing
Summary

• Frame analysis can be used to gain a deeper


understanding of various issues of public affairs
that may ultimately determine public perception of
the issue
What’s Next?
• More collaborations needed
• To bring visibility to languages in Indonesia
• Javanese, spoken by ~70 millions people, has 57
thousand Wikipedia pages
• Swedish, spoken by ~10 millions people, has 3.7
What’s Next?
• More collaborations needed
• To bring visibility to languages in Indonesia
• Javanese, spoken by ~70 millions people, has 57
thousand Wikipedia pages
• Swedish, spoken by ~10 millions people, has 3.7
millions Wikipedia pages
What’s Next?
• More collaborations needed
• To bring visibility to languages in Indonesia
• Javanese, spoken by ~70 millions people, has 57
thousand Wikipedia pages
• Swedish, spoken by ~10 millions people, has 3.7
millions Wikipedia pages
• We need ~2 to 10 millions monolingual sentences to
obtain a reasonable unsupervised MT system
• Collaborate to collect data
What’s Next?
• More collaborations needed
• To bring visibility to languages in Indonesia
• Research collaborations
• Co-advising (Garuda Ilmu Komputer)
• More programs like visiting research at UI
• Consulting and training
• Programming (python, pytorch), data science,
machine learning, deep learning
(https://www.ilmudata.com/)
Thank You!

• wijaya@bu.edu

You might also like