Professional Documents
Culture Documents
Derry Wijaya
Boston University
About Me
• BSc and MSc in Computer Science from
National University of Singapore
• conversational agent
• …
NLP for Interdisciplinary
Research
• Psychology: measure stress, anxiety, depression based on
patients’ tweets or social media posts
• Medicine: extract information from doctors’ or physicians’ notes
• Public Health: measure epidemiology like diabetes, obesity
risk factors from users’ tweets and Instagram
• Economy: predict stock performance of a company based on
news articles about the company
• Accounting: automatically create balance sheets from financial
reports
• Education: automated tutoring, grading
• Communication: detect media frames: angles of the story
Goal of NLP
Computers
solving task
involving human
language
Knowledge of Algorithm
task
Language solve ambiguity
captured by
Model
citrus, apple,
orange, lime
aromatic, nose,
scent, perfume
http://methodmatters.blogspot.com/2017/11/using-word2vec-to-analyze-word.html
Algorithms for Solving
Task
• Given the models, search through a space of
hypotheses about an input
Turing Test
by responding as a
person to the examiner’s
questions, the machine
wins if it can convince the
examiner into believing
that it is a person
https://xkcd.com/329/
NLP and the Measure of
Intelligence
ELIZA program (Weizenbaum, 1966)
NLP system that imitates a psychotherapist
the winners
https://microsoft.github.io/linguisticdiversity/
The State and Fate of Linguistic
Diversity in the NLP World
• Bahasa Indonesia is one of the rising star!
Table 1: Number of languages, number of speakers, and percentage of total languages for each language class.
The State and Fate of Linguistic
Diversity in the NLP World
• But, Javanese, Sundanese, Minangkabau are
scraping-bys
• And… Bugis, Ternate, and Manadonese are left-
behinds!
Table 1: Number of languages, number of speakers, and percentage of total languages for each language class.
The State and Fate of Linguistic
Diversity in the NLP World
• We need to collect more data on Indonesian
languages!
the winners
https://microsoft.github.io/linguisticdiversity/
Machine Translation
• Challenging because correct translations require:
Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
Transfer-based Machine Translation
Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
Interlingual MT
Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
• Transformed with the availability of parallel
sentences to collect statistics of word translations
and word sequences (circa 2006)
Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
• Transformed with the availability of parallel
sentences to collect statistics of word translations
and word sequences
Figure from "A history of machine translation from the Cold War to deep learning" by Ilya Pestov
Machine Translation
• Another transformation using deep learning-based
sequence models (circa 2014)
Neural Machine
Machine Translation Translation
EMNLP 2014
• Another transformation using deep learning-based
Kyunghyn Cho et al
sequence models (circa 2014)
Plentiful monolingual
EMNLP 2017, ACL 2018 data
EMNLP 2017, ACL 2018
Low Resource Machine
Translation
• Word vector representations in different languages
might have similar geometric arrangements
(Mikolov, T., Le, Q.V. and Sutskever, I., 2013)
english spanish
Low Resource Machine
Translation
• Starting with some ”anchors”, we can learn a
mapping (W) that aligns words in different language
spaces
Low Resource Machine
Translation
• Then, we can use the learned bilingual spaces to
find translations
Low Resource Machine
Translation
• Then, we can use the learned bilingual spaces to
find translations
BPR_W BPR_W+C BPR_LINEAR BPR_NN BPR_WE
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
so uz vi ne gu ta te az bn lv hi cy hu bs sq uk sk id sv tr nl bg it fr sr ro es
Low Resource Machine
Translation
• We can also use images to find translations in
different languages
100 35,000,000
Images per word Total images
20TB
of data
Hosted by Amazon Public Datasets multilingual-images.org
http://multilingual-images.org/
Low Resource Machine
Translation
Summary
Frame: Race/Ethnicity
Gun
Violence
Codebook and dataset are publicly available: Frame
https://derrywijaya.github.io/GVFC.html Corpus
Media Framing
• wijaya@bu.edu