Lecture1 PreprocessingAndResources

Lecture 1 – NLP Preprocessing and
resources 
 
Chloé Clavel
Institut Mines-Télécom
Introduction
NLP tasks
!2 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

2 kind of tasks:
■ Classify documents by themes, opinions etc...
• Supervised learning
─ Ex : SVM (support vector machines), Naive Bayes, … ?
• Unsupervised learning
─ Ex: Clustering

2 kind of tasks:
■ Detect particular expressions

─ Ex: Named Entities
o
From http://www.tal.univ-paris3.fr/plurital/travaux-2009-2010/bao-2009-2010/MarjorieSeizou-AxelCourt/webservices.html

Classification
■ Phase 1 – learning
• Training corpus = set of labelled documents
─ Manual labeling : each document is assigned to a class :
• Ex1. Movie reviews: the score attributed by a user (1 to 5)
• Ex2. the topic of the document (sport, politics)
• Goal : Learn from this corpus the specific features of
each class

Phase 1 – learning
■ Learning the classes
Convert
Document 1 NL Pre-processing documents into a
Matrix
This/PN movie/N is/VB really/RB good/JJ.
Document 2
Learn the models

…
corresponding to
|’T’|’h’|’i’|’s’|’ ‘|’m’|’o’|’v’|’i’|’e’|’ ‘|’i’|’s’|’ ‘|’r’|’e’|’a’|’l’|’l’|’y’|’ ‘|’g’|’o’|’o’|’d’|’.’|
each class

Phase 2 – classification
■ Phase 2 – classification
• Using the learned features, the system is able to assign
a class to a new document
Convert document
New into a Vector
Document NL Pre-processing
Use the models

previously learned
to predict the class

Objective of the lecture
■ Focus on Natural Language pre-processing

• for NLP tasks
■ Get familiar with:
• classical linguistic issues :
─ semantic disambiguisation, morphology, syntactic analysis,
etc.
• existing linguistic resources

Example of pre-processing steps for textual
data analysis
1. Segmentation / Tokenizing: Ex: "I saw a bat."

1. Words and sentences
2. Lexical processing
1. determine the lexical 1. I/saw/a/bat/./
information associated with
each word in isolation
(morphological rules and 2. bat : - a noun referring to a
dictionary) flying mammal or a wooden
3. Syntactic parsing: club
1. Disambiguate according to the
syntactic context, extract the
grammatical relations that 3.
words and groups of words bat : - object of the verb saw
maintain between them
4. Semantic parsing:
1. Word-sense disambiguation 4. bat : a flying mammal
based on the context

Motivations for pre-processing
■ Speech synthesis
• Syntactic analysis
─ to define the pronunication
• Couvent (sit on eggs) ou Couvent (convent)
─ to handle the prosody of the voice
• Define where to put silent pauses
Tests under Acapella : les poules couvent au couvent.

■ Pre-processing for the building of syntactic patterns for

information extraction
• Ex : Patterns which call
─ syntactic categories (ex: #PREP_DE, #NEG)
(manque|~negation-patt|
(il/#NEG/y/avoir/~negation-patt))/(#PREP_DE)?/ (conseil|contact|~services-lex)*
Examples of patterns will be given in Lecture « Opinion

Analysis »

■ Pre-processing for text classification

• Reduce the representation space
─ Group inflected forms of words around lemmas
• (ex: infinitive for a verb, masculine singular for a noun)
Details later in this lecture

Tokenization

Tokenization I/saw/a/bat/./
■ "I saw a bat. "

• given a character sequence,
■ tokenization is …
• the task of chopping it up into pieces, called tokens

Different options :
■ Option1: consider all the tokens indifferently

■ Output : (I, saw, a, bat, .)

Different options :
■ Option 2: consider that token = word/term

■ throw away certain characters (such as punctuation)
■ Output : (I, saw, a, bat)

■ throw away words coming from a list of stop words (common words
which would appear to be of little value for the NLP task)
■ Output : (saw, bat)

Tokenization
■ Methods for tokenization

• Simple Tokenization rule : chop on whitespace and throw away
punctuation characters
I saw a bat.
(I, saw, a, bat)
04.02.1985: I left San Francisco and went to Viêt-Nam.

You were’nt born.
??? Tricky cases ???

■ Tricky cases
─ Markers: dash (« - »), coma (« , »), tabulations (« »),
─ White space in « San Francisco »
─ End of sentence detection (find the dot (« . »)) : beware of
acronyms E.N.S.T., numbers (3.14), and dates (02.05.2018)
─ uses of the apostrophe for possession and contractions
Tests under Acapella

Nous sommes le 02.05.2018. Il y a quelques années le nom de l’école
était l’ENST ou mieux l’E.N.S.T.

Tokenization
■ More involved methods for word segmentation :

• Heuristic-based :
─ Use of a large vocabulary
─ Take the longest vocabulary match
• Ex : I went to San Francisco -> (I, went, San Francisco)
─ Use some heuristics for unknown words
such methods make mistakes sometimes, and so you are never

guaranteed a consistent unique tokenization.

Tokenization
■ More involved methods for word segmentation :

• Machine learning sequence models
─ trained over hand-segmented words
─ Ex: hidden Markov models, Conditional Random Fields,
recurrent neural networks
• see Lecture Hidden Markov Models - Laurence Likforman
• see Lecture Recurrent neural networks – Geoffroy Peters
such methods make mistakes sometimes, and so you are never

guaranteed a consistent unique tokenization.

Syntactic analysis – Part of
Speech tagging and chunking

Part of Speech tagging
■ Task for the attribution of grammatical categories

of a token
!22 23/04/2019 Institut Mines-Télécom Pre-processing and resources - Chloé Clavel

Chunking task [John] [talked] [to the children][about drugs]
■ Detection of syntactic components :

• Noun phrase (groupe nominal), verbal phrase (noyau
verbal), etc.
• Borderline detection
• Phrase labelling

Dependency parsing task

Methods and challenges
■ 2 types of methods
• Based on linguistic expertise

• Machine learning:
─ From a BIG labelled corpus/database
─ Learn the probabilities for the different transitions between
syntactic categories

Modelling linguistic expertise for POS
tagging/chunking/dependency parsing
■ Example :
• DET/PRO V -> PRO V
• NP (Noun Phrase) : DET ADJ* NN ADJ*
■ Strengths :
• Readable rules,
• Errors are easier to understand
■ Weaknesses
• Not robust to noisy inputs and out of vocabulary words
• Time-consuming

Machine learning for POS tagging
■ Problem formulation for Hidden Markov Models
• Labelled corpus :
─ sequences of pairs (word, syntactic category)
• Graphical model
• Training :
─ learn the probabilities for the different transitions between
the nodes of the graph (e.g. between syntactic categories)
See Lecture Hidden Markov Models - Laurence Likforman

N: number of distinct observations (vocabulary size)

C: number of grammatical categories
• Training : Learning the model λ = (A, B, П) from a labeled corpus

• A : CxC transition state matrix
o e.g. probability to have a VERB after a DET
• B : NxC matrix of the probabilities of the observation i in state
j
o e.g. probability to generate « like » if the state is a verb
• Distribution П of the initial state : vector of length C
o E.g. probability to begin with a DET

• Training :
─ Learning the model λ = (A, B, П) from a labeled corpus
• Decision:
─ find the best sequence E that maximizes the model for the
sequence of words M

• Simplifying hypothesis:
─ Markov assumption : first-order markov chain (the
probability of a particular state depends only on the
previous state)
─ Conditionnally to the labels, words are independant :
See Lecture 6 Hidden Markov Models - Laurence Likforman

Conditional Random Fields
■ Generalization of Hidden Markov models

• Markovian graphical model :
• Non-directed
• Each Yi is linked to all the Xj in the graph
• Non-generative model : a model of the conditional probability of the
target Y, given an observation x,
• P(y/x) depends on
─ the structure of Y graph
─ potential functions that depends on
• feature functions given by the linguist expert
• Weights of these feature functions in the model
Analyse de
!31 données Institut Mines-Télécom Les données textuelles
sociales
■ Feature functions
• To integrate more or less linguistic knowledge
• The model identifies the knowledge that plays a role in the labelling
task
Analyse de
sociales
■ Linear CRF
• Y graph is a first order linear chain
• Each Yi is still linked to all the Xj in the graph
• Non-generative model : a model of the conditional probability of the
target Y, given an observation x,
Analyse de
sociales
Syntactic analysis - challenges
■ Challenges
• Disambiguation : « La petite brise la glace »
• Capable of handling mistakes and typos

Filtering words
To reduce vocabulary size for NLP

tasks

Reduce vocabulary size
■ According to the NLP task, filtering …

• Punctuation (??,!!, .)
─ NB: useful for opinion mining
• Dates
─ NB: useful for Named Entity Recognition
• stop words using a predefined list (e.g. linking words)
─ NB: linking words are useful for argument mining
• Hapax
─ Marginal terms (occuring once or twice) in the corpus
• often corresponds to mispelling words

■ Gather inflectional forms and derivationally related

forms
• Inflectional forms : a change in or addition to the form of
a word that shows a change in the way it is used in
sentences
• Morphological derivation, : the process of forming a
new word from an existing word, often by adding a
prefix or suffix, such as -ness or un-
PRACTICE :
ENGLISH :
propose inflectional forms of « dog », « sit »
propose derivational form of happy
FRENCH : donner les flexions du verbe « jouer »

■ Gather inflectional forms and derivationally related

forms of a word around …
• Their stems -> stemming
─ « cherchons » -> « cherch »
• Their lemmas -> lemmatization
• am, are, is => be
• car, cars, car's, cars' => car
■ Stemming and lemmatization are based on

• a morphological analysis of the words
Morphological analysis : an analysis of word internal structure

Morpheme : minimal linguistic unit carrying a sense (abstract unit)
Morphologic processes : flection, declension, conjugation, derivation (anti-
constitu-tionn-elle-ment)

■ Stemming and lemmatization are based on

• a morphological analysis of the words
■ What is morphological analysis?
─ An analysis of word internal structure
─ Morpheme : minimal linguistic unit carrying a sense
(abstract unit)
─ Morphologic processes : flection, declension, conjugation,
derivation (anti-constitu-tionn-elle-ment)

• Stemming :
─ a crude heuristic process that chops off the ends of words
(removal of derivational affixes)
─ How? Ex: Porter’s algorithm based on morphological rules
[Porter, 1980]
PRACTICE:
What is the stem of the word « frontal » in French?

─ Example of stemmer outputs

“the boy's cars are different colors” =>
■ Lemmatization : “the boy car be differ color”
• use of a vocabulary and morphological analysis of words,

• to return the lemma:
─ the base or dictionary form of a word: .
• am, are, is => be
• car, cars, car's, cars' => car
• NB: syntactic analysis can help to disambiguate some cases

• Ex: « les poules du couvent couvent »
o Couvent -> couvent (noun) ou couver (verb)
PRACTICE:
What are the stem and the lemma of the word « saw » in English?

■ Lemmatization – existing
tools
• For French
─ Treetagger
─ Xerox, Brill [Brill, 1995]
─ LIA_Tag, macaon http://
macaon.lif.univ-mrs.fr/
index.php?page=home-en
• For English:
─ NLTK : http://www.nltk.org/
─ Treetagger Lemma Part of Speech

■ Stemming vs. Lemmatization

• What is the best choice?
─ It depends on the language
• Ex: stemming works well in German

Resources

Resources
■ Wordnet : lexical database

• Retrieve information on word meaning/sense
• Core idea :
─ A word can have several meanings (ex: « bat »)
─ groups English words into synsets
─ Synsets : set of synonyms
PRACTICE :
Let’s search the word « estimable » on Wordnet website for English
http://wordnetweb.princeton.edu/perl/webwn
Q1 : how many senses are existing for this word?

Q2 : what is the size of the synset of estimable#2?

Resources

─ Synsets : set of synonyms
• Synonyms : words that are interchangeable in some context
without changing the truth value of the proposition
• Synsets include simplex words as well as collocations like
"eat out" and "car pool.“
• The meaning of a synset is further clarified with a short
definition and one or more usage examples
Example :
good, right, ripe – (most suitable or right for a particular
purpose; "a good time to plant tomatoes"; "the right time to
act"; "the time is ripe for great sociological changes")

Resources

─ All synsets are connected to other synsets by means of
semantic relations:
• Ex: canine is a hypernym of dog
• window is a meronym of building
PRACTICE
On wordnet
To see the semantic relation click on the S
Version française : Wordnet Libre du Français (WOLF) : http://

alpage.inria.fr/~sagot/wolf.html

References
■ https://nlp.stanford.edu/IR-book
■ [PORTER, 1980]
• M.F. Porter, 1980, An algorithm for suffix
stripping, Program, 14(3) pp 130−137.

Lecture1 PreprocessingAndResources

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture1 PreprocessingAndResources

Uploaded by

Copyright:

Available Formats

Lecture 1 – NLP Preprocessing and

!2 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Classify documents by themes, opinions etc...

!3 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Detect particular expressions

!4 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

!5 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Learning the classes

This/PN movie/N is/VB really/RB good/JJ.

Learn the models

!6 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

Use the models

!7 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Focus on Natural Language pre-processing

!8 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

1. Segmentation / Tokenizing: Ex: "I saw a bat."

!9 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

Tests under Acapella : les poules couvent au couvent.

!10 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Pre-processing for the building of syntactic patterns for

Examples of patterns will be given in Lecture « Opinion

!11 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Pre-processing for text classification

Details later in this lecture

!12 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

!13 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ "I saw a bat. "

!14 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Option1: consider all the tokens indifferently

!15 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Option 2: consider that token = word/term

■ Output : (I, saw, a, bat)

■ Output : (saw, bat)

!16 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Methods for tokenization

(I, saw, a, bat)

04.02.1985: I left San Francisco and went to Viêt-Nam.

??? Tricky cases ???

!17 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

Tests under Acapella

!18 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ More involved methods for word segmentation :

such methods make mistakes sometimes, and so you are never

!19 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ More involved methods for word segmentation :

such methods make mistakes sometimes, and so you are never

!20 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

!21 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Task for the attribution of grammatical categories

!22 23/04/2019 Institut Mines-Télécom Pre-processing and resources - Chloé Clavel

■ Detection of syntactic components :

!23 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

!24 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

• Based on linguistic expertise

!25 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

!26 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Problem formulation for Hidden Markov Models

See Lecture Hidden Markov Models - Laurence Likforman

!27 23/04/2019 Institut Mines-Télécom Modèle de présentation Télécom ParisTech

■ Problem formulation for Hidden Markov Models

N: number of distinct observations (vocabulary size)

• Training : Learning the model λ = (A, B, П) from a labeled corpus

See Lecture Hidden Markov Models - Laurence Likforman