Professional Documents
Culture Documents
resources
Chloé Clavel
Institut Mines-Télécom
Introduction
NLP tasks
• Supervised learning
─ Ex : SVM (support vector machines), Naive Bayes, … ?
• Unsupervised learning
─ Ex: Clustering
From http://www.tal.univ-paris3.fr/plurital/travaux-2009-2010/bao-2009-2010/MarjorieSeizou-AxelCourt/webservices.html
■ Phase 1 – learning
• Training corpus = set of labelled documents
─ Manual labeling : each document is assigned to a class :
• Ex1. Movie reviews: the score attributed by a user (1 to 5)
• Ex2. the topic of the document (sport, politics)
• Goal : Learn from this corpus the specific features of
each class
Convert
Document 1 NL Pre-processing documents into a
Matrix
Document 2
■ Phase 2 – classification
• Using the learned features, the system is able to assign
a class to a new document
Convert document
New into a Vector
Document NL Pre-processing
■ Speech synthesis
• Syntactic analysis
─ to define the pronunication
• Couvent (sit on eggs) ou Couvent (convent)
─ to handle the prosody of the voice
• Define where to put silent pauses
(manque|~negation-patt|
(il/#NEG/y/avoir/~negation-patt))/(#PREP_DE)?/ (conseil|contact|~services-lex)*
Different options :
Different options :
I saw a bat.
■ Tricky cases
─ Markers: dash (« - »), coma (« , »), tabulations (« »),
─ White space in « San Francisco »
─ End of sentence detection (find the dot (« . »)) : beware of
acronyms E.N.S.T., numbers (3.14), and dates (02.05.2018)
─ uses of the apostrophe for possession and contractions
■ 2 types of methods
■ Example :
• DET/PRO V -> PRO V
• NP (Noun Phrase) : DET ADJ* NN ADJ*
■ Strengths :
• Readable rules,
• Errors are easier to understand
■ Weaknesses
• Not robust to noisy inputs and out of vocabulary words
• Time-consuming
• Labelled corpus :
─ sequences of pairs (word, syntactic category)
• Graphical model
• Training :
─ learn the probabilities for the different transitions between
the nodes of the graph (e.g. between syntactic categories)
• Training :
─ Learning the model λ = (A, B, П) from a labeled corpus
• Decision:
─ find the best sequence E that maximizes the model for the
sequence of words M
• Simplifying hypothesis:
─ Markov assumption : first-order markov chain (the
probability of a particular state depends only on the
previous state)
Analyse de
!31 données Institut Mines-Télécom Les données textuelles
sociales
Conditional Random Fields
■ Feature functions
• To integrate more or less linguistic knowledge
• The model identifies the knowledge that plays a role in the labelling
task
Analyse de
!32 données Institut Mines-Télécom Les données textuelles
sociales
Conditional Random Fields
■ Linear CRF
• Y graph is a first order linear chain
• Each Yi is still linked to all the Xj in the graph
• Non-generative model : a model of the conditional probability of the
target Y, given an observation x,
Analyse de
!33 données Institut Mines-Télécom Les données textuelles
sociales
Syntactic analysis - challenges
■ Challenges
• Disambiguation : « La petite brise la glace »
• Capable of handling mistakes and typos
PRACTICE :
ENGLISH :
propose inflectional forms of « dog », « sit »
propose derivational form of happy
FRENCH : donner les flexions du verbe « jouer »
• Stemming :
─ a crude heuristic process that chops off the ends of words
(removal of derivational affixes)
─ How? Ex: Porter’s algorithm based on morphological rules
[Porter, 1980]
PRACTICE:
What is the stem of the word « frontal » in French?
PRACTICE:
What are the stem and the lemma of the word « saw » in English?
■ Lemmatization – existing
tools
• For French
─ Treetagger
─ Xerox, Brill [Brill, 1995]
─ LIA_Tag, macaon http://
macaon.lif.univ-mrs.fr/
index.php?page=home-en
• For English:
─ NLTK : http://www.nltk.org/
─ Treetagger Lemma Part of Speech
PRACTICE :
Let’s search the word « estimable » on Wordnet website for English
http://wordnetweb.princeton.edu/perl/webwn
Example :
good, right, ripe – (most suitable or right for a particular
purpose; "a good time to plant tomatoes"; "the right time to
act"; "the time is ripe for great sociological changes")
PRACTICE
On wordnet
To see the semantic relation click on the S
■ https://nlp.stanford.edu/IR-book
■ [PORTER, 1980]
• M.F. Porter, 1980, An algorithm for suffix
stripping, Program, 14(3) pp 130−137.