Professional Documents
Culture Documents
• Information Extraction
• Machine Translation
• Phonetics
• Morphology
• Syntax
• Semantics
• Pragmatics
• Discourse
NLP © Sakthi Balan M
Progress
• Parsing
• Paraphrase
• PoS - 97%
• Machine Translation • Summarisation
• NER - 97%
• Information Extraction • Dialogue
NLP © Sakthi Balan M
Crash Blossoms and Garden Path Sentences
• “I went to bank”
• segmentation problems
• idioms
• neologisms
• world knowledge
• Automaton
• FSM:
• Empiricism:
• Linguistics Data Consortium (LDC) — large amounts of spoken and written materials
available
• At last works of Brown et al (1990), Och and Ney (2003) [Machine translation] and Biel
(2003) [Topic modeling] showed that we can even work with unannotated text data
NLP © Sakthi Balan M
Regular Expression
• [^A-Za-z][Th]he[^A-Za-z]
NLP © Sakthi Balan M
Basic Text Processing
• Tokenisation
• Cat, Cats, cat, cats, I’m, They’re, India’s capital, Ph.D and so on
• |V| = 13 million
• Morphemes are the small meaning full units that make words
• Operations: * E X E C U T I O N
• Insertion
1 delete 5 operations
• Deletion 3 substitution
1 insertion
8 operations
• Substitution
Levenshtein Distance
• Minimise the number of operations
• Named entity extraction and entity coreference — IBM and IBM Ltd
• Searching for a sequence of edits / paths from the starting string to the
final string
D(0,j) = j (Insert)
D(i,0) = i (delete)
• Every time we enter a new cell in the table we note down from
where we came from (minimum one)
I N T E * N T I O N
• Backtrace: O(m+n)