Professional Documents
Culture Documents
Session 11-12
Delivered by
Have you
heard of
Theodore
Kaczinski?
No. 1 in
FBI’s most
heard of the
infamous
terrorist
Unabomber?
Text Associations Of Unabomber’s Manifesto With
Word cloud of Unabomber’s Manifesto
Kaczynski’s Letters and Thesis
Natural Language Processing (NLP)
• Study of interaction between computers and human languages
• NLP = Computer Science + AI + Computational Linguistics
• NLP is the science of using machine learning and artificial intelligence to
facilitate interactions between computers and human (or natural)
languages.
• In particular, it concerns how computers process and analyze large
amounts of natural language data.
Why is NLP difficult to achieve?
• Human speech and text is full of phrases, connotations and hidden sentiments
• We speak more than 7000 languages, though only 23 % of them are spoken by half
the world population – still it is huge issue for machines to comprehend
• For example – in Mandarin language, there are more than 50,000 characters,
Japanese has more than 100 characters and three different script writing systems
• All computers understand – 0 and 1
What’s in
Text Mining?
Some breakthroughs…
Shakespearean text !!!!
https://medium.com/towards-
• Using machine learning and artificial-intelligence/ai-writes-
shakespearean-plays-
deep learning algorithms, e0d5f30c16b2
computers have been found Paul: This is
insane guys!!
to be successfully create….
CAN YOU GUESS????
And an original
Beatles Song
10
Stemming
• Stemming is the process of gathering words of similar origin into one
word for example “communication”, “communicates”, “communicate”.
• Stemming helps us increase accuracy in our mined text by removing
suffixes and reducing words to their basic forms
• Some useful algorithms for stemming are – Porter’s algorithm, Snowball
stemmer
Stemming
12
Lemmatization
• Lemmatization involves resolving words to their dictionary form. For
lemmatization to resolve a word to its lemma, it needs to know its part of
speech. That requires extra computational linguistics power such as a part
of speech tagger. This allows it to do better resolutions
• It is harder to create a lemmatizer in a new language than a stemming
algorithm. Because lemmatizers require a lot more knowledge about the
structure of a language, it’s a much more intensive process than just trying
to set up a heuristic stemming algorithm.
13
Lemmatization
Term Document Matrix
• After the cleaning process ,we are left with independent terms that exist
throughout the document.
• These are stored in a matrix that shows each of their occurrence. This
matrix logs the number of times the term appears in our clean data set thus
being called a term document matrix.
15
Word Cloud
(Visualization of frequency of
words appearing in a document)
POS Tagging
• Part Of Speech (POS) tagging
• Find the corresponding POS for each
word. – Word sense disambiguation
Eg: Dog likes bananas / dog gone
bananas
Eg: I think something fishy is going on /
I am going to have some fish
17
A LIST OF POS
TAGS
Bag of Words (BOW) Approach
• The bag-of-words model is a way of extracting features from text data when modeling
text with machine learning algorithms.
• A bag-of-words is a representation of text that describes the occurrence of words within
a document. It involves two things:
-A vocabulary of known words.
-A measure of the presence of known words.
• It is called a “bag” of words, because any information about the order or structure of
words in the document is discarded. The model is only concerned with whether known
words occur in the document, not where in the document.
• The intuition is that documents are similar if they have similar content. Further, that
from the content alone we can learn something about the meaning of the document.
Term Frequency
TF is a measure of how frequently a term, t, appears in a document, d. It can be represented by the
formula:
Here, in the numerator, n is the number of times the term “t” appears in the document “d”. Thus, each
document and term would have its own TF value.
What is Inverse Document Frequency (IDF)?
26
Corpus widget-for importing locally stored data
27
Text Preprocess widget-for cleaning the data
28
Functionalities of each section of Preprocess widget
B. Tokenization - breaking the text into smaller components (words, sentences, bigrams).
• Word & Punctuation will split the text by words and keep punctuation symbols. This example. →
(This), (example), (.)
• Whitespace will split the text by whitespace only. This example. → (This), (example.)
• Sentence will split the text by full stop, retaining only full sentences. This example. Another example.
→ (This example.), (Another example.)
• Regexp will split the text by provided regex. It splits by words only by default (omits punctuation).
Some useful regular expressions for quick filtering:\bword\b: matches exact word \w+: matches only
words, no punctuation \b(B|b)\w+\b: matches words beginning with the letter b \w{4,}: matches words
that are longer than 4 characters\b\w+(Y|y)\b: matches words ending with the letter y
• Tweet will split the text by pre-trained Twitter model, which keeps hashtags, emoticons and other
special symbols. This example. :-) #simple → (This), (example), (.), (:-)), (#simple)
Functionalities of each section of Preprocess widget (contd…)
• Stopwords removes stopwords from text (e.g. removes ‘and’, ‘or’, ‘in’…). Select the language to filter by,
English is set as default. You can also load your own list of stopwords provided in a simple *.txt file with one
stopword per line.
• Click ‘browse’ icon to select the file containing stopwords. If the file was properly loaded, its name will be
displayed next to pre-loaded stopwords. Change ‘English’ to ‘None’ if you wish to filter out only the provided
stopwords. Click ‘reload’ icon to reload the list of stopwords.
• Lexicon keeps only words provided in the file. Load a *.txt file with one word per line to use as lexicon. Click
‘reload’ icon to reload the lexicon.
• Regexp removes words that match the regular expression. Default is set to remove
Document Frequency
• Document frequency keeps tokens that appear in not less than and not more than
the specified number / percentage of documents. Absolute keeps only tokens that
appear in the specified number of documents.
• E.g. DF = (3, 5) keeps only tokens that appear in 3 or more and 5 or less
documents. Relative keeps only tokens that appear in the specified percentage of
documents.
• E.g. DF = (0.3, 0.5) keeps only tokens that appear in 30% to 50% of documents.
• Most frequent tokens keeps only the specified number of most frequent tokens.
Default is a 100 most frequent tokens.
POS Tagger
• N-grams Range creates n-grams from tokens. Numbers specify the range
of n-grams. Default returns one-grams and two-grams.
• POS Tagger runs part-of-speech tagging on tokens.
• Averaged Perceptron Tagger runs POS tagging with Matthew Honnibal’s averaged
perceptron tagger.
• Treebank POS Tagger (MaxEnt) runs POS tagging with a trained Penn Treebank
model.
Word Cloud Widget-for data visualization
35
Bag of words
TEXT CLASSIFICATION
WORKFLOW
Thank You