You are on page 1of 38

TEXT MINING

Session 11-12
Delivered by

Dr. Pratyush Banerjee


• An overview of Text Mining
• Basics of Text Mining Process
Session • Overview of Orange as a text mining tool

Objectives Hands-on exercises with Text data
Faculty of
Mathematics
at University
of California
Berkeley

Have you
heard of
Theodore
Kaczinski?
No. 1 in
FBI’s most

Have you wanted list

heard of the
infamous
terrorist
Unabomber?
Text Associations Of Unabomber’s Manifesto With
Word cloud of Unabomber’s Manifesto
Kaczynski’s Letters and Thesis
Natural Language Processing (NLP)
• Study of interaction between computers and human languages
• NLP = Computer Science + AI + Computational Linguistics
• NLP is the science of using machine learning and artificial intelligence to
facilitate interactions between computers and human (or natural)
languages.
• In particular, it concerns how computers process and analyze large
amounts of natural language data.
Why is NLP difficult to achieve?
• Human speech and text is full of phrases, connotations and hidden sentiments
• We speak more than 7000 languages, though only 23 % of them are spoken by half
the world population – still it is huge issue for machines to comprehend
• For example – in Mandarin language, there are more than 50,000 characters,
Japanese has more than 100 characters and three different script writing systems
• All computers understand – 0 and 1
What’s in
Text Mining?

Some breakthroughs…
Shakespearean text !!!!
https://medium.com/towards-
• Using machine learning and artificial-intelligence/ai-writes-
shakespearean-plays-
deep learning algorithms, e0d5f30c16b2
computers have been found Paul: This is
insane guys!!
to be successfully create….
CAN YOU GUESS????
And an original
Beatles Song

First ever song created by AI: https://www.youtube.com/watch?v=LSHZ_b05W7o


Basic Text Mining Process
Step 1: Text Preprocessing – Tokenization, stemming / lemmatization
and stop word removal
Step 2: Features Generation – Bag of words
Step 3: Features Selection – Simple counting – Statistics
Step 4: Data Mining – Classification (Supervised) / Clustering
(Unsupervised)
Step 5: Analyzing results
9
• Tokenization is the process of breaking
down text document apart into small pieces. In text
analytics, tokens are most frequently used words.
• A sentence of 10 words, then, would contain 10 tokens.
• Tokenizers can be of many types
• Sentence tokenizers break down text paragraphs into
Tokenization individual sentences
• Word tokenizers break down sentences into individual
words

10
Stemming
• Stemming is the process of gathering words of similar origin into one
word for example “communication”, “communicates”, “communicate”.
• Stemming helps us increase accuracy in our mined text by removing
suffixes and reducing words to their basic forms
• Some useful algorithms for stemming are – Porter’s algorithm, Snowball
stemmer
Stemming

12
Lemmatization
• Lemmatization involves resolving words to their dictionary form. For
lemmatization to resolve a word to its lemma, it needs to know its part of
speech. That requires extra computational linguistics power such as a part
of speech tagger. This allows it to do better resolutions
• It is harder to create a lemmatizer in a new language than a stemming
algorithm. Because lemmatizers require a lot more knowledge about the
structure of a language, it’s a much more intensive process than just trying
to set up a heuristic stemming algorithm.
13
Lemmatization
Term Document Matrix
• After the cleaning process ,we are left with independent terms that exist
throughout the document.
• These are stored in a matrix that shows each of their occurrence. This
matrix logs the number of times the term appears in our clean data set thus
being called a term document matrix.

15
Word Cloud
(Visualization of frequency of
words appearing in a document)
POS Tagging
• Part Of Speech (POS) tagging
• Find the corresponding POS for each
word. – Word sense disambiguation
Eg: Dog likes bananas / dog gone
bananas
Eg: I think something fishy is going on /
I am going to have some fish

17
A LIST OF POS
TAGS
Bag of Words (BOW) Approach
• The bag-of-words model is a way of extracting features from text data when modeling
text with machine learning algorithms.
• A bag-of-words is a representation of text that describes the occurrence of words within
a document. It involves two things:
-A vocabulary of known words.
-A measure of the presence of known words.
• It is called a “bag” of words, because any information about the order or structure of
words in the document is discarded. The model is only concerned with whether known
words occur in the document, not where in the document.
• The intuition is that documents are similar if they have similar content. Further, that
from the content alone we can learn something about the meaning of the document.
Term Frequency
TF is a measure of how frequently a term, t, appears in a document, d. It can be represented by the
formula:

Here, in the numerator, n is the number of times the term “t” appears in the document “d”. Thus, each
document and term would have its own TF value.
What is Inverse Document Frequency (IDF)?

• In simple terms, it's a measure of the rareness of a term.


• Term frequency measures commonness, and IDF’s focus is to measure
Here Mobilegeddon
rareness. is a unique yet
important word in
the document
What is TF-IDF?

Term frequency–inverse document frequency, is a numerical statistic that is intended to


reflect how important a word is to a document in a collection or corpus.
BAG OF WORD

How Bag of Word Analysis works?


Conducting Text analytics with Orange
• Orange has a text mining extension
• This extension can be downloaded from the add-on option
• The text mining add-on is very robust and can conduct complex text
analytics applications without the hassle of coding
Conducting Text analytics in Orange-
Text Widget tab
• This widget tab includes all
necessary applications of text
mining
• We will be covering Sentiment
analysis with this

26
Corpus widget-for importing locally stored data

27
Text Preprocess widget-for cleaning the data

28
Functionalities of each section of Preprocess widget

A. Transformation- transforms input data.


• Lowercase will turn all text to lowercase.
• Remove accents will remove all diacritics/accents in text. naïve → naive
• Parse html will detect html tags and parse out text only. <a href…>Some
text</a> → Some text
• Remove urls will remove urls from text. This is a http://orange.biolab.si/ url. →
This is a url.
Functionalities of each section of Preprocess widget (contd…)

B. Tokenization - breaking the text into smaller components (words, sentences, bigrams).
• Word & Punctuation will split the text by words and keep punctuation symbols. This example. →
(This), (example), (.)
• Whitespace will split the text by whitespace only. This example. → (This), (example.)
• Sentence will split the text by full stop, retaining only full sentences. This example. Another example.
→ (This example.), (Another example.)
• Regexp will split the text by provided regex. It splits by words only by default (omits punctuation).
Some useful regular expressions for quick filtering:\bword\b: matches exact word \w+: matches only
words, no punctuation \b(B|b)\w+\b: matches words beginning with the letter b \w{4,}: matches words
that are longer than 4 characters\b\w+(Y|y)\b: matches words ending with the letter y
• Tweet will split the text by pre-trained Twitter model, which keeps hashtags, emoticons and other
special symbols. This example. :-) #simple → (This), (example), (.), (:-)), (#simple)
Functionalities of each section of Preprocess widget (contd…)

C. Normalization applies stemming and lemmatization to words. (Eg: I’ve always


loved cats. → I have alway love cat.)
For languages other than English use Snowball Stemmer (offers languages available
in its NLTK implementation) or UDPipe Lemmatizer.
• Porter Stemmer applies the original Porter stemmer.
• Snowball Stemmer applies an improved version of Porter stemmer (Porter2). Set
the language for normalization, default is English.
• WordNet Lemmatizer applies a networks of cognitive synonyms to tokens based
on a large lexical database of English.
• UDPipe applies a pre-trained model for normalizing data.
Filtering Stop-words

• Stopwords removes stopwords from text (e.g. removes ‘and’, ‘or’, ‘in’…). Select the language to filter by,
English is set as default. You can also load your own list of stopwords provided in a simple *.txt file with one
stopword per line.

• Click ‘browse’ icon to select the file containing stopwords. If the file was properly loaded, its name will be
displayed next to pre-loaded stopwords. Change ‘English’ to ‘None’ if you wish to filter out only the provided
stopwords. Click ‘reload’ icon to reload the list of stopwords.
• Lexicon keeps only words provided in the file. Load a *.txt file with one word per line to use as lexicon. Click
‘reload’ icon to reload the lexicon.
• Regexp removes words that match the regular expression. Default is set to remove
Document Frequency
• Document frequency keeps tokens that appear in not less than and not more than
the specified number / percentage of documents. Absolute keeps only tokens that
appear in the specified number of documents.
• E.g. DF = (3, 5) keeps only tokens that appear in 3 or more and 5 or less
documents. Relative keeps only tokens that appear in the specified percentage of
documents.
• E.g. DF = (0.3, 0.5) keeps only tokens that appear in 30% to 50% of documents.
• Most frequent tokens keeps only the specified number of most frequent tokens.
Default is a 100 most frequent tokens.
POS Tagger
• N-grams Range creates n-grams from tokens. Numbers specify the range
of n-grams. Default returns one-grams and two-grams.
• POS Tagger runs part-of-speech tagging on tokens.
• Averaged Perceptron Tagger runs POS tagging with Matthew Honnibal’s averaged
perceptron tagger.
• Treebank POS Tagger (MaxEnt) runs POS tagging with a trained Penn Treebank
model.
Word Cloud Widget-for data visualization

35
Bag of words
TEXT CLASSIFICATION
WORKFLOW
Thank You

You might also like