Professional Documents
Culture Documents
Seminar 12:
Text Analytics
1
2/13/2020
2
2/13/2020
Source:
https://www.datamation.com/big‐data/structured‐vs‐unstructured‐data.html
5
3
2/13/2020
• Crawling
– Software programs crawls all over the web-pages, visiting
new and old websites. These programs are called crawlers,
spiders, or bots.
– It reads the websites and updates the indexing for next
part.
• Indexing
– Indexing is the next step that allows for quick reference in
future.
– Google builds a large database of webpages crawlers
found and generate corresponding index. Analogous to
index you found at the end of each book.
• Ranking
– With webpages indexed, different algorithms are used to
provide ranking of relevance. It includes information like
how many webpages links to this page, are the content of
webpages high quality, when was the webpage last updated
etc.
– Highly ranked webpages are displayed before the lower 7
ranked webpages.
Natural Language
• In English, words are combined together to form other
constituent units like words, phrases, clauses, and
sentences.
• Sentence is a structured format of representing a
collection of words following syntactic rules like
grammer.
Words without structure Structured sentence following hierarchical syntax
Picture src: Dipanjan Text Analytics with Python, 2016
4
2/13/2020
10
10
5
2/13/2020
NLP Frameworks
• Two components for NLP:
– Natural Language Understanding (NLU)
• Mapping the input in natural language to representations
• Analyze relevant aspects for the language
– Natural Language Generation (NLG)
• Producing meaningful text sequence in natural language
• It involves text and sentence construction.
– Steps in NLP:
• Lexical analysis
• Syntactic analysis
• Semantic analysis
• Disclosure integration
• Pragmatic analysis
Src: https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_natural_language_processing.htm
11
11
12
12
6
2/13/2020
Using NLTK
If you get into problem importing
nltk, runs the corresponding
Download from editor or IPython
Then select from
‘all’ option
followed by
clicking
download.
13
13
14
14
7
2/13/2020
Text Normalization
• Normalization in text analytics refers to the process of data
cleaning, wrangling, and standardizing it to a recognizable
form.
• Data cleaning:
– Removing unwanted special characters.
• import string
• string.punctuation # returns all the available special characters. Can
find matching punctuation and remove it.
– Removing unwanted spacing.
• sentence.strip() # can remove trailing spaces from the string sentence
– Standardise case
• string_var.lower() # convert the string variable to lower case.
• string_var.upper() # convert to upper case
15
15
Text tokenization
• Segmenting original text into smaller
“tokens”.
• Tokenization technique involves the use
of delimiters, specify delimiter to split the
sentences.
Codes
First import from nltk module.
Followed by the use of word_tokenize() Output 16
to split data by space
16
8
2/13/2020
Stemming
• Some English words contain ‘s’, ‘ed’, ‘en’,
prefix, or suffix.
• Stem is the base form of the English words
without prefix and suffix.
• Inflections is the process of adding prefix or
suffix to word.
• Stemming is the process of removing prefix
or suffix on English word to return the word
back to its base form.
17
17
Stemming
marketed
The stem is “market”.
‘marketing’ is stem plus a suffix of
‘ing’
markets
18
18
9
2/13/2020
Stemming
marketed
The stem is “market”.
‘marketing’ is stem plus a suffix of
‘ing’
To do stemming, we can utilize
marketer market marketing
‘PorterStemmer’.
PorterStemmer is a popular
stemming algorithms since 1979.
markets
19
19
Stemming
• Stemming from a list of words:
Get the stem of words from list.
• Outputs:
20
20
10
2/13/2020
Lemmatization
• Lemmatization is a similar process to
stemming which returns word to its original
form.
• Certain base words cannot be retrieved from
stemming. Also, sometimes the stem from
stemming is not a valid English word. While
words from lemmatization is generally valid.
• The root word is referred to as “lemma”.
21
21
Lemmatization
• Make use of WordNetLemmatizer to do
lemmatization
From Beaten to beat
From happiest to happy
22
22
11
2/13/2020
23
23
24
24
12
2/13/2020
25
25
Text Classification
• Processed words can be classified into pre-
defined categories.
• With the classified words, meaningful
insights can then be understood or
discovered.
• Classification can be done through different
types of machine learning algorithms like
unsupervised learning, supervised learning,
reinforcement learning and etc.
26
26
13
2/13/2020
Text Classification
• Typical supervised
text classification
contains two
phases:
– Training: Train up
the classification
system based on
sample data
– Prediction:
Predict new data
belongs to which
class
Blueprint for text classification system.
Src: Dipanjan Text Analytics with Python, 2016
27
27
28
14
2/13/2020
29
29
15