S12 Text Analytics

2/13/2020
Seminar 12:
Text Analytics
Text Analytics / Text Mining

• These two terms are used interchangeably.
• It refers to applying techniques or algorithms
to find patterns, connections, trends or high
quality information derived from text data.
• Text data are everywhere: Twitter’s tweets,
Facebook’s posting, Word documents,
Powerpoints, Webpages, Emails, Mobile’s
chat data, Reviews.
2
1
2/13/2020
Text Analytics Techniques

• Text classification
• Text clustering
• Text summarization
• Similarity analysis
• Sentiment analysis
Structured / Unstructured Data

• Structured data are organized. Displayed in columns and
rows in orders or organised into tables in relational database.
• Record in structured data can be identified and labelled.
• Unstructured data is anything else, the opposite of structured
data.
• Unstructured data can be text based or non-text based. Non-
text based unstructured data can be Satellite images, Sensor
data, or Videos data.
• Majority of the data that renders challenges today arises are
in unstructured forms.
2
2/13/2020
Structured / Unstructured Data
Source:
https://www.datamation.com/big‐data/structured‐vs‐unstructured‐data.html
5
Information Retrieval (IR)

• Retrieve relevant documents in response to a
query or requirement.
• Information Retrieval System (IRS) is the
platform that facilitates the retrieval. Eg.
Google, Yahoo, or other search engines.
• The fundamental technique for information
retrieval is based on similarity measures,
similarity to the search string or query.
6
3
2/13/2020
How does Gooooooogle search?
• Crawling
– Software programs crawls all over the web-pages, visiting
new and old websites. These programs are called crawlers,
spiders, or bots.
– It reads the websites and updates the indexing for next
part.
• Indexing
– Indexing is the next step that allows for quick reference in
future.
– Google builds a large database of webpages crawlers
found and generate corresponding index. Analogous to
index you found at the end of each book.
• Ranking
– With webpages indexed, different algorithms are used to
provide ranking of relevance. It includes information like
how many webpages links to this page, are the content of
webpages high quality, when was the webpage last updated
etc.
– Highly ranked webpages are displayed before the lower 7
ranked webpages.
Natural Language
• In English, words are combined together to form other
constituent units like words, phrases, clauses, and
sentences.
• Sentence is a structured format of representing a
collection of words following syntactic rules like
grammer.
Words without structure Structured sentence following hierarchical syntax
Picture src: Dipanjan Text Analytics with Python, 2016
4
2/13/2020
Natural Language Processing

(NLP)
• Natural language: A language that is developed and evolved
by humans naturally to facilitate communication between
humans as opposed to programming language which is
designed for computers.
• Examples of human languages: English, Chinese etc.
• Natural language processing bridges the gap between
human language and computer language. The goal of natural
processing helps computers to understand human language.
• NLP is challenging in nature due to the unstructured nature
intrinsically.
Natural Language Processing

• Natural language can be in the form of text, or speech.
• We focus primarily on text.
• Use statistical or machine learning approach to make
inference from data.
• Through appropriate NLP approach, we can organize,
structure, and analyse data.
• Examples of NLP:
– Summarizer: https://algorithmia.com/algorithms/nlp/Summarizer
– Sentiment Analysis:
https://algorithmia.com/algorithms/nlp/SentimentAnalysis
10
10
5
2/13/2020
NLP Frameworks
• Two components for NLP:
– Natural Language Understanding (NLU)
• Mapping the input in natural language to representations
• Analyze relevant aspects for the language
– Natural Language Generation (NLG)
• Producing meaningful text sequence in natural language
• It involves text and sentence construction.
– Steps in NLP:
• Lexical analysis
• Syntactic analysis
• Semantic analysis
• Disclosure integration
• Pragmatic analysis
Src: https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_natural_language_processing.htm
11
11
Natural Language ToolKit (NLTK)

• Python module that works with natural language.
• Allows access to corpora and lexical resources together with text
processing abilities,
• Corpus: Large collection of text, in written or speech being stored in
computer storage. The materials in which linguistic analysis is based on.
• Corpora: Plural form of corpus.
• Examples of English corpora:
– BNC British National Corpus
– BOE Bank of English
– Opus Open Source Parallel Corpus
– COCA Corpus of Contemporary American English
– Brown Corpus
– WordNet
• Lexical -> Vocabulary
12
12
6
2/13/2020
Using NLTK
If you get into problem importing
nltk, runs the corresponding
Download from editor or IPython
Then select from
‘all’ option
followed by
clicking
download.
13
13
Pre-processing for Text Data

• Text data can be standardized through pre-
processing or cleaning up of data.
• Examples of pre-processing:
– Removing trailing white spaces
– Convert to all lower case, or other standardized case
to make comparison easier.
– Tokenization, breaking the text into words
– Chunking
– Stemming
– Lemmatization
14
14
7
2/13/2020
Text Normalization
• Normalization in text analytics refers to the process of data
cleaning, wrangling, and standardizing it to a recognizable
form.
• Data cleaning:
– Removing unwanted special characters.
• import string
• string.punctuation # returns all the available special characters. Can
find matching punctuation and remove it.
– Removing unwanted spacing.
• sentence.strip() # can remove trailing spaces from the string sentence
– Standardise case
• string_var.lower() # convert the string variable to lower case.
• string_var.upper() # convert to upper case
15
15
Text tokenization
• Segmenting original text into smaller
“tokens”.
• Tokenization technique involves the use
of delimiters, specify delimiter to split the
sentences.
Codes
First import from nltk module.
Followed by the use of word_tokenize() Output 16
to split data by space
16
8
2/13/2020
Stemming
• Some English words contain ‘s’, ‘ed’, ‘en’,
prefix, or suffix.
• Stem is the base form of the English words
without prefix and suffix.
• Inflections is the process of adding prefix or
suffix to word.
• Stemming is the process of removing prefix
or suffix on English word to return the word
back to its base form.
17
17
Stemming
marketed
 The stem is “market”.
 ‘marketing’ is stem plus a suffix of
‘ing’
marketer market marketing
markets
18
18
9
2/13/2020
Stemming
marketed
 The stem is “market”.
 ‘marketing’ is stem plus a suffix of
‘ing’
 To do stemming, we can utilize
marketer market marketing
‘PorterStemmer’.
 PorterStemmer is a popular
stemming algorithms since 1979.
markets
19
19
Stemming
• Stemming from a list of words:
Get the stem of words from list.
• Outputs:
20
20
10
2/13/2020
Lemmatization
• Lemmatization is a similar process to
stemming which returns word to its original
form.
• Certain base words cannot be retrieved from
stemming. Also, sometimes the stem from
stemming is not a valid English word. While
words from lemmatization is generally valid.
• The root word is referred to as “lemma”.
21
21
Lemmatization
• Make use of WordNetLemmatizer to do
lemmatization
From Beaten to beat
From happiest to happy
22
22
11
2/13/2020
Part-Of-Speech (POS) Tagging

• Classifies words into part-of-speech and label according
to given tagset.
• Tagset is a collection of tags for tagging.
• pos_tag() from nltk processes the sentence and tags it.
• In the case of tagger error, issue the download from
Ipython.
23
23

• Now let’s examine the tagged output.
• POS tagging is also called grammatical tagging, it identifies each word as

nouns, verbs and etc.
• This first output shows that the word ‘Everyday’ is tagged as ‘NNP’ as
singular proper noun.
• Other interpretations are:
– VBZ for verb
– DT for determiner
– JJ for adjective.
• Try google for pos tagging and complete your understanding for the rest of
the tags.
24
24
12
2/13/2020

• Try google for pos tagging and complete
your understanding for the rest of the tags.
• Alternatively, we could use the help from
Ipython to assist in look up.
25
25
Text Classification
• Processed words can be classified into pre-
defined categories.
• With the classified words, meaningful
insights can then be understood or
discovered.
• Classification can be done through different
types of machine learning algorithms like
unsupervised learning, supervised learning,
reinforcement learning and etc.
26
26
13
2/13/2020
Text Classification
• Typical supervised
text classification
contains two
phases:
– Training: Train up
the classification
system based on
sample data
– Prediction:
Predict new data
belongs to which
class
Blueprint for text classification system.
Src: Dipanjan Text Analytics with Python, 2016
27
27
Text Classification Example –

Predict if a given name belongs to
male/female
• See the following example
that first loads data from
corpus’ names module and
construct a names list.
• It then constructs a feature
set with name and its
corresponding group. Two
classes/groups are created,
male and female.
• NaiveBayes classifier is then
used for classifying. Naïve
Bayes is a probabilistic model
that makes prediction base on
prior probability.
• The last line of code test the
name ‘Frank’ belongs to
which class and it shows as Src: https://pythonspot.com/natural‐language‐processing‐prediction/
‘male’. 28
28
14
2/13/2020
You have learnt...

1. The concept of web scraping through the use of Beautiful Soup module.
2. About the difference between structured and unstructured data.
3. About text analytics and natural language processing.
4. To use natural language toolkit to process text data.
5. About text pre‐processing techniques like stemming, lemmatization and classification.
29
29
15

S12 Text Analytics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

S12 Text Analytics

Uploaded by

Copyright:

Available Formats

2/13/2020

Text Analytics / Text Mining

Text Analytics Techniques

Structured / Unstructured Data

Structured / Unstructured Data

Information Retrieval (IR)

How does Gooooooogle search?

Natural Language Processing

Natural Language Processing

Natural Language ToolKit (NLTK)

Pre-processing for Text Data

marketer market marketing

Part-Of-Speech (POS) Tagging

Part-Of-Speech (POS) Tagging

• POS tagging is also called grammatical tagging, it identifies each word as

Part-Of-Speech (POS) Tagging

Text Classification Example –

You have learnt...

You might also like