You are on page 1of 15

2/13/2020

Seminar 12:
Text Analytics

Text Analytics / Text Mining


• These two terms are used interchangeably.
• It refers to applying techniques or algorithms
to find patterns, connections, trends or high
quality information derived from text data.
• Text data are everywhere: Twitter’s tweets,
Facebook’s posting, Word documents,
Powerpoints, Webpages, Emails, Mobile’s
chat data, Reviews.
2

1
2/13/2020

Text Analytics Techniques


• Text classification
• Text clustering
• Text summarization
• Similarity analysis
• Sentiment analysis

Structured / Unstructured Data


• Structured data are organized. Displayed in columns and
rows in orders or organised into tables in relational database.
• Record in structured data can be identified and labelled.
• Unstructured data is anything else, the opposite of structured
data.
• Unstructured data can be text based or non-text based. Non-
text based unstructured data can be Satellite images, Sensor
data, or Videos data.
• Majority of the data that renders challenges today arises are
in unstructured forms.

2
2/13/2020

Structured / Unstructured Data

Source:
https://www.datamation.com/big‐data/structured‐vs‐unstructured‐data.html
5

Information Retrieval (IR)


• Retrieve relevant documents in response to a
query or requirement.
• Information Retrieval System (IRS) is the
platform that facilitates the retrieval. Eg.
Google, Yahoo, or other search engines.
• The fundamental technique for information
retrieval is based on similarity measures,
similarity to the search string or query.
6

3
2/13/2020

How does Gooooooogle search?

• Crawling
– Software programs crawls all over the web-pages, visiting
new and old websites. These programs are called crawlers,
spiders, or bots.
– It reads the websites and updates the indexing for next
part.
• Indexing
– Indexing is the next step that allows for quick reference in
future.
– Google builds a large database of webpages crawlers
found and generate corresponding index. Analogous to
index you found at the end of each book.
• Ranking
– With webpages indexed, different algorithms are used to
provide ranking of relevance. It includes information like
how many webpages links to this page, are the content of
webpages high quality, when was the webpage last updated
etc.
– Highly ranked webpages are displayed before the lower 7
ranked webpages.

Natural Language
• In English, words are combined together to form other
constituent units like words, phrases, clauses, and
sentences.
• Sentence is a structured format of representing a
collection of words following syntactic rules like
grammer.

Words without structure Structured sentence following hierarchical syntax
Picture src: Dipanjan Text Analytics with Python, 2016

4
2/13/2020

Natural Language Processing


(NLP)
• Natural language: A language that is developed and evolved
by humans naturally to facilitate communication between
humans as opposed to programming language which is
designed for computers.
• Examples of human languages: English, Chinese etc.
• Natural language processing bridges the gap between
human language and computer language. The goal of natural
processing helps computers to understand human language.
• NLP is challenging in nature due to the unstructured nature
intrinsically.

Natural Language Processing


• Natural language can be in the form of text, or speech.
• We focus primarily on text.
• Use statistical or machine learning approach to make
inference from data.
• Through appropriate NLP approach, we can organize,
structure, and analyse data.
• Examples of NLP:
– Summarizer: https://algorithmia.com/algorithms/nlp/Summarizer
– Sentiment Analysis:
https://algorithmia.com/algorithms/nlp/SentimentAnalysis

10

10

5
2/13/2020

NLP Frameworks
• Two components for NLP:
– Natural Language Understanding (NLU)
• Mapping the input in natural language to representations
• Analyze relevant aspects for the language
– Natural Language Generation (NLG)
• Producing meaningful text sequence in natural language
• It involves text and sentence construction.
– Steps in NLP:
• Lexical analysis
• Syntactic analysis
• Semantic analysis
• Disclosure integration
• Pragmatic analysis

Src: https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_natural_language_processing.htm
11

11

Natural Language ToolKit (NLTK)


• Python module that works with natural language.
• Allows access to corpora and lexical resources together with text
processing abilities,
• Corpus: Large collection of text, in written or speech being stored in
computer storage. The materials in which linguistic analysis is based on.
• Corpora: Plural form of corpus.
• Examples of English corpora:
– BNC British National Corpus
– BOE Bank of English
– Opus Open Source Parallel Corpus
– COCA Corpus of Contemporary American English
– Brown Corpus
– WordNet
• Lexical -> Vocabulary

12

12

6
2/13/2020

Using NLTK

If you get into problem importing
nltk, runs the corresponding
Download from editor or IPython

Then select from 
‘all’ option 
followed by 
clicking 
download.
13

13

Pre-processing for Text Data


• Text data can be standardized through pre-
processing or cleaning up of data.
• Examples of pre-processing:
– Removing trailing white spaces
– Convert to all lower case, or other standardized case
to make comparison easier.
– Tokenization, breaking the text into words
– Chunking
– Stemming
– Lemmatization

14

14

7
2/13/2020

Text Normalization
• Normalization in text analytics refers to the process of data
cleaning, wrangling, and standardizing it to a recognizable
form.
• Data cleaning:
– Removing unwanted special characters.
• import string
• string.punctuation # returns all the available special characters. Can
find matching punctuation and remove it.
– Removing unwanted spacing.
• sentence.strip() # can remove trailing spaces from the string sentence
– Standardise case
• string_var.lower() # convert the string variable to lower case.
• string_var.upper() # convert to upper case

15

15

Text tokenization
• Segmenting original text into smaller
“tokens”.
• Tokenization technique involves the use
of delimiters, specify delimiter to split the
sentences.

Codes
First import from nltk module.
Followed by the use of word_tokenize()  Output 16
to split data by space

16

8
2/13/2020

Stemming
• Some English words contain ‘s’, ‘ed’, ‘en’,
prefix, or suffix.
• Stem is the base form of the English words
without prefix and suffix.
• Inflections is the process of adding prefix or
suffix to word.
• Stemming is the process of removing prefix
or suffix on English word to return the word
back to its base form.
17

17

Stemming

marketed
 The stem is “market”.
 ‘marketing’ is stem plus a suffix of 
‘ing’ 

marketer market marketing

markets

18

18

9
2/13/2020

Stemming

marketed
 The stem is “market”.
 ‘marketing’ is stem plus a suffix of 
‘ing’ 
 To do stemming, we can utilize 
marketer market marketing
‘PorterStemmer’.
 PorterStemmer is a popular 
stemming algorithms since 1979.
markets

19

19

Stemming
• Stemming from a list of words:

Get the stem of words from list.

• Outputs:
20

20

10
2/13/2020

Lemmatization
• Lemmatization is a similar process to
stemming which returns word to its original
form.
• Certain base words cannot be retrieved from
stemming. Also, sometimes the stem from
stemming is not a valid English word. While
words from lemmatization is generally valid.
• The root word is referred to as “lemma”.
21

21

Lemmatization
• Make use of WordNetLemmatizer to do
lemmatization

From Beaten to beat

From happiest  to happy

22

22

11
2/13/2020

Part-Of-Speech (POS) Tagging


• Classifies words into part-of-speech and label according
to given tagset.
• Tagset is a collection of tags for tagging.
• pos_tag() from nltk processes the sentence and tags it.
• In the case of tagger error, issue the download from
Ipython.

23

23

Part-Of-Speech (POS) Tagging


• Now let’s examine the tagged output.

• POS tagging is also called grammatical tagging, it identifies each word as


nouns, verbs and etc.
• This first output shows that the word ‘Everyday’ is tagged as ‘NNP’ as
singular proper noun.
• Other interpretations are:
– VBZ for verb
– DT for determiner
– JJ for adjective.
• Try google for pos tagging and complete your understanding for the rest of
the tags.

24

24

12
2/13/2020

Part-Of-Speech (POS) Tagging


• Try google for pos tagging and complete
your understanding for the rest of the tags.
• Alternatively, we could use the help from
Ipython to assist in look up.

25

25

Text Classification
• Processed words can be classified into pre-
defined categories.
• With the classified words, meaningful
insights can then be understood or
discovered.
• Classification can be done through different
types of machine learning algorithms like
unsupervised learning, supervised learning,
reinforcement learning and etc.
26

26

13
2/13/2020

Text Classification

• Typical supervised
text classification
contains two
phases:
– Training: Train up
the classification
system based on
sample data
– Prediction:
Predict new data
belongs to which
class
Blueprint for text classification system.
Src: Dipanjan Text Analytics with Python, 2016

27

27

Text Classification Example –


Predict if a given name belongs to
male/female
• See the following example
that first loads data from
corpus’ names module and
construct a names list.
• It then constructs a feature
set with name and its
corresponding group. Two
classes/groups are created,
male and female.
• NaiveBayes classifier is then
used for classifying. Naïve
Bayes is a probabilistic model
that makes prediction base on
prior probability.
• The last line of code test the
name ‘Frank’ belongs to
which class and it shows as Src: https://pythonspot.com/natural‐language‐processing‐prediction/
‘male’. 28

28

14
2/13/2020

You have learnt...


1. The concept of web scraping through the use of Beautiful Soup module.
2. About the difference between structured and unstructured data.
3. About text analytics and natural language processing.
4. To use natural language toolkit to process text data.
5. About text pre‐processing techniques like stemming, lemmatization and classification.

29

29

15

You might also like