You are on page 1of 31

Understanding & Examining Unstructured

Data (Text Mining & Sentiment Analysis)


Dr. Rishi Dwesar

Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
What is unstructured data?

Unstructured data is the data which does not conforms to a data model and
has no easily identifiable structure such that it can not be used by a computer
program easily.

In other words, Unstructured data is not organized in a pre-defined manner or


does not have a pre-defined data model, thus it is not a good fit for a
mainstream relational database.

Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Forms of Unstructured Data

There are several forms—textual unstructured data and non-textual


unstructured data, which includes images, colors, sounds, and
shapes.

Unstructured textual data is textual data found in web-pages,


emails, reports, documents, medical records, surveys, reviews and
spreadsheets.

Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Characteristics of Unstructured Data

• Data neither conforms to a data model nor has any structure.


• Data can not be stored in the form of rows and columns as in
Databases Relating to meaning in
language or logic.
• Data does not follows any semantic rules
• Data lacks any particular format or sequence
• Data has no easily identifiable structure
• Due to lack of identifiable structure, it can not used by computer
programs easily

Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Advantages of Unstructured Data:

• Its supports the data which lacks a proper format or sequence


• The data is not constrained by a fixed schema
• Very Flexible due to absence of schema.
• Data is portable
• It is very scalable
• It can deal easily with the heterogeneity of sources.
• These type of data have a variety of business intelligence and
analytics applications.

Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Text Mining
Text Mining Definition
Text mining, also referred to as text data mining, similar to text analytics, is the
process of deriving high-quality information from text. It involves "the discovery by
computer of new, previously unknown information, by automatically extracting
information from different written resources.“

Text mining usually involves the process of structuring the input text (usually parsing,
along with the addition of some derived linguistic features and the removal of others, and
subsequent insertion into a database), deriving patterns within the structured data, and
finally evaluation and interpretation of the output. 'High quality' in text mining usually
refers to some combination of relevance, novelty, and interest. Typical text mining tasks
include text categorization, text clustering, concept/entity extraction, production of
granular taxonomies, sentiment analysis, document summarization, and entity relation
modeling (i.e., learning relations between named entities).
Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Motivation for Text Mining

1. Approximately 90% of the World’s data is held in unstructured formats:


• Web pages
• Emails
• Technical documents
• Corporate documents
• Books
• Digital libraries
• Customer complaint and online reviews

2. Growing rapidly in size and importance

Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Why should marketers bother about
Text Mining?
Text Mining has various applications and use cases in Marketing. Some of these are:

1. Understanding product attributes liked and disliked by consumers.


2. Online Reputation Analysis and Mangement
3. Understanding Customer Experince, Satisfaction and Dissatisfaction
4. Automation of Service, Customer Support
5. Identifying buying intent
6. Identifying customer segment etc.

Read More About it at:


https://medium.com/@mbondar/text-mining-the-marketing-game-changer-ba9e7021da8d
https://blog.infegy.com/how-brands-can-use-text-analytics-for-consumer-insights
https://brandequity.economictimes.indiatimes.com/news/marketing/harnessing-conversations-the-use-of-text-
analytics-for-better-marketing/67175357
Data Scraping

Applications of Text Mining: https://www.lexalytics.com/applications

Detailed Process: https://www.lexalytics.com/technology/text-analytics


Mining Text Data
Data Mining / Knowledge Discovery

Structured Data Multimedia Free Text HTML


Hypertext Frank Rizzo bought <a href>Frank Rizzo
HomeLoan (
Loanee: Frank Rizzo his home from Lake </a> Bought
Lender: MWF View Real Estate in <a hef>this home</a>
Agency: Lake View 1992. from <a href>Lake
Amount: $200,000 He paid $200,000 View Real Estate</a>
Term: 15 years) under a15-year loan In <b>1992</b>.
Loans($200K,[map],...) from MW Financial. <p>...
Typical Text Mining Workflow
Common Text Preprocessing Procedures:

Tokenization Ngram/Unigrams

Lower case
Most Frequent Tokens
Stop word Removal

Part of Speech Tagging


Stemming/ Lemmatization

Read more at: https://towardsdatascience.com/text-preprocessing-in-natural-language-processing-using-python-6113ff5decd8


Text Preprocessing Procedures (Cont.)

1. Tokenization: Splitting the sentence into words.

2. Lower casing: Converting a word to lower case (NLP -> nlp).


Words like Book and book mean the same but when not converted to the lower case those two
are represented as two different words in the vector space model (resulting in more dimensions).

3. Stop words removal: Stop words are very commonly used words (a, an, the, etc.) in the
documents. These words do not really signify any importance as they do not help in distinguishing
two documents.

4. Stemming: It is a process of transforming a word to its root form.

5. Lemmatization: Unlike stemming, lemmatization reduces the words to a word existing in the
language.
N-Gram, POS Tagging

• N-grams & Unigrams: It is a contiguous sequence of n items from a


given sample of text or speech. The items are usually words, but also can be
syllables, letters, words or base pairs according to the application. The n-grams
typically are collected from a text or speech corpus. When the items are
words, n-grams may also be called shingles or uni-grams.

• Part of Speech Tagging (POS Tagging):Part-of-speech tagging, also called


grammatical tagging is the process of marking up a word in a text as
corresponding to a particular part of speech(e.g. nouns, adjectives, verb etc.),
based on both its definition and its context.
Parsing & Syntax

Parsing: resolve (a sentence) into its component parts and describe their syntactic
roles.

Syntax: The arrangement of words and phrases to create well-formed sentences in a


language. Parsing: resolve (a sentence) into its component parts and describe their
syntactic roles.

In linguistics, syntax is the set of rules, principles, and processes that govern the
structure of sentences in a given language, usually including word order. The term
syntax is also used to refer to the study of such principles and processes.

Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Bag of Words
A bag-of-words is a representation of text that describes the occurrence of words
within a document. It involves two things: A vocabulary of known words. A measure
of the presence of known words.

Source: https://machinelearningmastery.com/gentle-introduction-bag-words-model/
Part of Speech Tagging
Bag-of-Words

• A bag-of-words model, or BoW for short, is a way of extracting features from text for use in
modeling, such as with machine learning algorithms. The approach is very simple and flexible,
and can be used in a myriad of ways for extracting features from documents.
• A bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:
A vocabulary of known words.
A measure of the presence of known words.
• It is called a “bag” of words, because any information about the order or structure of words in
the document is discarded. The model is only concerned with whether known words occur in the
document, not where in the document.
Bag-of-Tokens/Words

Documents Token/Word Sets

Four score and seven nation – 5


years ago our fathers brought civil - 1
forth on this continent, a new war – 2
nation, conceived in Liberty, men – 2
and dedicated to the Feature Extraction died – 4
proposition that all men are people – 5
created equal. Liberty – 1
Now we are engaged in a God – 1
great civil war, testing …
whether that nation, or …

Loses all order-specific information. Severely limits context.


Term Frequency-Inverse Document Frequency
(TF-IDF)

• Term frequency–inverse document frequency, is a numerical statistic that is


intended to reflect how important a word is to a document in a collection or
corpus.

• It is a measure of how frequently a term, t, appears in a document, d:


Here, in the numerator, n is the number of times the term “t” appears in the
document “d”. Thus, each document and term would have its own TF value.
Term Frequency-Inverse Document Frequency
(TF-IDF) (Cont.)

Inverse Document Frequency (IDF)


IDF is a measure of how important a term is. We need the IDF value because
computing just the TF alone is not sufficient to understand the importance of words:

TF-IDF score can be comuted for each word in the corpus by multiplying TF & IDF.
Words with a higher score are more important, and those with a lower score are less
important:

Learn more at: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/


Natural Language Processing
(Application of POS, Syntactic Analysis)

A dog is chasing a boy on the playground Lexical analysis


(part-of-speech tagging)
Det Noun Aux Verb Det Noun Prep Det Noun

Noun Phrase
Noun Phrase Complex Verb Noun Phrase

Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
Sentence
+
Scared(x) if Chasing(_,x,_). A person saying this may
be reminding another person to
get the dog back…
(Taken from
ChengXiang Zhai, CS Scared(b1)
Pragmatic analysis
397cxz – Fall 2003) Inference
The layers for processing/transforming
unstructured text data into structured data.

Source: https://www.lexalytics.com/technology/text-analytics
Sentiment Analysis
Text Mining
Sentiment Analysis Definition

The process of computationally identifying and categorizing opinions expressed in a


piece of text, especially in order to determine whether the writer's attitude towards
a particular topic, product, etc. is positive, negative, or neutral.

It is a machine learning technique that detects polarity (e.g. a positive or negative


opinion) within text, whether a whole document, paragraph, sentence, or clause.

Source: https://monkeylearn.com/sentiment-analysis/
Types of Sentiment Analysis Algorithms

Types of Algorithms
Rule-Based Automatic Systems Hybrid Systems

Rule-based systems that perform sentiment analysis based on a set of manually


crafted rules.
Automatic systems that rely on machine learning techniques to learn from data.
Hybrid systems that combine both rule-based and automatic approaches.

Source: https://monkeylearn.com/sentiment-analysis/
Sentiment Analysis Process
Popular Sentiment Analysis Algorithms

1. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-
based sentiment analysis tool that is specifically attuned to sentiments expressed in
social media. VADER uses a combination of a sentiment lexicon is a list of lexical

Dr. Rishi Dwesar, IBS Hyderabad (2020)


features (e.g., words) which are generally labeled according to their semantic
orientation as either positive or negative. VADER not only tells about the Positivity and
Negativity score but also tells us about how positive or negative a sentiment is.

2. Liu and Hu algoritm is based on opinion lexicon which contains around 6800 positive
and negative opinion words or sentiment words for English language. This list was
composed over many years.

Read more at: http://wiki.socialisingaroundmedia.com/index.php/Sentiment_Analysis


Check out about more algorithms: https://medium.com/@datamonsters/sentiment-analysis-tools-overview-part-1-positive-and-negative-words-
databases-ae35431a470c
End of Slides…

You might also like