Professional Documents
Culture Documents
Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
What is unstructured data?
Unstructured data is the data which does not conforms to a data model and
has no easily identifiable structure such that it can not be used by a computer
program easily.
Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Forms of Unstructured Data
Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Characteristics of Unstructured Data
Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Advantages of Unstructured Data:
Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Text Mining
Text Mining Definition
Text mining, also referred to as text data mining, similar to text analytics, is the
process of deriving high-quality information from text. It involves "the discovery by
computer of new, previously unknown information, by automatically extracting
information from different written resources.“
Text mining usually involves the process of structuring the input text (usually parsing,
along with the addition of some derived linguistic features and the removal of others, and
subsequent insertion into a database), deriving patterns within the structured data, and
finally evaluation and interpretation of the output. 'High quality' in text mining usually
refers to some combination of relevance, novelty, and interest. Typical text mining tasks
include text categorization, text clustering, concept/entity extraction, production of
granular taxonomies, sentiment analysis, document summarization, and entity relation
modeling (i.e., learning relations between named entities).
Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Motivation for Text Mining
Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Why should marketers bother about
Text Mining?
Text Mining has various applications and use cases in Marketing. Some of these are:
Tokenization Ngram/Unigrams
Lower case
Most Frequent Tokens
Stop word Removal
3. Stop words removal: Stop words are very commonly used words (a, an, the, etc.) in the
documents. These words do not really signify any importance as they do not help in distinguishing
two documents.
5. Lemmatization: Unlike stemming, lemmatization reduces the words to a word existing in the
language.
N-Gram, POS Tagging
Parsing: resolve (a sentence) into its component parts and describe their syntactic
roles.
In linguistics, syntax is the set of rules, principles, and processes that govern the
structure of sentences in a given language, usually including word order. The term
syntax is also used to refer to the study of such principles and processes.
Prepared By: Dr. Rishi Dwesar for Marketing & Technology Course
Bag of Words
A bag-of-words is a representation of text that describes the occurrence of words
within a document. It involves two things: A vocabulary of known words. A measure
of the presence of known words.
Source: https://machinelearningmastery.com/gentle-introduction-bag-words-model/
Part of Speech Tagging
Bag-of-Words
• A bag-of-words model, or BoW for short, is a way of extracting features from text for use in
modeling, such as with machine learning algorithms. The approach is very simple and flexible,
and can be used in a myriad of ways for extracting features from documents.
• A bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:
A vocabulary of known words.
A measure of the presence of known words.
• It is called a “bag” of words, because any information about the order or structure of words in
the document is discarded. The model is only concerned with whether known words occur in the
document, not where in the document.
Bag-of-Tokens/Words
TF-IDF score can be comuted for each word in the corpus by multiplying TF & IDF.
Words with a higher score are more important, and those with a lower score are less
important:
Noun Phrase
Noun Phrase Complex Verb Noun Phrase
Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
Sentence
+
Scared(x) if Chasing(_,x,_). A person saying this may
be reminding another person to
get the dog back…
(Taken from
ChengXiang Zhai, CS Scared(b1)
Pragmatic analysis
397cxz – Fall 2003) Inference
The layers for processing/transforming
unstructured text data into structured data.
Source: https://www.lexalytics.com/technology/text-analytics
Sentiment Analysis
Text Mining
Sentiment Analysis Definition
Source: https://monkeylearn.com/sentiment-analysis/
Types of Sentiment Analysis Algorithms
Types of Algorithms
Rule-Based Automatic Systems Hybrid Systems
Source: https://monkeylearn.com/sentiment-analysis/
Sentiment Analysis Process
Popular Sentiment Analysis Algorithms
1. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-
based sentiment analysis tool that is specifically attuned to sentiments expressed in
social media. VADER uses a combination of a sentiment lexicon is a list of lexical
2. Liu and Hu algoritm is based on opinion lexicon which contains around 6800 positive
and negative opinion words or sentiment words for English language. This list was
composed over many years.