Professional Documents
Culture Documents
TEXT ANALYSIS
WHAT IS TEXT ANALYSIS ?
❖Text analysis aims to derive high quality information from unstructured text.
❖Text analysis is all about parsing text in order to extract machine readable
data from them.
❖It involves detecting and interpreting trends in order to get insight from the
data.
Why we need text analysis?
Data velocity is growing exponentially.
90% of the data is unstructured in nature, they need some sort of exploratory
data Analysis, Data Cleaning, Data wrangling
Structured Data: Structured data is data created using a predefined (fixed)
schema and is typically organized in a tabular format.
Science Math IT
Student A 80 91 67
Student B 67 45 56
Unstructured Data: Is information that either does not have a pre-defined data
model or is not organized in a pre-defined manner. Unstructured information is
typically text-heavy, but may contain data such as dates, numbers, and facts as
well.
Application
•Chatbots:
Today, many companies use chatbots for their apps and websites, which solves
basic queries of a customer. It not only makes the process easier for the
companies but also saves customers from the frustration of waiting to interact
with customer call assistance.
•Marketing:
Targeted advertising is a type of online advertising where ads are shown to the
user based on their online activity
For instance if you go to google translate , the box you type in says
“Detect language”.
Word Tokenization
text = “This is a cat.”
text.split()
>> [‘This’, ‘is’, ‘a’, ‘cat.’]
Sentence Tokenization
text = “This product is very good and I recommend you all should try this."
text = text.split('and')
>> ['this product is very good ', ‘I recommend you all should try this.']
Stemming and Lemmatization
Stemming and Lemmatization are Text Normalization techniques in the field of
NLP.
Stemming: process that chops of the suffix or prefix from the word in order to
Get its base (root) word.
‘s Rahul’s Rahul
Lemmatization : usually refers to doing thing properly with the use of vocabulary
and morphological analysis of words.
It solves problem of stemming.
Ponies 🡪 Pony and Privatization 🡪 Private
Sentence Boundary Disambiguation
Sentence boundary disambiguation (SBD), is the problem in natural language
processing of deciding where sentences begin and end.
NLP tools often require their input to be divided into sentences; however,
sentence boundary identification can be challenging due to the potential
ambiguity of punctuation marks.
Sentence
VB verb (ask)
RB adverb
•Chunking works on top of POS tagging, it uses pos-tags as input and provides
chunks as output.
• Chunking is very important when you want to extract information from text such
as Locations, Person Names etc. In NLP called Named Entity Extraction
sentence = "the little yellow dog barked at the cat“
#Define your grammar using regular expressions
grammar = (''' NP: {<DT>?<JJ>*<NN>} # NP ''')
SYNTACTIC & SEMANTIC ANALYSIS
SYNTACTICAL ANALYSIS:
This step decodes the syntactic structure of the given sentence to understand the
grammar and co-relation between the words.
SEMANTIC ANALYSIS:
semantic analysis is the process of drawing meaning from text. It allows computer
to understand and interpret sentences, paragraphs, or whole documents, by
analyzing their grammatical structure, and identifying relationships between
individual words in a particular context.
Relationship Extraction
This task consists of detecting the semantic relationships present in a text.
Relationships usually involve two or more entities (which can be names of
people, places, company names, etc.). These entities are connected through a
semantic category, such as “works at,” “lives in,” “is the CEO of,”
“headquartered at.”
For example, the phrase “Steve Jobs is one of the founders of Apple, which is
headquartered in California” contains two different relationships:
Lexical chaining might identify <hit, home, run> as a lexical chain and make it
more likely to be understood as talking about sports concepts and not
domestic violence.