You are on page 1of 14

FUNDAMENTALS OF

TEXT ANALYSIS
WHAT IS TEXT ANALYSIS ?
❖Text analysis aims to derive high quality information from unstructured text.

❖Text analysis is all about parsing text in order to extract machine readable
data from them.

❖It involves detecting and interpreting trends in order to get insight from the
data.
Why we need text analysis?
Data velocity is growing exponentially.

90% of the data is unstructured in nature, they need some sort of exploratory
data Analysis, Data Cleaning, Data wrangling
Structured Data: Structured data is data created using a predefined (fixed)
schema and is typically organized in a tabular format.

Science Math IT
Student A 80 91 67

Student B 67 45 56

Unstructured Data: Is information that either does not have a pre-defined data
model or is not organized in a pre-defined manner. Unstructured information is
typically text-heavy, but may contain data such as dates, numbers, and facts as
well.
Application
•Chatbots:
Today, many companies use chatbots for their apps and websites, which solves
basic queries of a customer. It not only makes the process easier for the
companies but also saves customers from the frustration of waiting to interact
with customer call assistance.

•Search Autocorrect and Autocomplete


Whenever you search something on Google, after typing 2-3 letters, it shows
you the possible search terms. Or, if you search for something with typos, it
corrects them and still finds relevant results for you.

•Marketing:
Targeted advertising is a type of online advertising where ads are shown to the
user based on their online activity

•Hiring and Recruitment:


Recruiters need to review hundreds or sometimes thousands of resumes for a
single position. It might take hours for filtering resumes and short-listing the
candidates.
Language identification
Language identification can be important step in NLP problem.
It involves trying to predict the natural language of a piece of text.

This is a life. ------ English


Yahee jeevan hai. ------ Hindi

For instance if you go to google translate , the box you type in says
“Detect language”.

Abbreviation based approach: Abbreviation of language are used to detect


language
spacy.load('en')
What is Tokenization in NLP?
Tokenization is essentially splitting a phrase, sentence, paragraph, or an
entire text document into smaller units, such as individual words or terms.
Each of these smaller units are called tokens.

Word Tokenization
text = “This is a cat.”
text.split()
>> [‘This’, ‘is’, ‘a’, ‘cat.’]

Sentence Tokenization
text = “This product is very good and I recommend you all should try this."
text = text.split('and')
>> ['this product is very good ', ‘I recommend you all should try this.']
Stemming and Lemmatization
Stemming and Lemmatization are Text Normalization techniques in the field of
NLP.
Stemming: process that chops of the suffix or prefix from the word in order to
Get its base (root) word.

Rule Example Root word

ies ponies poni

‘s Rahul’s Rahul

ex- ex-president president

Lemmatization : usually refers to doing thing properly with the use of vocabulary
and morphological analysis of words.
It solves problem of stemming.
Ponies 🡪 Pony and Privatization 🡪 Private
Sentence Boundary Disambiguation
Sentence boundary disambiguation (SBD), is the problem in natural language
processing of deciding where sentences begin and end.
NLP tools often require their input to be divided into sentences; however,
sentence boundary identification can be challenging due to the potential
ambiguity of punctuation marks.

A period(.) may indicate the end of a sentence, or may denote an abbreviation, a


decimal point, or an email address, among other possibilities

Sentence

‘It was due Friday by 5 p.m. Saturday would be to late'

>> [My name is Jonas E. Smith., Please turn to p. 55.]


Parts of Speech (POS)
Part-of-Speech(POS) Tagging is the process of assigning different labels
known as POS tags to the words in a sentence that tells us about the
part-of-speech of the word.

Tags Lexical terms


Why not tell someone ? NN noun, singular (cat, tree)
WRB RB VB NN
NNS noun plural (desks)

NNP proper noun, singular (Sachin)

NNPS proper noun, plural (Indians or


Americans)

PRP personal pronoun (hers, herself,


him , himself)

PRP$ possessive pronoun (her, his,


mine, my, our )

VB verb (ask)

RB adverb

WRB Wh - adverb (how)


Markov Chains
•“Why not tell someone?”,
imaging the sentence is truncated to “Why not tell … ” and we want to
determine whether the following word in the sentence is a noun, verb, adverb, or
some other part-of-speech.
•If you are familiar with English, you’d instantly identify the verb and assume that
it is more likely the word is followed by a noun rather than another verb

In this model the future


x + 1 is predicted on the
basis of current x state
Chunking
• Chunk extraction or partial parsing is a process of meaningful extracting short
phrases from the sentence (tagged with Part-of-Speech).

• It is a process of extracting phrases from unstructured text. Instead of just


simple tokens which may not represent the actual meaning of the text, its
advisable to use phrases such as “South Africa” as a single word instead of
‘South’ and ‘Africa’ separate words.

•Chunking works on top of POS tagging, it uses pos-tags as input and provides
chunks as output.

• Chunking is very important when you want to extract information from text such
as Locations, Person Names etc. In NLP called Named Entity Extraction
sentence = "the little yellow dog barked at the cat“
#Define your grammar using regular expressions
grammar = (''' NP: {<DT>?<JJ>*<NN>} # NP ''')
SYNTACTIC & SEMANTIC ANALYSIS
SYNTACTICAL ANALYSIS:
This step decodes the syntactic structure of the given sentence to understand the
grammar and co-relation between the words.

“The cat (noun phrase) went away (verb phrase)” .

Cat 🡪 went away

SEMANTIC ANALYSIS:
semantic analysis is the process of drawing meaning from text. It allows computer
to understand and interpret sentences, paragraphs, or whole documents, by
analyzing their grammatical structure, and identifying relationships between
individual words in a particular context.

It understand that a text is about “politics” and “economics” even if it doesn’t


contain actual word but related concepts such as “election”, “budget”, “tax” or
“inflation”.
Word Sense Disambiguation
The automated process of identifying in which sense is a word used according
to its context.
Natural language is ambiguous sometimes, the same word can have different
meanings depending on how it’s used.
The word “orange,” for example, can refer to a color, a fruit,

Relationship Extraction
This task consists of detecting the semantic relationships present in a text.
Relationships usually involve two or more entities (which can be names of
people, places, company names, etc.). These entities are connected through a
semantic category, such as “works at,” “lives in,” “is the CEO of,”
“headquartered at.”

For example, the phrase “Steve Jobs is one of the founders of Apple, which is
headquartered in California” contains two different relationships:

Steve Jobs founder of Apple person ---> Company


Apple Headquarter in Califonia Company --> Place
Sentence Chaining/ lexical chain
Lexical chain is a sequence of related words in a text. You can think of them as
ngrams but chosen much more carefully to represent fragments of a concept.
It's been shown that using them as features more accurately allows you to
summarize a document or better decide if two documents are discussing the
same topic.

Consider the sentence

John hit a home run.

Lexical chaining might identify <hit, home, run> as a lexical chain and make it
more likely to be understood as talking about sports concepts and not
domestic violence.

You might also like