Text Mining & Web Mining: N.P. Singh Professor

You might also like

You are on page 1of 100

Text Mining & Web Mining

N.P. Singh
Web Mining & Text Mining
Text & web mining are two sub-areas of data mining
both being focused on less structured data.
Reason of Growth:

Size of unstructured data: 85 % of the organization data is text

or less structured.
Text Mining
Motivation for text Mining
Approximately 90% of the Worlds data is held in unstructured
Web pages
Technical documents
Digital Libraries
Customer complaint letters
Transcripts of phone calls with customers
Growing rapidly in size and importance
Challenges of Text Mining
Information is in unstructured textual form
Not well structured text
Email/Chat rooms
- r u available ?
- Hey whazzzzzz up
Large textual data base.
Very high number of possible dimensions (but sparse):
all possible word and phrase types in the language.
Complex and subtle relationships between concepts in text
AOL merges with Time-Warner Time-Warner is bought by AOL
Challenges of Text Mining
Word ambiguity and context sensitivity
automobile = car = vehicle = Toyota
Apple (the company) or apple (the fruit).
Word ambiguity
- Pronouns (he, she )
- Synonyms (buy, purchase)
- Words with multiple meanings (bat is related to baseball or
Semantic ambiguity
The king saw the rabbit with his glasses (multiple meanings)
Noisy data
Example: Spelling mistakes
Business Opportunity

ref: www.stanford.edu/class/cs276b/handouts/lecture10.ppt
Why Biology Text Mining?
Strong motivations from biology side
Difficulty for biologists to access literature
No theory in biology, so we must keep all literature alive
Observations about the same biology mechanism may be described
in different terms (e.g., due to different perspectives of study)
Many unanswered research questions
Text mining may help better organize, link biology literature,
and answer simple questions (e.g., what do we know about
this gene? )
Why Biology Text Mining? (cont.)
Potentially high impact from CS side
Any discovery from biology text could be potentially
Biology text is relatively easy for mining
Literature is cleaner (compared with web data)
Biology text often has many annotations
Many other kinds of biology data can be exploited (e.g.,
DNA/Protein sequences, gene expression information, metabolic
Simple techniques may work
Characteristics of Biology Text
Large number of entities (e.g., genes, proteins) that
have well-defined semantics
No standard for terminology (inconsistencies)
Ambiguities (e.g., many acronyms)
High complexity in phrases and sentence structures
Research Topics
General goal: Applying known text mining techniques to help biology
Problem 1: Data/Information Integration
How can we integrate text information (discovering terminology linkages)
How can we link text with databases (semantic interpretations of text on top of
entities/relations in DB, e.g., entity extraction)
How can we integrate biology DBs (many fields are text)
Problem 2: Functional annotations
How can we annotate a biological entity (e.g., a gene) with functional
information extracted from literature
How can we annotate a set of related genes with functional information
How can we exploit the ontologies/thesauri in biology?
Research Topics (cont.)
Problem 3: Data/Information Cleanup & Curation
How can we detect suspicious data/information in existing databases?
How can we automate many manual tasks of database curation?
Problem 4: Research question answering
How can we answer simply research questions? (e.g., what functional
connections are there between these two genes?)
How can we support exploratory access and digest of literature
information? (e.g., a biology research workbench)
Swanson Example

All Migraine All Nutrition

Ca Channel
Research Blockers Research

Migraine Aggregability Magnesium

Spreading Cortical


Ref 1: www.stanford.edu/class/cs276b/handouts/lecture10.ppt
Ref 2: Swanson, D. R. (1988). Migraine and magnesium: Eleven neglected connections.
Perspectives in Biology and Medicine, 31, 526-557
Swanson Example

Stress can lead to a loss of magnesium

Magnesium is a natural calcium channel blocker

High levels of magnesium inhibit SCD

Magnesium can suppress platelet aggregability

Stress is associated with migraines

Calcium channel blockers prevent some migraines
SCD is implicated in some migraines
Migraine patients haveD. R.high
Ref: Swanson, platelet
(1988). Migraine aggregability
and magnesium: Eleven neglected connections.
Perspectives in Biology and Medicine, 31, 526-557
Swanson Example

Hypothesis: Magnesium Deficiency related to Migraine

Found by extracting features from medical literature on

Migraines and Nutrition

Three of his hypotheses received experimental


Information that not even the writer knows

Literature might be full of such undiscovered connections

2nd Case
Social media competitive analysis and text mining: A
case study in the pizza industry
Continued. Objectives
This study examined the social media sites of the three
largest pizza chains and applied text mining to analyze
unstructured text content on their Facebook and Twitter
sites. Specifically, the study attempts to answer the
following questions:
What patterns can be found from their Facebook sites
What patterns can be found from their Twitter sites
What are the main differences in terms of their Facebook
and Twitter patterns?
Continued Methodology
U.S. Pizza industry is one of the first industries that has
entered the social media arena for business purposes
and has a large social media user base
This is the reason social media competitive analysis is

Three largest pizza chains: Pizza Hut, Dominos Pizza

and Papa Johns Pizza in our case study

Text mining process for social media
Analysis- Market share
SN Pizza Market Share
1 Pizza Hut 11.65%
2 Dominos 7.60%
3 Papa Johns 4.23%
4 Others
Total 100%
Continued Promotion of Business
Pizza businesses promote sales to customers through
various marketing channels such as direct mail,
newspaper, magazines, print coupon, TV advertising.
Due to the rapid development of the Internet and the

widespread use of Facebook, Twitter and YouTube by

customers, more and more pizza stores are promoting
their pizza business via social media.
More facts
People actively look for specific media outlets and
information for gratification purposes.
As social media becomes an increasingly popular media

outlet among consumers.

Nearly half of the survey participants had looked for a

restaurant recommendation by reading online reviews and

information posted on blogs, Facebook and Twitters.
85% of pizza-chain sales are now tied to promotions and

discounts mostly acquired through social media sites.

many pizza restaurants such as Pizza Hut and Dominos

have assigned specific staff members with responsibilities

to engage customers and build an online community
Continued More facts
By using these social media applications, customers
can engage in activities such as customizing pizzas,
discussing pizza quality, tastes and deal information
with peer customers, giving praise and complaints,
providing feedback to pizza seller.
On the other hand, many pizza restaurants are using

social media as a customer service tool to listen to

customers and address their concerns.
Currently, large pizza chains are focusing their social

media use on Facebook and Twitter.

Process of Analysis
To answer the research questions, we conducted a social media
competitive analysis for the Facebook and Twitter sites of the Big Three
by following two phases.
First- Quantitative data was collected manually from their individual

social media sites such as number of fans/followers, number of

postings, comments, shares and likes, frequency of posting.
Secondly, text mining is applied to analyze the text messages posted on

their Facebook and Twitter sites in order to discover new knowledge

and patterns, and to acquire a deeper understanding of how the three
pizza chains are using social media in practice.
As October is the busiest month of the year in the pizza industry , study
used the posts collected between October 1, 2011 and October 31, 2011

as the sample for text mining.

The posts were saved into Excel Spreadsheets for analysis.
Continued.. Tools used
Raw data was transformed into a usable format, mainly
by cleaning, assigning attributes, and integrating data.
Subsequently, data mining and text mining techniques

are applied to examine the data sets in order to gain

insights about participants social media activities.
Two leading tools in textual data analysis and mining,

SPSS Clementine text mining tool and Nvivo 9, were

used to facilitate the mining and analysis
Trend of tweets numbers in October
for the Big Three.
Pizza Huts customer engagement
trend in October, 2011.
Dominos customer engagement trend
in October, 2011.
Papa Johns customer engagement
trend in October, 2011.
Examples related to ordering and
Examples related to the quality of
their pizzas.
A summary of the six main themes on
Facebook sites.
Rule Mining
Link Analysis & Content Analysis
Enron Fraud Case- Analysis of E-mail Data set.
Link Analysis can extract dynamic movies of the

evolution of social networks, identifying gatekeepers

and other central actors, as well as to generate
temporally correlated cluster maps of e-mail content.
E-Mail Dataset = 517,431
Massages belongs to 150 Mail Boxes
Enron Case Three Approaches to get
Some Meaningful Information
Filtering out messages with potentially suspicious
contents, and then focusing on the social network
created by those messages,
Doing a large-scale social network analysis, judging

actors by their closeness to suspicious people, and

Searching for clusters of suspicious activity, by looking

for what we term collaborative innovation networks

Enron Case Content Analysis
Objective: To reduce the size of data. To keep only the data of
There is a common language of evil e-mails. No consistent but

one caan make sense with patterns

Package v/s Bombs etc

Following words or combinations were used to filter the e-mails.

Affairs (Criminals do not use clear words)

Devastating (What is coming up they know)
Investigation (Dangerous things)
Disclosures (Dangerous things)
Bonus (Most important thing)
View of these communications
Content Combination
Concept Map of Interest
Few Actors of Enron
Birds of a Feather Flock Together
What is relation between Actors
Establishing Relations using Data
Mining Tools
Common Ways to establish Relation
A relationship between two actors exists if there is

direct communication between them.

The more they communicate, the more intensive is their

Enron Case:
Kenneth Lay and Tim Belden. Both actors played a

central role at Enron during the 2001 Californian

Energy crisis.
But there is no Communication Between them
But 13 common communication partners:
David Forster, David Oxley, Jeff Skilling, Karen

Denne, Liz Taylor, Mark Palmer, Philipp Allen,

Richard Shapiro, Sally Beck, Sarah Novosel, and
Steven Kean,
Out of which 6, namely David Delainey, Greg Whalley,

Jeff Skilling, Richard Shapiro, Sally Beck, and Steven

Kean appear in our Enron main actor list
Continued Gatekeepers Between Lay &
Searching Innovation
Three Types of communities work together to form an
ecosystem of interconnected communities
COINs (Collaborative Innovation Networks)
CLNs (Collaborative Learning Networks)
CINs (Collaborative Interest Networks)
COINs (Collaborative Innovation
COINs (Collaborative
Innovation Networks) develop
around a small core group of
people over time.
Around the core team, there
are people linked to only one
or two of the core team
A COIN has high density and
relatively low group between
A typical visualisation of a CLN
In a CLN (Collaborative Learning
Network), a small group of subject
matter experts talking among
themselves is developing around the
coordinator in the center of the
graph, who builds a learning
The communication activities are
arranged around them,
communicating with a large group
of other community members, who
are not communicating among
In a CIN (Collaborative
Interest Network) there are
different small teams,
operating as isolated islands.
Over time, the structural holes
are filling up, until the
network is almost fully
There is no clear center in this
graph, different people are
acting as local hubs.
Three Communities with Jeffery
Skilling at Center
As innovation can be done for good or for bad,
discovering potential COINs is most interesting for
gathering intelligence e-mail data analysis
Next figure Shows two COINs of Enron E- Mail Data.
Almost all the suspicious people are appearing in the

bigger of the two COINs.

Two COINS of Enron E- Mails
Further Analysis
Actors can have certain roles within their respective
communities, e.g. as a gatekeeper, who connects at least two
communities, or a leader, or a knowledge expert.
Besides identifying those roles visually in the social network

graph, they can also be found by calculating their contribution

Roles of different actors can be obtained by measuring

differences in their contribution frequency (measured in the

numbers of messages sent), and the extent to which their
communication is balanced between sending and receiving
messages, which we measured via a simple contribution index:
[messages sent messages received] / total of messages
sent and received.
This index is 1 for somebody who only receives

messages, 0 for somebody who sends and receives the

same number of messages, and +1 for somebody who
only sends messages.
Most suspects involved in the COIN structure above are now
located in the blue circle. It is straightforward to recognize
communities and their leaders.
It is anticipated that applying a similar approach to correlate

suspicious activity with temporal communication structure.

This way, compliance analysts and members of the legal

community can track emerging trends in email conversations

who is writing to whom about what, when and where they are
writing from.
Government, Intelligence and Law Enforcement professionals

will be able to analyze, in near-real time conversations and

links between email traffic, blogs, message boards and other
network-based communication, discovering hidden links and
emerging trends.
Web Mining
Web Structure Mining:
Web Structure mining is concerned with discovering

the model underlying the link structure of the web.

It is used to study topology of the of the hyperlinks

with out the description of the links.

Model can be used to categorize web pages and is useful to
generate information similarity & relationships between web
Algorithm of Web Structure Mining
Page Rank Algorithm
HITS (Hyperlink- Induced Topic Search)
Web Mining
Web Content mining
Web content consist of several types of data i.e.. Textual, image, audio,
video, metadata, as well as hyperlinks
Tools based on Two approaches of Mining
Agent-Based Approach
Database Approach
Agent-Based Approach
The agent-based approach to Web mining involves
the development of sophisticated AI systems that can
act autonomously or semi-autonomously on behalf of
a particular user, to discover and organize Web-based
Intelligent Search Agents : Several intelligent Web agents
have been developed that search for relevant information
using characteristics of a particular domain (and possibly a
user profile) to organize and interpret the discovered
Harvest, FAQ-Finder, Information Manifold, Parasite
Information Filtering/Categorization
A number of Web agents use various information retrieval
techniques and characteristics of open hypertext Web
documents to automatically retrieve, filter, and categorize
them. HyPursuit
Personalized Web Agents
Web agents includes those that obtain or learn user
preferences and discover Web information sources that
correspond to these preferences, and possibly those of other
individuals with similar interests (using collaborative
filtering). WebWatcher
Database Approach
The database approaches to Web mining have
generally focused on techniques for integrating and
organizing the heterogeneous and semi-structured
data on the Web into more structured and high-level
collections of resources, such as in relational
databases, and using standard database querying
mechanisms and data mining techniques to access
and analyze this information.
Multilevel Databases: ARANEUS system
Web Query Systems:W3QL, UnQL
Web Mining
Three Types
Web Structure Mining
Web Content Mining
Web Page Content Mining
Search Result Mining
Web Usage Mining
General Access Pattern Tracking
Customized Usage Tracking
Web Mining
Web usage mining
Deals with studying the data generated by the web surfers
sessions or behavior.
Web content & Structure mining utilizes the real or
primary data on the web. On the contrary web usage
mining mines the secondary data such as access logs,
proxy server logs, browser logs, user profiles, registration
data, user sessions, or transactions, cookies, user queries,
book mark data, mouse clicks, scrolls etc.
Data can be accumulated by the web server
Two approaches
General Access Pattern Tracking and
Customized usage Mining.
Web Usage Mining

General Access Pattern Tracking

To learn user navigation patterns (Impersonalized).
To understand access patterns & Trends.
Analyses can shed better light on the structure and grouping of the
resource providers.
Customized usage Mining
To learn user profile or user modeling in adaptive interfaces
Customized usage tracking analyzes individual trends
Help in customizing of the web sites to the users.
Web Usage Mining
Mining Techniques
The first approach maps the usage data of the web server in to
relational tables before data mining techniques are applied
(Classification & Clustering)
Second approach directly uses log data directly. It can be
represented with graphs.
Web structure mining

The web, by its very nature as hypertext, shows a

structure that can be described by means of graphs. The
visualization of said structure, of crucial importance to
understanding the results of web mining, is reduced to
that of graph drawing. This is, nevertheless a vast field
with many possibilities, some of which are presented in
this issue.
Page Rank Algorithm
Page Rank Algorithm
Page Rank is used by Search engine Optimization
Page Rank is one of the methods Google uses to

determine a pages relevance or importance

Questions for discussion:

How Page Rank is calculated?

How Page is used?
PR: Page Rank: Page rank is calculated for each page.
Varies from 0.15 to billions.
Toolbar PR: The page rank displayed in the Google

toolbar in your browser. This rank ranges from 0 to 10.

Back Link: if page A links out to page B, then page B is

said to have a back link from page A.

Page Rank
It a vote by all other pages on the web, about how
important a page is.
A link to a page counts as a vote of support.
If there is no link there is no support ( but it is an

abstention from voting rather than a vote against the

Page Rank Algorithm
We assume page A has pages T1...Tn which point to it
(i.e., are citations). The parameter d is a damping
factor which can be set between 0 and 1. We usually set
d to 0.85. Also C(A) is defined as the number of links
going out of page A. The PageRank of a page A is given
as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +

Page Rank Algorithm

PR(Tn) - Each page has a notion of its own self-

importance. Thats PR(T1) for the first page in the
web all the way up to PR(Tn) for the last page
C(Tn) - Each page spreads its vote out evenly
amongst all of its outgoing links. The count, or
number, of outgoing links for page 1 is C(T1),
C(Tn) for page n, and so on for all pages.
PR(Tn)/C(Tn) - so if our page (page A) has a
backlink from page n the share of the vote page A
will get is PR(Tn)/C(Tn)
Page Rank Algorithm
d(... - All these fractions of votes are added together
but, to stop the other pages having too much
influence, this total vote is damped down by
multiplying it by 0.85 (the factor d)
(1 - d) - The (1 d) bit at the beginning is a bit of
probability math magic so the sum of all web pages'
PageRanks will be one: it adds in the bit lost by the
d(.... It also means that if a page has no links to it (no
backlinks) even then it will still get a small PR of
0.15 (i.e. 1 0.85). (Aside: the Google paper says
the sum of all pages but they mean the the
normalised sum otherwise known as the average
to you and me.
Each page has one outgoing link (the outgoing count is 1, i.e.
C(A) = 1 and C(B) = 1).
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
d= 0.85
PR(A)= (1 d) + d(PR(B)/1)
PR(B)= (1 d) + d(PR(A)/1)i.e.
PR(A)= 0.15 + 0.85 * 1= 1
PR(B)= 0.15 + 0.85 * 1= 1
Guess 2
Lets start the guess at 0 instead and re-calculate:
PR(A)= 0.15 + 0.85 * 0= 0.15
PR(B)= 0.15 + 0.85 * 0.15 = = 0.2775
NB. weve already calculated a next best guess at PR(A) so
we use it here And again:
PR(A)= 0.15 + 0.85 * 0.2775 = 0.385875
PR(B)= 0.15 + 0.85 * 0.385875 = 0.47799375
And again
PR(A)= 0.15 + 0.85 * 0.47799375 =.5562946875
PR(B)= 0.15 + 0.85 * 0.5562946875 = 0.622850484375
and so on.
The numbers just keep going up.
But will the numbers stop increasing when they get to 1.0?
What if a calculation over-shoots and goes above 1.0?
Guess 3
Lets start the guess at 40 each and do a few cycles:
PR(A) = 40
PR(B) = 40
First calculation
PR(A)= 0.15 + 0.85 * 40= 34.25
PR(B)= 0.15 + 0.85 * 0.385875= 29.1775
And again
PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875
PR(B)= 0.15 + 0.85 * 24.950875 =21.35824375
Numbers are heading down alright!
It sure looks the numbers will get to 1.0 and stop
Heres the code used to calculate this example starting the guess at 0:
It doesnt matter where you start your guess, once the PageRank
calculations have settled down, the normalized probability distribution
(the average PageRank for all pages) will be 1.0

Calculate C(A), C(B), C, C(D) which one will be

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +

PR(A)=0, PR(B)=0, PR=0, PR(D)=0.
# forward links

# a -> b, c - 2 outgoing links

# b -> c - 1 outgoing link
# c -> a - 1 outgoing link
# d -> a - 1 outgoing link
# "backward" links (what's pointing to me?)
# a <= c
# b <= a
# c <= a, b, d
# d - nothing
PR(A) = (1 -d) + d * PR/C= .15+.85*0=.15
PR(B) =
(1 -d) + d * (PR(A)/C(A);=.15+.85*.15=0.3775
PR =
(1 -d) + d * (PR(A)/C(A) + PR(B)/C(B) + PR(D)/C(D))
= .15+ 0.85 (0.15/2 + 0.3775/1+ 0/1)=0.5346
PR(D)= 1 - d = 0.15


Observation: a hierarchy concentrates votes and PR

into one page

Hierarchical but with a link in and one out.

Site A contributed 0.85 PR to us, but the raised PR in the About,
Product and More pages has had a lovely feedback effect,
pushing up the home pages PR even further!
Principle: a well structured site will amplify the effect of any
contributed PR


The vote of the

Product page has
been split evenly
between it and the
external site. We now
value the external
Site B equally with
our More page.
The More page is
getting only half the
vote it had before
this is good for Site
B but very bad for
A case study of Pizza Industry
Question to be answered
Data is taken of three largest pizza chains.

Pizza Hut (11.65%)

Dominos Pizza (7.60%)
Pizza Hut (4.23%)
Total around 23%
What patterns can be found from their Facebook sites.
What patterns can be found from their twitters sites.
What are the main differences in terms of their

Facebook and twitter patterns.

Social Media use as October 2011
Trend in tweet numbers for October
for big three
Pizza Huts Customer engagement
trends in October 2011
Dominos Customer engagement
trends in October 2011
Papa johns Customer engagement
trends in October 2011
Five themes
Ordering & Delivery
Percentage of customers sharing their experience , feelings
and emotions
Pizza Quality
Feed back on Customer Purchase

Casual Socialization Tweets

Marketing Tweets

You might also like