You are on page 1of 105

Social Data

Toby Segaran
Author, Programming Collective Intelligence
Data Magnate, Metaweb Technologies
Data mining?

“Sorting through data* to identify
patterns and establish relationships”

* usually a lot of data
Where and why?

Methods and examples
Where and why?
• Targeted Advertising
• Recommendations
• Search Results
• Group Discovery
• Filtering of Documents
• Theme Extraction
Google ad
Facebook ad
This is strange...

• Google just has text
• Facebook knows more about me
• But it’s taking a few cues...
Status: “engaged”
Where and why?
• Targeted Advertising
• Recommendations
• Search Results
• Group Discovery
• Filtering of Documents
• Theme Extraction
Real Amazon Products
Netflix Prize
Strands Contest
Custom News
Custom News
Custom News
Where and why?
• Targeted Advertising
• Recommendations
• Search Results
• Group Discovery
• Filtering of Documents
• Theme Extraction
Ranking algorithms

The now-incredibly-famous paper
Ranking algorithms
Learning behavior

• Google begins tracking clicks in 2005
• MSN search claims neural network
• AOL Data Scandal
Where and why?
• Targeted Advertising
• Recommendations
• Search Results
• Group Discovery
• Filtering of Documents
• Theme Extraction
In Biology
Page grouping
News stories
Where and why?
• Targeted Advertising
• Recommendations
• Search Results
• Group Discovery
• Filtering of Documents
• Theme Extraction
The obvious: spam
SpamBayes
Other email uses
Web documents

“As you add information to Twine, it is automatically tagged
so that you and others can find it more easily”
Where and why?
• Targeted Advertising
• Recommendations
• Search Results
• Group Discovery
• Filtering of Documents
• Theme Extraction
What is the buzz?
Customer Community
Where and why?

Methods and examples
Methods and Examples
• Bayesian Filtering
• Distance Metrics
• Clustering
• Decision Trees
• Network Analysis
• Feature Extraction
Bayesian Filtering
Bayesian Filtering
Bayesian Filtering
Bayesian Filtering
Bayesian Filtering
Bayesian Filtering
school
work
algorithm
Bayesian Filtering
school
work
algorithm

v1agra
trades
associate
Craigslist personals
Analysis
Five Cities

W4M Personal Ads
Results
New York Boston Chicago
Mets Pink Cubs
Lounges Sox Burbs
Offense Poetry Bears
Desires Intellectually Girlie
Musical Punk Insecure
Submissive Appreciation Cheat
Create Exercise Importance
Song Winter Blunt
Oral Education Mouth
Results
Los Angeles San Francisco
Excellent Tee
Vegas Employment
Meaningful Picnic
Star STD
Lame Tasting
Industry Hikes
Heat French
Fitness .com
Entertainment Kayaking
Latino Cycling
Methods and Examples
• Bayesian Filtering
• Distance Metrics
• Clustering
• Decision Trees
• Network Analysis
• Feature Extraction
Preference distance

Sarah Marshall 3 2 1 2
Leatherheads 3 3 5 5
Preference distance
5

4

3

2

1

1 2 3 4 5
Preference distance
5

4

1
3
2.23

2

1

1 2 3 4 5
For recommendations
5

4

1
3
2.23
Prom Night: 5 ? Prom Night: 2

2

1

1 2 3 4 5
For recommendations
5

4

3
Prom Night: 5 4.1 Prom Night: 2

2

1

1 2 3 4 5
Linguistic distance
The
Six
Degrees
Hypothesis
Experienced
It
Is
When
You
Travel
Linguistic distance
The
Six
Degrees
Six Six 3
Hypothesis
Degrees Degrees 3
Experienced
Hypothesis Hypothesis 1
It
Experienced Experienced 5
Is
Travel Travel 6
When
You
Travel
Linguistic distance
“china” “kids” “music” “travel” “yahoo”

Gothamist 0 3 3 3 0

GigaOM 6 0 1 4 2

QuickOnlineTips 0 2 2 0 12

O’Reilly Radar 1 0 3 6 4
Linguistic distance
“china” “kids” “music” “yahoo”

Gothamist 0 3 3 0

GigaOM 6 0 1 2

Quick Online Tips 0 2 2 12

Euclidean “as the crow flies”

= 12 (approx)
Article/blog similarity

Valleywag - Huffington > Slashdot - Wired
Methods and Examples
• Bayesian Filtering
• Distance Metrics
• Clustering
• Decision Trees
• Network Analysis
• Feature Extraction
Hierarchical Clustering
5

4

3

2

1

1 2 3 4 5
Hierarchical Clustering
5

4

3

2

1

1 2 3 4 5
Hierarchical Clustering
5

4

3

2

1

1 2 3 4 5
Hierarchical Clustering
5

4

3

2

1

1 2 3 4 5
Hierarchical Clustering
Grouping bloggers
Grouping bloggers
Grouping bloggers
Grouping articles
Methods and Examples
• Bayesian Filtering
• Distance Metrics
• Clustering
• Decision Trees
• Network Analysis
• Feature Extraction
Decision Trees
CART Algorithm
Brand Type Life (hrs)

Duracell C 4

Energizer C 5

Duracell AA 2

Energizer AA 2.5

From any dataset...
CART Algorithm
Brand Type Life (hrs) Type is C?
No Yes
Duracell C 4

Energizer C 5 Avg=2.1 Avg=4.5
Duracell AA 2

Energizer AA 2.2

... find the best split ...
CART Algorithm
Brand Type Life (hrs) Type is C?
No Yes
Duracell C 4

Energizer C 5 Duracell Duracell
No Yes No Yes
Duracell AA 2

Energizer AA 2.2
2.2 2 5 4

... and repeat.
Hot or Not
Hot or Not
Methods and Examples
• Bayesian Filtering
• Distance Metrics
• Clustering
• Decision Trees
• Network Analysis
• Feature Extraction
A network
E
A
C
D

B
F
PageRank
E
A
1.0
1.0 C
D
1.0
1.0
B
F
1.0
1.0
PageRank
E
A
1.0
1.0 C
D
1.0
1.0
B
F
1.0
1.0
D = 0.15 + .85*E/1 + .85 * F/2 + .85*B/1 = 2.275
PageRank
E
A
1.0
0.58 C
D
1.0
2.275
B
F
0.58
0.15
PageRank
E
A
0.3
0.58 C
D
2.08
1.56
B
F
0.58
0.15
PageRank
E
A
0.3
1.03 C
D
1.48
1.56
B
F
1.03
0.15
PageRank
E
A
0.3
0.78 C
D
1.48
1.34
B
F
0.78
0.15
CI FOO participants
Science papers
Bringing PageRank to the citation analysis
The paper attempts to provide an alternative method for
measuring the importance of scientific papers based on the
Google's PageRank. The method is a meaningful extension of
the common integer counting of citations and is then
experimented for bringing PageRank to the citation analysis
in a large citation network. It offers a more integrated picture
of the publications' influence in a specific field.
Clustering coefficient

“How many of each persons friends
are friends with each other?”
Clustering coefficient
C
B D

A
E
F

Low clustering coefficient
Clustering coefficient
C
B D

A
E
F

High clustering coefficient
“small world graph”
Twitter!
Twitter!
Methods and Examples
• Bayesian Filtering
• Distance Metrics
• Clustering
• Decision Trees
• Network Analysis
• Feature Extraction
Independent Features
Message boards
Message boards
Matrix Factorization
F1 F2 F3
Gym 0 1 2 Msg1 M2 M3 M4 M5
Calorie 2 0 1 F1 1 0 2 3 0
Weigh
Carbs
2
1
2
0
1
3
x F2
F3
0
1
2
0
1
2
1
0
3
0
Treadmill 0 1 2
Weight Matrix
Features Matrix

Msg1 Msg2 Msg3 Msg4 Msg5
Gym 1 3 3 0 1
Calorie 0 2 4 1 3
Weigh 2 3 1 0 1
Carbs 0 1 1 0 2
Treadmill 3 2 0 2 2

Current Guess
Matrix Factorization
F1 F2 F3
Gym 0 1 2 Msg1 M2 M3 M4 M5
Calorie 2 0 1 F1 1 0 2 3 0
Weigh
Carbs
2
1
2
0
1
3
x F2
F3
0
1
2
0
1
2
1
0
3
0
Treadmill 0 1 2
Weight Matrix
Features Matrix

Msg1 Msg2 Msg3 Msg4 Msg
Msg1 Msg2 Msg3 Msg4 Msg5
5
Gym 1 3 3 0 1
Gym 2 0 0 3 0
Calorie 0 2 4 1 3
Calorie 0 2 1 1 3
Weigh 2 3 1 0 1
Weigh 1 0 2 0 0
Carbs 0 1 1 0 2
Carbs 0 3 0 0 2
Treadmill 3 2 0 2 2
Treadmill 1 0 0 2 0

Current Guess Target Result
Matrix Factorization
F1 F2 F3
Gym 1 0 0 Msg1 M2 M3 M4 M5
Calorie 0 1 1 F1 2 0 0 1 0
Weigh
Carbs
0
0
0
1
2
0
x F2
F3
0
1
2
0
0
1
1
0
3
0
Treadmill 1 0 0
Weight Matrix
Features Matrix

Msg1 Msg2 Msg3 Msg4 Msg
Msg1 Msg2 Msg3 Msg4 Msg5
5
Gym 2 0 0 3 0
Gym 2 0 0 3 0
Calorie 0 2 1 1 3
Calorie 0 2 1 1 3
Weigh 1 0 2 0 0
Weigh 1 0 2 0 0
Carbs 0 3 0 0 2
Carbs 0 3 0 0 2
Treadmill 1 0 0 2 0
Treadmill 1 0 0 2 0

Current Guess Target Result
Interpreting Features
F1 F2 F3
Theme 1 Theme 2 Theme 3
Gym 1 0 0
Calorie 0 1 1 Gym Calorie Weigh
Weigh 0 0 2
Treadmill Carbs Calorie
Carbs 0 1 0
Treadmill 1 0 0

Features Matrix

Msg1 M2 M3 M4 M5
F1 2 0 0 1 0 Msg1 Msg2 Msg3 etc.
F2 0 2 0 1 3 Theme 1 Theme 2 Theme 3
F3 1 0 1 0 0 Theme 3

Weight Matrix
Diet & Body themes
Calories
Atkins Weight
Induction Fats
South Chocolate Protein
Beach Black Cholesterol
Carbs Coffee
Olive
Gym
Broccoli
Weights
Cook Exercise
Recipe Running
Fried Injured
Home Money
Organic
Want
Best
Wikipedia people
she series league olympics university
her television major competed professor
after show baseball won received
when which season summer science
father radio played medal research
women bbc with athelete born
We’re just getting
started...
Homepage http://kiwitobes.com

Freebase http://freebase.com
Questions?