You are on page 1of 47

Efficient Parallel Set-Similarity Joins Using MapReduce


Self-Join case R-S Join case Handling insufficient memory Experimental evaluation Conclusions

Vast amount of data:
Google N-gram database : ~1 trillion records GeneBank : 100 million records, size=416GB Facebook : 400 million active users

Detecting similar pairs of records becomes a challanging proble

Detecting near duplicate web-pages in web crawlin Document clustering Plagiarism detection Master data management
John W. Smith , Smith, John , John William Smith

Making recommendations to users based on their similarity to other users in query refinement Mining in social networking sites
User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest

Identifying coalitions of click fraudsters in online advertising

Problem Statement: Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold , find the pairs of objects/items/records satisfying sim(o1,o2)

Set -similarity functions

Jaccard or Tanimoto coefficient
Jaccard(x, y) =|x y| / |x U y|

I will call back =[I, will, call, back] I will call you soon=[I, will, call, you, soon] Jaccard similarity=3/6=0.5

Why Hadoop ?

Set-similarity with MapReduce

Large amount data,shared nothign architecture

map (k1,v1) -> list(k2,v2); reduce (k2,list(v2)) -> list(k3,v3) Problem :

Too much data to transfer Too many pairs to verify(Two similar sets share at least 1 token)

Set-Similarity Filtering
Efficient set-similarity join algorithms rely on effective filters string s =I will call back global token ordering {back,call, will, I} prefix of length 2 of s= [back, call] prefix filtering principle states that similar strings need to share at least one common token in their prefixes.

Prefix filtering: example

Record 1

Record 2

Each set has 5 tokens Similar: they share at least 4 tokens Prefix length: 2

Parallel Set-Similarity Joins

Stage I: Token Ordering Compute data statistics for good signatures Stage II -RID-Pair Generation Stage III: Record Join Generate actual pairs of joined records

Input Data
RID = Row ID a : join column A B C is a string:
Address: 14th Saarbruecker Strasse Name: John W. Smith

Stage I: Token Ordering

Basic Token Ordering(BTO) One Phase Token Ordering (OPTO)

Token Ordering

Creates a global ordering of the tokens in the join column, based on their frequency
RID a b c

1 2
Global Ordering: (based on frequency)


B 3 A 4

Basic Token Ordering(BTO)

2 MapReduce cycles:
1st : compute token frequencies 2nd: sort the tokens by their frequencies

Basic Token Ordering 1st MapReduce cycle

, ,

map: tokenize the join value of each record emit each token with no. of occurrences 1

reduce: for each token, compute total count (frequency)

Basic Token Ordering 2nd MapReduce cycle

map: interchange key with value

reduce(use only 1 reducer): emits the value

One Phase Tokens Ordering (OPTO)

alternative to Basic Token Ordering (BTO):

Uses only one MapReduce Cycle (less I/O) In-memory token sorting, instead of using a reducer

OPTO Details
, ,

Use tear_down method to order the tokens in memory

map: reduce: tokenize the join for each token, compute value of each record total count (frequency) emit each token with no. of occurrences 1

Stage II: RID-Pair Generation

Basic Kernel(BK) Indexed Kernel(PK)

RID-Pair Generation
scans the original input data(records) outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim) consists of only one MapReduce cycle
Global ordering of tokens obtained in the previous stage

RID-Pair Generation: Map Phase

scan input records and for each record:
project it on RID & join attribute tokenize it extract prefix according to global ordering of tokens obtained in the Token Ordering stage route tokens to appropriate reducer

Grouping/Routing Strategies

Goal: distribute candidates to the right reducers to minimize reducers workload Like hashing (projected)records to the corresponding candidate-buckets Each reducer handles one/more candidatebuckets 2 routing strategies:
Using Individual Tokens Using Grouped Tokens

Routing: using individual tokens

Treat each token as a key For each record, generates a (key, value) pair for each of its prefix tokens:
Example: Given the global ordering: Token Frequency A 10 B 10 E 22 D 23 G 23 C 40 F 48

A B C => prefix of length 2: A,B => generate/emit 2 (key,value) pairs: (A, (1,A B C)) (B, (1,A B C))

Grouping/Routing: using individual tokens

high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer)

high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)

Routing: Using Grouped Tokens

Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key) For each record, generates a (key, value) pair for each the groups of the prefix tokens:
Example: Given the global ordering: Token Frequency A 10 B 10 E 22 D 23 G 23 C 40 F 48

A B C => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs: (X, (1,A B C)) (Y, (1,A B C))

Grouping/Routing: Using Grouped Tokens

The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner
Token Frequency A D F

A 10

B 10 B G

E 22

D 23

G 23

C 40 E C

F 48


Grouping/Routing: Using Grouped Tokens

fewer replication of record projection

Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity) ABCD (A,B belong to Group X ; C belong to Group Y)
o/p (X,_) & (Y,_)

EFG (E belong to Group Y )

o/p (Y,_)

RID-Pair Generation: Reduce Phase

This is the core of the entire method Each reducer processes one/more buckets In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate
If the similarity of the 2 candidates >= threshold => output their ids and also their similarity

Bucket of candidates

RID-Pair Generation: Reduce Phase

Computing similarity of the candidates in a bucket comes in 2 flavors:

Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket

Indexed Kernel : uses a PPJoin+ index

RID-Pair Generation: Basic Kernel

Straightforward method for finding candidates satisfying the join predicate Quadratic complexity : O(#candidates2)

RID-Pair Generation:PPJoin+Indexed Kernal

Uses a special index data structure Not so straightforward to implement map() -same as in BK algorithm Much more efficient

Stage III: Record Join

Until now we have only pairs of RIDs, but we need actual records Use the RID pairs generated in the previous stage to join the actual records Main idea:
bring in the rest of the each record (everything except the RID which we already have)

2 approaches:
Basic Record Join (BRJ) One-Phase Record Join (OPRJ)

Record Join: Basic Record Join

Uses 2 MapReduce cycles
1st cycle: fills in the record information for each half of each pair
2nd cycle: brings together the previously filled in records

Record Join: One Phase Record Join

Uses only one MapReduce cycle

R-S Join
Challenge: We now have 2 different record sources => 2 different input streams Map Reduce can work on only 1 input stream 2nd and 3rd stage affected Solution: extend (key, value) pairs so that it includes a relation tag for each record

Handling Insufficient Memory

Map-Based Block Processing. Reduce-Based Block Processing

Cluster: 10-node IBM x3650, running Hadoop Data sets: DBLP: 1.2M publications CITESEERX: 1.3M publication Consider only the header of each paper(i.e author, title, date of publication, etc.) Data size synthetically increased (by various factors) Measure: Absolute running time Speedup Scaleup

Self-Join running time

Best algorithm: BTO-PK-OPRJ Most expensive stage: the RID-pair generation

Self-Join Speedup
Fixed data size, vary the cluster size Best time: BTO-PK-OPRJ

Self-Join Scaleup
Increase data size and cluster size together by the same factor Best time: BTO-PK-OPRJ

Self-Join Summery
I stage- BTO was the best choice. II stage- PK was the best choice. III stage,-the best choice depends on the amount of data and the size of the cluster
OPRJ was somewhat faster, but the cost of loading the similar-RID pairs in memory was constant as the the cluster size increased, and the cost increased as the data size increased. For these reasons, we recommend BRJ as a good alternative

Best scaleup was achieved by BTO-PK-BRJ

R-S Join Performance

Speed Up
I stage - R-S Join performance was identical to the first stage in the self-join case II stage -noticed a similar speedup (almost perfect) as for the self-join case. III stage - OPRJ approach was initially the fastest (for the 2 and 4 node case), but it eventually became slower than the BRJ approach.

For both self-join and R-S join cases, we recommend BTOPK-BRJ as a robust and scalable method. Useful in many data cleaning scenarios SSJoin and MapReduce: one solution for huge datasets Very efficient when based on prefix-filtering and PPJoin+ Scales-up up nicely

Thank You!