Performance Analysis of Hadoop Link Prediction

Yuxiao Dong ydong1@nd.edu

Casey Robinson crobins9@nd.edu

Jian Xu jxu5@nd.edu

Introduction

Facebook

? ?

X

Twitter
? ? X ✓

Problem Statement
In a network G=(V,E,X), for a particular user vs and a set of candidates C to which vs may create a link, find a predictive function f:(V,E,X,vs,C)→Y where Y={y1,y2,...,y|C|} is a set of inferred results for whether user vs would create links with users in C.

Challenges
• Real networks are large • > 1 billion users on Facebook (Oct. 2012) • > 500 million users on Twitter (Jul. 2012) • > 175 million users on LinkedIn (Jun. 2012) • Big data makes prediction even slower

Our Solution
Divide Adjacency list Distributed computing Hadoop

→ → →

Smaller Problems Map Reduce Data Intensive Science Cluster

sort

split 0

map

merge

reduce

part 0

sort

split 1

map
merge

reduce
sort

part 0

split 2

map

Link Prediction Framework
Prepare Vertex Num Split Data Probe Edge Num Degree Statis

AUC

Non-Exist Score

Probe Score

LP Score

AdjList

Algorithm Design
1 2 5 3 4 7 6
1 5 1 1 2 2 3 2 4 6 2 6 3 4 3 4 4 6 5 7 5 5 3 1 1 1 6 7 4 4 2 3 4 5 2 2 2 3 4 6 1 2 3 4 5 6 2,3,4 3,4,6 4,,,, 5,,,, 6,7,, 7,,,, 2,3,1 2,4,1 3,4,1 3,4,2 3,6,2 4,6,2 6,7,5 2,3,1 2,4,1 3,4,1,2 3,6,2 4,6,2 6,7,5

4

5

Mapper

Reducer

Mapper

Reducer

Data Sets
Name
HepPh ND Web Live Journal

Nodes
12,008 325,729 4,847,571

Edges
237,010 1,497,134 68,993,773

Relative Size
1x 7.14x 357.78x

Approach

Black Box

• Number of Reducers • Data Size

Time Breakdown

• Which step(s)?

80 Time (% of total) 60 40 20 0

HEP Ph

ND Web

Live Journal

Resource Monitoring

• Bottlenecks

Machine Specifications
• • • • •
26 Nodes 32 GB RAM 12x2 TB SATA disks (4 dedicated to Hadoop storage) 2x8-core Intel Xeon E5620 CPUs @ 2.40 GHz Gigabit Ethernet

Monitoring Tools
Resource
CPU Disk Network

Command
iostat -c 1 iostat -d 1 netstat -c -I

Monitoring Implementation
1 for q in $(seq -w 1 26); do 2 ./ssh.exp disc$q.crc.nd.edu crobins9 $p 3 date >> /tmp/cpu.out 4 (iostat -c 1 >> /tmp/cpu.out) & 5 done 6 7 # submit and wait for link prediction 8 9 for q in $(seq -w 1 26); do 10 " ./ssh.exp disc$q.crc.nd.edu crobins9 $p 11 ps aux | grep iostat | awk ‘{print $2}’ | xargs kill -9 12 done 13 " 14 for q in $(seq -w 1 26); do 15 ./scp.exp disc$q.crc.nd.edu crobins9 $p 16 done

CPU

100

CPU Usage (%)

50

0

0

1000

2000

3000 4000 Time (s)

5000

6000

7000

Disk

80 Blocks Read (1k blocks)

LP Score

AUC

40

0

0

1000

2000

3000 4000 Time (s)

5000

6000

7000

800 Blocks Written (1k blocks)

LP Score

AUC

400

0

0

1000

2000

3000 4000 Time (s)

5000

6000

7000

Network

1000 Data Received (Mb/s)

LP Score

AUC

500

0

0

1000

2000

3000 4000 Time (s)

5000

6000

7000

LP Score

AUC

1000 Data Sent (Mb/s)

500

0

0

1000

2000

3000 4000 Time (s)

5000

6000

7000

Conclusions and Future Improvements

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

n = 13000000 double left[] = new double[n]; double right[] = new double[n]; int n1=0, n2=0; int m = 3*n; for(int i = 0; i < m; i++){ " index1 = rand.nextInt(n); " index2 = rand1.nextInt(n); " " leftScore = left[index1]; rightScore = right[index2]; if(leftScore > rightScore){ n1++; } else if( Math.abs(leftScore - rightScore) < 1E-6 ){ n2++; } } AUC = ( n1 + 0.5 * n2 ) / m;

" "

Some Conclusions
• • •
Data ≥1GB → Hadoop Useful 6 Reducers Multiple jobs with less reducers

Sign up to vote on this title
UsefulNot useful