P. 1
2009.Scalable Phoenix.iiswc.slides

2009.Scalable Phoenix.iiswc.slides

|Views: 0|Likes:
Published by Ary Suryawan

More info:

Published by: Ary Suryawan on Sep 20, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

09/20/2010

pdf

text

original

Phoenix Rebirth

:
Scalable MapReduce on a Large-Scale
Shared-Memory System
Richard Yoo, Anthony Romano, Christos Kozyrakis
Stanford University
http://mapreduce.stanford.edu

Yoo, Phoenix2 October 6, 2009
Talk in a Nutshell
Scaling a shared-memory MapReduce system on a 256-thread
machine with NUMA characteristics
Major challenges & solutions
• Memory mgmt and locality => locality-aware task distribution
• Data structure design => mechanisms to tolerate NUMA latencies
• Interactions with the OS => thread pool and concurrent allocators
Results & lessons learnt
• Improved speedup by up to 19x (average 2.5x)
• Scalability of the OS still the major bottleneck
Background

Yoo, Phoenix2 October 6, 2009
MapReduce and Phoenix
MapReduce
• A functional parallel programming framework for large clusters
• Users only provide map / reduce functions
Map: processes input data to generate intermediate key / value pairs
Reduce: merges intermediate pairs with the same key
• Runtime for MapReduce
Automatically parallelizes computation
Manages data distribution / result collection
Phoenix: shared-memory implementation of MapReduce
• An efficient programming model for both CMPs and SMPs [HPCA’07]

Yoo, Phoenix2 October 6, 2009
Phoenix on a 256-Thread System
4 UltraSPARC T2+ chips connected by a single hub chip
1. Large number of threads (256 HW threads)
2. Non-uniform memory access (NUMA) characteristics
300 cycles to access local memory, +100 cycles for remote memory
chip 0 chip 1
chip 2 chip 3
hub
mem 0
mem 3 mem 2
mem 1

Yoo, Phoenix2 October 6, 2009
0
3
10
13
20
23
1 2 4 8 16 32 64 128 236
s
p
e
e
d
u
p

#threads
hlsLoaram
kmeans
llnear_rearesslon
pca
sLrlna_maLch
word_counL
The Problem: Application Scalability
Baseline Phoenix scales well on a single socket machine
Performance plummets with multiple sockets & large thread counts
0
3
10
13
20
23
1 2 4 8 16 32 64
s
p
e
e
d
u
p

#threads
Speedup on a Single Socket UltraSPARC T2 Speedup on a 4-Socket UltraSPARC T2+

Yoo, Phoenix2 October 6, 2009
The Problem: OS Scalability
OS / libraries exhibit NUMA effects as well
• Latency increases rapidly when crossing chip boundary
• Similar behavior on a 32-core Opteron running Linux
0.0L+00
1.0L+02
2.0L+02
3.0L+02
4.0L+02
3.0L+02
6.0L+02
7.0L+02
8.0L+02
1 10 19 28 37 46 33 64 73 82 91 100 109 118 127
e
x
e
c
u
n
o
n

n
m
e

(
u
s
e
c
)

#threads
pLhread_muLex
pLhread_semaphore
Synchronization Primitive Performance on the 4-Socket Machine
Optimizing the Phoenix Runtime
on a Large-Scale NUMA System

Yoo, Phoenix2 October 6, 2009
Optimization Approach
Focus on the unique position of runtimes in a software stack
• Runtimes exhibit complex interactions with user code & OS
Optimization approach should be multi-layered as well
• Algorithm should be NUMA aware
• Implementation should be optimized around NUMA challenges
• OS interaction should be minimized as much as possible
App
Phoenix Runtime
OS
HW
Algorithmic Level
Implementation Level
OS Interaction Level

Yoo, Phoenix2 October 6, 2009
Algorithmic Optimizations
App
Phoenix Runtime
OS
HW
Algorithmic Level
Implementation Level
OS Interaction Level

Yoo, Phoenix2 October 6, 2009
Algorithmic Optimizations (contd.)
Runtime algorithm itself should be NUMA-aware
Problem: original Phoenix did not distinguish local vs. remote threads
• On Solaris, the physical frames for mmap()ed data spread out across
multiple locality groups (a chip + a dedicated memory channel)
• Blind task assignment can have local threads work on remote data
chip 0 chip 1
chip 2 chip 3
hub
mem 0
mem 3 mem 2
mem 1
remote
access
remote
access
remote
access

Yoo, Phoenix2 October 6, 2009
Algorithmic Optimizations (contd.)
Solution: locality-aware task distribution
• Utilize per-locality group task queues
• Distribute tasks according to their locality group
• Threads work on their local task queue first, then perform task stealing
chip 0 chip 1
chip 2 chip 3
hub
mem 0
mem 3 mem 2
mem 1

Yoo, Phoenix2 October 6, 2009
Implementation Optimizations
App
Phoenix Runtime
OS
HW
Algorithmic Level
Implementation Level
OS Interaction Level

Yoo, Phoenix2 October 6, 2009
Implementation Optimizations (contd.)
Runtime implementation should handle large data sets efficiently
Problem: Phoenix core data structure not efficient at handling large-scale data
Map Phase
• Each column of pointers amounts to a fixed-size hash table
• keys_array and vals_array all thread-local
“apple”
keys_array
2
vals_array
4
“banana”
“orange”
“pear”
map thread id
hash(“orange”)
1
2-D array of pointers
too many
keys
buffer
reallocations
num_map_threads
n
u
m
_
r
e
d
u
c
e
_
t
a
s
k
s

Yoo, Phoenix2 October 6, 2009
Implementation Optimizations (contd.)
Reduce Phase
• Each row amounts to one reduce task
• Mismatch in access pattern results in remote accesses
1
“orange”
“orange”
reduce task index
2 4 1
2 4 1 1
Copy and pass to
user reduce function
keys_array
vals_array
vals_array
keys_array
remote
access
2-D array of pointers
5 1 3 1
5 3 1 1
large chunk of
contiguous
memory
remote
access

Yoo, Phoenix2 October 6, 2009
Implementation Optimizations (contd.)
“apple”
keys_array
2
vals_array
4
“banana”
“orange”
“pear”
Solution 1: make the hash bucket count user-tunable
• Adjust the bucket count to get few keys per bucket
2-D array of pointers

Yoo, Phoenix2 October 6, 2009
Implementation Optimizations (contd.)
Solution 2: implement iterator interface to vals_array
• Removed copying / allocating the large value array
• Buffer implemented as distributed chunks of memory
• Implemented prefetch mechanism behind the interface
1
“orange”
“orange”
reduce task index
2 4 1
keys_array
vals_array
vals_array
keys_array
2-D array of pointers
5 1 3 1
&vals_array
Expose iterator to
user reduce function
&vals_array 2 4 1 1
Copy and pass to
user reduce function
5 3 1 1
prefetch!

Yoo, Phoenix2 October 6, 2009
Other Optimizations Tried
Replace hash table with more sophisticated data structures
• Large amount of access traffic
• Simple changes negated the performance improvement
E.g., excessive pointer indirection
Combiners
• Only works for commutative and associative reduce functions
• Perform local reduction at the end of the map phase
• Little difference once the prefetcher was in place
Could be good for energy
See paper for details

Yoo, Phoenix2 October 6, 2009
OS Interaction Optimizations
App
Phoenix Runtime
OS
HW
Algorithmic Level
Implementation Level
OS Interaction Level

Yoo, Phoenix2 October 6, 2009
OS Interaction Optimizations (contd.)
Runtimes should deliberately manage OS interactions
1. Memory management => memory allocator performance
• Problem: large, unpredictable amount of intermediate / final data
• Solution
Sensitivity study on various memory allocators
At high thread count, allocator performance limited by sbrk()
2. Thread creation => mmap()
• Problem: stack deallocation (munmap()) in thread join
• Solution
Implement thread pool
Reuse threads over various MapReduce phases and instances
Results

Yoo, Phoenix2 October 6, 2009
Experiment Settings
4-Socket UltraSPARC T2+
Workloads released in the original Phoenix
• Input set significantly increased to stress the large-scale machine
Solaris 5.10, GCC 4.2.1 –O3
Similar performance improvements and challenges on a 32-
thread Opteron system (8-sockets, quad-core chips) running
Linux

Yoo, Phoenix2 October 6, 2009
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 236
s
p
e
e
d
u
p

#threads
hlsLoaram
kmeans
llnear_rearesslon
maLrlx_muluplv
pca
sLrlna_maLch
word_counL
Scalability Summary
Significant scalability improvement
Scalability of the Original Phoenix Scalability of the Optimized Version
workloads scale up to
256 threads
limited by OS scalability issues
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 236
s
p
e
e
d
u
p

#threads

Yoo, Phoenix2 October 6, 2009
0
1
2
3
4
3
6
7
8
9
10
1 2 4 8 16 32 64 128 236
s
p
e
e
d
u
p

#threads
19
Execution Time Improvement
Optimizations more effective for NUMA
Relative Speedup over the Original Phoenix
little variation
average: 1.5x, max: 2.8x
significant improvement
average: 2.53x, max: 19x

Yoo, Phoenix2 October 6, 2009
0
30
100
130
200
230
8 16 32 64 128 236
e
x
e
c
u
n
o
n

n
m
e

(
s
e
c
)

#threads
basellne Lpool
Analysis: Thread Pool
kmeans performs a sequence of MapReduces
• 160 iterations, 163,840 threads
Thread pool effectively reduces the number of calls to munmap()
threads before after
8 20 10
16 1,947 13
32 4,499 18
64 9,956 33
128 14,661 44
256 14,697 102
Number of Calls to munmap() on kmeans
kmeans Performance Improvement due to Thread Pool
3.47x improvement

Yoo, Phoenix2 October 6, 2009
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1


10°
13°
20°
23°
30°
33°
40°
43°
30°
1 2 4 8 16 32 64 128 236
s
p
e
e
d
u
p

|
o
c
a
|
|
t
y

g
r
o
u
p

h
|
t

r
a
t
e

#threads
basellne locaLor_L speedup
Analysis: Locality-Aware Task Distribution
Locality group hit rate (% of tasks supplied from local memory)
Significant locality group hit rate improvement under NUMA
environment
Locality Group Hit Rate on string_match
forced misses result
in similar hit rate
improved hit rates
= improved performance

Yoo, Phoenix2 October 6, 2009
Analysis: Hash Table Size
No single hash table size worked for all the workloads
• Some workloads generated only a small / fixed number of unique keys
• For those that did benefit, the improvement was not consistent
Recommended values provided for each application
0.6
0.7
0.8
0.9
1
1.1
1.2
32 64 128 236
n
o
r
m
a
|
|
z
e
d

e
x
e
c
u
n
o
n

n
m
e

#threads
236 8k
word_count Sensitivity to Hash Table Size
trend reverses with more
threads
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
32 64 128 236
n
o
r
m
a
|
|
z
e
d

e
x
e
c
u
n
o
n

n
m
e

#threads
236 8k
kmeans Sensitivity to Hash Table Size
no thread count leads
to speedup
Why Are Some Applications Not Scaling?

Yoo, Phoenix2 October 6, 2009
Non-Scalable Workloads
Non-scalable workloads shared two common trends
1. Significant idle time increase
2. Increased portion of kernel time over total useful computation
0
0.1
0.2
0.3
0.4
0.3
0.6
0.7
0.8
0.9
1
0
0.3
1
1.3
2
2.3
3
3.3
4
4.3
3
32 64 128 236
r
a
n
o

e
x
e
c
u
n
o
n

n
m
e

(
s
e
c
)

#threads
svs user ldle svs / eñecuve
Execution Time Breakdown on histogram
idle time increase
kernel time increases
significantly

Yoo, Phoenix2 October 6, 2009
Profiler Analysis
histogram
• 64 % execution time spent idling for data page fault
linear_regression
• 63 % execution time spent idling for data page fault
word_count
• 28 % of its execution time in sbrk() called inside the memory
allocator
• 27 % of execution time idling for data pages
Memory allocator and mmap()turned out to be the bottleneck
Not the physical I/O problem
• OS buffer cache warmed up by repeating the same experiment with
the same input

Yoo, Phoenix2 October 6, 2009
Memory Allocator Scalability
0
3
10
13
20
23
30
1 2 4 8 16 32 64 128 236
s
p
e
e
d
u
p

#threads
malloc
mLmalloc
hoard
llbumem
llbumem_mmap
Memory Allocator Scalability Comparison on word_count
sbrk() scalability a major issue
• A single user-level lock serialized accesses
• Per-address space locks protected in-kernel virtual memory objects
mmap() even worse

Yoo, Phoenix2 October 6, 2009
mmap() Scalability
Microbenchmark: mmap()user file and calculate the sum by
streaming through data chunks
0
1
2
3
4
3
6
7
8
9
10
1 2 4 8 16 32 64 128 236
s
p
e
e
d
u
p

#threads
8k
64k
4m
236m
mmap() Microbenchmark Scalability
mmap() alone does not scale
Kernel lock serialization on per process page table

Yoo, Phoenix2 October 6, 2009
Conclusion
Multi-layered optimization approach proved to be effective
• Average 2.5x speedup, maximum 19x
OS scalability issues need to be addressed for further scalability
• Memory management and I/O
• Opens up a new research opportunity

Yoo, Phoenix2 October 6, 2009
Questions?
The Phoenix System for MapReduce Programming, v2.0
• Publicly available at http://mapreduce.stanford.edu

Talk in a Nutshell
Scaling a shared-memory MapReduce system on a 256-thread machine with NUMA characteristics Major challenges & solutions

• Memory mgmt and locality => locality-aware task distribution • Data structure design => mechanisms to tolerate NUMA latencies • Interactions with the OS => thread pool and concurrent allocators
Results & lessons learnt

• Improved speedup by up to 19x (average 2.5x) • Scalability of the OS still the major bottleneck

Yoo, Phoenix2

October 6, 2009

Background

MapReduce and Phoenix MapReduce • A functional parallel programming framework for large clusters • Users only provide map / reduce functions Map: processes input data to generate intermediate key / value pairs Reduce: merges intermediate pairs with the same key • Runtime for MapReduce Automatically parallelizes computation Manages data distribution / result collection Phoenix: shared-memory implementation of MapReduce • An efficient programming model for both CMPs and SMPs [HPCA’07] Yoo. Phoenix2 October 6. 2009 .

Phoenix2 October 6. Non-uniform memory access (NUMA) characteristics 300 cycles to access local memory. +100 cycles for remote memory mem 0 chip 0 chip 1 mem 1 hub mem 2 chip 2 chip 3 mem 3 Yoo.Phoenix on a 256-Thread System 4 UltraSPARC T2+ chips connected by a single hub chip 1. 2009 . Large number of threads (256 HW threads) 2.

2009 . Phoenix2 October 6.The Problem: Application Scalability Speedup on aa 4-Socket UltraSPARC T2+ T2 Speedup on Single Socket UltraSPARC Baseline Phoenix scales well on a single socket machine Performance plummets with multiple sockets & large thread counts Yoo.

2009 . Phoenix2 October 6.The Problem: OS Scalability Synchronization Primitive Performance on the 4-Socket Machine OS / libraries exhibit NUMA effects as well • Latency increases rapidly when crossing chip boundary • Similar behavior on a 32-core Opteron running Linux Yoo.

Optimizing the Phoenix Runtime on a Large-Scale NUMA System .

Phoenix2 October 6.Optimization Approach App Algorithmic Level Phoenix Runtime Implementation Level OS Interaction Level OS HW Focus on the unique position of runtimes in a software stack • Runtimes exhibit complex interactions with user code & OS Optimization approach should be multi-layered as well • Algorithm should be NUMA aware • Implementation should be optimized around NUMA challenges • OS interaction should be minimized as much as possible Yoo. 2009 .

2009 . Phoenix2 October 6.Algorithmic Optimizations App Algorithmic Level Phoenix Runtime Implementation Level OS Interaction Level OS HW Yoo.

Algorithmic Optimizations (contd.) Runtime algorithm itself should be NUMA-aware Problem: original Phoenix did not distinguish local vs. the physical frames for mmap()ed data spread out across multiple locality groups (a chip + a dedicated memory channel) • Blind task assignment can have local threads work on remote data remote 0 mem access chip 0 chip 1 mem 1 hub remote mem access 2 chip 2 chip 3 remote mem access 3 Yoo. 2009 . Phoenix2 October 6. remote threads • On Solaris.

Phoenix2 October 6. 2009 . then perform task stealing mem 0 chip 0 chip 1 mem 1 hub mem 2 chip 2 chip 3 mem 3 Yoo.Algorithmic Optimizations (contd.) Solution: locality-aware task distribution • Utilize per-locality group task queues • Distribute tasks according to their locality group • Threads work on their local task queue first.

2009 .Implementation Optimizations App Algorithmic Level Phoenix Runtime Implementation Level OS Interaction Level OS HW Yoo. Phoenix2 October 6.

2009 . Phoenix2 October 6.) Runtime implementation should handle large data sets efficiently Problem: Phoenix core data structure not efficient at handling large-scale data Map Phase • Each column of pointers amounts to a fixed-size hash table • keys_array and vals_array all thread-local num_reduce_tasks map thread id num_map_threads “apple” too many “orange” keys hash(“orange”) “banana” 2 buffer1 4 “pear” reallocations vals_array 2-D array of pointers keys_array Yoo.Implementation Optimizations (contd.

) Reduce Phase • Each row amounts to one reduce task • Mismatch in access pattern results in remote accesses reduce task index “orange” keys_array remote 1 2-D array of pointers 5 access 3 1 vals_array 1 “orange” keys_array remote 4 access 1 2 large chunk of 4 1 1 5 contiguous memory 3 1 1 2 Copy and pass to user reduce function vals_array Yoo. Phoenix2 October 6.Implementation Optimizations (contd. 2009 .

) Solution 1: make the hash bucket count user-tunable • Adjust the bucket count to get few keys per bucket “apple” “banana” “orange” “pear” 2 4 vals_array 2-D array of pointers keys_array Yoo. Phoenix2 October 6. 2009 .Implementation Optimizations (contd.

2009 . Phoenix2 October 6.Implementation Optimizations (contd.) Solution 2: implement iterator interface to vals_array • Removed copying / allocating the large value array • Buffer implemented as distributed chunks of memory • Implemented prefetch mechanism behind the interface reduce task index “orange” keys_array 1 2-D array of pointers 5 3 1 1 prefetch! vals_array “orange” keys_array 2 4 1 1 &vals_array &vals_array 1 5 3 1 Copy and pass to Expose iterator to user reduce function 2 4 1 vals_array Yoo.

. 2009 . Phoenix2 October 6. excessive pointer indirection Combiners • Only works for commutative and associative reduce functions • Perform local reduction at the end of the map phase • Little difference once the prefetcher was in place Could be good for energy See paper for details Yoo.g.Other Optimizations Tried Replace hash table with more sophisticated data structures • Large amount of access traffic • Simple changes negated the performance improvement E.

OS Interaction Optimizations App Algorithmic Level Phoenix Runtime Implementation Level OS Interaction Level OS HW Yoo. 2009 . Phoenix2 October 6.

OS Interaction Optimizations (contd. allocator performance limited by sbrk() 2. 2009 .) Runtimes should deliberately manage OS interactions 1. Memory management => memory allocator performance • Problem: large. Phoenix2 October 6. Thread creation => mmap() • Problem: stack deallocation (munmap()) in thread join • Solution Implement thread pool Reuse threads over various MapReduce phases and instances Yoo. unpredictable amount of intermediate / final data • Solution Sensitivity study on various memory allocators At high thread count.

Results .

Phoenix2 October 6. 2009 . quad-core chips) running Linux Yoo.2.1 –O3 Similar performance improvements and challenges on a 32thread Opteron system (8-sockets.10.Experiment Settings 4-Socket UltraSPARC T2+ Workloads released in the original Phoenix • Input set significantly increased to stress the large-scale machine Solaris 5. GCC 4.

2009 . Phoenix2 October 6.Scalability Summary workloads scale up to 256 threads limited by OS scalability issues Scalability of the Optimized Version Original Phoenix Significant scalability improvement Yoo.

2009 .53x. max: 2. Phoenix2 October 6.5x.8x significant improvement average: 2.Execution Time Improvement little variation average: 1. max: 19x Relative Speedup over the Original Phoenix Optimizations more effective for NUMA Yoo.

499 18 64 9.Analysis: Thread Pool threads before after 20 10 8 16 1. 2009 .840 threads Thread pool effectively reduces the number of calls to munmap() Yoo.947 13 32 4. 163.661 44 256 14. Phoenix2 October 6.47x improvement kmeans Performance Improvement due to Thread Pool kmeans performs a sequence of MapReduces • 160 iterations.697 102 Number of Calls to munmap() on kmeans 3.956 33 128 14.

2009 . Phoenix2 October 6.Analysis: Locality-Aware Task Distribution improved hit rates = improved performance forced misses result in similar hit rate Locality Group Hit Rate on string_match Locality group hit rate (% of tasks supplied from local memory) Significant locality group hit rate improvement under NUMA environment Yoo.

Analysis: Hash Table Size no thread count leads trend speedup with more to reverses threads word_count SensitivityHash Table Size kmeans Sensitivity to to Hash Table Size No single hash table size worked for all the workloads • Some workloads generated only a small / fixed number of unique keys • For those that did benefit. the improvement was not consistent Recommended values provided for each application Yoo. Phoenix2 October 6. 2009 .

Why Are Some Applications Not Scaling? .

Increased portion of kernel time over total useful computation Yoo. 2009 .Non-Scalable Workloads idle time increase kernel time increases significantly Execution Time Breakdown on histogram Non-scalable workloads shared two common trends 1. Phoenix2 October 6. Significant idle time increase 2.

2009 .Profiler Analysis histogram • 64 % execution time spent idling for data page fault linear_regression • 63 % execution time spent idling for data page fault word_count • 28 % of its execution time in sbrk() called inside the memory allocator • 27 % of execution time idling for data pages Memory allocator and mmap()turned out to be the bottleneck Not the physical I/O problem • OS buffer cache warmed up by repeating the same experiment with the same input Yoo. Phoenix2 October 6.

Phoenix2 October 6. 2009 .Memory Allocator Scalability Memory Allocator Scalability Comparison on word_count sbrk() scalability a major issue • A single user-level lock serialized accesses • Per-address space locks protected in-kernel virtual memory objects mmap() even worse Yoo.

Phoenix2 October 6.mmap() Scalability Microbenchmark: mmap()user file and calculate the sum by streaming through data chunks mmap() Microbenchmark Scalability mmap() alone does not scale Kernel lock serialization on per process page table Yoo. 2009 .

5x speedup.Conclusion Multi-layered optimization approach proved to be effective • Average 2. Phoenix2 October 6. 2009 . maximum 19x OS scalability issues need to be addressed for further scalability • Memory management and I/O • Opens up a new research opportunity Yoo.

Phoenix2 October 6.stanford.Questions? The Phoenix System for MapReduce Programming. v2. 2009 .0 • Publicly available at http://mapreduce.edu Yoo.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->