Professional Documents
Culture Documents
hub
Speedup
Speedupon
onaaSingle
4-Socket
Socket
UltraSPARC
UltraSPARC
T2+ T2
Synchronization Primitive Performance on the 4-Socket Machine
App
Algorithmic Level
OS Interaction Level
OS
HW
App
Algorithmic Level
OS Interaction Level
OS
HW
Problem: original Phoenix did not distinguish local vs. remote threads
• On Solaris, the physical frames for mmap()ed data spread out across
multiple locality groups (a chip + a dedicated memory channel)
• Blind task assignment can have local threads work on remote data
mem 0
remote chip 0 chip 1 mem 1
access
hub
remote remote
mem 2
access chip 2 chip 3 mem 3
access
hub
App
Algorithmic Level
OS Interaction Level
OS
HW
Problem: Phoenix core data structure not efficient at handling large-scale data
Map Phase
• Each column of pointers amounts to a fixed-size hash table
• keys_array and vals_array all thread-local
map thread id
num_map_threads
“apple”
num_reduce_tasks
“banana”
hash(“orange”) too many
“orange”
keys 2 buffer
4 1
reallocations
“pear” vals_array
large chunk of
“orange” 2 4 1 1 5
contiguous 3 1 1
keys_array memory
remote Copy and pass to
2 4 access
1 user reduce function
vals_array
“apple”
“banana”
“orange” 2 4
“pear” vals_array
1 5 3 1 1
vals_array prefetch!
2-D array of pointers
“orange” 2&vals_array
4 1 1 &vals_array
5 3 1 1
keys_array
Copy and
Expose iterator
pass toto
2 4 1 user reduce function
vals_array
Combiners
• Only works for commutative and associative reduce functions
• Perform local reduction at the end of the map phase
• Little difference once the prefetcher was in place
Could be good for energy
App
Algorithmic Level
OS Interaction Level
OS
HW
#
"
limited
by
OS
# scalability
" " issues
# !"
significant improvement
average: 2.53x, max: 19x
Relative Speedup over the Original Phoenix
improved hit rates
= improved performance
Locality Group Hit Rate on string_match
threads
kmeans Sensitivity
word_count Sensitivity
to Hash
to Hash
Table
Table
SizeSize
kernel time increases
significantly
Execution Time Breakdown on histogram
Memory Allocator Scalability Comparison on word_count
mmap() Microbenchmark Scalability