You are on page 1of 41

How NOT To Write A Microbenchmark

Dr. Cliff Click Senior Staff Engineer Sun Microsystems
Session #

How NOT To Write A Microbenchmark or Lies, Damn Lies, and Microbenchmarks

Session #

Microbenchmarks are a sharp knife

Microbenchmarks are like a microscope Magnification is high, but what the heck are you looking at? Like "truths, half-truths, and statistics", microbenchmarks can be very misleading

Beginning

3

Session # 1816

Learning Objectives
• As a result of this presentation, you will be able to:
– Recognize when a benchmark lies – Recognize what a benchmark can tell you (as opposed to what it purports to tell you) – Understand how some popular benchmarks are flawed – Write a microbenchmark that doesn't lie (to you)

Beginning

4

Session # 1816

Click wrote his first compiler at age 16 and has been writing: – runtime compilers for 15 years.Speaker’s Qualifications • Dr. and – optimizing compilers for 10 years • Dr. Click architected the HotSpot™ Server Compiler Beginning 5 Session # 1816 . Click is a Senior Staff Engineer at Sun Microsystems • Dr.

1% chance HotSpot didn't do "Y" but should 1% chance HotSpot didn't do "X" but should Beginning 6 Session # 1816 . "X" I routinely get handed a microbenchmark and told "HotSpot doesn't do X" 49% chance it doesn't matter.. 49% chance they got fooled..Your JVM is Lousy Because it Doesn't Do.

Agenda • • • • Recent real-word benchmark disaster Popular microbenchmarks and their flaws When to disbelieve a benchmark How to write your own Beginning 7 Session # 1816 .

What is a Microbenchmark? • Small program – Datasets may be large • All time spent in a few lines of code • Performance depends on how those few lines are compiled • Goal: Discover some particular fact • Remove all other variables Middle 8 Session # 1816 .

Why Run Microbenchmarks? • Discover some targeted fact – Such as the cost of 'turned off' Asserts. or – Will this inline? – Will another layer of abstraction hurt performance? – My JIT is faster than your JIT – My Java is faster than your C • Fun • But dangerous! Middle 9 Session # 1816 .

initializing Middle 10 Session # 1816 .How HotSpot Works • HotSpot is a mixed-mode system • Code first runs interpreted – Profiles gathered • Hot code gets compiled • Same code "just runs faster" after awhile • Bail out to interpreter for rare events – Never taken before code – Class loading.

2 sec With no test at all – 5 sec What did he really measure? Middle 11 Session # 1816 .Example from Magazine Editor • • • • • • • What do (turned off) Asserts cost? Tiny 5-line function r/w's global variable Run in a loop 100.000 times With explicit check – 5 sec With Assert (off) – 0.000.

} Middle 12 Session # 1816 . sval = val * 6. i++ ) v = test_assert(v). // } static void main(String args[]) { int REPEAT = Integer. sval += 3. i<REPEAT.parseInt(args[0]). for( int i=0. // Read/write global sval /= 2. // Global variable static int test_assert(int val) { assert (val >= 0) : "should be positive". int v=0.Assert Example static int sval. return val+2.

// Global variable static int test_assert(int val) { assert (val >= 0) : "should be positive".Assert Example static int sval. for( int i=0. } Middle 13 Session # 1816 . int v=0. i<REPEAT. } static void main(String args[]) { int REPEAT = Integer. i++ ) v = test_assert(v).parseInt(args[0]). // Read/write global sval /= 2. sval = val * 6. return val+2. sval += 3.

// Read/write global sval /= 2. // Global variable static int test_assert(int val) { assert (val >= 0) : "should be positive". sval = val * 6. for( int i=0. // } static void main(String args[]) { int REPEAT = Integer. return val+2. i++ ) v = test_assert(v). sval += 3.Assert Example static int sval. } Middle 14 Session # 1816 . int v=0. i<REPEAT.parseInt(args[0]).

// Global variable static int test_explicit(int val) { if( val < 0 ) throw .. for( int i=0.parseInt(args[0]). // } static void main(String args[]) { int REPEAT = Integer. sval = val * 6. return val+2..Assert Example static int sval.. i++ ) v = test_explicit(v). int v=0. // Read/write global sval /= 2. i<REPEAT. sval += 3. } Middle 15 Session # 1816 .

Assert Example (cont. // Global variable static void main(String args[]) { int REPEAT = Integer. only do final write • Small hot static function ⇒ inline into loop • Asserts turned off ⇒ no test in code static int sval. i++ ) { sval = (v*6+3)/2. i<REPEAT. for( int i=0.parseInt(args[0]). int v=0.) • No synchronization ⇒ hoist static global into register. } Session # 1816 } Middle 16 . v = v+2.

} } Middle 17 Session # 1816 . // pre-loop goes here..Assert Example (cont.parseInt(args[0]). v = v+2*16. i<REPEAT. int v=0.) • Small loop ⇒ unroll it a lot – Need pre-loop to 'align' loop • Again remove redundant writes to global static static int sval. i+=16 ) { sval = ((v+2*14)*6+3)/2. // Global variable static void main(String args[]) { int REPEAT = Integer.. for( int i=0.

different loops run 20x faster • What's really happening? Middle 18 Session # 1816 .. • ...Benchmark is 'spammed' • Loop body is basically 'dead' • Unrolling speeds up by arbitrary factor • Note: 0.2 sec / 100Million iters * 450 Mhz clock ⇒ Loop runs too fast: < 1 clock/iter • But yet..

out.... for( int i=0. } Middle 19 Session # 1816 .). i++ ) v = test_no_check(v).. i<REPEAT. i<REPEAT. // Loop 3: 5 sec long time4 = System.currentTimeMillis() System. i<REPEAT. i<REPEAT. i++ ) v = test_explicit(v).out. // hidden warmup loop: 0.2 sec long time1 = System.2 sec long time3 = System.. for( int i=0.).currentTimeMillis() for( int i=0..).currentTimeMillis() System.A Tale of Four Loops static void main(String args[]) { int REPEAT = Integer.println("no check time="..println("assert time=".println("explicit check time="+. // Loop 1: 5 sec long time2 = System. int v=0..currentTimeMillis() System. i++ ) v = test_assert(v). i++ ) v = test_no_check(v). // Loop 2: 0.parseInt(args[0]). for( int i=0.out.

– Don't inline test_explicit().. so..On-Stack Replacement • • • • HotSpot is mixed mode: interp + compiled Hidden warmup loop starts interpreted Get's OSR'd: generate code for middle of loop 'explicit check' loop never yet executed. no unroll – 'println' causes class loading ⇒ drop down to interpreter • Hidden loop runs fast • 'explicit check' loop runs slow Middle 20 Session # 1816 .

.On-Stack Replacement (con't) • Continue 'assert' loop in interpreter • Get's OSR'd: generate code for middle of loop • 'no check' loop never yet executed.. no unroll • Assert loop runs fast • 'no check' loop runs slow Middle 21 Session # 1816 . so. – Don't inline test_no_check().

Dangers of Microbenchmarks • Looking for cost of Asserts. • But found OSR Policy Bug (fixed in 1.1) • Why not found before? – Only impacts microbenchmarks – Not found in any major benchmark – Not found in any major customer app • Note: test_explicit and test_no_check: 5 sec – Check is compare & predicted branch – Too cheap... hidden in other math Middle 22 Session # 1816 .4.

but devil is in the details Middle 23 Session # 1816 .Self-Calibrated Benchmarks • Want a robust microbenchmark that runs on a wide variety of machines • How long should loop run? – Depends on machine! • Time a few iterations to get a sense of speed – Then time enough runs to reach steady state • Report iterations/sec • Basic idea is sound.

Self-Calibrated Benchmarks • First timing loop runs interpreted – Also pays compile-time cost • Count needed to run reasonable time is low • Main timed run used fast compiled code • Runs too fast to get result above clock jitter • Divide small count by clock jitter ⇒ random result • Report random score Middle 24 Session # 1816 .

20. 20 msec – Clock jitter = 1 msec Middle 25 Session # 1816 . always = 1 iteration • Then use 1st run time to compute iterations needed to run 100msec – Way too short. 20. 60.jBYTEmark Methodology • Count iterations to run for 10 msec – First run takes 104 msec – Way too short. always = 1 iteration • Then average 5 runs – Next 5 runs take 80.

.. so.. continuously add Strings Benchmark requires copy before add.. – Each iteration takes more work than before – 10000 main-loop iters with sync – 10030 main-loop iters withOUT sync • Self-timed loop has some jitter.Non-constant work per iter • • • • Non-sync'd String speed test Run with & without some synchronization Use concat. Middle 26 Session # 1816 .

Non-constant work per iter (con't) • Work for Loop1: • Work for Loop2: • • • • – 100002 copies. BUT Extra work swamps sync cost! Middle 27 Session # 1816 . no syncs Loop2 avoids 10000 syncs But pays 600.900 more copies Benchmark compares times. 10000 concats. 10000 syncs – 100302 copies. 10000 concats.

Beware 'hidden' Cache Blowout • • • • • Work appears to be linear with time More iterations should report same work/sec Grow dataset to run longer Blow out caches! Self-calibrated loop jitter can make your data set fit in L1 cache or not! • Jitter strongly affects apparent work/sec rate Middle 28 Session # 1816 .

Beware 'hidden' Cache Blowout • • • • • • • jBYTEmark StringSort with time raised to 1 sec Each iteration adds a 4K array Needs 1 or 2 iterations to hit 100msec Scale by 10 to reach 1 sec Uses either 10 or 20 4K arrays Working set is either 40K or 80K 64K L1 cache Middle 29 Session # 1816 .

Explicitly Handle GC • GC pauses occur at unpredictable times • GC throughput is predictable • Either don't allocate in your main loops – So no GC pauses • OR run long enough to reach GC steady state – Each run spends roughly same time doing GC Middle 30 Session # 1816 .

whole function is dead (oops!) • First timing loop: 900K iterations to run 1 sec • Second loop runs 2.7M iterations in 0.CaffeineMark3 Logic Benchmark • Purports to test logical operation speed • With unrolling OR constant propagation.685 sec • Score reports as negative (overflow) or huge • Overall score dominated by Logic score Middle 31 Session # 1816 .

next – 75% of time spent in byte copy – "He's dead. Jim" • Scimark2 MonteCarlo simulation • jBYTEmark StringSort • CaffeineMark Logic test Middle 32 Session # 1816 .Know what you test • SpecJVM98 209_db – 85% of time spent in shell_sort – 80% of time spent in sync'd Random.

at least 10 sec Middle 33 Session # 1816 . clock granularity • Self-calibrated loops always have jitter – Sanity check the time on the main run! • Run to steady state.How To Write a Microbenchmark (general – not just Java) • Pick suitable goals! – Can't judge web server performance by jBYTEmarks – Only answer very narrow questions – Make sure you actually care about the answer! • Be aware of system load.

How To Write a Microbenchmark (general – not just Java) • Fixed size datasets (cache effects) – Large datasets have page coloring issues – Or results not comparable • Constant amount of work per iteration • Avoid the 'eqntott Syndrome' – Capture 1st run result (interpreted) – Compare to last result Middle 34 Session # 1816 .

How To Write a Microbenchmark (general – not just Java) • Avoid 'dead' loops – These can have infinite speedups! – Print final answer – Make computation non-trivial – Report when speedup is unreasonable – CaffeineMark3 & JavaGrande Section 1 mix Infinities and reasonable scores • Or handle dead loops Middle 35 Session # 1816 .

How To Write a Microbenchmark (Java-specific) • • • • Be explicit about GC Thread scheduling is not deterministic JIT performance may change over time Warmup loop+test code before ANY timing – (HotSpot specific) Middle 36 Session # 1816 .

4.OSR Bug Avoidance • Fixed in 1.1 • Write 1 loop per method for Microbenchmarks • Doesn't matter for 'real' apps – Don't write 1-loop-per-method code to avoid bug Middle 37 Session # 1816 .

Summary • • • • • • • • Microbenchmarks can be easy. GC issues End 38 Session # 1816 . fun. informative They can be very misleading! Beware! Pick suitable goals Warmup code before timing Run reasonable length of time Fixed work / fixed datasets per iteration Sanity-check final times and results Handle 'dead' loops.

If You Only Remember One Thing… Put Microtrust in a Microbenchmark End 39 Session # 1816 .

Session # .

Session # .