You are on page 1of 29

DDR

Benchmarking
Tools(LMBench)
Agenda -

We will take a look on the following LMBench tools in this


session -

● LMBench introduction
● bw_mem
● bw_pipe
● lat_mem_read
● lat_ctx
● lat_syscall
● stream
What is LMBench ?

LMBench is a series of micro benchmarks intended to


measure basic operating system and hardware system
metrics.

The benchmarks fall into three general classes: bandwidth,


latency, and ‘‘other’’.
bw_mem

● Times memory bandwidth.


● Usage -

● The first three options are similar for most of the benchmarks.

● Next slides explains the options in detail.


bw_mem : options
○ [-P <parallelism>]

■ Number of benchmark processes to run in parallel.

■ Useful for measuring distributed computers performance.

■ Used to evaluate system’s performance scalability.

○ [-W <warmup>]

■ Minimum number of microseconds the benchmark should execute the benchmarked

capability before it begins measuring performance.

○ [N <repetitions>]

■ Number of measurements that the benchmark should take.

■ The default number of repetitions is 11.


bw_mem : options (contd…)
● size
○ Number of bytes to be transferred.
○ ‘‘k’’ means kilobytes and “m” means megabytes.
○ If no specification is made it assumes it in bytes by default.
● what - It specifies the different benchmarking options as follows -
○ rd
■ Measures time to read data into the processor. C
■ Allocates the specified amount of memory
■ Zeros it
■ Times the reading of that memory as a series of integer loads and adds.
■ Each four byte integer is loaded and added to accumulator.
bw_mem : options (contd…)
○ wr
■ Measures the time to write data to memory.
■ Allocates the specified amount of memory
■ Zeros it.
■ Assigns a constant value to each memory of an array of integer values.
■ Accesses every fourth word.
○ cp
■ Time to copy data from one location to another.
■ Does an array copy : dest[i] = source[i].
■ Accesses every fourth word.
■ Allocates twice the specified amount of memory.
■ Zeros it.
■ Times the copying of the first half to the second half.
bw_mem : options (contd…)
○ rdwr
■ Time to read data into memory and then write data to the same memory location.
■ For each array element it adds the current value to a running sum before assigning a
new (constant) value to the element.
■ Accesses every fourth word.
○ frd
■ Time to read data into the processor.
■ Computes the sum of an array of integer value.
○ fcp

■ measures the time to write data to memory.

■ Assigns a constant value to each memory of an array of integer values.


bw_mem : options (contd…)

○ fcp

■ measures the time to copy data from one location to another.

■ Does an array copy: dest[i] = source[i].

○ bzero

■ measures how fast the system can bzero memory.

○ bcopy

■ measures how fast the system can bcopy data.


bw_mem : output
● The Output format is "%0.2f %.2f\\n", megabytes, megabytes_per_second .
● First is the amount of data transferred and second is the bandwidth.
bw_pipe
● Times data movement through pipes.
● Creates a Unix pipe between two processes
● Moves 50MB through the pipe in 64KB chunks.
● Output -
○ Output format is "Pipe bandwidth: %0.2f MB/sec\\n", megabytes_per_second, e.g -
bw_pipe (contd…)

● MEMORY UTILIZATION
○ Can move up to six times the requested memory per process.
○ Two processes - sender and receiver.
○ One read from the source and a write to the destination.
○ Write usually results in a cache line read and then a write back of the cache line at some later
point.
lat_mem_rd
● Memory read latency benchmark.
● Usage -

● First three options are same as of bw_mem


● len - size of array used by the benchmark
● stride - size of memory in bytes to be used at one time.
lat_mem_rd : explanation
● Measures memory read latency for varying memory sizes and strides.
● Results in nanoseconds per load.
● Entire memory hierarchy is measured
● Only data accesses are measured; the instruction cache is not measured.
● Benchmark runs as two nested loops.
● Outer loop is the stride size.
● Inner loop is the array size.
● For each array size, the benchmark creates a ring of pointers that point forward one stride.
● Loop stops after doing a million loads.
● The size of the array varies from 512 bytes to eight megabytes.
● For the small sizes, the cache will have an effect, and the loads will be much faster.
● The output array size in megabytes and the load latency over all points in that array.
● The output is best examined in a graph.
lat_mem_rd : output
lat_mem_rd : output
lat_mem_rd : stride effects
● Having a stride smaller than the cache line size improves performance since there could be
multiple hits per cache line. In addition, having a stride larger than the cache line size shows a
performance impact.
lat_ctx
● Context switching benchmark
● Usage -

● The first three options are similar as before.


● [-s kbytes] - size of a process in kbytes
● [ processes...] - total number of processes
● Output for process size of 50 kbytes and number of processes = 5 is as follows -
lat_ctx (contd…)
● Measures context switching time for any number of processes of any size.
● Processes connected in a ring of Unix pipes.
● Each process reads a token from its pipe, possibly does some work, and then writes the token to the
next process.
● Processes may vary in number.
● Smaller numbers of processes result in faster context switches.
● More than 20 processes is not supported.
● Processes may vary in size.
● Size of zero is the baseline process that does nothing except pass the token on to the next process.
● Process size of greater than zero means that the process does some work before passing on the token.
● The work is simulated as the summing up of an array of the specified size.
● Both the data and the instruction cache get polluted by some amount before the token is passed on.
lat_ctx (contd...)
● Data cache gets polluted by approximately the process ``size''.
● Instruction cache gets polluted by a constant amount.
● Pollution of the caches results in larger context switching times for the larger processes.
● As the number and size of the processes increases, the caches are more and more polluted until the set
of processes do not fit.
● The context switch times go up because

context switch = switch time + time to restore process state(+cache state).

● This means that the switch includes the time for the cache misses on larger processes.
● Output format specifies the size and non-context switching overhead of the test.
● The overhead and the context switch times are in micro second units.
lat_syscall
● Times simple entry into the operating system
● Usage -

● After the first three options it takes the one of the options from
null,read,write,stat,fstat and open to time the operation.
● It takes last option as the path to file.
lat_syscall (contd…)
● Explaining the diff options -
○ null - measures how long it takes to do getppid().
○ read - measures how long it takes to read one byte from /dev/zero.
○ write - times how long it takes to write one byte to /dev/null.
○ stat - measures how long it takes to stat() a file whose inode is already cached.
○ fstat - measures how long it takes to fstat() an open file whose inode is already cached.
○ open - measures how long it takes to open() and then close() a file.
● Output format is -
stream

● Synthetic benchmark

● Measures sustainable memory bandwidth (in MB/s)


● Computer cpus are getting faster much more quickly than computer memory systems.
● More programs will be limited in performance by the memory bandwidth of the system, rather than by
the computational performance of the cpu.
● Recent estimates suggest that the cpu speed of the fastest available microprocessors is increasing at
approximately 80% per year.
● While the speed of memory devices has been growing at only about 7% per year.
● Increasing fraction of computer workloads will be dominated by memory access and transfer times
rather than by compute time.
● It suggests the adoption of memory latency and sustainable memory bandwidth as figures of merit for
modern high-performance computers.
stream (contd..)
● Each of the four tests adds independent information to the results:
● ``Copy'' measures transfer rates in the absence of arithmetic.
● ``Scale'' adds a simple arithmetic operation.
● ``Sum'' adds a third operand to allow multiple load/store ports on vector
machines to be tested.
● ``Triad'' allows chained/overlapped/fused multiply/add operations.
Stream (contd..)
● Output -
References
● Manual pages available at -
○ http://bitmover.com/lmbench/man_lmbench.html
● For lat_mem_rd -
○ https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_
4b40_9d82_446ebc23c550/page/Untangling%20memory%20access%20measurements%20-
%20memory%20latency
● For stream benchmark -
○ http://www.cs.virginia.edu/stream/ref.html
● LMBench download link -
○ http://bitmover.com/lmbench/get_lmbench.html
THANK YOU !!!!
Explaining the benchmp function
● It is used in all the benchmarks explained.
● It is present in lib_timing.c
● Is definition is as follows -

void benchmp(initialize, benchmark, cleanup, enough, parallel, warmup, repetitions, cookie)

● It measures the performance of benchmark repeatedly and reports the median result.
● benchmp creates parallel sub-processes which run benchmark in parallel.
● This allows lmbench to measure the system’s ability to scale as the number of client processes
increases.
Explaining arguments of benchmp
● Each sub-process executes initialize before starting the benchmarking cycle with iterations set to 0.
● It will call initialize , benchmark , and cleanup with iterations set to the number of iterations in the
timing loop several times in order to collect repetitions results.
● The calls to benchmark are surrounded by start and stop call to time the amount of time it takes to do
the benchmarked operation iterations times.
● After all the benchmark results have been collected, cleanup is called with iterations set to 0 to cleanup
any resources which may have been allocated by initialize or benchmark.
● cookie is a void pointer to a hunk of memory that can be used to store any parameters or state that is
needed by the benchmark.

You might also like