DDR Benchmarking Tools (LMBench)

DDR
Benchmarking
Tools(LMBench)
Agenda -
We will take a look on the following LMBench tools in this

session -
● LMBench introduction
● bw_mem
● bw_pipe
● lat_mem_read
● lat_ctx
● lat_syscall
● stream
What is LMBench ?
LMBench is a series of micro benchmarks intended to

measure basic operating system and hardware system
metrics.
The benchmarks fall into three general classes: bandwidth,

latency, and ‘‘other’’.
bw_mem
● Times memory bandwidth.

● Usage -
● The first three options are similar for most of the benchmarks.
● Next slides explains the options in detail.

bw_mem : options
○ [-P <parallelism>]
■ Number of benchmark processes to run in parallel.
■ Useful for measuring distributed computers performance.
■ Used to evaluate system’s performance scalability.
○ [-W <warmup>]
■ Minimum number of microseconds the benchmark should execute the benchmarked
capability before it begins measuring performance.
○ [N <repetitions>]
■ Number of measurements that the benchmark should take.
■ The default number of repetitions is 11.

bw_mem : options (contd…)
● size
○ Number of bytes to be transferred.
○ ‘‘k’’ means kilobytes and “m” means megabytes.
○ If no specification is made it assumes it in bytes by default.
● what - It specifies the different benchmarking options as follows -
○ rd
■ Measures time to read data into the processor. C
■ Allocates the specified amount of memory
■ Zeros it
■ Times the reading of that memory as a series of integer loads and adds.
■ Each four byte integer is loaded and added to accumulator.
○ wr
■ Measures the time to write data to memory.
■ Allocates the specified amount of memory
■ Zeros it.
■ Assigns a constant value to each memory of an array of integer values.
■ Accesses every fourth word.
○ cp
■ Time to copy data from one location to another.
■ Does an array copy : dest[i] = source[i].
■ Allocates twice the specified amount of memory.
■ Zeros it.
■ Times the copying of the first half to the second half.
○ rdwr
■ Time to read data into memory and then write data to the same memory location.
■ For each array element it adds the current value to a running sum before assigning a
new (constant) value to the element.
○ frd
■ Time to read data into the processor.
■ Computes the sum of an array of integer value.
○ fcp
■ measures the time to write data to memory.
■ Assigns a constant value to each memory of an array of integer values.

○ fcp
■ measures the time to copy data from one location to another.
■ Does an array copy: dest[i] = source[i].
○ bzero
■ measures how fast the system can bzero memory.
○ bcopy
■ measures how fast the system can bcopy data.

bw_mem : output
● The Output format is "%0.2f %.2f\\n", megabytes, megabytes_per_second .
● First is the amount of data transferred and second is the bandwidth.
bw_pipe
● Times data movement through pipes.
● Creates a Unix pipe between two processes
● Moves 50MB through the pipe in 64KB chunks.
● Output -
○ Output format is "Pipe bandwidth: %0.2f MB/sec\\n", megabytes_per_second, e.g -
bw_pipe (contd…)
● MEMORY UTILIZATION
○ Can move up to six times the requested memory per process.
○ Two processes - sender and receiver.
○ One read from the source and a write to the destination.
○ Write usually results in a cache line read and then a write back of the cache line at some later
point.
lat_mem_rd
● Memory read latency benchmark.
● Usage -
● First three options are same as of bw_mem

● len - size of array used by the benchmark
● stride - size of memory in bytes to be used at one time.
lat_mem_rd : explanation
● Measures memory read latency for varying memory sizes and strides.
● Results in nanoseconds per load.
● Entire memory hierarchy is measured
● Only data accesses are measured; the instruction cache is not measured.
● Benchmark runs as two nested loops.
● Outer loop is the stride size.
● Inner loop is the array size.
● For each array size, the benchmark creates a ring of pointers that point forward one stride.
● Loop stops after doing a million loads.
● The size of the array varies from 512 bytes to eight megabytes.
● For the small sizes, the cache will have an effect, and the loads will be much faster.
● The output array size in megabytes and the load latency over all points in that array.
● The output is best examined in a graph.
lat_mem_rd : output
lat_mem_rd : output
lat_mem_rd : stride effects
● Having a stride smaller than the cache line size improves performance since there could be
multiple hits per cache line. In addition, having a stride larger than the cache line size shows a
performance impact.
lat_ctx
● Context switching benchmark
● Usage -
● The first three options are similar as before.

● [-s kbytes] - size of a process in kbytes
● [ processes...] - total number of processes
● Output for process size of 50 kbytes and number of processes = 5 is as follows -
lat_ctx (contd…)
● Measures context switching time for any number of processes of any size.
● Processes connected in a ring of Unix pipes.
● Each process reads a token from its pipe, possibly does some work, and then writes the token to the
next process.
● Processes may vary in number.
● Smaller numbers of processes result in faster context switches.
● More than 20 processes is not supported.
● Processes may vary in size.
● Size of zero is the baseline process that does nothing except pass the token on to the next process.
● Process size of greater than zero means that the process does some work before passing on the token.
● The work is simulated as the summing up of an array of the specified size.
● Both the data and the instruction cache get polluted by some amount before the token is passed on.
lat_ctx (contd...)
● Data cache gets polluted by approximately the process ``size''.
● Instruction cache gets polluted by a constant amount.
● Pollution of the caches results in larger context switching times for the larger processes.
● As the number and size of the processes increases, the caches are more and more polluted until the set
of processes do not fit.
● The context switch times go up because
context switch = switch time + time to restore process state(+cache state).
● This means that the switch includes the time for the cache misses on larger processes.
● Output format specifies the size and non-context switching overhead of the test.
● The overhead and the context switch times are in micro second units.
lat_syscall
● Times simple entry into the operating system
● Usage -
● After the first three options it takes the one of the options from
null,read,write,stat,fstat and open to time the operation.
● It takes last option as the path to file.
lat_syscall (contd…)
● Explaining the diff options -
○ null - measures how long it takes to do getppid().
○ read - measures how long it takes to read one byte from /dev/zero.
○ write - times how long it takes to write one byte to /dev/null.
○ stat - measures how long it takes to stat() a file whose inode is already cached.
○ fstat - measures how long it takes to fstat() an open file whose inode is already cached.
○ open - measures how long it takes to open() and then close() a file.
● Output format is -
stream
● Synthetic benchmark
● Measures sustainable memory bandwidth (in MB/s)

● Computer cpus are getting faster much more quickly than computer memory systems.
● More programs will be limited in performance by the memory bandwidth of the system, rather than by
the computational performance of the cpu.
● Recent estimates suggest that the cpu speed of the fastest available microprocessors is increasing at
approximately 80% per year.
● While the speed of memory devices has been growing at only about 7% per year.
● Increasing fraction of computer workloads will be dominated by memory access and transfer times
rather than by compute time.
● It suggests the adoption of memory latency and sustainable memory bandwidth as figures of merit for
modern high-performance computers.
stream (contd..)
● Each of the four tests adds independent information to the results:
● ``Copy'' measures transfer rates in the absence of arithmetic.
● ``Scale'' adds a simple arithmetic operation.
● ``Sum'' adds a third operand to allow multiple load/store ports on vector
machines to be tested.
● ``Triad'' allows chained/overlapped/fused multiply/add operations.
Stream (contd..)
● Output -
References
● Manual pages available at -
○ http://bitmover.com/lmbench/man_lmbench.html
● For lat_mem_rd -
○ https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_
4b40_9d82_446ebc23c550/page/Untangling%20memory%20access%20measurements%20-
%20memory%20latency
● For stream benchmark -
○ http://www.cs.virginia.edu/stream/ref.html
● LMBench download link -
○ http://bitmover.com/lmbench/get_lmbench.html
THANK YOU !!!!
Explaining the benchmp function
● It is used in all the benchmarks explained.
● It is present in lib_timing.c
● Is definition is as follows -
void benchmp(initialize, benchmark, cleanup, enough, parallel, warmup, repetitions, cookie)
● It measures the performance of benchmark repeatedly and reports the median result.
● benchmp creates parallel sub-processes which run benchmark in parallel.
● This allows lmbench to measure the system’s ability to scale as the number of client processes
increases.
Explaining arguments of benchmp
● Each sub-process executes initialize before starting the benchmarking cycle with iterations set to 0.
● It will call initialize , benchmark , and cleanup with iterations set to the number of iterations in the
timing loop several times in order to collect repetitions results.
● The calls to benchmark are surrounded by start and stop call to time the amount of time it takes to do
the benchmarked operation iterations times.
● After all the benchmark results have been collected, cleanup is called with iterations set to 0 to cleanup
any resources which may have been allocated by initialize or benchmark.
● cookie is a void pointer to a hunk of memory that can be used to store any parameters or state that is
needed by the benchmark.

DDR Benchmarking Tools (LMBench)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DDR Benchmarking Tools (LMBench)

Uploaded by

Copyright:

Available Formats

DDR

We will take a look on the following LMBench tools in this

LMBench is a series of micro benchmarks intended to

The benchmarks fall into three general classes: bandwidth,

● Times memory bandwidth.

● Next slides explains the options in detail.

■ Number of benchmark processes to run in parallel.

■ Useful for measuring distributed computers performance.

■ Used to evaluate system’s performance scalability.

■ Minimum number of microseconds the benchmark should execute the benchmarked

capability before it begins measuring performance.

■ Number of measurements that the benchmark should take.

■ The default number of repetitions is 11.

■ measures the time to write data to memory.

■ Assigns a constant value to each memory of an array of integer values.

■ measures the time to copy data from one location to another.

■ Does an array copy: dest[i] = source[i].

■ measures how fast the system can bzero memory.

■ measures how fast the system can bcopy data.

● First three options are same as of bw_mem

● The ﬁrst three options are similar as before.

context switch = switch time + time to restore process state(+cache state).

● Measures sustainable memory bandwidth (in MB/s)

void benchmp(initialize, benchmark, cleanup, enough, parallel, warmup, repetitions, cookie)

You might also like