Professional Documents
Culture Documents
By Vitaly Kroivets
for Software Design Seminar
Profiling Tools 1
Contents
Introduction
Software optimization process , optimization traps and
pitfalls
Benchmark
Performance tools overview
Optimizing compilers
System Performance monitors
Profiling tools
GNU gprof
INTEL VTune
Valgrind
What does it mean to use system efficiently
Profiling Tools 2
The Problem
PC speed increased 500 times since 1981, but
today’s software is more complex and still
hungry for more resources
How to run faster on same hardware and OS
architecture?
Highly optimized applications run tens times faster
than poorly written ones.
Using efficient algorithms and well-designed
implementations leads to high performance
applications
Profiling Tools 3
The Software Optimization
Process Hotspots are areas in
your code that take a
long time to execute
Create benchmark
Find hotspots
Retest using
benchmark Investigate causes
Modify application
Profiling Tools 4
Extreme Optimization Pitfalls
Large application’s performance cannot be
improved before it runs
Build the application then see what machine it
runs on
Runs great on my computer…
Debug versus release builds
Performance requires assembly language
programming
Code features first then optimize if there is
time leftover
Profiling Tools 5
Key Point:
Profiling Tools 6
The Benchmark
The benchmark is program that used to
Objectively evaluate performance of an application
Provide repeatable application behavior for use with
performance analysis tools
Industry standard benchmarks :
TPC-C 3D-Winbench
http://www.specbench.com/
Enterprise Services
Graphics/Applications
HPC/OMP
Java Client/Server
Mail Servers
Network File System
Web Servers
Profiling Tools 7
Attributes of good benchmark
Profiling Tools 8
Benchmark attributes (cont.)
Easy to run
Verifiable
need QA for benchmark!
Measure Elapsed Time vs. other number
Use benchmark to test functionality
Algorithmic tricks to gain performance may
break the application…
Profiling Tools 9
How to find performance
bottlenecks
Determine how your system resources, such as
memory and processor, are being utilized to identify
system-level bottlenecks
Measure the execution time for each module and
function in your application
Determine how the various modules running on your
system affect the performance of each other
Identify the most time-consuming function calls and call
sequences within your application
Determine how your application is executing at the
processor level to identify microarchitecture-level
performance problems
Profiling Tools 10
Performance Tools Overview
Timing mechanisms
Stopwatch : UNIX time tool
Optimizing compiler (easy way)
System load monitors
vmstat , iostat , perfmon.exe, Vtune Counter
Software profiler
Gprof, VTune, Visual C++ Profiler, IBM Quantify
Memory debugger/profiler
Valgrind , IBM Purify, Parasoft Insure++
Profiling Tools 11
Using Optimizing Compilers
Profiling Tools 12
Optimizing Compiler : choosing
optimization flags combination
Profiling Tools 13
Optimizing Compiler’s effect
Profiling Tools 14
Optimizing Compilers: Conclusions
Profiling Tools 16
Performance Monitor Counters
Profiling Tools 17
Profilers
Profiling Tools 18
Sampling vs. Instrumentation
Sampling Instrumentation
Overhead Typically about 1% High, may be 500% !
System-wide Yes, profiles all app, drivers, OS functions Just application and
profiling instrumented DLLs
Profiling Tools 19
Profiling Tools
Old, buggy and
inaccurate
Gprof
Intel
VTune $700.
Unstable
Valgrind
Is not profiler
really …
Profiling Tools 20
GNU gprof
Profiling Tools 21
Using gprof GNU profiler
Compile and link your program with profiling
enabled
cc -g -c myprog.c utils.c -pg
cc -o myprog myprog.o utils.o -pg
Execute your program to generate a profile
data file
Program will run normally (but slower) and will write
the profile data into a file called gmon.out just
before exiting
Program should exit using exit() function
Run gprof to analyze the profile data
gprof a.out
Profiling Tools 22
Example Program
Profiling Tools 23
Understanding Flat Profile
Profiling Tools 24
Flat profile : %time
Profiling Tools 25
Flat profile: Cumulative seconds
This is cumulative total number of
seconds the spent in this functions, plus the
time spent in all the functions above this one
Profiling Tools 26
Flat profile: Self seconds
Profiling Tools 27
Flat profile: Calls
Number of times
was invoked
Profiling Tools 28
Flat profile: Self seconds per call
Average number of sec per call
Spent in this function alone
Profiling Tools 29
Flat profile: Total seconds per call
Average number of seconds spent
in this function and its descendents
per call
Profiling Tools 30
Call Graph : call tree of the program
Called by :
main ( )
Descendants:
Current doit ( )
Function:
g( )
Profiling Tools 31
Call Graph : understanding each line
Profiling Tools 32
Call Graph : parents numbers
Profiling Tools 33
Call Graph : “children” numbers
Current
Function:
g( )
Profiling Tools 35
VTune performance analyzer
Profiling Tools 36
VTune Modes/Features
Time- and Event-Based, System-Wide
Sampling provides developers with the most
accurate representation of their software's
actual performance with negligible overhead
Call Graph Profiling provides developers with a
pictorial view of program flow to quickly identify
critical functions and call sequences
Counter Monitor allows developers to readily
track system activity during runtime which helps
them identify system level performance issues
Profiling Tools 37
Sampling mode
Profiling Tools 38
Sampling Mode Benefits
Low-overhead, system-wide profiling helps you identify
which modules and functions are consuming the most
time, giving you a detailed look at your operating system
and application
Benefits of sampling:
Profiling to find hotspots. Find the module, functions, lines
of source code and assembly instructions that are
consuming the most time
Low overhead. Overhead incurred by sampling is typically
about one percent
No need to instrument code. You do not need to make any
changes to code to profile with sampling
Profiling Tools 39
How does sampling work?
Sampling interrupts the processor after a certain
number of events and records the execution
information in a buffer area. When the buffer is full, the
information is copied to a file. After saving the
information, the program resumes operation. In this
way, the VTune™ maintains very low overhead (about
one percent) while sampling
Time-based sampling: collects samples of active instruction
addresses at regular time-based intervals (1ms. by default)
Event-based sampling: collects samples of active instruction
addresses after a specified number of processor events
Profiling Tools 40
Starting the Sampling Wizard
Profiling Tools 41
Starting the Sampling Wizard
Hardware
prevents from
sampling of
many counters
simultaneously
Profiling Tools 42
Starting the Sampling Wizard
Profiling Tools 43
Starting the Sampling Wizard
Unsupported
CPU ?
Ha-ha-ha…
Profiling Tools 44
EBS : choosing events
Profiling Tools 45
Events counted by VTune
Profiling Tools 46
Sampling …
Profiling Tools 47
Viewing Sampling Results
Process view
all the processes that ran on the system during data
collection
Thread view
the threads that ran within the processes you
select in Process view
Module view
the modules that ran within the selected processes
and threads
Hotspot view
the functions within the modules you select in
Module view
Profiling Tools 48
Different events collected – modules
view System-wide look at software
running on the system
Our
program
CPI-
good
average
indication
Profiling Tools 49
Hotspot Graph Click on hotspot bar
VTune displays source
code view
Each bar
represents one
of the functions
of our program
Profiling Tools 50
Source View
Test_if
function
Test_if
function
Profiling Tools 51
Annotated Source View(% of module)
See how much time is spent on each one line
Check this
“for” loop ! 10% of CPU
spent in few
statements
Profiling Tools 52
VTune Tuning assistant
In few clicks we reached to the performance problem!
Now, how to solve it ?
Tuning Assistant highlights performance problems
Provides approximate time lost by each performance
problem
Database contains performance metrics based on
Intel’s experience of tuning hundreds of applications
Analyzes the data gathered by our application
Generates tuning recommendations for each “hotspot”
Gives user idea what might be done to fix the problem
Profiling Tools 53
Tuning Assistance Report
Profiling Tools 54
Hotspot Assistant Report : Penalties
Profiling Tools 55
Hotspot Assistant Report
Profiling Tools 56
Call Graph Mode
Provides with a pictorial view of program flow
to quickly identify critical functions and call
sequences
Call graph profiling reveals:
Structure of your program on a function level
Number of times a function is called from a
particular location
The time spent in each function
Functions on a critical path.
Profiling Tools 57
Call Graph Screenshot the
function
summary
pane
Profiling Tools 60
Call Graph – Call List View
Caller Functions
are the functions
that called the
Focus Function
Callee Functions
are the functions
that called by
Focus Function
Profiling Tools 61
Counter Monitor
Use the Counter Monitor feature of the VTune™ to
collect and display performance counter data. Counter
monitor selectively polls performance counters, which
are grouped categorically into performance objects.
With the VTune analyzer, you can:
Monitor selected counters in performance objects.
Correlate performance counter data with data
collected by other features in the VTune analyzer,
such as sampling.
Trigger the collection of counter data on events other
than a periodic timer.
Profiling Tools 62
Counter Monitor
Profiling Tools 63
Getting Help
Profiling Tools 64
VTune Summary
Profiling Tools 65
Valgrind
Profiling Tools 66
Valgrind Toolkit
Memcheck is memory debugger
detects memory-management problems
Cachegrind is a cache profiler
performs detailed simulation of the I1, D1 and L2
caches in your CPU
Massif is a heap profiler
performs detailed heap profiling by taking regular
snapshots of a program's heap
Helgrind is a thread debugger
finds data races in multithreaded
programs
Profiling Tools 67
Memcheck Features
When a program is run under Memcheck's supervision, all reads
and writes of memory are checked, and calls to
malloc/new/free/delete are intercepted
Profiling Tools 68
Memcheck Example
Access of
unallocated
memory
Using non-
initialized
value
Profiling Tools 70
Memcheck report
Profiling Tools 71
Memcheck report (cont.)
Leaks detected:
S
T
A
C
K
Profiling Tools 72
Cachegrind
Detailed cache profiling can be very useful for improving
the performance of the program
On a modern x86 machine, an L1 miss will cost around 10
cycles, and an L2 miss can cost as much as 200 cycles
Cachegrind performs detailed simulation of the I1, D1
and L2 caches in your CPU
Can accurately pinpoint the sources of cache misses in
your code
Identifies number of cache misses, memory references
and instructions executed for each line of source code,
with per-function, per-module and whole-program
summaries
Cachegrind runs programs about 20--100x slower than
normal
Profiling Tools 73
How to run
L2-cache instruction
read misses
Profiling Tools 75
Cachegrind Summary output
D-cache reads
(memory reads) Data caches
D1 cache read misses READ performance
L2-cache data
read misses
Profiling Tools 76
Cachegrind Summary output
D-cache writes D1 cache write
(memory writes) misses
Data caches
WRITE performance
L2-cache data
write misses
Profiling Tools 77
Cachegrind Accuracy
Valgrind's cache profiling has a number of
shortcomings:
It doesn't account for kernel activity -- the effect of
system calls on the cache contents is ignored
It doesn't account for other process activity
(although this is probably desirable when
considering a single program)
It doesn't account for virtual-to-physical address
mappings; hence the entire simulation is not a true
representation of what's happening in the cache
Profiling Tools 78
Massif tool
Massif is a heap profiler - it measures how much heap
memory programs use. It can give information about:
Heap blocks
Heap administration blocks
Stack sizes
Help to reduce the amount of memory the program uses
smaller program interact better with caches, avoid
paging
Detect leaks that aren't detected by traditional leak-
checkers, such as Memcheck
That's because the memory isn't ever actually lost - a
pointer remains to it - but it's not in use anymore
Profiling Tools 79
Executing Massif
Run valgrind –tool=massif prog
Produces following:
Summary
Space (in bytes)
Graph Picture multiplied by
Report time (in
milliseconds).
Summary will look like this:
Total spacetime: 2,258,106 ms.B
Heap: 24.0% number of words
allocated on
Heap admin: 2.2% heap, via
Stack (s): 73.7% malloc(), new
and new[].
Profiling Tools 80
Spacetime Graphs
Profiling Tools 81
Spacetime Graph (Cont.)
Each band represents single line of source
code
It's the height of a band that's important
Triangles on the x-axis show each point at
which a memory census was taken
Not necessarily evenly spread; Massif only takes a
census when memory is allocated or de-allocated
The time on the x-axis is wall-clock time
not ideal because can get different graphs for
different executions of the same program, due to
random OS delays
Profiling Tools 82
Text/HTML Report example
Shows places in
the program where
most memory was
allocated
Profiling Tools 83
Valgrind – how it works
Valgrind is compiled into a shared object, valgrind.so. The shell
script valgrind sets the LD_PRELOAD environment variable to
point to valgrind.so. This causes the .so to be loaded as an extra
library to any subsequently executed dynamically-linked ELF
binary
The dynamic linker allows each .so in the process image to have
an initialization function which is run before main(). It also allows
each .so to have a finalization function run after main() exits
Profiling Tools 84
Valgrind Summary
Valgrind will save hours of debugging time
Valgrind can help speed up your programs
Valgrind runs on x86-Linux
Valgrind works with programs written in any language
Valgrind is actively maintained
Valgrind can be used with other tools (gdb)
Valgrind is easy to use
uses dynamic binary translation, so no need to modify,
recompile or re-link applications. Just prefix command
line with valgrind and everything works
Valgrind is not a toy
Used by large projects : 25 millions lines of code
Valgrind is free
Profiling Tools 85
Other Tools
KCachegrind
Oprofile
Profiling Tools 86
Writing Fast Programs
Select right algorithm
Implement it efficiently
Detect hotspots using profiler and fix them
Understanding of target system architecture is often
required – such as cache structure
Use platform-specific compiler extensions – memory
pre-fetching, cache control-instruction, branch
prediction, SIMD instructions
Write multithreaded applications (“Hyper Threading
Technology”)
Profiling Tools 87
CPU Architecture (Pentium 4)
Branch
prediction
der Execution
r
t- of-o on ! Units
Ou cuti
Exe
Memory
Profiling Tools 88
Instruction Execution
Execution Units
Integer
Integer
Memory Load
Memory Save
Profiling Tools 89
Keeping CPU Busy
Processors are limited by data dependencies and
speed of instructions
Keep data dependencies low
Good blend of instructions keep all execution units
busy at same time
Waiting for memory with nothing else to execute is
most common reason for slow applications
Goals: ready instructions, good mix of instructions and
predictable branches
Remove branches if possible
Reduce randomness of branches, avoid function
pointers and jump tables
Profiling Tools 90
Memory Overview (Pentium 4)
L1 cache (data only) 8 kbytes
Execution Trace Cache that stores up to
12K of decoded micro-ops
L2 Advanced Transfer Cache (data +
instructions) 256 kbytes, 3 times slower
than L1
L3 : 4MB cache (optional)
Main RAM (usually 64M … 4G) , 10
times slower than L1
Profiling Tools 91
Fixing memory problems
Use less memory to reduce compulsory cache
misses
Increase cache efficiency (place items used at
same time near each other)
Read sooner with prefetch
Write memory faster without using cache
Avoid conflicts
Avoid capacity issues
Add more work for CPU (execute non-
dependent instruction while waiting)
Profiling Tools 92
References
SPEC website http://www.specbench.org
The Software Optimization Cookbook
High-Performance Recipes for the Intel® Architecture
by Richard Gerber
GCC Optimization flags
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Valgrind Homepage http://valgrind.kde.org
An Evolutionary Analysis of GNU C Optimizations Using
Natural Selection to Investigate Software Complexities
by Scott Robert Ladd
Intel VTune Performace Analyzer webpage
http://www.intel.com/software/products/vtune/
Profiling Tools 93
Questions?
Profiling Tools 94