You are on page 1of 94

Profiling tools

By Vitaly Kroivets
for Software Design Seminar

Profiling Tools 1
Contents
 Introduction
 Software optimization process , optimization traps and
pitfalls
 Benchmark
 Performance tools overview
 Optimizing compilers
 System Performance monitors
 Profiling tools
 GNU gprof
 INTEL VTune
 Valgrind
 What does it mean to use system efficiently

Profiling Tools 2
The Problem
 PC speed increased 500 times since 1981, but
today’s software is more complex and still
hungry for more resources
 How to run faster on same hardware and OS
architecture?
 Highly optimized applications run tens times faster
than poorly written ones.
 Using efficient algorithms and well-designed
implementations leads to high performance
applications

Profiling Tools 3
The Software Optimization
Process Hotspots are areas in
your code that take a
long time to execute

Create benchmark
Find hotspots

Retest using
benchmark Investigate causes

Modify application

Profiling Tools 4
Extreme Optimization Pitfalls
 Large application’s performance cannot be
improved before it runs
 Build the application then see what machine it
runs on
 Runs great on my computer…
 Debug versus release builds
 Performance requires assembly language
programming
 Code features first then optimize if there is
time leftover

Profiling Tools 5
Key Point:

Software optimization doesn’t


begin where coding ends –
It is ongoing process that
starts at design stage and
continues all the way through
development

Profiling Tools 6
The Benchmark
 The benchmark is program that used to
 Objectively evaluate performance of an application
 Provide repeatable application behavior for use with
performance analysis tools
 Industry standard benchmarks :
 TPC-C 3D-Winbench
 http://www.specbench.com/
 Enterprise Services
 Graphics/Applications
 HPC/OMP
 Java Client/Server
 Mail Servers
 Network File System
 Web Servers

Profiling Tools 7
Attributes of good benchmark

 Repeatable (consistent measurements)


 Remember system tasks , caching issues
 “incoming fax” problem : use minimum
performance number
 Representative
 Execution of typical code path, mimic how
customer uses the application
 Poor benchmarks : Using QA tests

Profiling Tools 8
Benchmark attributes (cont.)

 Easy to run
 Verifiable
 need QA for benchmark!
 Measure Elapsed Time vs. other number
 Use benchmark to test functionality
 Algorithmic tricks to gain performance may
break the application…

Profiling Tools 9
How to find performance
bottlenecks
 Determine how your system resources, such as
memory and processor, are being utilized to identify
system-level bottlenecks
 Measure the execution time for each module and
function in your application
 Determine how the various modules running on your
system affect the performance of each other
 Identify the most time-consuming function calls and call
sequences within your application
 Determine how your application is executing at the
processor level to identify microarchitecture-level
performance problems
Profiling Tools 10
Performance Tools Overview

 Timing mechanisms
 Stopwatch : UNIX time tool
 Optimizing compiler (easy way)
 System load monitors
 vmstat , iostat , perfmon.exe, Vtune Counter
 Software profiler
 Gprof, VTune, Visual C++ Profiler, IBM Quantify
 Memory debugger/profiler
 Valgrind , IBM Purify, Parasoft Insure++

Profiling Tools 11
Using Optimizing Compilers

 Always use compiler optimization


settings to build an application for use
with performance tools
 Understanding and using all the features
of an optimizing compiler is required for
maximum performance with the least
effort

Profiling Tools 12
Optimizing Compiler : choosing
optimization flags combination

Profiling Tools 13
Optimizing Compiler’s effect

Profiling Tools 14
Optimizing Compilers: Conclusions

 Some processor-specific options still do


not appear to be a major factor in
producing fast code
 More optimizations do not guarantee
faster code
 Different algorithms are most effective
with different optimizations
 Idea : using statistics gathered by profiler
as input for compiler/linker
Profiling Tools 15
Windows Performance Monitor
 Sampling “profiler”
 Uses OS timer interrupt to wake up and record
the value of software counters – disk reads,
free memory
 Maximum resolution : 1 sec
 Cannot identify piece of code that caused
event to occur
 Good for finding system issues
 Unix tools : vmstat, iostat, xos, top, oprofile,
etc.

Profiling Tools 16
Performance Monitor Counters

Profiling Tools 17
Profilers

 Profiler may show time elapsed in each


function and its descendants
 number of calls , call-graph (some)

 Profilers use either instrumentation or


sampling to identify performance issues

Profiling Tools 18
Sampling vs. Instrumentation
Sampling Instrumentation
Overhead Typically about 1% High, may be 500% !

System-wide Yes, profiles all app, drivers, OS functions Just application and
profiling instrumented DLLs

Detect unexpected Yes , can detect other programs using OS No


events resources

Setup None Automatic ins. of data


collection stubs required

Data collected Counters, processor an OS state Call graph , call times,


critical path
Data granularity Assembly level instr., with src line Functions, sometimes
statements
Detects No, Limited to processes , threads Yes – can see algorithm,
algorithmic issues call path is expensive

Profiling Tools 19
Profiling Tools
Old, buggy and
inaccurate

Gprof
Intel
VTune $700.
Unstable

Valgrind
Is not profiler
really …

Profiling Tools 20
GNU gprof

Instrumenting profiler for every


UNIX-like system

Profiling Tools 21
Using gprof GNU profiler
 Compile and link your program with profiling
enabled
cc -g -c myprog.c utils.c -pg
cc -o myprog myprog.o utils.o -pg
 Execute your program to generate a profile
data file
 Program will run normally (but slower) and will write
the profile data into a file called gmon.out just
before exiting
 Program should exit using exit() function
 Run gprof to analyze the profile data
 gprof a.out

Profiling Tools 22
Example Program

Profiling Tools 23
Understanding Flat Profile

 The flat profile shows the total amount of


time your program spent executing each
function.
 If a function was not compiled for
profiling, and didn't run long enough to
show up on the program counter
histogram, it will be indistinguishable
from a function that was never called

Profiling Tools 24
Flat profile : %time

Percentage of the total execution


time your program spent in this function.
These should all add up to 100%.

Profiling Tools 25
Flat profile: Cumulative seconds
This is cumulative total number of
seconds the spent in this functions, plus the
time spent in all the functions above this one

Profiling Tools 26
Flat profile: Self seconds

Number of seconds accounted


for this function alone

Profiling Tools 27
Flat profile: Calls

Number of times
was invoked

Profiling Tools 28
Flat profile: Self seconds per call
Average number of sec per call
Spent in this function alone

Profiling Tools 29
Flat profile: Total seconds per call
Average number of seconds spent
in this function and its descendents
per call

Profiling Tools 30
Call Graph : call tree of the program

Called by :
main ( )

Descendants:
Current doit ( )
Function:
g( )
Profiling Tools 31
Call Graph : understanding each line

Total time propagated


Unique into this function by its Number of times
index of this children was called
function
Current
Function:
g( )

Percentage of the `total‘ total amount of


time spent in this function time spent in
and its children. this function

Profiling Tools 32
Call Graph : parents numbers

Time that was propagated


from the function's children Number of times this parent
into this parent called the function `/‘

Call Graph : understanding each line


Time that was propagated
total number of times the
function was called
directly from the function
into this parent Current
Function:
g( )

Profiling Tools 33
Call Graph : “children” numbers

Number of times this function


called the child `/‘
total number of times this
child was called

Current
Function:
g( )

Amount of time that was


propagated directly
from the child into function

Amount of time that was propagated


from the child's children to the function
Profiling Tools 34
How gprof works
 Instruments program to count calls
 Watches the program running, samples the PC every 0.01
sec
 Statistical inaccuracy : fast function may take 0 or 1
samples
 Run should be long enough comparing with sampling
period
 Combine several gmon.out files into single report
 The output from gprof gives no indication of parts of your
program that are limited by I/O or swapping bandwidth. This
is because samples of the program counter are taken at fixed
intervals of run time
 number-of-calls figures are derived by counting, not
sampling. They are completely accurate and will not vary
from run to run if your program is deterministic
 Profiling with inlining and other optimizations needs care

Profiling Tools 35
VTune performance analyzer

To squeeze every bit of


power out of Intel
architecture !

Profiling Tools 36
VTune Modes/Features
 Time- and Event-Based, System-Wide
Sampling provides developers with the most
accurate representation of their software's
actual performance with negligible overhead
 Call Graph Profiling provides developers with a
pictorial view of program flow to quickly identify
critical functions and call sequences
 Counter Monitor allows developers to readily
track system activity during runtime which helps
them identify system level performance issues

Profiling Tools 37
Sampling mode

 Monitors all active software on your


system
 including your application, the OS , JIT-
compiled Java* class files, Microsoft* .NET
files, 16-bit applications, 32-bit applications,
device drivers
 Application performance is not impacted
during data collection

Profiling Tools 38
Sampling Mode Benefits
 Low-overhead, system-wide profiling helps you identify
which modules and functions are consuming the most
time, giving you a detailed look at your operating system
and application

 Benefits of sampling:
 Profiling to find hotspots. Find the module, functions, lines
of source code and assembly instructions that are
consuming the most time
 Low overhead. Overhead incurred by sampling is typically
about one percent
 No need to instrument code. You do not need to make any
changes to code to profile with sampling

Profiling Tools 39
How does sampling work?
 Sampling interrupts the processor after a certain
number of events and records the execution
information in a buffer area. When the buffer is full, the
information is copied to a file. After saving the
information, the program resumes operation. In this
way, the VTune™ maintains very low overhead (about
one percent) while sampling
 Time-based sampling: collects samples of active instruction
addresses at regular time-based intervals (1ms. by default)
 Event-based sampling: collects samples of active instruction
addresses after a specified number of processor events

 After the program finishes, the samples are mapped to


modules and stored in a database within the analyzer
program.

Profiling Tools 40
Starting the Sampling Wizard

Profiling Tools 41
Starting the Sampling Wizard

Hardware
prevents from
sampling of
many counters
simultaneously

Profiling Tools 42
Starting the Sampling Wizard

Profiling Tools 43
Starting the Sampling Wizard
Unsupported
CPU ?
Ha-ha-ha…

Profiling Tools 44
EBS : choosing events

Profiling Tools 45
Events counted by VTune

 Basic Events: clock cycles, retired instructions


 Instruction Execution: instruction decode,
issue and execution, data and control
Ab
speculation, and memory operations ev ou
en t 1
 Cycle Accounting Events: stall cycle t 3
ar s in 0 d
breakdowns ch P if
ite en fer
ct tiu en
 Branch Events: branch prediction ur m t
e
! 4
 Memory Hierarchy: instruction prefetch,
instruction and data caches
 System Events: operating system monitors,
instruction and data TLBs

Profiling Tools 46
Sampling …

Profiling Tools 47
Viewing Sampling Results
 Process view
 all the processes that ran on the system during data
collection
 Thread view
 the threads that ran within the processes you
select in Process view
 Module view
 the modules that ran within the selected processes
and threads
 Hotspot view
 the functions within the modules you select in
Module view

Profiling Tools 48
Different events collected – modules
view System-wide look at software
running on the system

Our
program
CPI-
good
average
indication

Profiling Tools 49
Hotspot Graph Click on hotspot bar
VTune displays source
code view

Each bar
represents one
of the functions
of our program

Profiling Tools 50
Source View

Test_if
function

Test_if
function

Profiling Tools 51
Annotated Source View(% of module)
See how much time is spent on each one line

Check this
“for” loop ! 10% of CPU
spent in few
statements

Profiling Tools 52
VTune Tuning assistant
 In few clicks we reached to the performance problem!
 Now, how to solve it ?
 Tuning Assistant highlights performance problems
 Provides approximate time lost by each performance
problem
 Database contains performance metrics based on
Intel’s experience of tuning hundreds of applications
 Analyzes the data gathered by our application
 Generates tuning recommendations for each “hotspot”
 Gives user idea what might be done to fix the problem

Profiling Tools 53
Tuning Assistance Report

Profiling Tools 54
Hotspot Assistant Report : Penalties

Profiling Tools 55
Hotspot Assistant Report

Profiling Tools 56
Call Graph Mode
 Provides with a pictorial view of program flow
to quickly identify critical functions and call
sequences
 Call graph profiling reveals:
 Structure of your program on a function level
 Number of times a function is called from a
particular location
 The time spent in each function
 Functions on a critical path.

Profiling Tools 57
Call Graph Screenshot the
function
summary
pane

Critical Path displayed as red lines:


call sequence in an application that
took the most time to execute.
Switch to Call-
list View
Profiling Tools 58
Call Graph (Cont.)

Wait time Additional info available


– how much time spent - by hovering the move over
waiting for event to the functions
occur
Profiling Tools 59
Jump to Source view

Profiling Tools 60
Call Graph – Call List View

Caller Functions
are the functions
that called the
Focus Function

Callee Functions
are the functions
that called by
Focus Function

Profiling Tools 61
Counter Monitor
 Use the Counter Monitor feature of the VTune™ to
collect and display performance counter data. Counter
monitor selectively polls performance counters, which
are grouped categorically into performance objects.
 With the VTune analyzer, you can:
 Monitor selected counters in performance objects.
 Correlate performance counter data with data
collected by other features in the VTune analyzer,
such as sampling.
 Trigger the collection of counter data on events other
than a periodic timer.

Profiling Tools 62
Counter Monitor

Profiling Tools 63
Getting Help

•Context –sensitive help


•Online Help repository

Profiling Tools 64
VTune Summary

 Pros: Allows to get best possible


performance out of Intel architecture
 Cons: Extreme tuning requires deep
understanding of processor and OS
internals

Profiling Tools 65
Valgrind

Multi-purpose Linux x86 profiling


tool

Profiling Tools 66
Valgrind Toolkit
 Memcheck is memory debugger
 detects memory-management problems
 Cachegrind is a cache profiler
 performs detailed simulation of the I1, D1 and L2
caches in your CPU
 Massif is a heap profiler
 performs detailed heap profiling by taking regular
snapshots of a program's heap
 Helgrind is a thread debugger
 finds data races in multithreaded
 programs

Profiling Tools 67
Memcheck Features
 When a program is run under Memcheck's supervision, all reads
and writes of memory are checked, and calls to
malloc/new/free/delete are intercepted

 Memcheck can detect:


 Use of uninitialised memory
 Reading/writing memory after it has been free'd
 Reading/writing off the end of malloc'd blocks
 Reading/writing inappropriate areas on the stack
 Memory leaks -- where pointers to malloc'd blocks are lost forever
 Passing of uninitialised and/or unaddressible memory to system
calls
 Mismatched use of malloc/new/new [] vs free/delete/delete []
 Overlapping src and dst pointers in memcpy() and related functions
 Some misuses of the POSIX pthreads API

Profiling Tools 68
Memcheck Example

Access of
unallocated
memory

Using non-
initialized
value

Memor Using “free” of


y leak memory
allocated by
Profiling Tools “new” 69
Memcheck Example (Cont.)

 Compile the program with –g flag:


 g++ -c a.cc –g –o a.out
Debug
leaks
 Execute valgrind :
 valgrind --tool=memcheck --leak-check=yes a.out > log

 View log Executabl


e name

Profiling Tools 70
Memcheck report

Profiling Tools 71
Memcheck report (cont.)
Leaks detected:

S
T
A
C
K

Profiling Tools 72
Cachegrind
 Detailed cache profiling can be very useful for improving
the performance of the program
 On a modern x86 machine, an L1 miss will cost around 10
cycles, and an L2 miss can cost as much as 200 cycles
 Cachegrind performs detailed simulation of the I1, D1
and L2 caches in your CPU
 Can accurately pinpoint the sources of cache misses in
your code
 Identifies number of cache misses, memory references
and instructions executed for each line of source code,
with per-function, per-module and whole-program
summaries
 Cachegrind runs programs about 20--100x slower than
normal

Profiling Tools 73
How to run

 Run valgrind --tool=cachegrind in front of the


normal command line invocation
 Example : valgrind --tool=cachegrind ls -l
 When the program finishes, Cachegrind will
print summary cache statistics. It also collects
line-by-line information in a file
cachegrind.out.pid
 Execute cg_annotate to get annotated source
file: Source files
 cg_annotate --7618 a.cc > a.cc.annotated
PID Profiling Tools 74
Cachegrind Summary output
I-cache reads
(instructions executed) I1 cache read misses
Instruction caches
performance

L2-cache instruction
read misses

Profiling Tools 75
Cachegrind Summary output
D-cache reads
(memory reads) Data caches
D1 cache read misses READ performance

L2-cache data
read misses

Profiling Tools 76
Cachegrind Summary output
D-cache writes D1 cache write
(memory writes) misses
Data caches
WRITE performance

L2-cache data
write misses

Profiling Tools 77
Cachegrind Accuracy
 Valgrind's cache profiling has a number of
shortcomings:
 It doesn't account for kernel activity -- the effect of
system calls on the cache contents is ignored
 It doesn't account for other process activity
(although this is probably desirable when
considering a single program)
 It doesn't account for virtual-to-physical address
mappings; hence the entire simulation is not a true
representation of what's happening in the cache

Profiling Tools 78
Massif tool
 Massif is a heap profiler - it measures how much heap
memory programs use. It can give information about:
 Heap blocks
 Heap administration blocks
 Stack sizes
 Help to reduce the amount of memory the program uses
 smaller program interact better with caches, avoid
paging
 Detect leaks that aren't detected by traditional leak-
checkers, such as Memcheck
 That's because the memory isn't ever actually lost - a
pointer remains to it - but it's not in use anymore

Profiling Tools 79
Executing Massif
 Run valgrind –tool=massif prog
 Produces following:
 Summary
Space (in bytes)
 Graph Picture multiplied by
 Report time (in
milliseconds).
 Summary will look like this:
 Total spacetime: 2,258,106 ms.B
 Heap: 24.0% number of words
allocated on
 Heap admin: 2.2% heap, via
 Stack (s): 73.7% malloc(), new
and new[].

Profiling Tools 80
Spacetime Graphs

Profiling Tools 81
Spacetime Graph (Cont.)
 Each band represents single line of source
code
 It's the height of a band that's important
 Triangles on the x-axis show each point at
which a memory census was taken
 Not necessarily evenly spread; Massif only takes a
census when memory is allocated or de-allocated
 The time on the x-axis is wall-clock time
 not ideal because can get different graphs for
different executions of the same program, due to
random OS delays

Profiling Tools 82
Text/HTML Report example

Contains a lot of extra information about heap allocations that you


don't see in the graph.

Shows places in
the program where
most memory was
allocated

Profiling Tools 83
Valgrind – how it works
 Valgrind is compiled into a shared object, valgrind.so. The shell
script valgrind sets the LD_PRELOAD environment variable to
point to valgrind.so. This causes the .so to be loaded as an extra
library to any subsequently executed dynamically-linked ELF
binary

 The dynamic linker allows each .so in the process image to have
an initialization function which is run before main(). It also allows
each .so to have a finalization function run after main() exits

 When valgrind.so's initialization function is called by the dynamic


linker, the synthetic CPU to starts up. The real CPU remains
locked in valgrind.so until end of run

 System call are intercepted; Signal handlers are monitored

Profiling Tools 84
Valgrind Summary
 Valgrind will save hours of debugging time
 Valgrind can help speed up your programs
 Valgrind runs on x86-Linux
 Valgrind works with programs written in any language
 Valgrind is actively maintained
 Valgrind can be used with other tools (gdb)
 Valgrind is easy to use
 uses dynamic binary translation, so no need to modify,
recompile or re-link applications. Just prefix command
line with valgrind and everything works
 Valgrind is not a toy
 Used by large projects : 25 millions lines of code
 Valgrind is free

Profiling Tools 85
Other Tools

 Tools not included in this presentation:


 IBM Purify
 Parasoft Insure

 KCachegrind

 Oprofile

 GCC’s and GLIBC’s debugging hooks

Profiling Tools 86
Writing Fast Programs
 Select right algorithm
 Implement it efficiently
 Detect hotspots using profiler and fix them
 Understanding of target system architecture is often
required – such as cache structure
 Use platform-specific compiler extensions – memory
pre-fetching, cache control-instruction, branch
prediction, SIMD instructions
 Write multithreaded applications (“Hyper Threading
Technology”)

Profiling Tools 87
CPU Architecture (Pentium 4)

Branch
prediction

Instruction Instruction Instruction


retirement
fetch decode pool

der Execution
r
t- of-o on ! Units
Ou cuti
Exe
Memory

Profiling Tools 88
Instruction Execution
Execution Units

Integer

Integer

Instruction Floating point


pool Dispatch unit
Floating point

Memory Load

Memory Save

Profiling Tools 89
Keeping CPU Busy
 Processors are limited by data dependencies and
speed of instructions
 Keep data dependencies low
 Good blend of instructions keep all execution units
busy at same time
 Waiting for memory with nothing else to execute is
most common reason for slow applications
 Goals: ready instructions, good mix of instructions and
predictable branches
 Remove branches if possible
 Reduce randomness of branches, avoid function
pointers and jump tables

Profiling Tools 90
Memory Overview (Pentium 4)
 L1 cache (data only) 8 kbytes
 Execution Trace Cache that stores up to
12K of decoded micro-ops
 L2 Advanced Transfer Cache (data +
instructions) 256 kbytes, 3 times slower
than L1
 L3 : 4MB cache (optional)
 Main RAM (usually 64M … 4G) , 10
times slower than L1

Profiling Tools 91
Fixing memory problems
 Use less memory to reduce compulsory cache
misses
 Increase cache efficiency (place items used at
same time near each other)
 Read sooner with prefetch
 Write memory faster without using cache
 Avoid conflicts
 Avoid capacity issues
 Add more work for CPU (execute non-
dependent instruction while waiting)

Profiling Tools 92
References
 SPEC website http://www.specbench.org
 The Software Optimization Cookbook
High-Performance Recipes for the Intel® Architecture
by Richard Gerber
 GCC Optimization flags
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
 Valgrind Homepage http://valgrind.kde.org
 An Evolutionary Analysis of GNU C Optimizations Using
Natural Selection to Investigate Software Complexities
by Scott Robert Ladd
 Intel VTune Performace Analyzer webpage
http://www.intel.com/software/products/vtune/

 Gprof man page


http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_mono/gprof.html

Profiling Tools 93
Questions?

Profiling Tools 94

You might also like