Profiling tools

By Vitaly Kroivets
for Software Design Seminar

Profiling Tools



Introduction  Software optimization process , optimization traps and pitfalls  Benchmark Performance tools overview  Optimizing compilers  System Performance monitors Profiling tools  GNU gprof  INTEL VTune  Valgrind What does it mean to use system efficiently

Profiling Tools


The Problem
 PC

speed increased 500 times since 1981, but today’s software is more complex and still hungry for more resources  How to run faster on same hardware and OS architecture?
Highly optimized applications run tens times faster than poorly written ones.  Using efficient algorithms and well-designed implementations leads to high performance applications

Profiling Tools 3

The Software Optimization Process
Create benchmark Find hotspots

Hotspots are areas in your code that take a long time to execute

Retest using benchmark

Investigate causes

Modify application

Profiling Tools


Extreme Optimization Pitfalls
 Large

application’s performance cannot be improved before it runs  Build the application then see what machine it runs on  Runs great on my computer…  Debug versus release builds  Performance requires assembly language programming  Code features first then optimize if there is time leftover
Profiling Tools 5

Key Point:
Software optimization doesn’t begin where coding ends –
It is ongoing process that starts at design stage and continues all the way through development

Profiling Tools


The Benchmark

The benchmark is program that used to  Objectively evaluate performance of an application  Provide repeatable application behavior for use with performance analysis tools

Industry standard benchmarks :
        

TPC-C 3D-Winbench Enterprise Services Graphics/Applications HPC/OMP Java Client/Server Mail Servers Network File System Web Servers
Profiling Tools 7

Attributes of good benchmark
 Repeatable
 Remember

(consistent measurements)

system tasks , caching issues  “incoming fax” problem : use minimum performance number
 Representative
 Execution

of typical code path, mimic how customer uses the application  Poor benchmarks : Using QA tests
Profiling Tools 8

Benchmark attributes (cont.)
 Easy

to run  Verifiable
need QA for benchmark!
 Measure

Elapsed Time vs. other number  Use benchmark to test functionality
 Algorithmic

tricks to gain performance may break the application…

Profiling Tools


How to find performance bottlenecks
Determine how your system resources, such as memory and processor, are being utilized to identify system-level bottlenecks  Measure the execution time for each module and function in your application  Determine how the various modules running on your system affect the performance of each other  Identify the most time-consuming function calls and call sequences within your application  Determine how your application is executing at the processor level to identify microarchitecture-level performance problems

Profiling Tools 10

Performance Tools Overview
 Timing


Stopwatch : UNIX time tool

 Optimizing

compiler (easy way)  System load monitors
vmstat , iostat , perfmon.exe, Vtune Counter
 Software


Gprof, VTune, Visual C++ Profiler, IBM Quantify

 Memory


Valgrind , IBM Purify, Parasoft Insure++
Profiling Tools

Using Optimizing Compilers
 Always

use compiler optimization settings to build an application for use with performance tools  Understanding and using all the features of an optimizing compiler is required for maximum performance with the least effort

Profiling Tools


Optimizing Compiler : choosing optimization flags combination

Profiling Tools


Optimizing Compiler’s effect

Profiling Tools


Optimizing Compilers: Conclusions
 Some

processor-specific options still do not appear to be a major factor in producing fast code  More optimizations do not guarantee faster code  Different algorithms are most effective with different optimizations  Idea : using statistics gathered by profiler as input for compiler/linker
Profiling Tools 15

Windows Performance Monitor
 Sampling

“profiler”  Uses OS timer interrupt to wake up and record the value of software counters – disk reads, free memory  Maximum resolution : 1 sec  Cannot identify piece of code that caused event to occur  Good for finding system issues  Unix tools : vmstat, iostat, xos, top, oprofile, etc.
Profiling Tools 16

Performance Monitor Counters

Profiling Tools


 Profiler

may show time elapsed in each function and its descendants
number of calls , call-graph (some)

 Profilers

use either instrumentation or sampling to identify performance issues

Profiling Tools


Sampling vs. Instrumentation
Overhead System-wide profiling Detect unexpected events Setup Data collected Data granularity Detects algorithmic issues
Typically about 1% Yes, profiles all app, drivers, OS functions Yes , can detect other programs using OS resources None

High, may be 500% ! Just application and instrumented DLLs No Automatic ins. of data collection stubs required Call graph , call times, critical path Functions, sometimes statements Yes – can see algorithm, call path is expensive

Counters, processor an OS state Assembly level instr., with src line No, Limited to processes , threads

Profiling Tools


Profiling Tools
Old, buggy and inaccurate

Gprof Intel

VTune Valgrind
Is not profiler really …

$700. Unstable

Profiling Tools


GNU gprof
Instrumenting profiler for every UNIX-like system

Profiling Tools


Using gprof GNU profiler
 Compile


and link your program with profiling

cc -g -c myprog.c utils.c -pg cc -o myprog myprog.o utils.o -pg
 Execute

data file

your program to generate a profile

Program will run normally (but slower) and will write the profile data into a file called gmon.out just before exiting  Program should exit using exit() function
 Run

gprof to analyze the profile data

gprof a.out
Profiling Tools

Example Program

Profiling Tools


Understanding Flat Profile
 The

flat profile shows the total amount of time your program spent executing each function.  If a function was not compiled for profiling, and didn't run long enough to show up on the program counter histogram, it will be indistinguishable from a function that was never called
Profiling Tools 24

Flat profile : %time

Percentage of the total execution time your program spent in this function. These should all add up to 100%.

Profiling Tools


Flat profile: Cumulative seconds
This is cumulative total number of seconds the spent in this functions, plus the time spent in all the functions above this one

Profiling Tools


Flat profile: Self seconds
Number of seconds accounted for this function alone

Profiling Tools


Flat profile: Calls
Number of times was invoked

Profiling Tools


Flat profile: Self seconds per call
Average number of sec per call Spent in this function alone

Profiling Tools


Flat profile: Total seconds per call
Average number of seconds spent in this function and its descendents per call

Profiling Tools


Call Graph : call tree of the program

Called by : main ( )

Current Function: g( )
Profiling Tools

Descendants: doit ( )


Call Graph : understanding each line

Unique index of this function

Total time propagated into this function by its children

Number of times was called

Current Function: g( )

Percentage of the `total‘ time spent in this function and its children.

total amount of time spent in this function
Profiling Tools 32

Call Graph : parents numbers
Time that was propagated from the function's children into this parent

Time that was propagated directly from the function into this parent

Call Graph : understanding each line
Current Function: g( )

Number of times this parent called the function `/‘ total number of times the function was called

Profiling Tools


Call Graph : “children” numbers

Number of times this function called the child `/‘ total number of times this child was called

Current Function: g( )

Amount of time that was propagated directly from the child into function Amount of time that was propagated from the child's children to the function
Profiling Tools


How gprof works
 

Instruments program to count calls Watches the program running, samples the PC every 0.01 sec  Statistical inaccuracy : fast function may take 0 or 1 samples  Run should be long enough comparing with sampling period  Combine several gmon.out files into single report The output from gprof gives no indication of parts of your program that are limited by I/O or swapping bandwidth. This is because samples of the program counter are taken at fixed intervals of run time number-of-calls figures are derived by counting, not sampling. They are completely accurate and will not vary from run to run if your program is deterministic Profiling with inlining and other optimizations needs care

Profiling Tools

VTune performance analyzer
To squeeze every bit of power out of Intel architecture !

Profiling Tools


VTune Modes/Features
 Time-

and Event-Based, System-Wide Sampling provides developers with the most accurate representation of their software's actual performance with negligible overhead  Call Graph Profiling provides developers with a pictorial view of program flow to quickly identify critical functions and call sequences  Counter Monitor allows developers to readily track system activity during runtime which helps them identify system level performance issues
Profiling Tools 37

Sampling mode
 Monitors

all active software on your

 including

your application, the OS , JITcompiled Java* class files, Microsoft* .NET files, 16-bit applications, 32-bit applications, device drivers

 Application

performance is not impacted during data collection
Profiling Tools 38

Sampling Mode Benefits

Low-overhead, system-wide profiling helps you identify which modules and functions are consuming the most time, giving you a detailed look at your operating system and application Benefits of sampling:

 

Profiling to find hotspots. Find the module, functions, lines of source code and assembly instructions that are consuming the most time Low overhead. Overhead incurred by sampling is typically about one percent No need to instrument code. You do not need to make any changes to code to profile with sampling
Profiling Tools 39

How does sampling work?

Sampling interrupts the processor after a certain number of events and records the execution information in a buffer area. When the buffer is full, the information is copied to a file. After saving the information, the program resumes operation. In this way, the VTune™ maintains very low overhead (about one percent) while sampling
 

Time-based sampling: collects samples of active instruction addresses at regular time-based intervals (1ms. by default) Event-based sampling: collects samples of active instruction addresses after a specified number of processor events

After the program finishes, the samples are mapped to modules and stored in a database within the analyzer program.
Profiling Tools 40

Starting the Sampling Wizard

Profiling Tools


Starting the Sampling Wizard

Hardware prevents from sampling of many counters simultaneously

Profiling Tools


Starting the Sampling Wizard

Profiling Tools


Starting the Sampling Wizard
Unsupported CPU ? Ha-ha-ha…

Profiling Tools


EBS : choosing events

Profiling Tools


Events counted by VTune
Basic Events: clock cycles, retired instructions  Instruction Execution: instruction decode, issue and execution, data and control Ab speculation, and memory operations ev ou en t 1  Cycle Accounting Events: stall cycle t 3 ar s i n 0 d breakdowns ch P if ite en fer ct tiu en  Branch Events: branch prediction ur m t e ! 4  Memory Hierarchy: instruction prefetch, instruction and data caches  System Events: operating system monitors, instruction and data TLBs

Profiling Tools


Sampling …

Profiling Tools


Viewing Sampling Results
 Process


all the processes that ran on the system during data collection

 Thread

view view view

the threads that ran within the processes you select in Process view the modules that ran within the selected processes and threads the functions within the modules you select in Module view
Profiling Tools 48

 Module

 Hotspot

Different events collected – modules view
System-wide look at software running on the system

Our program
CPIgood average indication

Profiling Tools


Hotspot Graph

Click on hotspot bar VTune displays source code view

Each bar represents one of the functions of our program

Profiling Tools


Source View

Test_if function

Test_if function

Profiling Tools


Annotated Source View(% of module)
See how much time is spent on each one line

Check this “for” loop !

10% of CPU spent in few statements

Profiling Tools


VTune Tuning assistant
   

In few clicks we reached to the performance problem!

Now, how to solve it ?

Tuning Assistant highlights performance problems Provides approximate time lost by each performance problem Database contains performance metrics based on Intel’s experience of tuning hundreds of applications
  

Analyzes the data gathered by our application Generates tuning recommendations for each “hotspot” Gives user idea what might be done to fix the problem

Profiling Tools


Tuning Assistance Report

Profiling Tools


Hotspot Assistant Report : Penalties

Profiling Tools


Hotspot Assistant Report

Profiling Tools


Call Graph Mode
 Provides

with a pictorial view of program flow to quickly identify critical functions and call sequences  Call graph profiling reveals:
Structure of your program on a function level  Number of times a function is called from a particular location  The time spent in each function  Functions on a critical path.

Profiling Tools


Call Graph Screenshot

the function summary pane

Critical Path displayed as red lines: call sequence in an application that took the most time to execute.
Switch to Calllist View

Profiling Tools

Call Graph (Cont.)

Wait time – how much time spent waiting for event to occur

Additional info available - by hovering the move over the functions
Profiling Tools 59

Jump to Source view

Profiling Tools


Call Graph – Call List View

Caller Functions are the functions that called the Focus Function

Callee Functions are the functions that called by Focus Function
Profiling Tools 61

Counter Monitor

Use the Counter Monitor feature of the VTune™ to collect and display performance counter data. Counter monitor selectively polls performance counters, which are grouped categorically into performance objects. With the VTune analyzer, you can:  Monitor selected counters in performance objects.  Correlate performance counter data with data collected by other features in the VTune analyzer, such as sampling.  Trigger the collection of counter data on events other than a periodic timer.

Profiling Tools


Counter Monitor

Profiling Tools


Getting Help
•Context –sensitive help •Online Help repository

Profiling Tools


VTune Summary
 Pros:

Allows to get best possible performance out of Intel architecture  Cons: Extreme tuning requires deep understanding of processor and OS internals

Profiling Tools


Multi-purpose Linux x86 profiling tool

Profiling Tools


Valgrind Toolkit
 Memcheck

is memory debugger is a cache profiler

detects memory-management problems performs detailed simulation of the I1, D1 and L2 caches in your CPU

 Cachegrind

 Massif

is a heap profiler is a thread debugger

performs detailed heap profiling by taking regular snapshots of a program's heap

 Helgrind

finds data races in multithreaded  programs
Profiling Tools 67

Memcheck Features

When a program is run under Memcheck's supervision, all reads and writes of memory are checked, and calls to malloc/new/free/delete are intercepted Memcheck can detect:
        

Use of uninitialised memory Reading/writing memory after it has been free'd Reading/writing off the end of malloc'd blocks Reading/writing inappropriate areas on the stack Memory leaks -- where pointers to malloc'd blocks are lost forever Passing of uninitialised and/or unaddressible memory to system calls Mismatched use of malloc/new/new [] vs free/delete/delete [] Overlapping src and dst pointers in memcpy() and related functions Some misuses of the POSIX pthreads API

Profiling Tools


Memcheck Example
Access of unallocated memory Using noninitialized value

Memor y leak
Profiling Tools

Using “free” of memory allocated by “new”


Memcheck Example (Cont.)
 Compile

the program with –g flag:
–o a.out Debug leaks

g++ -c –g

 Execute

valgrind :

valgrind --tool=memcheck --leak-check=yes a.out > log

 View


Executabl e name

Profiling Tools


Memcheck report

Profiling Tools


Memcheck report (cont.) Leaks detected:


Profiling Tools



Detailed cache profiling can be very useful for improving the performance of the program

On a modern x86 machine, an L1 miss will cost around 10 cycles, and an L2 miss can cost as much as 200 cycles

  

Cachegrind performs detailed simulation of the I1, D1 and L2 caches in your CPU Can accurately pinpoint the sources of cache misses in your code Identifies number of cache misses, memory references and instructions executed for each line of source code, with per-function, per-module and wholeprogram summaries Cachegrind runs programs about 20--100x slower than normal
Profiling Tools 73

How to run
 Run

valgrind --tool=cachegrind in front of the normal command line invocation
Example : valgrind --tool=cachegrind ls -l

 When

the program finishes, Cachegrind will print summary cache statistics. It also collects line-by-line information in a file  Execute cg_annotate to get annotated source Source files file:

cg_annotate --7618 >
Profiling Tools 74

Cachegrind Summary output
I-cache reads (instructions executed) I1 cache read misses

Instruction caches performance

L2-cache instruction read misses

Profiling Tools


Cachegrind Summary output
D-cache reads (memory reads) D1 cache read misses

Data caches READ performance

L2-cache data read misses

Profiling Tools


Cachegrind Summary output
D-cache writes (memory writes) D1 cache write misses

Data caches WRITE performance

L2-cache data write misses

Profiling Tools


Cachegrind Accuracy
 Valgrind's

cache profiling has a number of shortcomings:
It doesn't account for kernel activity -- the effect of system calls on the cache contents is ignored  It doesn't account for other process activity (although this is probably desirable when considering a single program)  It doesn't account for virtual-to-physical address mappings; hence the entire simulation is not a true representation of what's happening in the cache

Profiling Tools


Massif tool

Massif is a heap profiler - it measures how much heap memory programs use. It can give information about:  Heap blocks  Heap administration blocks  Stack sizes Help to reduce the amount of memory the program uses  smaller program interact better with caches, avoid paging Detect leaks that aren't detected by traditional leakcheckers, such as Memcheck  That's because the memory isn't ever actually lost - a pointer remains to it - but it's not in use anymore

Profiling Tools

Executing Massif
 Run

valgrind –tool=massif prog
Space (in bytes) multiplied by time (in milliseconds).

Produces following:
 Summary  Graph  Report


 Summary

will look like this:

Total spacetime: 2,258,106 ms.B  Heap: 24.0% number of words allocated on  Heap admin: 2.2% heap, via malloc(), new  Stack (s): 73.7%
and new[].
Profiling Tools 80

Spacetime Graphs

Profiling Tools


Spacetime Graph (Cont.)
 Each

code  It's the height of a band that's important  Triangles on the x-axis show each point at which a memory census was taken
Not necessarily evenly spread; Massif only takes a census when memory is allocated or de-allocated  The time on the x-axis is wall-clock time  not ideal because can get different graphs for different executions of the same program, due to random OS delays

Profiling Tools 82

band represents single line of source

Text/HTML Report example
Contains a lot of extra information about heap allocations that you don't see in the graph.

Shows places in the program where most memory was allocated
Profiling Tools 83

Valgrind – how it works

Valgrind is compiled into a shared object, The shell script valgrind sets the LD_PRELOAD environment variable to point to This causes the .so to be loaded as an extra library to any subsequently executed dynamically-linked ELF binary The dynamic linker allows each .so in the process image to have an initialization function which is run before main(). It also allows each .so to have a finalization function run after main() exits When's initialization function is called by the dynamic linker, the synthetic CPU to starts up. The real CPU remains locked in until end of run System call are intercepted; Signal handlers are monitored

Profiling Tools


Valgrind Summary
     

Valgrind will save hours of debugging time Valgrind can help speed up your programs Valgrind runs on x86-Linux Valgrind works with programs written in any language

Valgrind is actively maintained

Valgrind can be used with other tools (gdb) Valgrind is easy to use

uses dynamic binary translation, so no need to modify, recompile or re-link applications. Just prefix command line with valgrind and everything works Used by large projects : 25 millions lines of code

 

Valgrind is not a toy

Valgrind is free
Profiling Tools 85

Other Tools
 Tools

not included in this presentation:

Purify  Parasoft Insure  KCachegrind  Oprofile  GCC’s and GLIBC’s debugging hooks

Profiling Tools


Writing Fast Programs
 Select

right algorithm  Implement it efficiently
Detect hotspots using profiler and fix them
 Understanding

of target system architecture is often required – such as cache structure  Use platform-specific compiler extensions – memory pre-fetching, cache control-instruction, branch prediction, SIMD instructions  Write multithreaded applications (“Hyper Threading Technology”)

Profiling Tools


CPU Architecture (Pentium 4)
Branch prediction

Instruction fetch

Instruction decode

Instruction pool


r r de f-o t-o tion ! Ou cu Exe

Execution Units Memory
Profiling Tools 88

Instruction Execution
Execution Units Integer Integer Instruction pool Floating point Dispatch unit Floating point Memory Load Memory Save

Profiling Tools


Keeping CPU Busy

Processors are limited by data dependencies and speed of instructions

Keep data dependencies low

  

Good blend of instructions keep all execution units busy at same time Waiting for memory with nothing else to execute is most common reason for slow applications Goals: ready instructions, good mix of instructions and predictable branches
 

Remove branches if possible Reduce randomness of branches, avoid function pointers and jump tables
Profiling Tools 90

Memory Overview (Pentium 4)
 L1

cache (data only) 8 kbytes

 Execution

Trace Cache that stores up to 12K of decoded micro-ops

 L2

Advanced Transfer Cache (data + instructions) 256 kbytes, 3 times slower than L1  L3 : 4MB cache (optional)  Main RAM (usually 64M … 4G) , 10 times slower than L1
Profiling Tools 91

Fixing memory problems
 Use

less memory to reduce compulsory cache misses  Increase cache efficiency (place items used at same time near each other)  Read sooner with prefetch  Write memory faster without using cache  Avoid conflicts  Avoid capacity issues  Add more work for CPU (execute nondependent instruction while waiting)
Profiling Tools 92

 

  

SPEC website The Software Optimization Cookbook High-Performance Recipes for the Intel® Architecture by Richard Gerber GCC Optimization flags

 

Valgrind Homepage An Evolutionary Analysis of GNU C Optimizations Using Natural Selection to Investigate Software Complexities by Scott Robert Ladd Intel VTune Performace Analyzer webpage

Gprof man page

Profiling Tools



Profiling Tools


Sign up to vote on this title
UsefulNot useful