P. 1


|Views: 348|Likes:
Published by ymsgrtest

More info:

Published by: ymsgrtest on Mar 11, 2011
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PPT, PDF, TXT or read online from Scribd
See more
See less






  • Contents
  • The Problem
  • Extreme Optimization Pitfalls
  • Key Point:
  • The Benchmark
  • Attributes of good benchmark
  • Benchmark attributes (cont.)
  • Performance Tools Overview
  • Using Optimizing Compilers
  • Optimizing Compiler¶s effect
  • Windows Performance Monitor
  • Performance Monitor Counters
  • Profilers
  • Sampling vs. Instrumentation
  • Profiling Tools
  • Using gprof GNU profiler
  • Example Program
  • Flat profile : %time
  • Flat profile: Self seconds
  • Flat profile: Calls
  • Flat profile: Self seconds per call
  • Flat profile: Total seconds per call
  • How gprof works
  • VTune Modes/Features
  • Sampling mode
  • Sampling Mode Benefits
  • EBS : choosing events
  • Sampling «
  • Viewing Sampling Results
  • Hotspot Graph
  • Source View
  • VTune Tuning assistant
  • Tuning Assistance Report
  • Hotspot Assistant Report
  • Call Graph Mode
  • Call Graph Screenshot
  • Call Graph (Cont.)
  • Jump to Source view
  • Call Graph ± Call List View
  • Getting Help
  • VTune Summary
  • Valgrind Toolkit
  • Memcheck Features
  • Memcheck Example
  • Memcheck Example (Cont.)
  • Memcheck report
  • Cachegrind
  • Cachegrind Summary output
  • Cachegrind Accuracy
  • Massif tool
  • Executing Massif
  • Spacetime Graphs
  • Spacetime Graph (Cont.)
  • Text/HTML Report example
  • Valgrind ± how it works
  • Valgrind Summary
  • Other Tools
  • Writing Fast Programs
  • CPU Architecture (Pentium 4)
  • Instruction Execution
  • Keeping CPU Busy
  • Memory Overview (Pentium 4)
  • Fixing memory problems
  • References
  • Questions?

Profiling tools

By Vitaly Kroivets
for Software Design Seminar

Profiling Tools



Introduction  Software optimization process , optimization traps and pitfalls  Benchmark Performance tools overview  Optimizing compilers  System Performance monitors Profiling tools  GNU gprof  INTEL VTune  Valgrind What does it mean to use system efficiently

Profiling Tools


The Problem  

PC speed increased 500 times since 1981, but today¶s software is more complex and still hungry for more resources How to run faster on same hardware and OS architecture?  

Highly optimized applications run tens times faster than poorly written ones. Using efficient algorithms and well-designed implementations leads to high performance applications
Profiling Tools

The Software Optimization Process Create benchmark Find hotspots Hotspots are areas in your code that take a long time to execute Retest using benchmark Investigate causes Modify application Profiling Tools 4 .

Extreme Optimization Pitfalls       Large application¶s performance cannot be improved before it runs Build the application then see what machine it runs on Runs great on my computer« Debug versus release builds Performance requires assembly language programming Code features first then optimize if there is time leftover Profiling Tools 5 .

Key Point: Software optimization doesn·t begin where coding ends ² It is ongoing process that starts at design stage and continues all the way through development Profiling Tools 6 .

com/ Enterprise Services Graphics/Applications HPC/OMP Java Client/Server Mail Servers Network File System Web Servers Profiling Tools 7 .specbench.The Benchmark  The benchmark is program that used to  Objectively evaluate performance of an application  Provide repeatable application behavior for use with performance analysis tools  Industry standard benchmarks :          TPC-C 3D-Winbench http://www.

caching issues  ³incoming fax´ problem : use minimum performance number  Representative Execution of typical code path. mimic how customer uses the application  Poor benchmarks : Using QA tests  Profiling Tools 8 .Attributes of good benchmark  Repeatable  (consistent measurements) Remember system tasks .

)  Easy to run  Verifiable  need QA for benchmark! Elapsed Time vs. other number  Use benchmark to test functionality  Measure  Algorithmic tricks to gain performance may break the application« Profiling Tools 9 .Benchmark attributes (cont.

such as memory and processor.How to find performance bottlenecks      Determine how your system resources. are being utilized to identify system-level bottlenecks Measure the execution time for each module and function in your application Determine how the various modules running on your system affect the performance of each other Identify the most time-consuming function calls and call sequences within your application Determine how your application is executing at the processor level to identify microarchitecture-level performance problems Profiling Tools 10 .

iostat . VTune. perfmon.exe. IBM Purify.Performance Tools Overview      Timing mechanisms  Stopwatch : UNIX time tool Optimizing compiler (easy way) System load monitors  vmstat . IBM Quantify Valgrind . Parasoft Insure++ Profiling Tools 11 Software profiler  Memory debugger/profiler  . Visual C++ Profiler. Vtune Counter Gprof.

Using Optimizing Compilers  Always use compiler optimization settings to build an application for use with performance tools  Understanding and using all the features of an optimizing compiler is required for maximum performance with the least effort Profiling Tools 12 .

Optimizing Compiler : choosing optimization flags combination Profiling Tools 13 .

Optimizing Compiler¶s effect Profiling Tools 14 .

Optimizing Compilers: Conclusions  Some processor-specific options still do not appear to be a major factor in producing fast code  More optimizations do not guarantee faster code  Different algorithms are most effective with different optimizations  Idea : using statistics gathered by profiler as input for compiler/linker Profiling Tools 15 .

Profiling Tools 16 . oprofile. etc. iostat. free memory Maximum resolution : 1 sec Cannot identify piece of code that caused event to occur Good for finding system issues Unix tools : vmstat. xos. top.Windows Performance Monitor       Sampling ³profiler´ Uses OS timer interrupt to wake up and record the value of software counters ± disk reads.

Performance Monitor Counters Profiling Tools 17 .

Profilers  Profiler may show time elapsed in each function and its descendants  number of calls . call-graph (some)  Profilers use either instrumentation or sampling to identify performance issues Profiling Tools 18 .

critical path Functions. Instrumentation Sampling Overhead System-wide profiling Detect unexpected events Setup Data collected Data granularity Detects algorithmic issues Typically about 1% Yes. OS functions Yes . drivers. call path is expensive Counters. profiles all app. Limited to processes . of data collection stubs required Call graph .Sampling vs. may be 500% ! Just application and instrumented DLLs No Automatic ins.. processor an OS state Assembly level instr. can detect other programs using OS resources None Instrumentation High. threads Profiling Tools 19 . with src line No. sometimes statements Yes ± can see algorithm. call times.

Profiling Tools Old. Unstable Profiling Tools 20 . buggy and inaccurate Gprof VTune Valgrind Intel Is not profiler really « $700.

GNU gprof Instrumenting profiler for every UNIX-like system Profiling Tools 21 .

out Profiling Tools 22  Run gprof to analyze the profile data  .Using gprof GNU profiler  Compile and link your program with profiling enabled cc -g -c myprog.o -pg  Execute your program to generate a profile data file   Program will run normally (but slower) and will write the profile data into a file called gmon.c -pg cc -o myprog myprog.o utils.c utils.out just before exiting Program should exit using exit() function gprof a.

Example Program Profiling Tools 23 .

it will be indistinguishable from a function that was never called Profiling Tools 24 . and didn't run long enough to show up on the program counter histogram.  If a function was not compiled for profiling.Understanding Flat Profile  The flat profile shows the total amount of time your program spent executing each function.

Flat profile : %time Percentage of the total execution time your program spent in this function. These should all add up to 100%. Profiling Tools 25 .

Flat profile: Cumulative seconds This is cumulative total number of seconds the spent in this functions. plus the time spent in all the functions above this one Profiling Tools 26 .

Flat profile: Self seconds Number of seconds accounted for this function alone Profiling Tools 27 .

Flat profile: Calls Number of times was invoked Profiling Tools 28 .

Flat profile: Self seconds per call Average number of sec per call Spent in this function alone Profiling Tools 29 .

Flat profile: Total seconds per call Average number of seconds spent in this function and its descendents per call Profiling Tools 30 .

Call Graph : call tree of the program Called by : main ( ) Current Function: g( ) Profiling Tools Descendants: doit ( ) 31 .

total amount of time spent in this function Profiling Tools 32 .Call Graph : understanding each line Unique index of this function Total time propagated into this function by its children Number of times was called Current Function: g( ) Percentage of the `totalµ time spent in this function and its children.

Call Graph : parents numbers Time that was propagated from the function's children into this parent Time that was propagated directly from the function into this parent Call Graph : understanding each line Current Function: g( ) Number of times this parent called the function `/µ total number of times the function was called Profiling Tools 33 .

Call Graph : ³children´ numbers Number of times this function called the child `/µ total number of times this child was called Current Function: g( ) Amount of time that was propagated directly from the child into function Amount of time that was propagated from the child's children to the function Profiling Tools 34 .

samples the PC every 0.01 sec  Statistical inaccuracy : fast function may take 0 or 1 samples  Run should be long enough comparing with sampling period  Combine several gmon.out files into single report The output from gprof gives no indication of parts of your program that are limited by I/O or swapping bandwidth. This is because samples of the program counter are taken at fixed intervals of run time number-of-calls figures are derived by counting.How gprof works      Instruments program to count calls Watches the program running. They are completely accurate and will not vary from run to run if your program is deterministic Profiling with inlining and other optimizations needs care Profiling Tools 35 . not sampling.

VTune performance analyzer To squeeze every bit of power out of Intel architecture ! Profiling Tools 36 .

VTune Modes/Features    Time. System-Wide Sampling provides developers with the most accurate representation of their software's actual performance with negligible overhead Call Graph Profiling provides developers with a pictorial view of program flow to quickly identify critical functions and call sequences Counter Monitor allows developers to readily track system activity during runtime which helps them identify system level performance issues Profiling Tools 37 .and Event-Based.

16-bit applications. the OS . Microsoft* .Sampling mode  Monitors all active software on your system  including your application. device drivers  Application performance is not impacted during data collection Profiling Tools 38 .NET files. JITcompiled Java* class files. 32-bit applications.

Overhead incurred by sampling is typically about one percent No need to instrument code. You do not need to make any changes to code to profile with sampling Profiling Tools 39 . Find the module. giving you a detailed look at your operating system and application Benefits of sampling:     Profiling to find hotspots. lines of source code and assembly instructions that are consuming the most time Low overhead. system-wide profiling helps you identify which modules and functions are consuming the most time.Sampling Mode Benefits  Low-overhead. functions.

When the buffer is full. After saving the information. the information is copied to a file.How does sampling work?  Sampling interrupts the processor after a certain number of events and records the execution information in a buffer area. by default) Event-based sampling: collects samples of active instruction addresses after a specified number of processor events  After the program finishes. the VTuneΠmaintains very low overhead (about one percent) while sampling   Time-based sampling: collects samples of active instruction addresses at regular time-based intervals (1ms. the program resumes operation. the samples are mapped to modules and stored in a database within the analyzer program. In this way. Profiling Tools 40 .

Starting the Sampling Wizard Profiling Tools 41 .

Starting the Sampling Wizard Hardware prevents from sampling of many counters simultaneously Profiling Tools 42 .

Starting the Sampling Wizard Profiling Tools 43 .

Starting the Sampling Wizard Unsupported CPU ? Ha-ha-ha« Profiling Tools 44 .

EBS : choosing events Profiling Tools 45 .

data and control speculation.Events counted by VTune       Basic Events: clock cycles. instruction and data TLBs Profiling Tools 46 . and memory operations Cycle Accounting Events: stall cycle breakdowns Branch Events: branch prediction Memory Hierarchy: instruction prefetch. issue and execution. retired instructions Instruction Execution: instruction decode. instruction and data caches System Events: operating system monitors.

Sampling « Profiling Tools 47 .

Viewing Sampling Results  Process view  all the processes that ran on the system during data collection the threads that ran within the processes you select in Process view the modules that ran within the selected processes and threads the functions within the modules you select in Module view Profiling Tools 48  Thread view   Module view   Hotspot view  .

Different events collected ± modules view System-wide look at software running on the system Our program CPIgood average indication Profiling Tools 49 .

Hotspot Graph Click on hotspot bar VTune displays source code view Each bar represents one of the functions of our program Profiling Tools 50 .

Source View Test_if function Test_if function Profiling Tools 51 .

Annotated Source View(% of module) See how much time is spent on each one line Check this ³for´ loop ! 10% of CPU spent in few statements Profiling Tools 52 .

how to solve it ? Tuning Assistant highlights performance problems  Provides approximate time lost by each performance problem  Database contains performance metrics based on Intel¶s experience of tuning hundreds of applications    Analyzes the data gathered by our application Generates tuning recommendations for each ³hotspot´ Gives user idea what might be done to fix the problem Profiling Tools 53 .VTune Tuning assistant   In few clicks we reached to the performance problem!  Now.

Tuning Assistance Report Profiling Tools 54 .

Hotspot Assistant Report : Penalties Profiling Tools 55 .

Hotspot Assistant Report Profiling Tools 56 .

Call Graph Mode   Provides with a pictorial view of program flow to quickly identify critical functions and call sequences Call graph profiling reveals:     Structure of your program on a function level Number of times a function is called from a particular location The time spent in each function Functions on a critical path. Profiling Tools 57 .

Switch to Calllist View Profiling Tools 58 .Call Graph Screenshot the function summary pane Critical Path displayed as red lines: call sequence in an application that took the most time to execute.

by hovering the move over the functions 59 .) Wait time ± how much time spent waiting for event to occur Profiling Tools Additional info available .Call Graph (Cont.

Jump to Source view Profiling Tools 60 .

Call Graph ± Call List View Caller Functions are the functions that called the Focus Function Callee Functions are the functions that called by Focus Function Profiling Tools 61 .

 Profiling Tools 62 .  With the VTune analyzer.  Correlate performance counter data with data collected by other features in the VTune analyzer.Counter Monitor Use the Counter Monitor feature of the VTuneΠto collect and display performance counter data. you can:  Monitor selected counters in performance objects.  Trigger the collection of counter data on events other than a periodic timer. Counter monitor selectively polls performance counters. which are grouped categorically into performance objects. such as sampling.

Counter Monitor

Profiling Tools


Getting Help
‡Context ±sensitive help ‡Online Help repository

Profiling Tools


VTune Summary 

Allows to get best possible performance out of Intel architecture  Cons: Extreme tuning requires deep understanding of processor and OS internals

Profiling Tools


Valgrind Multi-purpose Linux x86 profiling tool Profiling Tools 66 .

D1 and L2 caches in your CPU performs detailed heap profiling by taking regular snapshots of a program's heap finds data races in multithreaded programs Profiling Tools 67 Cachegrind is a cache profiler   Massif is a heap profiler   Helgrind is a thread debugger   .Valgrind Toolkit   Memcheck is memory debugger  detects memory-management problems performs detailed simulation of the I1.

where pointers to malloc'd blocks are lost forever Passing of uninitialised and/or unaddressible memory to system calls Mismatched use of malloc/new/new [] vs free/delete/delete [] Overlapping src and dst pointers in memcpy() and related functions Some misuses of the POSIX pthreads API Profiling Tools 68 .Memcheck Features  When a program is run under Memcheck's supervision. and calls to malloc/new/free/delete are intercepted Memcheck can detect:           Use of uninitialised memory Reading/writing memory after it has been free'd Reading/writing off the end of malloc'd blocks Reading/writing inappropriate areas on the stack Memory leaks -. all reads and writes of memory are checked.

Memcheck Example Access of unallocated memory Using noninitialized value Memor y leak Profiling Tools Using ³free´ of memory allocated by ³new´ 69 .

out  Execute  valgrind --tool=memcheck --leak-check=yes a.Memcheck Example (Cont.cc ²g ²o a.out > log  View log Executabl e name Profiling Tools 70 .)  Compile  the program with ±g flag: valgrind : Debug leaks g++ -c a.

Memcheck report Profiling Tools 71 .

) Leaks detected: S T A C K Profiling Tools 72 .Memcheck report (cont.

with per-function. an L1 miss will cost around 10 cycles. D1 and L2 caches in your CPU  Can accurately pinpoint the sources of cache misses in your code  Identifies number of cache misses. per-module and wholeprogram summaries  Cachegrind runs programs about 20--100x slower than normal  Profiling Tools 73 . memory references and instructions executed for each line of source code. and an L2 miss can cost as much as 200 cycles Cachegrind performs detailed simulation of the I1.Cachegrind  Detailed cache profiling can be very useful for improving the performance of the program  On a modern x86 machine.

cc > a.How to run Run valgrind --tool=cachegrind in front of the normal command line invocation   Example : valgrind --tool=cachegrind ls -l   When the program finishes.cc.pid Execute cg_annotate to get annotated source Source files file:  cg_annotate --7618 a. It also collects line-by-line information in a file cachegrind.annotated PID Profiling Tools 74 .out. Cachegrind will print summary cache statistics.

Cachegrind Summary output I-cache reads (instructions executed) I1 cache read misses Instruction caches performance L2-cache instruction read misses Profiling Tools 75 .

Cachegrind Summary output D-cache reads (memory reads) D1 cache read misses Data caches READ performance L2-cache data read misses Profiling Tools 76 .

Cachegrind Summary output D-cache writes (memory writes) D1 cache write misses Data caches WRITE performance L2-cache data write misses Profiling Tools 77 .

the effect of system calls on the cache contents is ignored It doesn't account for other process activity (although this is probably desirable when considering a single program) It doesn't account for virtual-to-physical address mappings.Cachegrind Accuracy  Valgrind's cache profiling has a number of shortcomings:    It doesn't account for kernel activity -. hence the entire simulation is not a true representation of what's happening in the cache Profiling Tools 78 .

It can give information about:  Heap blocks  Heap administration blocks  Stack sizes  Help to reduce the amount of memory the program uses  smaller program interact better with caches.a pointer remains to it .but it's not in use anymore  Profiling Tools 79 .it measures how much heap memory programs use. avoid paging  Detect leaks that aren't detected by traditional leakcheckers.Massif tool Massif is a heap profiler . such as Memcheck  That's because the memory isn't ever actually lost .

2% heap.7% and new[]. Total spacetime: 2.Executing Massif  Run valgrind ²tool=massif prog  Produces following:    Summary Graph Picture Report  Summary will look like this:     Space (in bytes) multiplied by time (in milliseconds).258.106 ms.B Heap: 24. Profiling Tools 80 .0% number of words allocated on Heap admin: 2. via malloc(). new Stack (s): 73.

Spacetime Graphs Profiling Tools 81 .

Spacetime Graph (Cont.)  Each band represents single line of source code  It's the height of a band that's important  Triangles on the x-axis show each point at which a memory census was taken   Not necessarily evenly spread. due to random OS delays Profiling Tools 82 . Massif only takes a census when memory is allocated or de-allocated The time on the x-axis is wall-clock time  not ideal because can get different graphs for different executions of the same program.

Shows places in the program where most memory was allocated Profiling Tools 83 .Text/HTML Report example Contains a lot of extra information about heap allocations that you don't see in the graph.

It also allows each . the synthetic CPU to starts up.so.so to have a finalization function run after main() exits When valgrind.so until end of run System call are intercepted. Signal handlers are monitored    Profiling Tools 84 . The shell script valgrind sets the LD_PRELOAD environment variable to point to valgrind.so. The real CPU remains locked in valgrind.so's initialization function is called by the dynamic linker.so in the process image to have an initialization function which is run before main().so to be loaded as an extra library to any subsequently executed dynamically-linked ELF binary The dynamic linker allows each . valgrind.Valgrind ± how it works  Valgrind is compiled into a shared object. This causes the .

Valgrind Summary       Valgrind will save hours of debugging time Valgrind can help speed up your programs Valgrind runs on x86-Linux Valgrind works with programs written in any language  Valgrind is actively maintained Valgrind can be used with other tools (gdb) Valgrind is easy to use  uses dynamic binary translation. so no need to modify. recompile or re-link applications. Just prefix command line with valgrind and everything works Used by large projects : 25 millions lines of code   Valgrind is not a toy  Valgrind is free Profiling Tools 85 .

Other Tools 

not included in this presentation:

IBM Purify  Parasoft Insure  KCachegrind  Oprofile  GCC¶s and GLIBC¶s debugging hooks

Profiling Tools


Writing Fast Programs 

Select right algorithm Implement it efficiently 

Detect hotspots using profiler and fix them   

Understanding of target system architecture is often required ± such as cache structure Use platform-specific compiler extensions ± memory pre-fetching, cache control-instruction, branch prediction, SIMD instructions Write multithreaded applications (³Hyper Threading Technology´)

Profiling Tools


CPU Architecture (Pentium 4)
Branch prediction

Instruction fetch

Instruction decode

Instruction pool


Execution Units

Profiling Tools

Instruction Execution Execution Units Integer Integer Instruction pool Floating point Dispatch unit Floating point Memory Load Memory Save Profiling Tools 89 .

good mix of instructions and predictable branches    Remove branches if possible Reduce randomness of branches. avoid function pointers and jump tables Profiling Tools 90 .Keeping CPU Busy  Processors are limited by data dependencies and speed of instructions  Keep data dependencies low Good blend of instructions keep all execution units busy at same time  Waiting for memory with nothing else to execute is most common reason for slow applications  Goals: ready instructions.

10 times slower than L1 Profiling Tools 91 . 3 times slower than L1  L3 : 4MB cache (optional)  Main RAM (usually 64M « 4G) .Memory Overview (Pentium 4)  L1  cache (data only) 8 kbytes Execution Trace Cache that stores up to 12K of decoded micro-ops  L2 Advanced Transfer Cache (data + instructions) 256 kbytes.

Fixing memory problems        Use less memory to reduce compulsory cache misses Increase cache efficiency (place items used at same time near each other) Read sooner with prefetch Write memory faster without using cache Avoid conflicts Avoid capacity issues Add more work for CPU (execute nondependent instruction while waiting) Profiling Tools 92 .

specbench.gnu.org An Evolutionary Analysis of GNU C Optimizations Using Natural Selection to Investigate Software Complexities by Scott Robert Ladd  Intel VTune Performace Analyzer webpage   http://www.org/onlinedocs/gcc/OptimizeOptions.org/software/binutils/manual/gprof-2.9.intel.html Profiling Tools 93 .gnu.org  The Software Optimization Cookbook High-Performance Recipes for the Intel® Architecture by Richard Gerber  GCC Optimization flags http://gcc.References  SPEC website http://www.kde.com/software/products/vtune/  Gprof man page http://www.1/html_mono/gprof.html Valgrind Homepage http://valgrind.

Questions? Profiling Tools 94 .

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->