You are on page 1of 33

PTU – Intel® Performance Tuning Utility

Intel® PTU Sample View
Basic block grouping and execution flow analysis



Core 2 Architecture - Basic Blocks
32 bytes/clk

Next IP Instruction Fetch
16 bytes/clk

DPL Router L2Q/XQ





Length Decoding
6 inst/clk

FB (8 entries)

18 entries

32 Kb


32 LB + 20 SB



Decoders 4:1:1:1
7 µops/clk

Port 0 ALU/SSE FMul Port 1 ALU/SSE FAdd Port 5 ALU/SSE Branch Port 2 LOADS Port 3 STA Port 4 STD

11 entries


Alloc & Rename
4 µops/clk

Reservation Station
32 entries Result Bus dispatch 6 µops/clk


ReOrder Buffer
4 µops/clk

96 entries


Execution Reservation station (32 entries) Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 ALU1 ALU2 IMUL ALU3 SHIFT2 JEU LOAD AGU STORE AGU INT STORE DATA (MIU) SHIFT1 LEA MOB Result Bus SIMD SIALU1 SISHIFT FMUL SIMUL SIALU2 SISHUF 32 entries LB 20 entries SB DTLB PMH FADD FPROM FP FDIV FSHUF DCU L2 / FSB ROB 4 10/23/2009 .

however. typically 2-3 only – This is important to remember when running VTune/PTU Sampling 5 10/23/2009 .ANY.CORE and CPU_CLK_UNHALTED. CPU_CLK_UNHALTED.REF have dedicated counter registers (1:1 mapping) – Some events are only counted by PMC0 ( precise and atretirement events not belonging to the three events above ) – All other events are counted either by PMC0 or PMC1 Counter structure allows collection of up to 5 events in one run .Core™ Architecture PMU The PMU of Intel® Core™ architecture offers some 700 events ( including sub-events ) compared to – Netburst™ Architecture: Some 140 Event counters (5 in total) are not identical – Three frequently used (“basic”) events INST_RETIRED. depending on the event selection.

Linux* Sampling drivers ) A free tool but requires Intel® VTune™ Performance Analyzer license A tool for experienced developers – Intel internally and externally Less focus on user guidance and help functionality as VTune Focus is on getting results fast. pragmatic view No registration/RPM/registry – simply unpack and start ! Same look and feel / same GUI .but an alternative A lot of components are shared ( e. make extraordinary possible 6 10/23/2009 .PTU – Background and Motivation A replacement for Intel® VTune™ Performance Analyzer ? • • • • • • • • No ! .g.on all platforms ( Eclipse-based ) Has been Intel-internal tool only but now made available to everybody ! A test environment for sophisticated new features/prototypes available in VTune (potentially) only in future releases • Includes pre-launch platform support Make simple things easy.

Intel® PTU Analysis Methods at a Glance Event Based Sampling (PMU hardware Events) • Optional event Multiplexing – more events collected in one run • Includes sophisticated Data Access Profiling • Includes features to compare multiple runs Statistical Call Graph Sampling (with loop detection) • No instrumentation / very low overhead • Statistical (pragmatical) approach – thus not 100% precise Instrumentation-based Call Graph/Count Analysis • Instrumentation similar to VTune Callgraph but using PIN technology • Intrusive (factor 2 or more) but 100% precise Heap / Memory Access Profiling 7 10/23/2009 .

Pre-built Eclipse package part of distribution * Other names and brands may be claimed as the property of others.Structural Components Data collectors – command utilities .Symbol resolver .Data converters . Comparing and Displaying Eclipsed-based GUI (“a plug-in for Eclipse”) .Windows* and Linux* .Collection controllers Data Viewers – command utilities .Driver based and non-driver based collectors . 8 10/23/2009 .single look and feel .Filtering.

vtsadiff. vtssview SEP Data conversion (symbols. 9 10/23/2009 PMU Hardware . tb5 -> database) * Other names andKernel Driver brands may be claimed as the property of others.PTU Architecture – High Level Overview Eclipse* GUI Configuration Collection Command Line Queries Data Command Line Collectors vtsarun. vtssrun Experiment Directory Command Line Viewers vtsaview.

Command Line utilities Collectors Viewers vtsarun – Sampling collector vtsaview – Sampling viewer vtdpview – Data profiler viewer vtsadiff – Hotspot difference viewer vtssrun – Statistical Call graph collector vtcgrun – Exact Call graph collector vthprun – Heap profiler collector vtssrun – Statistical Call graph viewer vtcgview – Exact Call graph viewer vthpview – Heap profiler viewer Command line utilities can be used independent of graphical interface GUI functionality is completely based on these “batch applications” 10 10/23/2009 .

Disassembly and Control Flow graph view for a procedure • • • All three (can be) displayed simultaneously Results difference (text mode. csv export. export/import Sampling Hotspot View expansion to show events per CPU Event over IP chart Assembler display choice ( Intel versus AT&T ) 11 10/23/2009 . and for identical sources Ability to modify CPU frequency via GUI CPU Frequency Changing Configuration per PMU. GUI mode) GUI for different binaries.Some Interesting Usability Features Basic blocks as underlying data representation • Aggregation. manipulation per basic block Drill-down to Source.

12 10/23/2009 .PTU Workspace & Projects Main concepts Eclipse* Workspace – Project • Experiment – Modules (for experiment sharing) – Sources (for experiment sharing) • • Project defines the application and its workload (parameters) Simply zipping up an experiment creates a self contained test case PTU Project is workload specification * Other names and brands may be claimed as the property of others.

Basic Sampling and others You can create your own configurations as well 5 predefined configurations come in PTU distribution Project + Profile Configuration produces the Experiment * Other names and brands may be claimed as the property of others. Project.PTU Profile Configurations • • Project is definition of the workload Profile Configuration is definition of collection method – You apply a Profile Configuration to a Project to collect Experiment – GUI generates a command line for collector and invokes it • • Workspace. 13 10/23/2009 . Profile Configuration are in GUI world only – – – Command line tools know only about Experiments Basic Statistical Callgraph.

Project vs. Profiling Configuration Configurations Configuration 1 Projects Configuration 2 Project 1 Project 2 Project 3 Experiment Result Experiment Result • Profiling Configuration is orthogonal to the Project concept • Profiling Configuration created by user can be applied to any Project 14 10/23/2009 .

module. basic block or instruction (RVA) Standard drill down to source/asm by row in spread sheet 15 10/23/2009 .Event based Sampling Use Basic Sampling predefined configuration or create your own with events you need in the experiment Granularity in base display is function.

Add the Control Flow Graph with the Tool bar Button 16 10/23/2009 .

All Three Views at once 17 10/23/2009 .

Events over IP histogram In Hotspot view right click on row of interest. appropriate item invokes • Events over IP histogram for selected module • Drill-down to highlighted region of histogram to source and/or disassembly view 18 10/23/2009 .

the addresses of load and store instructions can be reconstructed 19 10/23/2009 .L2_LINE_MISS Further the values of all registers are also captured in a buffer • • At completion of the instruction on Intel® Core™2 Processor Family Prior to execution of the instruction on Pentium® 4 Processors Coupled with the disassembly. an L2 load miss that retrieves a cacheline can be identified with MEM_LOAD_RETIRED. Intel® Core™2 processor family and Itanium® Processor family have Precise Event Based Sampling (PEBS) Precise events have hardware to capture the Instruction Pointer of the instruction that caused the event • Ex: On Intel® Core™2 Processor family.Data Access Profiling Intel® Pentium® processors.

Basic Data Access Profiling – Easy to Use Pre-Defined Event Set 20 10/23/2009 .

Data Access Result: Here Heavy Contention on marked Line Multiple Threads Accessing Different Offsets Indicate False Sharing 21 10/23/2009 .

From one Hotspot View to two of them… This was the model that we used so far… Aggregate over time But the reality is like this … Aggregate over time + over one more dimension 22 10/23/2009 .

Easy Navigation between both “Dimensions” Filtering .projecting slices onto another dimension Sorting – repositioning segments of the axes Applying granularity – changing scale of the axis Intel® PTU Data Access profiling operates with two hotspots views: Code and Data hotspots 23 10/23/2009 .

with data taken on different machines – They can be different but built from the same source • Allowing differences to be analyzed down to source view – They can be completely different (sources and binaries) • PTU will compare functions with the same names for modules with the same names Single click on first experiment Ctrl single click on second experiment Right click and select compare 24 10/23/2009 .x supports an analysis of differences of experiments This requires • Event names must be the same • Load Modules have the same names – They can be the same.Differences of EBS Measurments Intel® PTU 3.

Data blocked 2X2 unrolled Matrix Multiply compiled at -O2 25 10/23/2009 .

Data blocked 2X2 unrolled Matrix Multiply compiled at -O3 –QxT 26 10/23/2009 .

Only Significant Difference is Cycle Count Create Difference Display Control click to select 2 experiments Right click to select “Compare Experiments” 27 10/23/2009 .

Differences of Samples Differences in Cycles Shown in msec to Correct for Comparison of Machines at Different Frequencies 28 10/23/2009 .

Same Source can Display Difference per Source Line 29 10/23/2009 .

Statistical Call Graph Along with standard features supports multithreaded apps and 32 bit binaries on 64 bit OS’s Easy to identify path to the hotspot 30 10/23/2009 .

Exact Call Graph • • • Reconstructs the application’s Call graph by instrumenting function entry and exit points using the PIN dynamic instrumentation framework Records a caller-callee pair together with timing and call count information Text viewer provides the following information: – List of functions with their parents. children.8-2x 31 10/23/2009 . and timing information – Flat data: list of functions with timing and call information with no relationships – Per-thread view for multithreaded applications • Collection overhead for computationally-intensive console applications 1.

module name.Call Count GUI View • Displays a limited set of exact call graph data: a function name. and call count. which is a number of times the function was called 32 10/23/2009 .

com • Using it and providing feedback /submitting bugs Call to Action: • performance debugging tool • Focus on getting results quickly • The tool is easy to use but the performance methodology supported perfectly by this tool requires some experience – it is not a tool for the beginner !! you help make it better for you and contribute to future Performance Suite – User Forum at whatif.Summary • Intel® PTU . install and use Intel® PTU ! 33 10/23/2009 .