You are on page 1of 6

1 inch (“)

st
This Sample shows the 1 two & Last two pages of a paper

No page numbers

A Comparison of Path Profiling and Edge Profiling Font 12 pt & bold


In C++ Applications

Brian A. Malloya Body Text font size 10 single


Clemson University,U.S.A spacing
0.25’’
No page numbers

0.75’’ Abstract
0.75’’
No page numbers

We investigate the notion of phases as they occur in graphical representation of the program that describes the
object-oriented programs. The focus of our investigation is flow of control, and then place counters at vertices or edges
on C++ programs and our test suite includes various kinds in the graph. Edge profiles record an aggregate count of the
of object-oriented programs including scientific and frequency of edge executions that provide general
general-purpose applications. We focus on individual information about the behavior of a program.
phased branch behavior and attempt to capture information
and generalize about the frequency of phased behavior. We However, reference [26] argues for the collection of
provide profiling in determining phase change in various more detailed information than aggregate edge information
program executions. to describe a program's run-time characteristics, introducing
the notion that the run-time behavior of a program cycles
Keywords: Profiling, edge profiling, path profiling, through a series of phases. For example the branch
performance, control flow graph. prediction and cache miss rate might be more accurately
described over the execution of a program using phases
1. Introduction [23]. Phase profiling might address the shortcomings of
edge profiling if one could detect changes in phase and
The tremendous interest and growth in the internet has apply an optimization more suited for a particular phase and
brought with it a demand for mobile code that is compact later apply a different one for a different phase.
and efficient. The idea behind mobile code is that Phase profiling might address the shortcomings of edge
applications can be distributed quickly across computer profiling if one could detect changes in phase and apply an
networks and automatically executed upon arrival. For these optimization more suited for a particular phase and later
applications to run efficiently, they must adapt to the run- apply a different one for a different phase.
time environment of the target architecture. In view of the Phase profiling might address the shortcomings of edge
demand for speed, there is evidence that traditional static profiling if one could detect changes in phase and apply an
optimizations may not provide the efficiency required for optimization more suited for a particular phase and later
modern mobile applications. Moreover, there is increased apply a different one for a different phase.
use of object-technology in these mobile applications with a profiling if one could detect changes in phase and apply .
concurrent belief that object-oriented code, with its frequent apply a different one for a different phase.
use of dynamic binding, is less amenable to traditional, profiling if one could detect changes in phase and apply .
static optimizations. apply a different one for a different phase.
optimization more suited for a particular phase and later
One optimization in common use today is feedback- apply a different one for a different phase.
directed optimization, FDO, which uses program This metric can indicate those parts of a program where
characteristics, obtained at runtime, to attempt to improve optimization is likely to achieve
the performance of the application. Most FDO approaches Profiling entails instrumenting an application to obtain a
use profile guided compilation, which facilitates cost metric such
determination of those parts of a program to modify in order
Please put
to maximize performance [26]. title at the
Profiling entails instrumenting an application to obtain bottom
a of the
cost metric such as time spent in a method or the number of figure
times a method executes. This metric can indicate those
parts of a program where optimization is likely to achieve a
parts of a program where optimization is likely to achieve a

a
Computer Science Department
Clemson, SC 29634, USA
malloy@cs.clemson.edu
If the input to a program changes, the optimizations
Should be an exact 0.75’’ from the above paragraph line and the end of the page
performed on a previous execution may no longer be
beneficial [27]. In fact, the performance of a statically
Should have
optimized program may degrade appreciably, sofigure thatno. like
say figure 1
performance is worse than the un-optimized version of the
program [26]. Phased Behavior. This is a branch taken from the lcom
benchmark which exhibits phased behavior. The pink line at y =
A second drawback with FDO relates to the 55% represents the average bias for the branch. The blue line
represents the sampled taken bias for the branch. The sampled bias
determination of the instrumentation strategy that is likely to
seems to interleave between periods of being strongly taken and
provide the most information at a tolerable cost. Traditional being weakly taken. An optimization based purely on the
approaches to profile guided optimization construct a
No page numbers

aggregate bias would adversely affect performance during the discussion of phases and phased behavior, including the
period of very weak taken bias. impact of phased behavior on superblock scheduling.
To illustrate the impact that phased behavior might have 2.1 Profiling
on optimization, consider the graph in Figure [1], which
models the run time branching behavior of an application. One of the common uses of profiles is to derive paths
The branch shown in the figure is biased towards “taken” taken during program execution for use in path based
55% of the time; that is, aggregate information would lead optimizations [7]. Edge profiling records the execution
the profiler to conclude that the branch is likely to be count of transitions between basic blocks [5] or edges.
“taken” more often than not. However, the sampled taken Since edge profiles summarize path execution using
bias interleaves between periods of strong bias towards aggregate edge counts [26, 6], programs paths must be
“taken” and periods of “not taken”. An optimization based derived from a profile. Generally path based optimizations
strictly on the 55.5% average bias would perform poorly are performed on hot paths, heavily executed paths, in order
during the phases with significant bias towards ``not taken''. to gain the best performance gain with limited resources.
In this paper, we investigate the notion of phases as they One technique for deriving heavily executed paths from
might occur in object-oriented programs. The focus of our an edge profile is to use a greedy algorithm that follows the
investigation is on C++ programs and our test suite includes edge with the maximum execution frequency out of a basic
various kinds of object-oriented programs including block[6]. This greedy algorithm does not always produce
scientific and general-purpose applications. We focus on accurate results; more specifically, the greedy algorithm
individual phased branch behavior and attempt to capture may misidentify which path significantly contributes to
information and generalize about the frequency of phased overall control flow of a program. For example, given an
behavior. We provide guidance about the advantages of edge profile for a CFG such as Figure 2a, a greedy
using path profiling over edge profiling for improving the algorithm would incorrectly identify ACDEF instead of
performance of programs that exhibit phased branch ABDEF as the most frequently executing path.
behavior.
Path profiling addresses the shortcomings of edge
The remainder of this paper is organized as follows. In profiles by recording the actual paths taken within an
the next section we provide background about various application[6].
approaches to program profiling, including the notion of Looking at Figure 2a, path profiling can disambiguate
superblock scheduling. In Section 3 we describe the between the two paths and determine which path actually
simulation tool that we exploit to investigate phased contributed significantly to the control flow of the program.
behavior and in Section 4 we describe our case study that Since the frequency counts for actual paths are recorded
forms the basis of our conclusions about phased behavior. In instead of aggregate edge counts, the correct hot path
Section 5 we present the results obtained from analyzing ABDEF was found for the CFG in Figure 2a.
phased behavior for individual branches and discuss these
results in Section 6. Section 7 overviews related work in the The additional accuracy provided by path profiling can
area of program optimizations, phases and the use of be beneficial for certain optimizations based on run-time
profiling. Finally, in Section 8 we summarize and conclude. profile information. Young and Smith demonstrated
improvements that can be made to superblock schedulers
using path profiling [29, 26]. A superblock scheduler
2. Background attempts to improve program performance by increasing the
amount of instruction-level parallelism (ILP) extractable
In this section, we provide definitions of terms and
background about profiling. We discuss both edge and path
profiling including a greedy algorithm for computing edge 7. Related Work
profile information. We conclude this section with a
In this section, we overview the research that relates to different mechanism for detecting changes in phase. The
program optimizations, phases and the use of profiling. current
0.75’’ prototype of Deco records Ball and Larus acyclic
paths[6] and stores them in a hash table(profile table). An
7.1 The Deco Project at Harvard and Hewlett Packard’s execution count is maintained for each entry in the hash
Dynamo table. Deco uses an interrupt mechanism to examine the
hash table and determine which path to optimize based on a
The Deco project [14] at Harvard University and Hewlett threshold execution count. Each optimized piece of code is
Packard's Dynamo[4] are feedback-direct optimization put into a fixed sized optimization cache. After the hash
systems which attempt to create executables which can table is examined, each entry's execution count is zeroed. If
adapt to their run-time environment. Each system uses a
No page numbers

the cache is full, a random entry is evicted in order to deal amount of phased behavior. We observed a mean
with program phases. percentage of phased behavior of 13% at a granularity of
1000, 13.7% at a granularity of 500, and 17% at a
Dynamo uses a heuristic called MRET (Most Recently granularity of 100. We believe this indicates a great deal of
Executed Tail)[4] for identifying hot traces. Within Dynamo opportunity for an optimizer that can exploit path profiling.
backward taken branches represent the end of a trace. The We have described techniques to improve the accuracy of
target addresses of these branches identify the start of new our work, as well as approaches to extend the work for
traces. An execution count is associated with each of these analyzing phased behavior in conditional branches.
target addresses. If this execution count exceeds some
threshold value, dynamo begins to record the instruction References
stream until another backward taken branch is encountered.
The trace is optimized and placed into a cache indexed by [1] G. Aigner and U Holzle, “Eliminating Virtual
the target address that starts the trace. Subsequent Function Calls in C++ Programs”, in Proc. of
encounters of the target address result in hits to the cache. the European Conference on Object Oriented
The target address is replaced by the address of the Programming, 1996.
optimized trace. Once the trace ends control is returned to [2] G. Albert, “A Transparent Method for
the Dynamo interpreter which starts the tracing process Correlating Profiles with Source Programs”, in
again. To adapt to changes in phase, Dynamo flushes the Proc. of the Second Workshop on Feedback-
trace cache whenever the rate of trace creation surpasses Directed Optimization(FDO), Nov. 1999.
some threshold value. This flushing strategy attempts to [3] J. Anderson, “Continuous Profiling: Where Have
readjust optimizations to changes in branch phase. All the Cycles Gone”, in Proc. of the Sixteenth
ACM Symposium on Operating System
7.2 Time Varying Behavior Principles, Oct. 1997, pp. 1- 14.
[4] V. Bala, E. Duesterwald and S. Banerjia,
Sherwood and Calder looked at phases of program “Dynamo: A Transparent Dynamic Optimization
behavior that vary over time[23]. They looked at how the System”, in Proc. of the ACM SIGPLAN
instructions executed per cycle, branch prediction rates, Conference on Programming Language Design
address prediction rates, cache miss rates, and value and Implementation(PLDI), June, 2000.
prediction rates influence each other and change over the [5] T. Ball and J. R. Larus, “Optimally Profiling and
lifetime of the SPEC95 benchmark suite. They used the Tracing Programs”, ACM Trans. on Program-
SimpleScalar [9] simulator to record statistics for every 100 ming Languages and Systems, July 1994.
million committed instructions. Each point (for each 100 [6] T. Ball and J. R. Larus, “Efficient Path
million instructions committed) obtained was graphed to Profiling”, in Proc. of MICRO-29, Dec. 1996.
view trends over time and to look for cyclic behavior. The [7] T. Ball, P. Mataga and M. Sagiv, “Edge
cyclic behavior of the SPEC95 benchmarks was used to Profiling versus Path Profiling: The Show-
reduce the amount of time needed to get an accurate picture down”, In Symposium on Principles of Pro-
of program behavior. The graph was used to determine the gramming Languages, Jan. 1998.
length of a program cycle.
7.3 Adaptive Parallelism 0.75’’ [8] F.-B. Bjorn and J. Maloney, “The deltablue
algorithm: An incremental constraint hierarchy
Even though adaptive loop transformations for parallel solver”, in Proc. of the Eighth Annual IEEE
computations can provide significant performance speedups, Phoenix Conference on Computers and Com-
existing adaptive techniques waste processor resources. For munications, 1989.
a particular parallel computation, speedups gained by [9] D. Burger and T. M. Austin, “The SimpleScalar
adding more processors could level off or possibly decrease. Toolset, Version 2.0”, in University of
Hall and Martonosi showed that the behavior of some loops Wisconsin-Madison Technical Report, CS-TR-
from the Specfp95 and NAS benchmark suites may go 1997-1342, June 1997, pp. 128-137.
through phases where there may be insufficient levels of [10] B. Calder, D. Grunwald and B. Zorn,
parallelism so additional processors may not help improve “Quantifying Behavioral Differences Between C
performance[15]. In some cases a serialized version of a and C++ Programs”, in Journal of Programming
loop might perform better. Languages, 1994.
[11] R. F. Cmelik and D. Keppel, “Shade: A Fast
8. Conclusions Instruction-Set Simulator for Execution Pro-
filing”, in Proc. ACM SIGMETRICS Con-
We investigated the impact of phased behavior on ference on the Measurement and Modeling of
various C++ benchmarks. We showed that the conditional Computer Systems, May 1994, pp. 128-137
branches within the C++ benchmarks exhibit a significant [12] E. Cox, “Fuzzy Fundamentals”, IEEE Spectrum,
No page numbers

Oct. 1992, pp. 58-61. the ACM SIGPLAN '91 Conference on PLDI,
[13] E. Duesterwald and V. Bala, “Software Profiling vol. 26, pp. 59-70, June 1991.
for Hot Path Prediction: Less is More”, in Proc. [28] Z. Wang, K. Pierce and S. McFarling, “BMAT --
Ninth International Conference on Architectural A Binary Matching Tool”, in Proc. of the Seond
Support for Programming Languages and Workshop on Feedback-Directed Optimization,
Operating Systems, Nov. 2000. 1999.
[14] E. Feigin, “A Case for Automatic Run-time [29] C. Young and M. D. Smith, “Better Global
Code Optimization”, A Case for Automatic Run- Scheduling Using Path Profiles”, in Proc. 30th
time Code Optimization. Bachelor of Arts Annual IEEE/ACM Intl. Symp. on Micro-
Thesis, Harvard College, April 1999, 1999. architecture, Nov. 1998.
[15] M. W. Hall and M. Martonosi, “Adaptive
Parallelism in Compiler Parallelized Code”, in
Concurrency: Practice and Experience, vol. 10, Brian A. Malloy is an associate professor in the department
Black & White
no. 14, pp. 1235-1250, 1998. ofpicture
Computer Science at Clemson University. Dr. Malloy's
[16] W. Hwu, “The Superblock: An Effective research focus is software engineering and compiler
Technique for VLIW and Superscalar Compila- technology. He has investigated issues in software
tion”, in Journal of Supercomputing, vol. 7, pp. validation, testing and program representations to facilitate
229-248, Jan. 1993. validation and testing. He has applied software engineering
[17] G. J. Klir and B. Yuan, “Fuzzy Sets and Fuzzy to parser development, especially applied to the construction
Logic - Theory and Applications”, Upper Saddle of a parser front-end for C++. Dr. Malloy has given
River, NJ: Prentice Hall PTR, 1995. presentations at national and international conferences and
[18] J. R. Larus and E. Schnarr, “EEL: Machine- workshops. He is an active member of the Association for
Independent Executable Editing”, in Proc. of the Computing Machinery (ACM) and the IEEE Computer
SIGPLAN Conference on PLDI, vol. 30, no. 6, Society. Dr. Malloy has reviewed papers for IEEE
pp. 291-300, 1995. Transactions for Parallel and Distributed Systems, Journal
[19] D. C. Lee, P. J. Crowley, J.-L. Baer, T. E. of Software, Practice and Experience, Journal of Parallelism
Anderson and B. N. Bershad, “Execution and IEEE Transactions on Software Engineering.
Characteristics of Desktop Applications on
Windows NT”, in ISCA, 1998 pp. 27—38.
[20] M. A. Linton, J. M. Vlissides and P. R. Calder},
“Composing User Interfaces with InterViews”,
IEEE Computer, vol. 22, no. 2, pp. 8-22, 1989.
[21] T. Romer, G. Voelkner, D. Lee, A. Wolman, W.
Wong, H. Levy and B. Bershard, “Instrumen-
tation and Optimization of Win32/Intel
Executables Using Etch.”, USENIX Windows
NT Workshop, Seatle, WA, August 1997.
[22] S. Savari and C. Young, “Comparing and
Combining Profiles”, in Proc. Second Workshop
on Feedback-Directed Optimization (FDO),
Nov. 1999.
[23] T. Sherwood and B. Calder, “Time Varying
Behavior of Programs”, in UC San Diego
Technical Report UCSD-CS99-630, Aug. 1999.
[24] L. Smith and C. Laird, “Android, Open Source
Scripting for Testing & Automation”, in Dr.
Dobbs Journal, no. 326, pp. 58-61, July 2001.
[25] M. D. Smith, “Extending SUIF for Machine-
dependent Optimizations”, in Proc. of the First
SUIF Compiler Workshop, 1996, pp. 14-25.
[26] M. D. Smith, “Overcoming the Challenges to
Feedback-Directed Optimization”, in Proc. of
the ACM SIGPLAN Workshop on Dynamic and
Adaptive Compilation and Optimization
(Dynamo'00), Jan. 2000.
[27] D. W. Wall, “Predicting Program Behavior
Using Real or Estimated Profiles”, in Proc. of

You might also like