Exploiting Parallelism in Multicore Processors through Dynamic Optimizations

Abhimanyu Khosla Mtech Cse

What is TLS ??

Potentially Dependent Threads or SPECULATIVE THREADS .

Difficult to statically estimate performance impact even when extensive information is available .Why Dynamic ??  Static thread management – analyse extensive profile information (Probability based data and control dependence profiling). Why ??   . Can extract Coarse-Grained parallelism ( several thousand instructions).

   Speculative threads experience phase behavior. . Performance impact of speculative threads depends on the underlying hardware configuration. Speculative threads behaviors are inputdependent.Problems with Static Thread Management  Profiling Information cannot accurately predict the costs of speculation. synchronization and other overheads.

Experimental Infrastructure For Dynamic Optimizations. Simulation Infrastructure.   Speculative Thread Execution Model.   Compilation Infrastructure. Architectural Support. .

Underlying hardware keeps track of each memory access. .Speculative Thread Execution Model  Allows Compiler to parallelize a sequential program without first proving the independence among the extracted threads.   TLS empowers compiler to parallelize programs that were previously nonparallelizable.

.Compiler Is forced in giving up Parallelizing this Code(memory Address unknown At compile time).

Transition to and From these States.  .  All Speculative threads are assigned a unique ID.(SpE) .Architectural Support for Speculation (CMP)  Each core has a private 1st level protocol and a shared l2 cache.Speculatively Exclusive. Extends Cache Coherence Protocol with two new states .  If a cache line is Speculatively loaded it enters a SpS or SpE state. .  STAMPede approach is used to support TLS.(SpS) – Speculatively Shared. If an invalidation message arrives from a logically earlier thread for a cache line (Sps or SpE).   . then the thread is squashed and re-executed. Thread ID of the Sender PiggyBacks on all invalidation messages.

compiler is forced to create a different executable in which every loop is parallized. To dynamically optimize where Speculative Threads should be spawned. Extended to extract Speculative threads from loops.Compilation Infrastructure   Built on Open64 Compiler.  .

Pipeline is based on Simple Scalar.Trace Generation portion based on PIN instrumentation tool.   .Architectural Simulation based on Simple scalar. TG instruments all instructions to extract . Cacti model – Cache.Simulation Infrastructure  Based on trace-driven. register used. Orion model – inter-connection power consumption. out of order superscalar processor simulator.instruction address.   AS. TG.opcode etc   AS reads the trace file and translates the code generated by compiler into Alpha like code.   Wattch model – power consumption.

.

.  Deriving Seq execution time from TLS cycle breakdown. .Busy : Cycles spent graduating NonTLS instructions.Performance Estimation  Cycles for TLS are broken down into 6 segments .iFetch : Cycles stalled due to fetch penalty .Squash : Cycles stalled due to speculation failures.dCache : Cycles stalled due to data cache misses.Exe Stall : Cycles stalled due to lack of ILP . . PSEQs = (TLS – Squash -Others) .Others : Cycles spent on various TLS overheads.

Decision making for TLS   .Runtime Support Performance profile with Hardware performance monitors.

each data cache miss.Performance Profile With Hardware Montiors  Hardware Performance montiors are programmed to attribute execution cycles into following categories. we also count the number of cycles needed to serve the miss.dCacheServe.  Counters are maintained per core. number of non-TLS instructions committed.iFetch.ThreadCount. Examining the head of the stall gives us some clue to the cause of a stall.  . number of threads committed. . cycles stalled due to instruction execution delays. cycles stalled due to data cache misses. A counter is aggregated if its value is aggregated from all the cores.  . . .ExeStall.dCache. . .Busy cycles spent graduating instructions. .Total.Useful Instrution. cycles stalled due to instruction fetch penalty. cycles elapsed since the beginning of TLS invocation. .

.

When a non-TLS-management instruction commits.. the iFetch counter is incremented.Counters. If the instruction stalled at the head of the ROB is a memory operation. At a given cycle. if the ROB is empty.  Total is incremented on every clock cycle. the ExeStall counter is incremented. the dCacheServe counter is incremented. no counter is incremented.       . the Busy counter is incremented.. . such as thread creation/commit instructions or synchronization instruc-tions. If the instruction stalled is a TLS management instruction. Otherwise. If the instruction at the head of ROB is able to graduate.

speculative thread and the ThreadCount is incremented when a speculative thread becomes nonspeculative.Aggregating The Counters  when a thread is spawned to a core. When a thread commits (only the nonspeculative thread is allowed to commit).   . counters on that core are reset. all the aggregated counters are forwarded to the next non. it aggregates the forwarded counters with its own counters.

a data item in the L1 cache is invalidated by a message from a speculative thread. . . .a data item is needed by two threads running on two different cores. causing two cache misses.Counting Cycles for Data Cache Misses  Estimating the cache performance of sequential execution from parallel execution is a complex task.a data item used by a thread is actually brought into the cache by another speculative thread which fails.  . Consider the following scenarios.

the main thread updates the table by adding the difference between the TLS execution and the predicted sequential execution time to the performance summary. Each entry in the table contains two entries  . . . which contains the cumulative difference in execution time between the TLS execution and the estimated sequential execution.  After a candidate loop is executed in TLS mode. which is incremented if the TLS execution outperforms the predicted sequential execution and vice a versa. a performance table is maintained for the candidate loops.Decision Making  To decide which loops to parallelize speculatively.a performance profile summary.saturation counter.

Identify the loop that can take maximum advantage of TLS.and also select the right level of loop to be parallelised. . .Performance Evaluation  Dynamic Performance tuning method is required to .

Dynamic Performance Tuning Policies     Simple Quantitative Quantitative + Static Quantitative + StaticHint .