You are on page 1of 29

A Study on Hyper-Threading

Click to edit Master subtitle style Vimal Reddy

Ambarish Sule Aravindh Anantaraman

Microarchitectural trends
• •

Higher degrees of instruction-level parallelism Different generations:

I. Serial Processors – Fetch and execute each instruction back to back II. Pipelined Processors – Overlap different phases of instruction processing for higher throughput III. Superscalar Processors – Overlap different phases of instruction processing and issue and execute multiple instructions in parallel for IPC > 1 IV. ???

Superscalar limits
Limitations with superscalar approach: - Amount of ILP in most programs is limited - Nature of ILP in programs can be bursty - Bottom-line: Resources can be utilized better

Simultaneous Multithreading

Finds parallelism at thread level Executes multiple instructions from multiple threads each cycle No significant increase in chip area over a superscalar processor

Multiple PCs

Fetch Unit

Thread selection Replicate RAS • BTB thread ids
• •

Replicate architectural state

FP queue Instruction Cache
Selective squash

FP Registers

FP units
Data Cache

Decode Int. queue Register Renaming

Int. Registers Int.+ load/store units
Replicate architectural state Per-thread disambiguation

Selective squash

Multiple rename map tables Multiple arch. map tables • Multiple active lists

From ece721 notes, Prof. Eric Rotenberg, NCSU


Brings goodness of Simultaneous MultiThreading (SMT) to Intel Architecture Motivation (Same as that for SMT)
– –

High processor utilization Better throughput (by exploiting thread level parallelism - TLP) Power efficient due to smaller processor cores compared to CMP

Hyper-Threading – Contd.

2 Logical processors (2 threads in SMT terminology) Shared Instruction Trace Cache and L1 D-Cache 2 PCs and 2 register renamers Other resources partitioned equally between 2 threads (no degradation of

Recombines shared resources when single threaded single thread performance)

Intel® NetBurst™ Microarchitecture Pipeline With Hyper-Threading Technology

Project Goal

Measure performance of micro-benchmarks (kernels) on Pentium-4. Form workloads to utilize different processor resources and study behavior.

Pentium4 Functional Units

– 3 Integer ALU units (2 double speed) – 1 unit for Floating point computation – Separate address generator units for loads and stores


Created 3 types of kernels: – Floating Point intensive kernel (flt)
• •

Performs FP Add, Sub, Multiply, Divide operations a large number of times Targets single FP unit Performs integer Add, Subtract and Shift a large number of times Targets integer units (2 double speed and 1 slow) Dynamically allocates a linked list larger than L1 D$ and parses it Targets shared data cache and memory hierarchy as such

Integer intensive kernel (int)
• •

Memory intensive kernel (mem, mem_s)
• •

Micro-benchmarks (contd.)

Integer kernel

Floating Point kernel

Memory intensive kernel

Machine: Pentium4 “Northwood” 2.53-2.66 GHz. with Hyper-Threading Operating System: Linux 2.4.18-SMP kernel. OS views each thread as a processor
– – –

BIOS setting to turn HT On/Off PERL script to fork processes at the same time “top” (Linux utility) to monitor processes (processor and memory utilization) “time” utility to get timing statistics for each program Ran each experiment 10 times and took the average execution time


Run different workload combinations. fltflt – 2 Floating point kernels mem_smem_s – 2 small memory intensive kernels intflt – 1 integer and 1 float kernel and so on ….. Run in 3 modes: 1. back-to-back: Run each program individually 2. HT Off: No Hyper-Threading. But OS context switching 3. HT On: Hyper-Threading on and OS context switching Find “Contending” workloads: Compete for resources and degrade performance (increase execution time with HT on) Find “Complementary” workloads: Utilize idle resources and increase performance (decrease execution time with HT on)

Experiments: Single thread performance

Hyper-Threading does not degrade single thread performance

Experiments (Contd.)

Contention for single FP unit increases execution time • Contention for data cache can lead to thrashing

Experiments (Contd.)

Integer workloads perform well – 3 integer units (2 double speed) are well utilized • Workloads with complementary resource requirements perform well (intflt, memint) • OS plays important role when number of programs > number of hardware contexts available

Experiments (Contd.)

Experiments (contd.)

Execution time with 3 kernel workload is less than that for 2! • Scheduling important! intfltflt - int kernel has 100% of 1 thread, 50:50 between flt and flt fltfltint - flt kernel has 100% of 1 thread, 50:50 between int and flt. Has higher execution time!

Project Goal

Model Hyper-Threading on a simulator. Vary key parameters and study first order effects

Simulator details
• •

Execution driven, cycle accurate simulator based on SimpleScalar toolset Extended the simulator to model SMT and HyperThreading:
– –

Resource sharing by tagging thread id (I$, D$) Resource replication through multiple instantiation (PC, Map tables, Branch history, RAS) Resource partitioning by having separate instances but imposing a global limit on entries ( Active list, Load/store buffers, IQ’s) Stop simulation after completion of all threads

Simulator details
Features   ISA Branch Misprediction pipeline Bandwidths Rename Map Table Architecture Map Table MEM IQ and ALU IQ Store buffers (24) Load buffers ( 48 ) Unified L2 cache Fetch unit Instruction cache Branch history register Branch predictor table Program Counters Return Address Stack ROB (126) L1 data cache Double Speed ALU/Functional Units Pentium 4   x86 20 stage (Fetch=3, Dispatch=3, Issue=6) Replicated Replicated Partitioned Partitioned (12+12) Partitioned (24+24) Simulator   SimpleScalar (MIPS like) 20 stage (used dummy stages) Same Same Same Same Same Same

8-way set assoc. 128 Byte lines, 256 KB No L2 cache Shared, RR.1.3  Trace cache (12K micro-ops, 6 per  trace line) Replicated Shared (algorithm unknown) Replicated Replicated Partitioned (63+63) Same (RR.2.3, ICOUNT,  BRCOUNT, MISSCOUNT) Shared L1 I$ Same Shared (Gshare, 2K entries) Same Same Shared (126)

Shared, 4-way set assoc., 64 Byte lines,  8KB Same Yes No

Simulator SMT/HT validation

Experiment: Modeling L1 data cache interference

Experiment: Modeling issue queue partitioning

Experiment: Modeling total issue queue size with partitioning

Experiment: Varying Load/Store buffer sizes (Pentium4: 48 Load, 24 Store)

Experiment: Comparison of fetch policies

[1] Prof. Eric Rotenberg, Course Notes, ECE 792E Advanced Microarchitecture, Fall 2002 NC State University. [2] Deborah T. Marr et al. “Hyper-Threading Technology Architecture and Microarchitecture,” Intel Technology Journal 1st Qtr 2002 Vol 6 Issue 1. [3] Vimal Reddy, Ambarish Sule, Aravindh Anantaraman “Hyperthreading on the Pentium 4,” ECE792E Project, Fall 2002 [4] D. M. Tullsen, et al. “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,” 23rd Annual ISCA, pp. 191202, May 1996.