Simultaneous and heterogeneous
multithreading
Simultaneous and heterogeneous multithreading (SHMT) is a software framework that takes
advantage of heterogeneous computing systems that contain a mixture of central processing units (CPUs),
graphics processing units (GPUs), and special purpose machine learning hardware, for example Tensor
Processing Units (TPUs).[1][2]
Each component processes information differently. Often data has to move among processors, which can
create bottlenecks, with one processor starving while waiting on another to finish.[1]
Architecture
The system defines virtual processors and virtual operations (VOPs). VOPs decompose into one or more
high-level operations (HLOPs). It then distributes the operations across the processors. The runtime
system then dynamically maps virtual processors to physical processors, assessing resource availability in
order to keep all the processors busy. The scheduler employs a light-weight, quality-aware work-stealing
(QAWS) policy.[1]
Conventional runtimes use assign one processor (set) to each subtask, leaving other types of processors
idle. In other words, the CPU(s) run (possibly in parallel), then when that subtask completes, the next
subtask is handed to the GPU(s). When they finish the next subtask is handed to the TPU(s).[2]
Adding software pipelining allows the second subtask to run using partial results from the first subtask,
which improves resource utilization.[2]
SHMT takes things a step further, identifying subtasks that can run independently of others to the
appropriate processor type, allow even better parallelism. Some subtasks can be performed on multiple
processor types. SHMT can divide a single subtask across such processor types. Thus the fundamental
breakthrough is to keep more processors working simultaneously, reducing time and energy costs.[2]
Benchmark
Researchers tested the concept using a typical smartphone configuration tweaked so that it resembled a
data center server.[1]
The hardware was Nvidia's Jetson Nano module containing a quad-core ARM Cortex-A57 processor
(CPU) and 128 Maxwell architecture GPU cores. A Google Edge TPU was connected via its M.2 Key E
slot. The processors communicated via an onboard PCI Express (PCIe) interface. Shared data was hosted
in a 4 GB 64-bit LPDDR4. The Edge TPU adds an 8 MB device memory. Ubuntu Linux 18.04 was the
operating system.[1]
Compared to a conventional system performance increased by 1.95X boost, while energy consumption
was reduced by 51%, on a range of benchmarks, including Black–Scholes, DCT8X8, DWT, FFT,
Histogram, Hotspot, Laplacian, MF, Sobel, SRAD, and GMEAN.[1]
See also
Asymmetric multiprocessing
Instruction-level parallelism (ILP)
Parallel computing
Simultaneous multithreading
Superscalar processor
Symmetric multiprocessing (SMP)
Variable SMP
Thread (computing)
References
1. McClure, Paul (February 22, 2024). "Software tweak doubles computer processing speed,
halves energy use" (https://newatlas.com/computers/smht-parallel-processing/). New Atlas.
Retrieved 2024-02-25.
2. Hsu, Kuan-Chieh; Tseng, Hung-Wei (2023-12-08). "Simultaneous and Heterogenous
Multithreading". 56th Annual IEEE/ACM International Symposium on Microarchitecture.
MICRO '23. New York, NY, USA: Association for Computing Machinery. pp. 137–152.
doi:10.1145/3613424.3614285 (https://doi.org/10.1145%2F3613424.3614285). ISBN 979-8-
4007-0329-4.
Retrieved from "https://en.wikipedia.org/w/index.php?
title=Simultaneous_and_heterogeneous_multithreading&oldid=1239906888"