Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 1
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 2
mcf bwaves calculix sift svm scheduling was proposed in form of distributed algorithms.
Authors in [1] and [13] introduced best-effort heuristic Multi-
2
Agents Systems (MAS) for run-time many-core scheduling
with malleable tasks. We ourselves designed a MAS in [15]
Avg. IPC
1.5
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 3
4) Reposition the Task j according to the updated ICj in As argued in base case, suboptimal decision of allocating the
the sorted queue using a binary search insertion. first additional core to Task x when allocated Cx − 2 happens
5) Repeat Step 3 and Step 4 till all cores are allocated. when either Cy′ ≤ Cy or Cz′ ≤ Cz or both. Therefore, by
6) Readjust real core allocations from last scheduling epoch design of the greedy algorithm following relations hold.
to reflect the new optimal core allocations.
ICx −1 − ICx −2 ≥ ICy′ +1 − ICy′ (5)
7) Execute tasks with the optimal core allocations.
The proposed greedy scheduler, like all greedy algorithms ≥ ICz′ +1 − ICz′
is minimalistic in its approach and is quite easy to implement. Similarly, suboptimal decision of allocating the second
Nevertheless, its real strength come from its ability to provide additional core to Task x with Cx − 1 cores already allocated
optimal results. We now proceed to prove the theoretical happens when either Cy′ ≤ Cy or Cz′ ≤ Cz or both. Therefore,
optimality of our algorithm. by design following relations also hold.
Theorem 1. The greedy core allocations are optimal. ICx − ICx −1 ≥ ICy′ +1 − ICy′
Proof. Let hC1 , C2 , ..., Ck i be the core allocations chosen by ≥ ICz′ +1 − ICz′ (6)
our proposed greedy algorithm. We prove our theorem using
Adding Equation (5) and (6) we get,
proof by induction.
Base Case: Suppose hC1 , C2 , Cx − 1, ..., Cy + 1, ..., Ck i ICx − ICx −2 ≥ ICy′ +1 − ICy′ + ICz′ +1 − ICz′
instead be the optimal core allocations in which Task x has one
But, we know from concavity Equation (1)
less core allocated to it, which instead is allocated to Task y.
Now for the optimal core allocations to be better than greedy ICy′ +1 − ICy′ ≥ ICy +1 − ICy
core allocations following equation must hold. ICz′ +1 − ICz′ ≥ ICz +1 − ICz
ICy +1 − ICy > ICx − ICx −1 (2) Therefore,
The above equation says that the benefit (in terms of ICx − ICx −2 ≥ ICy +1 − ICy + ICz +1 − ICz
increased IPC) of assigning additional core to Task y must
outweigh the loss of taking away that core (in terms of Above equation is in contradiction to Equation (4). Hence,
decreased IPC) from Task x under optimal core allocations. we prove our step case to be optimal.
Our greedy algorithm does not reconsider core allocation Assumption Case: We assume greedy allocations are op-
decisions. So, the suboptimal decision of allocating an addi- timal till n cores are removed from Task x and distributed
tional core to Task x when already allocated Cx − 1 cores among remaining tasks in any combination. Mathematically
happens when either Cy cores or Cy′ < Cy cores were following relationship is assumed to be true.
allocated to Task y. We consider both cases below. ICx − ICx −n ≥ ICy +αy − ICy + ... + ICz +αz − ICz (7)
If suboptimal core allocation happens when Task y had Cy
cores, then it implies following relation. where αy + ... + αz = n.
Induction Case: Now we assume in optimal allocation
ICx − ICx −1 ≥ ICy +1 − ICy (n + 1)th core is removed from Task x and without loss of
generality given to Task y, while previously removed n cores
The above equation is in contradiction to Equation (2).
are distributed in same combination as in assumption case.
On the other hand, if the suboptimal core allocation happens
Now for the optimal core allocations to be better than greedy
when Task y had Cy′ < Cy cores, then
core allocations following equation must hold.
ICx − ICx −1 ≥ ICy′ +1 − ICy′
ICy +αy +1 − ICy + ... + ICz +αz − ICz ≥ ICx − ICx −n−1 (8)
But, we know from concavity Equation (1)
Since greedy algorithm choose to allocate core to Task x
ICy′ +1 − ICy′ ≥ ICy +1 − ICy with Cx − n − 1 cores allocated instead of Task y with
Cy +αy cores allocated, by the design of the greedy algorithm
=⇒ ICx − ICx −1 ≥ ICy +1 − ICy (3)
following relationship holds.
Again a contradiction to Equation (2). Hence, we prove our
ICx −n − ICx −n−1 ≥ ICy +αy +1 − ICy +αy
base case is optimal.
Step Case: Suppose hC1 , C2 , Cx − 2, ..., Cy + 1, Cz + Adding Equation (7) to above equation we get,
1, ..., Ck i instead be the optimal core allocations in which
ICx − ICx −n−1 ≥ ICy +αy +1 − ICy + ... + ICz +αz − ICz
Task x has two less cores allocated to it, which instead are
allocated to Task y and Task z; one each. Without loss of Above equation is in contradiction to Equation (8), proving
generality the proof will also hold if both cores from Task x our induction step is optimal; hence proved.
were allocated to Task y or Task z exclusively instead.
Now for above optimal core allocations to be better than Complexity: Given that dynamic programming already
greedy core allocations following equation must hold. provides the optimal solution for the problem of run-time
many-core scheduling, the primary reason for developing
ICy +1 − ICy + ICz +1 − ICz > ICx − ICx −2 (4) a greedy algorithm is to reduce the scheduling overheads.
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 4
Benchmarks Traces (1-8 Cores) Result Trace (64-512 Cores) Dynamic Programming Greedy
Throughput [IPC]
394.39
394.09
Cycle-Accurate Multi-Core Simulator (gem5 [3]) Trace Many-Core Simulator 600
332.14
331.47
234.52
231.6
Fig. 3: Experimental Setup. 400
129.11
129.05
200
TABLE I: Benchmarks used for experiments.
Suite Benchmark Name 64 128 256 512
SPEC [10], [11] art astar, bwaves, bzip2, calculix, equake, gemsfdtd, Number of Tasks
gobmk, h264ref, hmmer, lbm, mcf, namd, omnetpp,
perlbench, povray, sjeng, tonto, twolf, vortex Fig. 4: Performance comparison between schedulers employ-
SD-VBS [19] disparity, mser, sift, svm, texture, tracking
PARSEC [2] blackscholes, fluidanimate, swaptions, streamcluster
ing dynamic programming and greedy algorithm on closed
SPLASH [20] cholesky, fmm, lu, radix, radiosity, water-sp 256-core many-core system under different size workloads.
499.47
Throughput [IPC]
800
494.3
In the proposed algorithm, one time sorting in Step 2
332.14
331.47
600
has a processing overhead of O(T lg T ). Additionally, binary
197.19
197.05
119.42
119.19
400
search in Step 4 has a processing overhead of O(lg T ). Since
Step 4 is repeated C times, the total processing overhead of 200
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 5
TABLE II: Comparison of time taken by the greedy scheduler In practice, it requires 10,000x less time than a dynamic
and the dynamic programming based scheduler to solve run- programming based scheduler to reach a solution, while pro-
time many-core scheduling problem of different sizes. viding near-optimal performance.
Dynamic
Cores Tasks Greedy (ms) Improvement
Programming (ms) ACKNOWLEDGMENT
32 7.985 0.061 130.90x
64 64 16.203 0.077 210.42x This work was supported in parts by the German Research
128 32.910 0.108 304.72x Foundation (DFG) as part of the Transregional Collaborative
64 71.005 0.113 628.36x
128 128 144.214 0.144 1,001.48x Research Centre “Invasive Computing" (SFB/TR 89) and in
256 285.617 0.223 1,280.79x parts by Huawei International Pte. Ltd. in Singapore.
128 586.276 0.250 2,345.11x
256 256 1,163.108 0.318 3,657.57x
512 2,314.798 0.555 4,170.80x R EFERENCES
256 4,553.491 0.635 7,170.85x [1] I. Anagnostopoulos, V. Tsoutsouras, A. Bartzas, and D. Soudris. Dis-
512 512 9,056.819 0.866 10,458.22x tributed Run-time Resource Management for Malleable Applications on
1024 18,031.729 1.578 11,426.95x Many-core Platforms. In Design Automation Conference (DAC). ACM.
[2] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark
Suite: Characterization and Architectural Implications. In Parallel
Architectures and Compilation Techniques (PACT), 2008.
resentative varisized inputs. We then recorded the simulated [3] N. Binkert et al. The gem5 Simulator. In SIGARCH Computer
system-time it took for both schedulers to reach a solution in Architecture News, 2011.
Table II on average for a scheduling epoch. [4] J. Blazewicz, M. Machowiak, J. Wkeglarz, M. Y. Kovalyov, and D. Trys-
tram. Scheduling Malleable Tasks on Parallel Processors to Minimize
Empirical evaluations show that dynamic programming the Makespan. Annals of Operations Research, 2004.
based scheduler requires 7.985 ms to solve the problem of run- [5] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, M. F. Kaashoek,
time many-core scheduling for a 64-core many-core running R. Morris, A. Pesterev, L. Stein, M. Wu, Y.-h. Dai, et al. Corey: An
Operating System for Many Cores. In Operating Systems Design and
32 tasks. For a 10 ms scheduling epoch at which a multi- Implementation (OSDI), 2008.
core OS operates, this will result in impractical overhead of [6] S. Buchwald, M. Mohr, and A. Zwinkau. Malleable Invasive Applica-
79.85%. The greedy scheduler in comparison will have an tions. In Software Engineering, 2015.
[7] D. G. Feitelson and L. Rudolph. Metrics and Benchmarking for Parallel
overhead of only 0.61% for the same size problem. Job Scheduling. Job Scheduling Strategies for Parallel Processing, 1998.
Furthermore, the overhead of dynamic programming based [8] D. P. Gulati, C. Kim, S. Sethumadhavan, S. W. Keckler, and D. Burger.
Multitasking Workload Scheduling on Flexible-core Chip Multipro-
scheduler grows exponentially with both increase in the num- cessors. In International Conference on Parallel Architectures and
ber of cores and the tasks executing on them. On the other Compilation Techniques (PACT), 2008.
hand, overhead of greedy scheduler grows much more slowly. [9] J. Henkel, A. Herkersdorf, L. Bauer, T. Wild, M. Hübner, R. K. Pujari,
A. Grudnitsky, J. Heisswolf, A. Zaib, B. Vogel, V. Lari, and S. Kobbe.
Even on a 512-core running 1024 tasks, overhead of greedy Invasive Manycore Architectures. In Asia and South Pacific Design
algorithm stands at a somewhat acceptable 15.78%. To reduce Automation Conference (ASP-DAC), 2012.
overhead further, it seems a many-core OS will need to operate [10] J. L. Henning. SPEC CPU2000 : Measuring CPU performance in the
new millennium. Computer, 2000.
with a much higher values of scheduling epoch. [11] J. L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH
It is important to note that beside direct reduction in Computer Architecture News, 2006.
processing overhead, a reduction in space overhead also plays [12] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core Fusion:
Accommodating Software Diversity in Chip Multiprocessors. In ACM
critical role in reducing the total problem solving time. When SIGARCH Computer Architecture News, 2007.
operating with large data-structures, a scheduler is forced [13] S. Kobbe, L. Bauer, D. Lohmann, W. Schröder-Preikschat, and J. Henkel.
to perform page-swaps with main memory when the last- DistRM: Distributed Resource Management for On-chip Many-core
Systems. In Conference on Hardware/Software Codesign and System
level cache is saturated. Since main memory accesses have Synthesis (CODES), 2011.
several times the latencies of cache accesses, the performance [14] V. Pallipadi and A. Starikovskiy. The Ondemand Governor. In The
suffers enormously. Given the fact that space-overhead for our Linux Symposium, 2006.
[15] A. Pathania, V. Venkataramani, M. Shafique, T. Mitra, and J. Henkel.
greedy scheduler is much smaller than dynamic programming Distributed Scheduling for Many-Cores Using Cooperative Game The-
based scheduler, cache-saturation will happen in the former ory. In Design Automation Conference (DAC), 2016.
for problem of much larger size than the latter. [16] M. Pricopi and T. Mitra. Bahurupi: A Polymorphic Heterogeneous
Multi-core Architecture. Transactions on Architecture and Code Op-
timization (TACO), 2012.
V. C ONCLUSION [17] A. K. Singh, M. Shafique, A. Kumar, and J. Henkel. Mapping on
Multi/Many-core Systems: Survey of Current and Emerging trends. In
In this work, we proposed a theoretically optimal greedy Design Automation Conference (DAC), 2013.
[18] V. Vanchinathan. Performance Modeling of Adaptive Multi-core Archi-
algorithm for the problem of run-time many-core schedul- tecture. http://www.comp.nus.edu.sg/~vvenkata/publications/thesis.pdf,
ing. Given the large optimization search-space, the problem 2015.
requires light-weight algorithms that can be applied online. [19] S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia,
S. Belongie, and M. B. Taylor. SD-VBS: The San Diego vision
A dynamic programming based scheduler can solve the benchmark suite. In International Symposium on Workload Charac-
problem optimally but has high scheduling overheads asso- terization (IISWC), 2009.
ciated with it. As an alternative, we proposed a scheduler [20] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-
2 programs: Characterization and methodological considerations. In
based on a greedy algorithm in this work. The proposed greedy SIGARCH Computer Architecture News, 1995.
scheduler exploits the concavity in IPC extraction inherent in
many-core workloads to maintain theoretical optimality.
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.