You are on page 1of 5

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 1

Optimal Greedy Algorithm for


Many-Core Scheduling
Anuj Pathania, Vanchinathan Venkatramani, Muhammad Shafique, Tulika Mitra, Jörg Henkel

Abstract—In this work, we propose an optimal greedy algo-


rithm for the problem of run-time many-core scheduling. The
previously best known centralized optimal algorithm proposed astar
for the problem is based on dynamic programming. A dynamic bzip2
programming based scheduler has high overheads which grow gobmk
fast with increase in both the number of cores in the many- h264ref
cores and number of tasks executing on them. We show in this hmmer
libquantum
work that the inherent concavity of extractable Instructions Per
mcf
Cycle in tasks with increase in number of allocated cores allows omnetpp
for an alternative greedy algorithm. The proposed algorithm perlbench
significantly reduces the run-time scheduling overheads, while sjeng
maintaining optimality. In practice, it reduces the problem twolf
solving time 10,000x to provide near-optimal solutions. vortex
applu
Index Terms—Scheduling, Throughput Maximization, Many- art
Cores, Greedy Algorithm. bwaves
calculix
equake
gemsFDTD
I. I NTRODUCTION lbm
mesa
Many-cores are hundred core (or even thousand core)
namd
processors designed to execute several multi-threaded tasks povray
in parallel [9]. One of the most crucial scheduling decision soplex
tonto
to be made on a many-core is the distribution of limited disparity
processing cores amongst the executing tasks in such a manner mser
sift
so that the processor always operates at its peak performance. svm
In this work, we use adaptive many-cores [12], [16] which texture
allows acceleration of not just multi-threaded tasks by Thread tracking
Level Parallelism (TLP) exploitation but also acceleration
of single-threaded tasks by exploitation of Instruction Level
Parallelism (ILP). 1.2 1.4 1.6 1.8 2 2.2
Speedup
To reduce context-switching overheads, many-cores prefer
to operate with one thread per-core model [5]; keeping the Fig. 1: Observed variance in 2-core speedup for different
problem of many-core scheduling mathematically discrete. A benchmarks sampled at every ten million instructions over
task goes through different phases during its execution. These there entire execution.
phases result in a high variance within the speedup a task
can derive from a many-core as shown in Figure 1. N-core
speedup of a task is defined as the ratio of its Instructions achieve maximum throughput (total IPC) from a many-core
Per Cycle (IPC) when assigned N cores with its IPC when at all times [15].
assigned only to a single core. The problem of run-time scheduling on many-core (through-
Malleable tasks [6] by design allow their core allocations to put maximization) can be solved optimally in polynomial
increase or decrease in size at any time during their execution time using dynamic programming [8]. A scheduler based
with negligible performance penalties. When operating with on dynamic programming has high overheads which grow
malleable tasks, schedulers can exploit their speedup variance fast as both the number of cores in a many-core and the
by performing run-time scheduling (core redistributions) to number of tasks they can execute in parallel increase. Dynamic
programming given its overheads is not suitable for scheduling
A. Pathania, M. Shafique, J. Henkel are with Chair of Embedded System, at run-time due to frequent invocations. Therefore, a search
Karlsruhe Institute of Technology (KIT). V. Venkatramani and T. Mitra are
with School of Computing, National University of Singapore (NUS). for alternative low overhead algorithms for run-time many-
Corresponding Author: anuj.pathania@kit.edu. core scheduling is mandated. In past, researchers have resorted
No part of this work has been previously published. This work has been to proposing heuristics with no guarantees on the obtained
submitted as short/brief paper (5 pages) for review.
Manuscript received Month XX, 20XX; revised Month XX, 20XX; accepted results [17]. We instead propose a low overhead solution in
Month XX, 20XX; published Month XX, 20XX. this work, which preserves the theoretical optimality of results.

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 2

mcf bwaves calculix sift svm scheduling was proposed in form of distributed algorithms.
Authors in [1] and [13] introduced best-effort heuristic Multi-
2
Agents Systems (MAS) for run-time many-core scheduling
with malleable tasks. We ourselves designed a MAS in [15]
Avg. IPC

1.5

1 that solved the problem optimally using a distributed con-


sensus algorithm based on cooperative game theory. Though
0.5
promising in reducing the per-core processing overhead by
1 2 4 8 disbursement of scheduling calculations across multiple/all
Number of Core Allocated cores, the reduction is accompanied by concurrent increase
Fig. 2: Observed average IPC of different benchmarks when in the communication overhead.
allocated different number of cores. We instead in this work, improve upon best-known central-
ized algorithm for throughput maximization [8] by introducing
another optimal centralized algorithm that decreases both the
This search is aided by the fact that extractable IPC in processing as well as space overhead without changing the
tasks executing on a many-core is in general concave with communication overhead. The proposed algorithm when op-
increase in number of allocated cores as shown in Figure 2. erating with malleable tasks brings the efficiency of throughput
The observed concavity is due to the saturation of exploitable maximizing schedulers to the same level as that of makespan
ILP/TLP in the tasks. minimizing schedulers.
Our Novel Contributions: In this work, we present a
greedy algorithm for the problem of run-time many-core III. G REEDY A LGORITHM FOR
scheduling based on the inherent concavity in extractable RUN - TIME M ANY-C ORE S CHEDULING
IPC in executing tasks. We show that a scheduler based
on the proposed greedy algorithm significantly reduces the System Model: We begin by presenting the notations used
processing- and space overhead in comparison to a scheduler in this work to model a many-core system. Let T and C be
based on dynamic programming without sacrificing on the the number of tasks and cores in the system, respectively. Ci
theoretical optimality. In practice where concavity property be the number of cores currently allocated to Task i. Let ICi
does not always hold, the algorithm results in significantly denote the IPC of Task i when allocated Ci cores.
near-optimal solutions. Using the above notations, the problem of throughput maxi-
mization on many-core can be mathematically stated as below.
II. R ELATED W ORK
X
T X
T
Multi/Many-core scheduling is a well-studied problem in Maximize ICi , under constraint |Ci | ≤ C
research [17]. Though, major focus in past for performance- i=1 i=1
oriented schedulers remained on design-time makespan mini- In the context of run-time many-core scheduling, the above
mization instead of run-time throughput maximization. equation needs to be solved by the scheduler nearly in every
Makespan minimization involves determining a fixed sched- scheduling epoch. Scheduling epoch is the frequency at which
ule that finishes a given workload on a multi/many-core a scheduler is invoked by an Operating System (OS). In this
processor in the shortest time [7]. Makespan minimizing work, its value is assumed to be 10 ms; same as the default
schedulers often assume all information about the workload scheduling epoch used by Linux kernel [14].
such as execution time and arrival time of constituent tasks Furthermore to operate, a scheduler also needs to know the
is known beforehand. They also often assume executing value of IPC ICi′ of Task i from a possible core allocation Ci′
tasks lack any phases and thereby always progress at a from its current observable IPC ICi when Ci′ 6= Ci . For this,
fixed rate. When operating with malleable tasks with concave we employ regression-based performance-prediction models
speedups/IPCs, authors in [4] proposed a scheduler that can for adaptive many-cores developed in [18].
optimally minimize makespan with O(T max{C, T log2 C})
The concavity in IPC extraction as shown in Figure 2 is
processing overhead, where T is the number of tasks and C
mathematically captured by following equation.
is the number of processing cores.
Throughput maximization on the other hand involves de- ICi′ +n − ICi′ ≥ ICi +n − ICi if Ci′ ≤ Ci (1)
termining a schedule at discrete intervals for a workload,
where often neither arrival time nor the departure time of Greedy Algorithm: We now propose a greedy algorithm for
its constituent tasks are known to the scheduler [7]. A run- the problem of run-time many-core scheduling. The proposed
time scheduler when running unprofiled tasks can rely only algorithm is composed of following sequential steps performed
on information available to it from online observations or by the greedy scheduler before every scheduling epoch.
predictions that can be made from them. 1) Assume Ci = 0 ∀i ∈ T .
Authors in [8] proposed a dynamic programming based 2) Sort all tasks in T in ascending order by using compara-
algorithm with O(CT 2 ) processing overhead for optimal tor [ICi +1 − ICi ] and store in a queue.
throughput maximization. Given the high-overhead for obtain- 3) Virtually assign a core to Task j in front of the queue
ing optimal solutions through dynamic programming, alter- and update the corresponding ICj using performance-
native approaches to reduce overhead of run-time many-core prediction models from [18].

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 3

4) Reposition the Task j according to the updated ICj in As argued in base case, suboptimal decision of allocating the
the sorted queue using a binary search insertion. first additional core to Task x when allocated Cx − 2 happens
5) Repeat Step 3 and Step 4 till all cores are allocated. when either Cy′ ≤ Cy or Cz′ ≤ Cz or both. Therefore, by
6) Readjust real core allocations from last scheduling epoch design of the greedy algorithm following relations hold.
to reflect the new optimal core allocations.
ICx −1 − ICx −2 ≥ ICy′ +1 − ICy′ (5)
7) Execute tasks with the optimal core allocations.
The proposed greedy scheduler, like all greedy algorithms ≥ ICz′ +1 − ICz′
is minimalistic in its approach and is quite easy to implement. Similarly, suboptimal decision of allocating the second
Nevertheless, its real strength come from its ability to provide additional core to Task x with Cx − 1 cores already allocated
optimal results. We now proceed to prove the theoretical happens when either Cy′ ≤ Cy or Cz′ ≤ Cz or both. Therefore,
optimality of our algorithm. by design following relations also hold.
Theorem 1. The greedy core allocations are optimal. ICx − ICx −1 ≥ ICy′ +1 − ICy′
Proof. Let hC1 , C2 , ..., Ck i be the core allocations chosen by ≥ ICz′ +1 − ICz′ (6)
our proposed greedy algorithm. We prove our theorem using
Adding Equation (5) and (6) we get,
proof by induction.
Base Case: Suppose hC1 , C2 , Cx − 1, ..., Cy + 1, ..., Ck i ICx − ICx −2 ≥ ICy′ +1 − ICy′ + ICz′ +1 − ICz′
instead be the optimal core allocations in which Task x has one
But, we know from concavity Equation (1)
less core allocated to it, which instead is allocated to Task y.
Now for the optimal core allocations to be better than greedy ICy′ +1 − ICy′ ≥ ICy +1 − ICy
core allocations following equation must hold. ICz′ +1 − ICz′ ≥ ICz +1 − ICz
ICy +1 − ICy > ICx − ICx −1 (2) Therefore,
The above equation says that the benefit (in terms of ICx − ICx −2 ≥ ICy +1 − ICy + ICz +1 − ICz
increased IPC) of assigning additional core to Task y must
outweigh the loss of taking away that core (in terms of Above equation is in contradiction to Equation (4). Hence,
decreased IPC) from Task x under optimal core allocations. we prove our step case to be optimal.
Our greedy algorithm does not reconsider core allocation Assumption Case: We assume greedy allocations are op-
decisions. So, the suboptimal decision of allocating an addi- timal till n cores are removed from Task x and distributed
tional core to Task x when already allocated Cx − 1 cores among remaining tasks in any combination. Mathematically
happens when either Cy cores or Cy′ < Cy cores were following relationship is assumed to be true.
allocated to Task y. We consider both cases below. ICx − ICx −n ≥ ICy +αy − ICy + ... + ICz +αz − ICz (7)
If suboptimal core allocation happens when Task y had Cy
cores, then it implies following relation. where αy + ... + αz = n.
Induction Case: Now we assume in optimal allocation
ICx − ICx −1 ≥ ICy +1 − ICy (n + 1)th core is removed from Task x and without loss of
generality given to Task y, while previously removed n cores
The above equation is in contradiction to Equation (2).
are distributed in same combination as in assumption case.
On the other hand, if the suboptimal core allocation happens
Now for the optimal core allocations to be better than greedy
when Task y had Cy′ < Cy cores, then
core allocations following equation must hold.
ICx − ICx −1 ≥ ICy′ +1 − ICy′
ICy +αy +1 − ICy + ... + ICz +αz − ICz ≥ ICx − ICx −n−1 (8)
But, we know from concavity Equation (1)
Since greedy algorithm choose to allocate core to Task x
ICy′ +1 − ICy′ ≥ ICy +1 − ICy with Cx − n − 1 cores allocated instead of Task y with
Cy +αy cores allocated, by the design of the greedy algorithm
=⇒ ICx − ICx −1 ≥ ICy +1 − ICy (3)
following relationship holds.
Again a contradiction to Equation (2). Hence, we prove our
ICx −n − ICx −n−1 ≥ ICy +αy +1 − ICy +αy
base case is optimal.
Step Case: Suppose hC1 , C2 , Cx − 2, ..., Cy + 1, Cz + Adding Equation (7) to above equation we get,
1, ..., Ck i instead be the optimal core allocations in which
ICx − ICx −n−1 ≥ ICy +αy +1 − ICy + ... + ICz +αz − ICz
Task x has two less cores allocated to it, which instead are
allocated to Task y and Task z; one each. Without loss of Above equation is in contradiction to Equation (8), proving
generality the proof will also hold if both cores from Task x our induction step is optimal; hence proved.
were allocated to Task y or Task z exclusively instead.
Now for above optimal core allocations to be better than Complexity: Given that dynamic programming already
greedy core allocations following equation must hold. provides the optimal solution for the problem of run-time
many-core scheduling, the primary reason for developing
ICy +1 − ICy + ICz +1 − ICz > ICx − ICx −2 (4) a greedy algorithm is to reduce the scheduling overheads.

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 4

Benchmarks Traces (1-8 Cores) Result Trace (64-512 Cores) Dynamic Programming Greedy

Throughput [IPC]

394.39
394.09
Cycle-Accurate Multi-Core Simulator (gem5 [3]) Trace Many-Core Simulator 600

332.14
331.47
234.52
231.6
Fig. 3: Experimental Setup. 400

129.11
129.05
200
TABLE I: Benchmarks used for experiments.
Suite Benchmark Name 64 128 256 512
SPEC [10], [11] art astar, bwaves, bzip2, calculix, equake, gemsfdtd, Number of Tasks
gobmk, h264ref, hmmer, lbm, mcf, namd, omnetpp,
perlbench, povray, sjeng, tonto, twolf, vortex Fig. 4: Performance comparison between schedulers employ-
SD-VBS [19] disparity, mser, sift, svm, texture, tracking
PARSEC [2] blackscholes, fluidanimate, swaptions, streamcluster
ing dynamic programming and greedy algorithm on closed
SPLASH [20] cholesky, fmm, lu, radix, radiosity, water-sp 256-core many-core system under different size workloads.

Dynamic Programming Greedy


These reductions will directly translate into improvement in
performance in real-world systems.

499.47
Throughput [IPC]
800

494.3
In the proposed algorithm, one time sorting in Step 2

332.14
331.47
600
has a processing overhead of O(T lg T ). Additionally, binary

197.19
197.05
119.42
119.19
400
search in Step 4 has a processing overhead of O(lg T ). Since
Step 4 is repeated C times, the total processing overhead of 200

our algorithm is O(T lg T + C lg T ) or O(max{C, T } lg T ).


64 128 256 512
This overhead is substantially less than O(CT 2 ) processing
Number of Cores
overhead of dynamic programming.
Furthermore, the proposed algorithm requires maintenance Fig. 5: Performance comparison between schedulers employ-
of only one queue data structure with space overhead of O(T ). ing dynamic programming and greedy algorithm for a fixed
On the other hand, dynamic programming has a significantly 256-task workload on varisized closed many-core systems.
higher space overhead of O(CT ).
benchmarks with uniform distribution. Throughput measured
IV. E XPERIMENTAL E VALUATIONS in terms of average total Instructions per Cycle (IPC) of the
Setup: We use a two-stage adaptive many-core simulator system over two billion cycles is selected as the metric for
as shown in Figure 3 for our proof-of-concept empirical measuring performance.
evaluations. First stage is based on gem5 [3] cycle-accurate Figure 4 shows the performance of a closed many-core
simulator with Bahurupi [16] adaptive cores implementing system with 256 cores under different size workloads when
ARM v7 ISA and running at 1 GHz. Cores have 2-way out- operated with a dynamic programming- and a greedy algo-
of-order pipeline and have separated 4-way associative L1 rithm based schedulers. In our experiments, we observed that
instruction- and data cache of size 64 KB each. An 8-way the performance under the greedy scheduler can differ from
associative 2 MB unified L2 cache is shared by all cores. dynamic programming based scheduler by up to 1.24% even
Unfortunately, cycle-accurate many-core simulations are not though both are proven theoretically optimal.
time-wise feasible and hence our first stage is limited to maxi- This discrepancy is attributed to the fact that the greedy
mum of eight cores. To bypass this limitation, we take isolated algorithm requires IPC concavity to be observed by all bench-
execution traces of different tasks from the cycle-accurate marks at all times, while no such requirement is mandated by
simulator with up to eight cores allocated and extrapolate them dynamic programming. In practice, IPC concavity is exhibited
using an in-house trace-simulator for many-core simulations. by all benchmarks most of the time, but there are few
Our extrapolated trace simulations lack the modeling depth of rare exceptions. For example, benchmarks like h264 and sift
cycle-accurate simulations but we believe them to be sufficient sometime exhibit super-linearity as shown in Figure 1 where
as an initial testbed for many-core algorithms. speedup of more than two can be observed on allocations of
On software side, we took 36 benchmarks from SPEC [10], only two cores. Hence, a slight degradation in performance is
[11], SD-VBS [19], PARSEC [2] and SPLASH [20] benchmark observed when the concavity assumption is violated.
suites as listed in Table I. All benchmarks were executed Figure 5 shows the performance of closed many-core sys-
in Syscall Emulation (SE) mode. SPEC, SD-VBS and PAR- tems of various sizes under different schedulers when a given
SEC/SPLASH benchmarks were executed with “ref", “full-hd" workload is executed on them. We observed that the greedy
and “sim-small" inputs, respectively. scheduler results in near-optimal solutions in under all cases;
Performance Evaluation: We chose to evaluate our al- similar to observations made in Figure 4.
gorithm on a closed many-core system. In a closed system, Scalability: To get a measure of real-world benefits from
a workload is fixed at the start and constituent tasks of the a greedy scheduler against a dynamic programming based
workload restart execution immediately after completion. Our scheduler, we ran both schedulers cycle-accurately on gem5
workload comprises of a random mix of all the available within a single core in our simulated many-core with rep-

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2618880, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS , VOL. XX, NO. X, MONTH 20XX 5

TABLE II: Comparison of time taken by the greedy scheduler In practice, it requires 10,000x less time than a dynamic
and the dynamic programming based scheduler to solve run- programming based scheduler to reach a solution, while pro-
time many-core scheduling problem of different sizes. viding near-optimal performance.
Dynamic
Cores Tasks Greedy (ms) Improvement
Programming (ms) ACKNOWLEDGMENT
32 7.985 0.061 130.90x
64 64 16.203 0.077 210.42x This work was supported in parts by the German Research
128 32.910 0.108 304.72x Foundation (DFG) as part of the Transregional Collaborative
64 71.005 0.113 628.36x
128 128 144.214 0.144 1,001.48x Research Centre “Invasive Computing" (SFB/TR 89) and in
256 285.617 0.223 1,280.79x parts by Huawei International Pte. Ltd. in Singapore.
128 586.276 0.250 2,345.11x
256 256 1,163.108 0.318 3,657.57x
512 2,314.798 0.555 4,170.80x R EFERENCES
256 4,553.491 0.635 7,170.85x [1] I. Anagnostopoulos, V. Tsoutsouras, A. Bartzas, and D. Soudris. Dis-
512 512 9,056.819 0.866 10,458.22x tributed Run-time Resource Management for Malleable Applications on
1024 18,031.729 1.578 11,426.95x Many-core Platforms. In Design Automation Conference (DAC). ACM.
[2] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark
Suite: Characterization and Architectural Implications. In Parallel
Architectures and Compilation Techniques (PACT), 2008.
resentative varisized inputs. We then recorded the simulated [3] N. Binkert et al. The gem5 Simulator. In SIGARCH Computer
system-time it took for both schedulers to reach a solution in Architecture News, 2011.
Table II on average for a scheduling epoch. [4] J. Blazewicz, M. Machowiak, J. Wkeglarz, M. Y. Kovalyov, and D. Trys-
tram. Scheduling Malleable Tasks on Parallel Processors to Minimize
Empirical evaluations show that dynamic programming the Makespan. Annals of Operations Research, 2004.
based scheduler requires 7.985 ms to solve the problem of run- [5] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, M. F. Kaashoek,
time many-core scheduling for a 64-core many-core running R. Morris, A. Pesterev, L. Stein, M. Wu, Y.-h. Dai, et al. Corey: An
Operating System for Many Cores. In Operating Systems Design and
32 tasks. For a 10 ms scheduling epoch at which a multi- Implementation (OSDI), 2008.
core OS operates, this will result in impractical overhead of [6] S. Buchwald, M. Mohr, and A. Zwinkau. Malleable Invasive Applica-
79.85%. The greedy scheduler in comparison will have an tions. In Software Engineering, 2015.
[7] D. G. Feitelson and L. Rudolph. Metrics and Benchmarking for Parallel
overhead of only 0.61% for the same size problem. Job Scheduling. Job Scheduling Strategies for Parallel Processing, 1998.
Furthermore, the overhead of dynamic programming based [8] D. P. Gulati, C. Kim, S. Sethumadhavan, S. W. Keckler, and D. Burger.
Multitasking Workload Scheduling on Flexible-core Chip Multipro-
scheduler grows exponentially with both increase in the num- cessors. In International Conference on Parallel Architectures and
ber of cores and the tasks executing on them. On the other Compilation Techniques (PACT), 2008.
hand, overhead of greedy scheduler grows much more slowly. [9] J. Henkel, A. Herkersdorf, L. Bauer, T. Wild, M. Hübner, R. K. Pujari,
A. Grudnitsky, J. Heisswolf, A. Zaib, B. Vogel, V. Lari, and S. Kobbe.
Even on a 512-core running 1024 tasks, overhead of greedy Invasive Manycore Architectures. In Asia and South Pacific Design
algorithm stands at a somewhat acceptable 15.78%. To reduce Automation Conference (ASP-DAC), 2012.
overhead further, it seems a many-core OS will need to operate [10] J. L. Henning. SPEC CPU2000 : Measuring CPU performance in the
new millennium. Computer, 2000.
with a much higher values of scheduling epoch. [11] J. L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH
It is important to note that beside direct reduction in Computer Architecture News, 2006.
processing overhead, a reduction in space overhead also plays [12] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core Fusion:
Accommodating Software Diversity in Chip Multiprocessors. In ACM
critical role in reducing the total problem solving time. When SIGARCH Computer Architecture News, 2007.
operating with large data-structures, a scheduler is forced [13] S. Kobbe, L. Bauer, D. Lohmann, W. Schröder-Preikschat, and J. Henkel.
to perform page-swaps with main memory when the last- DistRM: Distributed Resource Management for On-chip Many-core
Systems. In Conference on Hardware/Software Codesign and System
level cache is saturated. Since main memory accesses have Synthesis (CODES), 2011.
several times the latencies of cache accesses, the performance [14] V. Pallipadi and A. Starikovskiy. The Ondemand Governor. In The
suffers enormously. Given the fact that space-overhead for our Linux Symposium, 2006.
[15] A. Pathania, V. Venkataramani, M. Shafique, T. Mitra, and J. Henkel.
greedy scheduler is much smaller than dynamic programming Distributed Scheduling for Many-Cores Using Cooperative Game The-
based scheduler, cache-saturation will happen in the former ory. In Design Automation Conference (DAC), 2016.
for problem of much larger size than the latter. [16] M. Pricopi and T. Mitra. Bahurupi: A Polymorphic Heterogeneous
Multi-core Architecture. Transactions on Architecture and Code Op-
timization (TACO), 2012.
V. C ONCLUSION [17] A. K. Singh, M. Shafique, A. Kumar, and J. Henkel. Mapping on
Multi/Many-core Systems: Survey of Current and Emerging trends. In
In this work, we proposed a theoretically optimal greedy Design Automation Conference (DAC), 2013.
[18] V. Vanchinathan. Performance Modeling of Adaptive Multi-core Archi-
algorithm for the problem of run-time many-core schedul- tecture. http://www.comp.nus.edu.sg/~vvenkata/publications/thesis.pdf,
ing. Given the large optimization search-space, the problem 2015.
requires light-weight algorithms that can be applied online. [19] S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia,
S. Belongie, and M. B. Taylor. SD-VBS: The San Diego vision
A dynamic programming based scheduler can solve the benchmark suite. In International Symposium on Workload Charac-
problem optimally but has high scheduling overheads asso- terization (IISWC), 2009.
ciated with it. As an alternative, we proposed a scheduler [20] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-
2 programs: Characterization and methodological considerations. In
based on a greedy algorithm in this work. The proposed greedy SIGARCH Computer Architecture News, 1995.
scheduler exploits the concavity in IPC extraction inherent in
many-core workloads to maintain theoretical optimality.

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like