algo

© All Rights Reserved

2 views

algo

© All Rights Reserved

- DSP Using Labview FPGA
- 2013 02 Seisspace Promax Data Sheet
- Output Post Processor
- ParaAPIs Handout
- Heterogeneous Highly Parallel Implementation Of Matrix Exponentiation Using GPU
- Arbiter for Runtime Traffic Permutation Using Multi Stage Switching
- Understanding Non-Uniform Memory Access – NUMA
- perfbook.2011.08.28a-1
- DataStage Quality Stage Fundamentals
- Parallel Explicit Finite Element Solid Dynamics With Domain Decomposition and Message Passing -- Dual Partitioning Scalability
- Cases 11
- Pricelist 2014
- Folheto s7400 e
- Architecture and Implementation Issues of Multi-core Processors and Caching – a Survey
- Notes
- Generations of Comp
- 26-parallelism.pdf
- Law AMGah
- An FPGA Implementation of 3D Numerical Simulations on a 2D SIMD Array Processor.pdf
- NNT Release Notes

You are on page 1of 9

Scheduling how parallelism and locality arise in programs

ordering constraints between tasks for correctness or

efficiency

This lecture: How do we assign tasks to workers?

multicore: workers might be cores

distributed-memory machines: workers might be

hosts/machines

Scheduling

Keshav Pingali rich literature exists for dependence graph scheduling

most of it is not very useful in practice since they use

University of Texas, Austin unrealistic program and machine models

(e.g.) assume task execution times are known

nevertheless, it is useful to study it since it gives us

intuition for what the issues are for scheduling in practice

P identical processors

DAG with START and END nodes START Memory START

all nodes reachable from START

processors have local memory

END reachable from all nodes

wi all shared-data is stored in global memory

START and END are not essential 1 1

i How does a processor know which nodes it

Nodes are computations must execute?

each computation can be executed by work assignment

a processor in some number of time-

Memory

Memory

computation may require to execute a node? a

reading/writing shared-memory (eg) if P1 executes node a and P2 executes

node weight: time taken by a processor

to perform that computation

node b, how does P2 know when P1 is done?

synchronization b

wi is weight of node i For now, let us defer these questions

Edges are precedence constraints P In general, time to execute program depends P

nodes other than START can be on work assignment

executed only after immediate Processors for now, assume only that if there is an idle Processors

predecessors in graph have been processor and a ready node, that node is

executed assigned immediately to an idle processor

known as dependences

END END

TP = best possible time to execute program

Very old model: on P processors

PERT charts (late 50s): Dependence DAG Dependence DAG

Program Evaluation and Review

Technique

developed by US Navy to manage

Polaris submarine contracts

1

Work and critical path Terminology

Work = i wi START Instantaneous parallelism

1

1

time required to execute

program on one processor IP(t) = maximum number of 1

1 processors that can be kept 1 1

= T1 1 1

busy at each point in execution

Path weight of algorithm

1

1 1

sum of weights of nodes on wi Maximal parallelism

path

Critical path Data MP = highest instantaneous

parallelism

path from START to END

that has maximal weight Average parallelism

P 3

this work must be done AP = T1/T

Processors 2

sequentially, so you need These are properties of the

this much time regardless END computation DAG, not of the 1

of how many processors machine or the work assignment

you have Computation DAG time

call this T

Instantaneous and average parallelism

Algorithm for computing earliest start times of nodes

Keep a value called minimum-start-time (mst) with each node, Speed-up(P) = T1/TP

initialized to 0

Do a topological sort of the DAG intuitively, how much faster is it to execute

ignoring node weights program on P processors than on 1

For each node n ( START) in topological order processor?

for each node p in predecessors(n)

mstn = max(mstn, mstp + wp) Bound on speed-up

Complexity = O(|V|+|E|)

Critical path and instantaneous, maximal and average

regardless of how many processors you have,

parallelism can easily be computed from this you need at least T units of time

speed-up(P) T1/T = i wi /CP = AP

2

Amdahls law Scheduling

1

Suppose P MP 1

Amdahl: There will be times during 1

1 1

suppose a fraction p of a program can be done in parallel the execution when only 1 1

suppose you have an unbounded number of parallel processors a subset of ready nodes 1

and they operate infinitely fast can be executed. 1 1

speed-up will be at most 1/(1-p). Time to execute DAG can

Follows trivially from previous result. depend on which subset

Plug in some numbers: of P nodes is chosen for

execution. 3

p = 90% speed-up 10

p = 99% speed-up 100

To understand this better, 2

it is useful to have a more 1

To obtain significant speed-up, most of the program must detailed machine model

be performed in parallel time

serial bottlenecks can really hurt you

What if we only had 2 processors?

Schedule: function from node to (processor, start time)

Also known as space-time mapping

1

Schedule 1

Processors operate START

time

synchronously (in lock-step) 1 1

barrier synchronization in hardware

Shared memory 0 1 2 3 4

a b c

1

P0 START a c END

space

can assume all other processors 1 d

P1 b d

have completed tasks in all previous P1 P2 Pp

steps Schedule 2

time END

Each processor has private 1

memory 0 1 2 3 4

P0

P0 START a b d END

space

P1

P1 c

Intuition: nodes along the critical path should be given preference in scheduling

3

Optimal schedules Heuristic: list scheduling

Optimal schedule all predecessor nodes have completed execution

shortest possible schedule for a given DAG and the given number of

processors Fill in the schedule cycle-by-cycle

Complexity of finding optimal schedules in each cycle, choose nodes from ready list

one of the most studied problems in CS use heuristics to choose best nodes in case you cannot

DAG is a tree: schedule all the ready nodes

level-by-level schedule is optimal (Aho, Hopcroft)

General DAGs One popular heuristic:

variable number of processors (number of processors is input to assign node priorities before scheduling

problem): NP-complete priority of node n:

fixed number of processors weight of maximal weight path from n to END

2 processors: polynomial time algorithm

3,4,5: complexity is unknown! intuitively, the further a node is from END, the higher its priority

Many heuristics available in the literature

Example

List scheduling algorithm

cycle c = 0; time 1 4

ready-list = {START}; START

inflight-list = { }; 0 1 2 3 4

while (|ready-list|+|inflight-list| > 0) { 2 3 2

P0 START a c END 1 1 1

space

a b

if (a processor is free at this cycle) {

remove n from ready-list and add to inflight-list; P1 b d 2

add node to schedule at time cycle; 1 d

}

else break;

} Heuristic picks the good schedule 1

c = c + 1; //increment time 1 END

for each node n in inflight-list {//determine ready tasks Not always guaranteed to produce optimal schedule

if (n finishes at time cycle) {

(otherwise we would have a polynomial time algorithm!)

remove n from inflight-list; P0

add every ready successor of n in DAG to ready-list

} P1

}

}

4

Generating dependence graphs Data dependence

How do we produce dependence graphs in the Basic blocks

first place? straight-line code

Nodes represent statements

Two approaches Edge S1 S2

specify DAG explicitly flow dependence (read-after-write (RAW))

parallel programming S1 is executed before S2 in basic block

S1 writes to a variable that is read by S2

easy to make mistakes anti-dependence (write-after-read (WAR))

data races: two tasks that write to same location but are not S1 is executed before S2 in basic block

ordered by dependence S1 reads from a variable that is written by S2

by compiler analysis of sequential programs output-dependence (write-after-write (WAW))

S1 is executed before S2 in basic block

Let us study the second approach S1 and S2 write to the same variable

input-dependence (read-after-read (RAR)) (usually not important)

called dependence analysis S1 is executed before S2 in basic block

S1 and S2 read from the same variable

In real programs, we often cannot determine

precisely whether a dependence exists Write sequential program.

in example, Compiler produces parallel code

i = j: dependence exists generates control-flow graph

i j: dependence does not exist

produces computation DAG for each basic block by performing

dependence may exist for some invocations and not

for others dependence analysis

Conservative approximation generates schedule for each basic block

when in doubt, assume dependence exists use list scheduling or some other heuristic

at the worst, this will prevent us from executing branch at end of basic block is scheduled on all processors

some statements in parallel even if this would be Problem:

legal

average basic block is fairly small (~ 5 RISC instructions)

Aliasing: two program names for the same storage

location One solution:

(e.g.) X(i) and X(j) are may-aliases transform the program to produce bigger basic blocks

may-aliasing is the major source of imprecision in

dependence analysis

5

One transformation: loop unrolling Smarter loop unrolling

Original program Use new name for loop iteration variable in each

for i = 1,100

unrolled instance

X(i) = i

Unroll loop 4 times: not very useful! for i = 1,100,4

for i = 1,100,4 X(i) = i

X(i) = i o i1 = i+1

o i = i+1 X(i1) = i1

X(i) = i o i2 = i+2

o i = i+1

X(i2) = i2

X(i) = i

o i3 = i+3

o

i = i+1

X(i) = i X(i3) = i3

If compiler can also figure out that X(i), X(i+1), X(i+2), We will study techniques for array

and X(i+3) are different locations, we get the following

dependence graph for the loop body dependence analysis later in the course

Problem can be formulated as an integer

for i = 1,100,4

X(i) = i linear programming problem:

i1 = i+1 Is there an integer point within a certain

X(i1) = i1

i2 = i+2

polyhedron derived from the loop bounds and

X(i2) = i2 the array subscripts?

i3 = i+3

X(i3) = i3

6

Scheduling instructions for VLIW

Two applications machines

Static scheduling Processors functional units

START

Local memories registers Ops

create space-time diagram at compile-time Global memory memory

a b c

VLIW code generation Time instruction

Nodes in DAG are operations

Instruction

Dynamic scheduling (load/store/add/mul/branch/..) d

instruction-level parallelism

create space-time diagram at runtime List scheduling

useful for scheduling code for END

multicore scheduling for dense linear algebra pipelined, superscalar and VLIW

machines

used widely in commercial compilers

loop unrolling and array dependence

analysis are also used widely

processors Reality:

hard to build single cycle memory that can be

START

accessed by large numbers of cores

Ideas originated in late 70s-early 80s

Two key people: Architectural change a b c

Bob Rau (Stanford,UIUC, TRW, Cydrome, HP) decouple cores so there is no notion of a global

Josh Fisher (NYU,Yale, Multiflow, HP) step

d

Bob Raus contributions: each core/processor has its own PC and cache

transformations for making basic blocks larger: memory is accessed independently by each

predication

software pipelining

core

hardware support for these techniques New problem:

predicated execution END

rotating register files Bob Rau since cores do not operate in lock-step, how

most of these ideas were later incorporated does a core know when it is safe to execute a

into the Intel Itanium processor node? P0: a

Josh Fisher: Solution: software synchronization P1: b

transformations for making basic blocks larger:

trace scheduling: uses key idea of branch

counter associated with each DAG node P2: c d

probabilities decremented when predecessor task is done

Multiflow compiler used loop unrolling

Software synchronization increases overhead How does P2 know when

of parallel execution P0 and P1 are done?

cannot afford to synchronize at the instruction

level

Josh Fisher nodes of DAG must be coarse-grain: loop

iterations

7

Increasing granularity: New problem

Block Matrix Algorithms

Original matrix multiplication B00 B01

Difficult to get accurate execution times of

coarse-grain nodes

for I = 1,N

for J = 1,N

B10 B11

conditional inside loop iteration

for K = 1,N

C(I,J)= C(I,J)+A(I,K)*B(K,J) A00 A01 C00 C01

cache misses

exceptions

Block (tiled) matrix multiplication A10 A11 C10 C11

O/S processes

for IB = 1,N step B

for JB = 1,N step B

parallel loops C00 = A00*B00 + A01*B10 .

Solution: runtime scheduling

for KB = 1,N step B C01 = A01*B11 + A00*B01

for I = IB, IB+B-1 C11 = A11*B01 + A10*B01

for J = JB, JB+B-1 C10 = A10*B00 + A11*B10

for K = KB, KB+B-1

C(I,J) = C(I,J)+A(I,K)*B(K,J)

Dongarra et al (UTK)

Programming model for specifying DAGs for

parallel blocked dense linear algebra codes

nodes: block computations

DAG edges specified by programmer (see next

slides)

Runtime system

keeps track of ready nodes

assigns ready nodes to cores

determines if new nodes become ready when a Tiled QR (using tiles and in/out notations)

node completes

32

8

DAGuE: Tiled QR (2) Summary of multicore

scheduling

Assumptions

Tiled QR

DAG of tasks is known

each task is heavy-weight and executing task

on one worker exploits adequate locality

no assumptions about runtime of tasks

no lock-step execution of processors or

synchronous global memory

Scheduling

Dataflow Graph for 2x2 processor grid Machine: 81 nodes, 648 cores keep a work-list of tasks that are ready to execute

use heuristic priorities to choose from ready tasks

33

Summary

Dependence graphs

nodes are computations

edges are dependences

Static dependence graphs: obtained by

studying the algorithm

analyzing the program

Limits on speed-ups

critical path

Amdahls law

DAG scheduling

heuristic: list scheduling (many variations)

static and dynamic scheduling

applications: VLIW code generation, multicore scheduling for dense

linear algebra

Major limitations:

works for topology-driven algorithms with fixed neighborhoods since we

know tasks and dependences before executing program

not very useful for data-driven algorithms since tasks are created

dynamically

one solution: work-stealing, work-sharing. Study later.

- DSP Using Labview FPGAUploaded byamhosny64
- 2013 02 Seisspace Promax Data SheetUploaded byRisky Aftartu
- Output Post ProcessorUploaded bymdmay61
- ParaAPIs HandoutUploaded byapi-3732679
- Heterogeneous Highly Parallel Implementation Of Matrix Exponentiation Using GPUUploaded byijdps
- Arbiter for Runtime Traffic Permutation Using Multi Stage SwitchingUploaded byIOSRjournal
- Understanding Non-Uniform Memory Access – NUMAUploaded byANILPETWAL
- perfbook.2011.08.28a-1Uploaded byiz_hak
- DataStage Quality Stage FundamentalsUploaded bybrightfulindia3
- Parallel Explicit Finite Element Solid Dynamics With Domain Decomposition and Message Passing -- Dual Partitioning ScalabilityUploaded bypkrysl2384
- Cases 11Uploaded byNguyen Van Huy
- Pricelist 2014Uploaded byAyeDingoasen-Cabalbal
- Folheto s7400 eUploaded byHR
- Architecture and Implementation Issues of Multi-core Processors and Caching – a SurveyUploaded byInternational Journal of Research in Engineering and Technology
- NotesUploaded bysreedeviish
- Generations of CompUploaded byRvikumar
- 26-parallelism.pdfUploaded bykarthikrajn
- Law AMGahUploaded bysundeepadapa
- An FPGA Implementation of 3D Numerical Simulations on a 2D SIMD Array Processor.pdfUploaded byBoppidiSrikanth
- NNT Release NotesUploaded byMartin' Dejesus Arias Parada
- snapdragon-820-for-embedded-product-brief.pdfUploaded byDimebag Gaurav
- workloadUploaded byBurhan6220
- Python Advanced_ Threads and ThreadingUploaded byPedro Elias Romero Nieto
- Settings ZM 2013Uploaded byveronica
- Parallel Solver for Bordered Block Diagonal MatrixUploaded byShashank Gangrade
- 88187 JRandall - Build High Perf SQL ServerUploaded byKarthikeyan Ramachandriya
- OMP Common Core-VossUploaded byAvinash
- Concurrent Engineering-Kaur-1063293X16679001.pdfUploaded bySavina Bansal
- Concurrent Engineering-Kaur-1063293X16679001.pdfUploaded bySavina Bansal
- DFT Strategy for Arm CoresUploaded byyellow51

- 97_astrUploaded byxSonyWIi
- Operating Systems Concepts MCQsUploaded bybabarirfanali
- CSS 2018 Written Part ResultUploaded bybabarirfanali
- CloudSim-DVFS.pdfUploaded byjorisduda
- CSS_2018_Written_Part_Result.pdfUploaded bybabarirfanali
- Solar System MCQSUploaded bybabarirfanali
- Advt. No.9-2018_0.pdfUploaded byMuhammad Tayyab
- sociologyUploaded bybabarirfanali
- Exercise SolutionsUploaded bybabarirfanali
- 00726348Uploaded bybabarirfanali
- scheduling_2.pdfUploaded bybabarirfanali
- 2014_04_msw_a4_format.docUploaded bysunnyday32
- Advertisment 2017Uploaded bybabarirfanali
- IPDPS03Uploaded bybabarirfanali
- ADV-06-2017_2.pdfUploaded bybabarirfanali
- --Advertisment-1. Faculty ADV-0006 MUET-Khairpur (WEB) (1).pdfUploaded byAbdul Latif Abro

- Aspire 4937Uploaded bycess2406
- debug1214.txtUploaded byDani El Calvo
- EMAYAUploaded byValeria Estupiñan
- 6-67059-12 Install Guide Unracked DXi8500 RevAUploaded byjpwawrzy3579
- hp serverUploaded byDong Duong Phuong
- Bca 4030 - System SoftwareUploaded byabhishek
- Cas3 Interface English user manualUploaded byAdrian
- EH1000H-4 Nano Quick GuideUploaded bylalonader
- Two Sensor Symmetrical Muting (TSSM) Wiring and Programming ExampleUploaded byMiguel Huertero Diego
- Unit 3 Central Processing UnitUploaded bycollegeaits
- A Survey on FPGA Hardware Implementation for Image ProcessingUploaded byfarella
- 467086Uploaded byClaudia Verella
- VMware Compatibility Guide - CpuUploaded byABDQ
- Solid-State Drive (SSD) EnduranceUploaded byMirna
- ATL P3000 Field Service ManualUploaded byDiego Carlos
- 939NF6G-VSTAUploaded bykisgallcsaba
- IOT ProjectUploaded byHerminio Menchaca
- ASUS EeePC 701-ServiceManual en-2.0 - Complete and Including an IndexUploaded byandyG_2005
- UM-810DT-E2Uploaded byteslaguy
- Windows System Error CodesUploaded bymanish
- 784300 MCU ManualUploaded byhung_vn_007
- 5px ups technical specifications 007Uploaded byapi-170472102
- VeritasUploaded byAravind Potti
- EXII ControllerUploaded bysamiie30
- Simics User Guide for LinuxUploaded bylohit24
- [Oz Shy] the Economics of Network IndustriesUploaded byJorge Bravo
- HPE ProliantUploaded byike4546
- Ani Ovonic Unified MemoryUploaded byapi-3827000
- Design and Fast Implementation of G726 ADPCM Codec for Audio and Speech ApplicationsUploaded bykumarbsnsp
- VMware Exam QuestionsUploaded byNick Bakke

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.