7 views

Original Title: Lecture 5

Uploaded by Phuoc Loc Nguyen

- Using STAAD Pro 2005 Courseware
- thesis-mimo.pdf
- Error Correction
- 3GPP 36.212
- lu Bba Syllabus
- CW-Marking Criteria - 2014
- Nastran Users Guide
- Matlab For Chemical Engineer2_Zaidoon.pdf
- Bus Admittance tearing and the partial inverse
- Matrix Algebra
- EC401-A
- Data Science With R- Project 1
- Matlab Robust Control Toolbox (1)
- ANPR PowerPoint
- Cdma Theory
- EC2301-qb.pdf
- digi comm question bank
- Bound Reference Yr 12 Further Maths
- MATH2089 (NUM) NOTES.docx
- RGA-1.docx

You are on page 1of 13

5-1

Algorithm Strength Reduction

Fast algorithm

Using polyphase filter bank to realize long-length taps linear phase filter Using FFT instead of DFT Fast convolution algorithm Fast Trackback Viterbi algorithm (for convolutonal code, Turbo code) Detect first and followed by BM algorithm for Reed-Solomon Code Using CORDIC machine instead of complex multiplier

Memory management

Power Management

Memory bank, Register File Local buffer (cache, or FIFO) Multiple-port memory is replaced by multiple single-port memory bank Resource allocation Using finite-state machine (FSM) to well timing and flow control, through enable/disable signals, in order to time-share the same PE Clock gating, Data gating Power-aware, Energy-aware design

5-2

Convolutinal Code

Encoder

5-3

Localized operations, intensive computations, and matrix operations are features of many DSP algorithms. Derive a maximal concurrency by using both pipelining and parallel processing

How is the inherent concurrency? How is the array processor design dependent on the algorithm? How is the algorithm best implemented in the array processor? By tracing the associated space-time index space and using proper arcs to display the dependencies It exhibits the full dependencies incurred in the execution of a specific algorithm Modularity, regularity, local interconnection

5-4

Introduced by HT Kung and Leiserson, 1978 Designs for matrix computations Illustrated by snapshots of operation Motivations

Improve performance of special-purpose systems (e.g. maximize processing per memory access) Reduce design and implementation costs

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)

5-5

A network of processing elements (PEs) that computes and rhythmically passes data through it Multiple PEs to maximize processing per memory access

100 ns 100 ns

Ordinary system

10 MOPS

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)

40 MOPS

5-6

Example

Systolic FIR filter Least-Square Algorithm

5-7

1D array

2D array

3D array

5-8

A new class of pipelined array architectures Benefits

Simple and regular design (cost-effective) Concurrency and communication Modular and expandable Not all algorithms can be implemented using a systolic architecture Cost in hardware and area Cost in latency

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw) 5-9

Drawbacks

Systolic Fundamentals

Systolic architecture are designed by using linear mapping techniques on regular dependence graph (DG) Regular Dependence Graph: the presence of an edge in a certain direction at any node in the DG represents presence of an edge in the same direction at all nodes in the DG DG corresponds to space representation no time instance is assigned to any computation Systolic architectures have a space-time representation where each node is mapped to a certain processing element (PE) and is scheduled at a particular time instance. Systolic design methodology maps an N-dimensional DG to a lower dimensional systolic architecture

5-10

Space representation for FIR filter

(0,1)T

(1,0)T

(1,-1)T

y(n)=w0x(n)+w1x(n-1)+w2x(n-2)

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw) 5-11

Definitions

Projection vector d = (also called iteration d 2 vector)

Two nodes that are displaced by d or multiples of d are executed by the same processor

d1

Scheduling vector sT = ( s1 , s2 )

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)

pT I = ( p1 , p2 ) j

5-12

Many systolic architectures can be designed for a given algorithm by selecting different projection, processor space, and scheduling vectors. Feasibility constraints

If point IA and point IB differ by d, d= IA- IB, i.e. they are lying on the same direction along projection vector, they must be executed by the same processor. That is, pTIA=pTIB or pTd=0 If point IA and point IB are mapped to the same processor, i.e. IA- IB =d, they cannot be executed at the same time. That is, sTIA sTIB or sTd0 If an edge e exists in DG, then an edge pTe is introduced in the systolic array with sTe delay

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw) 5-13

Step 1: mapping algorithm to DG

Based on the space-time indices in the recursive algorithm Shift-Invariance (Homogeneity) of DG Localization of DG: broadcast vs. transmitted data

Processor assignment: a projection method may be applied (projection vector d) Scheduling: a permissible linear schedule may be applied (schedule vector s)

Preserve the inter-dependence Nodes on an equitemporal hyperplane should not be projected to the same PE

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw) 5-14

y(4) y(5) y(6)

k

h(4) h(3) h(2) h(1) y(0) h(0) y(1) y(2)

y(3)

d s

D D D D D

2D D 2D D 2D D 2D D 2D D

n

x(0) x(1) x(2) x(3) x(4) x(5) x(6)

Equitemporal hyperplanes

5-15

Space-Time Representation

The space representation or DG can be transformed to a space-time representation by interpreting one of the spatial dimensions as temporal dimension 2D DG: i 0 0 1 i i' j ' = T j = pT 0 j t sT 0 t t'

i' = t ,

j ' = p I , t ' = sT I

T

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw) 5-16

Example

s result D D D d input D D D D

5-17

5-18

Linear scheduling SX=sTIX=(s1 s2) (ix,jx)T SY=sTIY=(s1 s2) (iY,jY)T Affine scheduling (A transformation followed by a translation) SX=sTIX+x=(s1 s2) (ix,jx)T + x SY=sTIY+ Y=(s1 s2) (iY,jY)T+ Y For a dependence relation X Y, where IXT=(ix,jx)T and T=(i ,j )T . Then we have S S +T , where S and S are Iy y y Y X x X Y scheduling times for node X and Y, respectively, and TX is the computation time for node X. Each edge of a DG leads to an inequality for selection of scheduling vectors

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw) 5-19

Standard input RIA form Standard output RIA form

If the index of the inputs are the same for all equations If all the output indices are the same

w(i,j) = w(i-1,j) x(i,j) = x(i,j-1) y(i,j) = y(i-1,j+1) + w(i,j)x(i,j)

It is obvious that the FIR filtering problem cannot be expressed in standard input RIA form

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw) 5-20

10

Selection of sT

Capture all the fundamentals edge in the reduced dependence graph (RDG), which is constructed by the regular iteration algorithm (RIA) Construct the scheduling inequalities according SY Sx+Tx, if there is an edge X Y

y

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw) 5-21

X

y

Example

sT (Iy-Ix) + y - x Tx e

5-22

11

5-23

Matrix-Matrix Multiplication

c11 c 21 c12 a11 = c22 a 21 a12 b11 b12 a 22 b21 b22

a(i,j+1,k) = a(i,j,k) b(i+1,j,k) = b(i,j,k) c(i,j,k+1) = c(i,j,k)+ a(i,j,k+1) b(i,j,k+1) Standard output RIA form

5-24

12

Example

5-25

Example

5-26

13

- Using STAAD Pro 2005 CoursewareUploaded bydktiong
- thesis-mimo.pdfUploaded bySindhu Dash
- Error CorrectionUploaded byapi-3819017
- 3GPP 36.212Uploaded byRamy .. Marian
- lu Bba SyllabusUploaded byTalha Siddiqui
- CW-Marking Criteria - 2014Uploaded byRobyn Gautam
- Nastran Users GuideUploaded byOctavian Nistor
- Matlab For Chemical Engineer2_Zaidoon.pdfUploaded byManojlovic Vaso
- Bus Admittance tearing and the partial inverseUploaded byeeng8124
- Matrix AlgebraUploaded bysha101
- EC401-AUploaded byMk
- Data Science With R- Project 1Uploaded byadi8066
- Matlab Robust Control Toolbox (1)Uploaded bykastolo
- ANPR PowerPointUploaded byashish
- Cdma TheoryUploaded bymehta.tarun
- EC2301-qb.pdfUploaded byGowri Shankar
- digi comm question bankUploaded bydshgjklop
- Bound Reference Yr 12 Further MathsUploaded byLucas Gauci
- MATH2089 (NUM) NOTES.docxUploaded byHellen Chen
- RGA-1.docxUploaded byj udayasekhar
- Full TextUploaded bydglorifier
- 10.1.1.22Uploaded byVinod Ravulapalli
- Chap 10 C++Uploaded bysafuan_alcatra
- 00090Uploaded byGiora Rozmarin
- mUploaded byapi-444439435
- Maths REML ManualUploaded byLam Sin Wing
- GroupsUploaded bybounceeez
- 01 04 ExamplesUploaded byravg10
- 868b Computer Sc 2 QpUploaded byRudra Saha
- Finite Element Method R2015 12-04-2018Uploaded byLakshmi Ganesh Sudikonda

- CodesUploaded byHannah Lynn Feranda
- Tokheim Progauge XMT-SI Probe Rev.4Uploaded byTamboura
- Filmws FirewallUploaded byPaula Sapo
- LM324Uploaded byCoral Guzman Guzman
- Surge Suppression 109Uploaded bykesraj77
- c220m4Uploaded byaykargil
- Multical 402 EnUploaded byTomislav Vratić
- 68HC11Uploaded bySalman Kavish
- testbank_03Uploaded byTú Nguyễn
- A6V10333423 Configuration siemens cerberusUploaded byAnonymous 8RFzObv
- 1003546Uploaded byMark Evan Salutin
- 2400_2400H_CouplingUploaded bycasarrubiasv
- Cub Cadet CC4175 Operators ManualUploaded byarypple
- Dq170 BiosUploaded bywalthack
- QFX5110 SPECS.pdfUploaded byroel balolong
- Radial Distribution Test FeedersUploaded byyourou1000
- Dell EMC VxRail DigitalWorkspace Tech WhitepaperUploaded bydipesh
- c++ 1 unitUploaded byNikhil Thakkar
- PBL10 User's ManuelUploaded by陳韡
- Rain Alarm ProjectUploaded byDiana Maria Martinez Oliva
- Canalyzer PresentationUploaded byranamana
- Windows Server 2003 R2 Overview GuideUploaded bydummi1985
- SQL ExamplesUploaded byRamesh Rummy
- 1996 Nissan FrontierUploaded byJavier Antillaque
- Hampson Russell View3D GuideUploaded bypoojad03
- Bengali Character Recognition using feature extractionUploaded byshivua
- Dynamic Characteristic Analysis of a Hydraulic Engine Mount With Lumped Model Based on Finite Element AnUploaded byIndranil Bhattacharyya
- File WritingUploaded byJavian Campbell
- Commodore World Issue 02Uploaded bySteven D
- README.txtUploaded byJack Sim