Professional Documents
Culture Documents
dortmund informatik 12
Optimizations
- Compilation for Embedded Processors -
Peter Marwedel
TU Dortmund
© Springer, 2010
Informatik 12
Germany
2014 年 01 月 17 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.
Structure of this course
Modeling repository
3: 8:
ES-hardware 6: Application Test
mapping
4: system 7: Optimization
software (RTOS,
5: Evaluation &
middleware, …) validation & (energy,
cost, performance, …)
Never
true
Always true
technische universität fakultät für p. marwedel,
dortmund informatik informatik 12, 2014 - 10 -
Optimized version of Tin
Never true
Tin () {
READ (IN, sample, 1);
j==i-1 sum += sample; i++;
ji DATA = sample; d = DATA;
L0: if (i < N) return;
DATA = sum/N; d = DATA;
d = d*c; WRITE(OUT,d,1);
sum = 0; i = 0;
return;
}
Always true
technische universität fakultät für p. marwedel,
dortmund informatik informatik 12, 2014 - 11 -
technische universität fakultät für informatik
dortmund informatik 12
Peter Marwedel
TU Dortmund
© Springer, 2010
Informatik 12
Germany
These slides use Microsoft clip arts. Microsoft copyright restrictions apply.
High-level optimizations
•• Floating-Point
Floating-Pointvs.
vs.Fixed-Point
Fixed-Point •• Integer
Integervs.
vs.Fixed-Point
Fixed-Point
exponent,
exponent,mantissa
mantissa
Floating-Point
Floating-Point S 1 0 0 . . . 0 0 0 0 1 0
•• automatic
automaticcomputation
computation
and
andupdate
updateofofeach
each (a) Integer
exponent
exponentat atrun-time
run-time
Fixed-Point IWL=3 FWL
Fixed-Point
•• implicit
implicitexponent
exponent
•• determined S 1 0 0 . . . 0 0 0 0 1 0
determinedoff-line
off-line
hypothetical binary point
(b) Fixed-Point
© Ki-Il Kum, et al
technische universität fakultät für p. marwedel,
dortmund informatik informatik 12, 2014 - 14 -
Floating-point to fixed point conversion
Pros:
Lower cost
Faster
Lower power consumption
Sufficient SQNR, if properly scaled
Suitable for portable applications
Cons:
Decreased dynamic range
Finite word-length effect, unless properly scaled
• Overflow and excessive quantization noise
Extra programming effort
© Ki-Il Kum, et al. (Seoul National University): A Floating-point To Fixed-point C Converter
For Fixed-point Digital Signal Processors, 2nd SUIF Workshop, 1996
technische universität fakultät für p. marwedel,
dortmund informatik informatik 12, 2014 - 15 -
Development Procedure
Floating-Point
C Program
Range
Estimator
Floating-
Point to Range Estimation
Fixed-Point C Program
C Program
Converter Execution Manual
specification
Fixed-Point
C Program IWL
information
s
0.9 x 215 iwl=4.xxxxxxxxxxxx
*
x
iwl=0.xxxxxxxxxxxx
overflow if + >>5
result
technische universität fakultät für p. marwedel,
dortmund informatik informatik 12, 2014 - 18 -
Floating-Point to Fixed-Point Program Converter
Fixed-Point C Program
mulh
int iir1(int x) to access the upper
{ half of the multiplied
static int s = 0; result
int y; target dependent
y=sll(mulh(29491,s)+ (x>> 5),1); implementation
s = y; sll
return y; to remove 2nd sign bit
} opt. overflow check
Cycle s ADPCM
140000 125249
120000
100000
80000 61401
60000
40000 26718
20000
0
F ixe d -P oin t F ixe d -P oin t F loa tin g -P oin t
(16b ) (32b )
Array p[j][k]
Column major order
Row major order (C) (FORTRAN)
j=0
j=0 k=0 j=1
j=0
j=1 k=1
j=1
…
j=0
j=2 k=2
j=1
…
…
j=2
j=1
j=0
For row major order
Example:
…#define iter 400000 Improved locality
int a[20][20][20];
void computeijk() {int i,j,k;
for (i = 0; i < 20; i++) {
for (j = 0; j < 20; j++) {
for (k = 0; k < 20; k++) {
a[i][j][k] += a[i][j][k];}}}}
void computeikj() {int i,j,k;
for (i = 0; i < 20; i++) {
for (j = 0; j < 20; j++) {
for (k = 0; k < 20; k++) {
a[i][k][j] += a[i][k][j] ;}}}}…
start=time(&start);for(z=0;z<iter;z++)computeijk();
end=time(&end);
printf("ijk=%16.9f\n",1.0*difftime(end,start));
(SUIF interchanges array indexes instead of loops)
technische universität fakultät für p. marwedel,
dortmund informatik informatik 12, 2014 - 26 -
Results:
strong influence of the memory architecture
Loop structure: i j k Dramatic impact of locality
Processor Ti C6xx Sun SPARC Intel Pentium
reduction to [%] ~ 57% 35% 3.2 %
Time [s]
Not always the same impact .. [Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004]
#define
#define size
size 30
30 void ms1() {int i,j;
#define
#define iter
iter 40000
40000 for (i=0;i< size;i++){
int
int a[size][size];
a[size][size]; for (j=0;j<size;j++){
float
float b[size][size];
b[size][size]; a[i][j]+=17; }
for (j=0;j<size;j++){
b[i][j]-=13; }}}
void ss1() {int i,j;
for (i=0;i<size;i++){
for (j=0;j<size;j++){ void mm1() {int i,j;
a[i][j]+= 17;}} for(i=0;i<size;i++){
for(i=0;i<size;i++){ for(j=0;j<size;j++){
for (j=0;j<size;j++){ a[i][j] += 17;
b[i][j]-=13;}}} b[i][j] -= 13;}}}
40
with
with–o3
–o3
20
0
X86 gcc 3.2 -03 x86 gcc 2.95 -o3 Sparc gcc 3xo1 Sparc gcc 3x o3
Plattform
factor = 2
Better locality for access to p.
Less branches per execution
of the loop. More opportunities
for optimizations.
Tradeoff between code size
and improvement.
Extreme case: completely
unrolled loop (no branch).
factor
factor
Benefits quite small; penalties may be large [Till Buchwald, Diploma thesis, Univ.
Dortmund, Informatik 12, 12/2004]
Processor Ti C6xx
reduction to [%]
#define s 50
#define iter 150000
int a[s][s], b[s][s];
void compute() {
int i,k;
for (i = 0; i < s; i++) {
for (k = 1; k < s; k++) {
a[i][k] = b[i][k];
b[i][k] = a[i][k-1];
}}}
factor
Small benefits; [Till Buchwald, Diploma thesis, Univ.
Dortmund, Informatik 12, 12/2004]
Pentium P4
16 kB L1 data cache, 4 cycles/access
1 MB L2 cache, 14 cycles/access
Main memory, 200 cycles/access
U. Drepper: What every programmer should know about memory*, 2007, http://www.akkadia.
org/drepper/cpumemory.pdf; Dank an Prof. Teubner (LS6) für Hinweis auf diese Quelle
* In Anlehnung an das Papier „David Goldberg, What every programmer should know about
floating point arithmetic, ACM Computing Surveys, 1991 (auch für diesen Kurs benutzt).
* prefetching succeeds §
prefetching fails
technische universität fakultät für p. marwedel,
© Graphics: U. Drepper, 2007 - 37 -
dortmund informatik informatik 12, 2014
Impact of TLB misses and larger caches
Elements on different pages; run time Larger caches are shifting the
increase when exceeding the size of steps to the right
the TLB
i++
k++
i++
k++
k++ kk
j++ jj
i++ jj
i++
k++, j++
technische universität fakultät für p. marwedel, Monica Lam: The Cache Performance and Optimization
dortmund informatik informatik 12, 2014 of Blocked Algorithms, ASPLOS, 1991 - 40 -
Example
SPARC
In practice, results
by Buchwald are
disappointing.
One of the few
cases where an
improvement was
achieved:
Source: similar to
matrix mult.
Tiling-factor
Pentium [Till Buchwald, Diploma thesis, Univ.
Dortmund, Informatik 12, 12/2004]
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
160%
140%
120%
100%
80%
60%
40%
20%
0%
[Falk, 2002]
technische universität fakultät für p. marwedel,
dortmund informatik informatik 12, 2014 - 46 -
High-level optimizations
Initial arrays
Unfolded
Unfolded
arrays
arrays
Inter-array folding
ADOPT&DTSE required
to achieve real benefit
[C.Ghez et al.: Systematic high-level Address
Code Transformations for Piece-wise Linear
Indexing: Illustration on a Medical Imaging
Algorithm, IEEE WS on Signal Processing
System: design & implementation, 2000, pp.
623-632]