Project #1: Computer Architecture EE6304 Due Date: 3/8/2012 Team Number:22

Computer Architecture EE6304 Project #1
PROJECT #1
Computer Architecture
EE6304
Due Date: 3/8/2012
Team Number:22
Submitted By:
Kaur Guneet
Patel Jayendra
Index
Objective
General description for Alpha 21264 EV6 configuration
Part1:Preparation
Part2:Find CPI
Part3:Optimize CPI For Each Benchmark
Procedure followed for each Benchmark
Plots for benchmarks: GCC, Anagram, GO
Data tables for lowest CPIs for benchmarks: GCC, Anagram, GO
19
Part4:Define Cost Function
24
Part5:Optimize Cache for Performance/Cost
25
References
32
Objective
In this project, we have been asked to fine-tune the cache hierarchy of an Alpha
microprocessor for three individual benchmarks. The cache design parameters we can modify
are:
Cache levels: One or two levels, for data and instruction caches.
Unified caches: Selection of separate vs. unified instruction/data caches. For example,
you can have separate L1 caches and a unified L2 cache.
Size: Cache size, one of the most important choices.
Associativity: Selection of cache associativity (e.g. direct mapped, 2-way set associative,
etc.).
Block size: Block size of the cache, usually 64 or 32 bytes.
Block replacement policy: Selection between FIFO, LRU and Random.
While larger caches generally mean better performance, they also come at a greater cost. Thus,
sensible design choices and trade-offs are required. In the end, of this project we will also be
defining a cost function and to use it in order to identify the optimal configuration.
General description for Alpha 21264 EV6 configuration

Architecture
Alpha architecture is a 64-bit load and store RISC architecture designed with particular emphasis
on speed, multiple instruction issue, multiple processors, and software migration from many
operating systems.
All registers are 64 bits long and all operations are performed between 64-bit registers
All instructions are 32 bits long. Memory operations are either load or store operations
All data manipulation is done between registers.
The Alpha architecture supports the following data types:
8-, 16-, 32-, and 64-bit integers
IEEE 32-bit and 64-bit floating-point formats
VAX architecture 32-bit and 64-bit floating-point formats
Addressing
The basic addressable unit in the Alpha architecture is the 8-bit byte. The 21264 supports a 48bit or 43-bit virtual address (selectable under IPR control).
Virtual addresses as seen by the program are translated into physical memory addresses
by the memory-management mechanism. The 21264 supports a 44-bit physical address.
Alpha 21264 Microarchitecture

The 21264 microprocessor is a high-performance third-generation implementation of the
Compaq Alpha architecture. The 21264 consists of the following sections, as shown in below
figure.
Instruction fetch, issue, and retire unit (Ibox)
Integer execution unit (Ebox)
Floating-point execution unit (Fbox)
Onchip caches (Icache and Dcache)
Memory reference unit (Mbox)
External cache and system interface unit (Cbox)
Pipeline operation sequence
Figure:Alpha 21264 Microarchitecture
Part1:Preparation
Followed all the steps as given in Part1 of the project document for setting up simple scalar.
Benchmarks were downloaded and verified to match with the data given in the project handout
Part2:Find CP1
Given:
Cache levels: Two levels.

Unified caches: Separate L1 data and instruction cache, unified L2 cache.
Size: 64K Separate L1 data and instruction caches, 1MB unified L2 cache.
Associativity: Two-way set-associative L1 caches, Direct-mapped L2 cache.
Block size: 64 bytes.
Block replacement policy: FIFO
97L1 Miss Penalty=5 Cycles
L2 Miss Penalty=40 Cycles
Cache hit/Instruction execution=1 Cycle
Table showing data from running all the three benchmarks

Benchmarks
Number Of Instruction
il1 Accesses
il1 Misses
il1 Missrate
dl1 Accesses
dl1 Misses
dl1 Missrate
dl2 Accesses
dl2 Misses
dl2 Missrate
CC1
337353464
337353464
1589871
0.0047
122134327
1287463
0.0105
3265974
401908
0.1231
GO
545823063
545823063
716280
0.0013
211701947
176428
0.0008
955573
69334
0.0726
Anagram
25729033
25729033
504
0.00000
9295564
18357
0.002
24107
5252
0.2179
Calculating CPI:
CPI = 1+
+
=
es
Benchmarks
CC1
GO
ANAGRAM
CPI
1.06634195
1.01313547
1.01177939
Part3: Optimize CPI for each Benchmark

Procedure followed for each Benchmark[a]
Below are the parameters and their corresponding combinations that were selected for running
the optimization based.
L1 Cache: 128KB
L2 Cache : 1MB
Block size: 128B, 64B, 32B
Associativity: 8, 2, 1
Block Replacement Policy: LRU, FIFO and Random
Nsets can then be calculated using the formula:
In addition, we chose three combinations of L1 cache with L2 cache for each of the three
benchmarks as shown below:
L1 separate, L2 separate
L1 separate, L2 unified
L1 unified, L2 unified
Data was collected for all these stated parameters using a python script (Can be provided if asked
for but has not been included as it wasnt listed as one of the deliverables). The script was run for
all the above parameters for each of the benchmarks to obtain data for the three L1 and L2 cache
combinations listed above.
Below are the plots showing the CPI for each of the benchmarks for all three above listed cache
configurations for L1 and L2 caches. The chart titles specify the benchmark and also the L1 and
L2 cache combination. The y-axis shows the CPI and the x-axis shows the L1 and L2
configuration corresponding to each and every CPI.
The data in these plots have been arranged to show the least CPI first and then follow an
ascending order reaching the highest CPI. The CPIs are easily traceable by looking at the
corresponding point on the x-axis which would give information on L1 and L2 cache
configurations pertaining to the CPI under observation.
Please note that the data tables and the graphs will follow the following notation:
L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block
size):(associativity):(replacement policy)
8
p
e
r
I
n
s
t
r

L1:2048:32:1:l_L2:32768:32:1:l
L1:2048:32:1:r_L2:32768:32:1:r
L1:2048:32:1:l_L2:16384:32:1:l
L1:2048:32:1:r_L2:16384:32:1:r
L1:2048:32:1:f_L2:16384:32:1:f
of
L1:2048:32:1:f_L2:32768:32:1:f
L1:1024:64:1:r_L2:8192:64:1:r
L1:1024:64:1:f_L2:8192:64:1:f
L1:1024:64:1:l_L2:8192:64:1:l
L1:512:128:1:r_L2:4096:128:1:r
L1:512:128:1:f_L2:4096:128:1:f
policy)_L2:(No.
L1:1024:64:1:r_L2:16384:64:1:r
L1:1024:64:1:f_L2:16384:64:1:f
L1:1024:64:1:l_L2:16384:64:1:l
L1:512:128:1:r_L2:8192:128:1:r
L1:512:128:1:f_L2:8192:128:1:f
L1:512:128:1:l_L2:4096:128:1:l
L1:1024:32:2:r_L2:8192:32:2:r
L1:1024:32:2:f_L2:8192:32:2:f
L1:1024:32:2:l_L2:8192:32:2:l
L1:512:64:2:r_L2:4096:64:2:r
L1:512:64:2:f_L2:4096:64:2:f
L1:256:32:8:r_L2:2048:32:8:r
L1:512:64:2:l_L2:4096:64:2:l
L1:256:32:8:f_L2:2048:32:8:f
L1:256:128:2:f_L2:2048:128:2:f
size):(associativity):(replacement
L1:512:128:1:l_L2:8192:128:1:l
L1:1024:32:2:r_L2:16384:32:2:r
L1:1024:32:2:f_L2:16384:32:2:f
L1:1024:32:2:l_L2:16384:32:2:l
L1:256:32:8:r_L2:4096:32:8:r
L1:256:32:8:f_L2:4096:32:8:f
L1:512:64:2:r_L2:8192:64:2:r
L1:512:64:2:f_L2:8192:64:2:f
L1:512:64:2:l_L2:8192:64:2:l
L1:256:32:8:l_L2:4096:32:8:l
L1:256:128:2:r_L2:2048:128:2:r
L1:256:128:2:l_L2:2048:128:2:l
L1:256:32:8:l_L2:2048:32:8:l
L1:128:64:8:r_L2:1024:64:8:r
L1:128:64:8:f_L2:1024:64:8:f
L1:64:128:8:r_L2:512:128:8:r
L1:64:128:8:f_L2:512:128:8:f
L1:128:64:8:l_L2:1024:64:8:l
NOTATION: L1:(No. of sets):(block

L1:256:128:2:r_L2:4096:128:2:r
L1:128:64:8:r_L2:2048:64:8:r
L1:256:128:2:f_L2:4096:128:2:f
L1:128:64:8:f_L2:2048:64:8:f
L1:256:128:2:l_L2:4096:128:2:l
1.12
1.11
1.1
1.09
1.08
1.07
1.06
1.05
1.04
1.03
1.02
1.01
1
L1:128:64:8:l_L2:2048:64:8:l
I
n
s
t
r
L1:64:128:8:r_L2:1024:128:8:r
C
y
c
l
e
s
1.1
1.09
1.08
1.07
1.06
1.05
1.04
1.03
1.02
1.01
1
L1:64:128:8:f_L2:1024:128:8:f
p
e
r
L1:64:128:8:l_L2:512:128:8:l
C
y
c
l
e
s
L1:64:128:8:l_L2:1024:128:8:l
Plots for benchmarks: GCC, Anagram, GO

GCC benchmark
sets):(block
GCC Benchmark: CPI - versus - L1separate_L2separate cache

config

GCC Benchmark: CPI - versus - L1separate_L2unified cache

config
C
y
c
l
e
s
I
n
s
t
r
L1:128:128:8:l_L2:1024:128:8:l
L1:128:128:8:f_L2:1024:128:8:f
L1:128:128:8:r_L2:1024:128:8:r
L1:256:64:8:l_L2:2048:64:8:l
L1:256:64:8:f_L2:2048:64:8:f
L1:256:64:8:r_L2:2048:64:8:r
L1:512:128:2:r_L2:4096:128:2:r
L1:512:128:2:l_L2:4096:128:2:l
L1:512:32:8:l_L2:4096:32:8:l
L1:512:128:2:f_L2:4096:128:2:f
L1:512:32:8:f_L2:4096:32:8:f
L1:512:32:8:r_L2:4096:32:8:r
L1:1024:64:2:l_L2:8192:64:2:l
L1:1024:64:2:f_L2:8192:64:2:f
L1:1024:64:2:r_L2:8192:64:2:r
L1:2048:32:2:l_L2:16384:32:2:l
L1:2048:32:2:f_L2:16384:32:2:f
L1:2048:32:2:r_L2:16384:32:2:r
L1:1024:128:1:l_L2:8192:128:1:l
L1:1024:128:1:f_L2:8192:128:1:f
L1:1024:128:1:r_L2:8192:128:1:r
L1:2048:64:1:l_L2:16384:64:1:l
L1:2048:64:1:f_L2:16384:64:1:f
L1:2048:64:1:r_L2:16384:64:1:r
L1:4096:32:1:l_L2:32768:32:1:l
L1:4096:32:1:f_L2:32768:32:1:f
L1:4096:32:1:r_L2:32768:32:1:r

10
policy)_L2:(No.
of
sets):(block
GCC Benchmark: CPI - versus - L1unified_L2unified

cache config
1.009
1.008
1.007
1.006
1.005
1.004
1.003
p 1.002
e 1.001
1

Each of the plots above clearly shows that L1 unified cache with L2 unified cache gives the best
possible CPI
More of the data analysis is done in the following section.
I
n
s
t
r
C
y
c
l
e
s
p
e
r
I
n
s
t
r
L1:512:128:1:l_L2:4096:128:1:l
L1:512:128:1:f_L2:4096:128:1:f
L1:512:128:1:r_L2:4096:128:1:r
L1:1024:64:1:l_L2:8192:64:1:l
L1:1024:64:1:f_L2:8192:64:1:f
L1:1024:64:1:r_L2:8192:64:1:r
L1:64:128:8:r_L2:512:128:8:r
L1:256:128:2:r_L2:2048:128:2:r
L1:512:64:2:r_L2:4096:64:2:r
L1:128:64:8:r_L2:1024:64:8:r
L1:2048:32:1:l_L2:16384:32:1:l
L1:2048:32:1:f_L2:16384:32:1:f
L1:2048:32:1:r_L2:16384:32:1:r
L1:1024:32:2:r_L2:8192:32:2:r
L1:256:32:8:r_L2:2048:32:8:r
L1:256:128:2:f_L2:2048:128:2:f
L1:256:128:2:l_L2:2048:128:2:l
L1:512:64:2:f_L2:4096:64:2:f
L1:64:128:8:l_L2:512:128:8:l
L1:64:128:8:f_L2:512:128:8:f
L1:128:64:8:l_L2:1024:64:8:l
L1:512:64:2:l_L2:4096:64:2:l
L1:128:64:8:f_L2:1024:64:8:f
L1:256:32:8:l_L2:2048:32:8:l
L1:1024:32:2:l_L2:8192:32:2:l
L1:256:32:8:f_L2:2048:32:8:f
L1:1024:32:2:f_L2:8192:32:2:f
1.00015
1.0001
p 1.00005
1
e
L1:512:128:1:l_L2:8192:128:1:l
L1:512:128:1:f_L2:8192:128:1:f
L1:512:128:1:r_L2:8192:128:1:r
L1:1024:64:1:l_L2:16384:64:1:l
L1:1024:64:1:f_L2:16384:64:1:f
L1:1024:64:1:r_L2:16384:64:1:r
L1:256:128:2:r_L2:4096:128:2:r
L1:64:128:8:r_L2:1024:128:8:r
L1:2048:32:1:l_L2:32768:32:1:l
L1:2048:32:1:f_L2:32768:32:1:f
L1:2048:32:1:r_L2:32768:32:1:r
L1:512:64:2:r_L2:8192:64:2:r
L1:128:64:8:r_L2:2048:64:8:r
L1:1024:32:2:r_L2:16384:32:2:r
L1:256:32:8:r_L2:4096:32:8:r
L1:256:128:2:f_L2:4096:128:2:f
L1:256:128:2:l_L2:4096:128:2:l
L1:512:64:2:f_L2:8192:64:2:f
L1:64:128:8:l_L2:1024:128:8:l
L1:64:128:8:f_L2:1024:128:8:f
L1:128:64:8:l_L2:2048:64:8:l
L1:512:64:2:l_L2:8192:64:2:l
L1:128:64:8:f_L2:2048:64:8:f
L1:256:32:8:l_L2:4096:32:8:l
L1:1024:32:2:l_L2:16384:32:2:l
L1:256:32:8:f_L2:4096:32:8:f
L1:1024:32:2:f_L2:16384:32:2:f

Anagram benchmark
11
policy)_L2:(No.
of
sets):(block
Anagram Benchmark: CPI - versus - L1separate_L2separate

cache config
C
y
1.00045
c
1.0004
l 1.00035
1.0003
e 1.00025
s
1.0002

Anagram Benchmark: CPI - versus - L1separate_L2unified cache config
1.00045
1.0004
1.00035
1.0003
1.00025
1.0002
1.00015
1.0001
1.00005
1

C
y
c
l
e
s
p
e
r
I
n
s
t
r
1.000009
1.000008
1.000007
1.000006
1.000005
1.000004
1.000003
1.000002
1.000001
1
L1:128:128:8:r_L2:1024:128:8:r
L1:512:128:2:r_L2:4096:128:2:r
L1:128:128:8:l_L2:1024:128:8:l
L1:512:128:2:l_L2:4096:128:2:l
L1:1024:128:1:l_L2:8192:128:1:l
L1:128:128:8:f_L2:1024:128:8:f
L1:512:128:2:f_L2:4096:128:2:f
L1:1024:128:1:f_L2:8192:128:1:f
L1:1024:128:1:r_L2:8192:128:
L1:256:64:8:r_L2:2048:64:8:r
L1:1024:64:2:r_L2:8192:64:2:r
L1:256:64:8:l_L2:2048:64:8:l
L1:1024:64:2:l_L2:8192:64:2:l
L1:2048:64:1:l_L2:16384:64:1:l
L1:256:64:8:f_L2:2048:64:8:f
L1:1024:64:2:f_L2:8192:64:2:f
L1:2048:64:1:f_L2:16384:64:1:f
L1:2048:64:1:r_L2:16384:64:1:r
L1:512:32:8:r_L2:4096:32:8:r
L1:512:32:8:l_L2:4096:32:8:l
L1:2048:32:2:l_L2:16384:32:2:l
L1:4096:32:1:l_L2:32768:32:1:l
L1:512:32:8:f_L2:4096:32:8:f
L1:2048:32:2:f_L2:16384:32:2:f
L1:4096:32:1:f_L2:32768:32:1:f
L1:2048:32:2:r_L2:16384:32:2:r
L1:4096:32:1:r_L2:32768:32:1:r

12
policy)_L2:(No.
of
sets):(block
Anagram Benchmark: CPI - versus - L1unified_L2unified cache config

possible CPI
p
e
r
I
n
s
t
r
e
r
I
n
s
t
r
L1:64:128:8:r_L2:512:128:8:r
L1:64:128:8:l_L2:512:128:8:l
L1:64:128:8:f_L2:512:128:8:f
L1:256:128:2:l_L2:2048:128:2:l
L1:256:128:2:r_L2:2048:128:2:r
L1:256:128:2:f_L2:2048:128:2:f
L1:128:64:8:r_L2:1024:64:8:r
L1:128:64:8:l_L2:1024:64:8:l
L1:512:64:2:f_L2:4096:64:2:f
L1:512:64:2:l_L2:4096:64:2:l
L1:512:64:2:r_L2:4096:64:2:r
L1:128:64:8:f_L2:1024:64:8:f
L1:1024:32:2:l_L2:8192:32:2:l
L1:256:32:8:r_L2:2048:32:8:r
L1:1024:32:2:r_L2:8192:32:2:r
L1:1024:32:2:f_L2:8192:32:2:f
L1:256:32:8:l_L2:2048:32:8:l
L1:256:32:8:f_L2:2048:32:8:f
L1:1024:64:1:l_L2:8192:64:1:l
L1:1024:64:1:f_L2:8192:64:1:f
L1:1024:64:1:r_L2:8192:64:1:r
L1:2048:32:1:l_L2:16384:32:1:l
L1:2048:32:1:f_L2:16384:32:1:f
L1:2048:32:1:r_L2:16384:32:1:r
L1:512:128:1:l_L2:4096:128:1:l
L1:512:128:1:f_L2:4096:128:1:f
L1:512:128:1:r_L2:4096:128:1:r
C
y
1.03
c 1.025
1.02
l
1.015
e
1.01
s 1.005
L1:64:128:8:r_L2:1024:128:8:r
L1:64:128:8:l_L2:1024:128:8:l
L1:64:128:8:f_L2:1024:128:8:f
L1:256:128:2:l_L2:4096:128:2:l
L1:256:128:2:r_L2:4096:128:2:r
L1:256:128:2:f_L2:4096:128:2:f
L1:128:64:8:l_L2:2048:64:8:l
L1:128:64:8:r_L2:2048:64:8:r
L1:512:64:2:l_L2:8192:64:2:l
L1:512:64:2:f_L2:8192:64:2:f
L1:512:64:2:r_L2:8192:64:2:r
L1:128:64:8:f_L2:2048:64:8:f
L1:256:32:8:r_L2:4096:32:8:r
L1:1024:32:2:l_L2:16384:32:2:l
L1:256:32:8:l_L2:4096:32:8:l
L1:1024:32:2:r_L2:16384:32:2:r
L1:1024:32:2:f_L2:16384:32:2:f
L1:256:32:8:f_L2:4096:32:8:f
L1:1024:64:1:l_L2:16384:64:1:l
L1:1024:64:1:f_L2:16384:64:1:f
L1:1024:64:1:r_L2:16384:64:1:r
L1:512:128:1:l_L2:8192:128:1:l
L1:512:128:1:f_L2:8192:128:1:f
L1:512:128:1:r_L2:8192:128:1:r
L1:2048:32:1:l_L2:32768:32:1:l
L1:2048:32:1:f_L2:32768:32:1:f
L1:2048:32:1:r_L2:32768:32:1:r
GO benchmark

13
policy)_L2:(No.
of
sets):(block
GO Benchmark: CPI - versus - L1separate_L2separate cache

config

GO Benchmark: CPI - versus - L1separate_L2unified cache config
C
y
c
l
e
s
1.04
1.035
1.03
1.025
1.02
1.015
1.01
1.005
p
1

1.003
1.0025
1.002
1.0015
1.001
1.0005
1
p
e
r
I
n
s
t
r
L1:512:128:2:l_L2:4096:128:2:l
L1:1024:128:1:l_L2:8192:128:1:l
L1:1024:128:1:f_L2:8192:128:1:f
L1:1024:128:1:r_L2:8192:128:1:r
L1:512:128:2:f_L2:4096:128:2:f
L1:512:128:2:r_L2:4096:128:2:r
L1:1024:64:2:l_L2:8192:64:2:l
L1:128:128:8:f_L2:1024:128:8:f
L1:1024:64:2:r_L2:8192:64:2:r
L1:2048:64:1:l_L2:16384:64:1:l
L1:2048:64:1:f_L2:16384:64:1:f
L1:2048:64:1:r_L2:16384:64:1:r
L1:1024:64:2:f_L2:8192:64:2:f
L1:128:128:8:r_L2:1024:128:8:r
L1:128:128:8:l_L2:1024:128:8:l
L1:256:64:8:f_L2:2048:64:8:f
L1:2048:32:2:r_L2:16384:32:2:r
L1:4096:32:1:l_L2:32768:32:1:l
L1:4096:32:1:f_L2:32768:32:1:f
L1:4096:32:1:r_L2:32768:32:1:r
L1:2048:32:2:f_L2:16384:32:2:f
L1:2048:32:2:l_L2:16384:32:2:l
L1:256:64:8:r_L2:2048:64:8:r
L1:256:64:8:l_L2:2048:64:8:l
L1:512:32:8:f_L2:4096:32:8:f
L1:512:32:8:r_L2:4096:32:8:r
L1:512:32:8:l_L2:4096:32:8:l

14
policy)_L2:(No.
of
sets):(block
GO Benchmark: CPI - versus - L1unified_L2unified cache config
C
y
c
l
e
s

possible CPI
Data tables for lowest CPIs for benchmarks: GCC, Anagram, GO

Data below has been chosen to contain the six lowest CPI values from the three L1 and L2 cache
configurations from each of the benchmarks. They have been listed in the below tables along
with their corresponding charts.
policy)_L2:(No.
GCC benchmark
GCC
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1separate_L2separate
L1separate_L2unified
L1 and L2 cache config

CPI
L1:128:128:8:l_L2:1024:128:8:l
1.001753557
L1:128:128:8:f_L2:1024:128:8:f
1.001758036
L1:128:128:8:r_L2:1024:128:8:r
1.002152616
L1:256:64:8:l_L2:2048:64:8:l
1.002231389
L1:256:64:8:f_L2:2048:64:8:f
1.002353557
L1:256:64:8:r_L2:2048:64:8:r
1.002748137
L1:64:128:8:l_L2:512:128:8:l
1.026271362
L1:128:64:8:l_L2:1024:64:8:l
1.028479706
L1:64:128:8:f_L2:512:128:8:f
1.028799969
L1:64:128:8:r_L2:512:128:8:r
1.028815025
L1:128:64:8:f_L2:1024:64:8:f
1.031192022
L1:128:64:8:r_L2:1024:64:8:r
1.03282593
L1:64:128:8:l_L2:1024:128:8:l
1.034493517
L1:64:128:8:f_L2:1024:128:8:f
1.039331023
L1:64:128:8:r_L2:1024:128:8:r
1.040554121
L1:128:64:8:l_L2:2048:64:8:l
1.040779169
L1:256:128:2:l_L2:4096:128:2:l
1.045580361
L1:128:64:8:f_L2:2048:64:8:f
1.046203326
*listed in ascending order
Plot is on the next page.
15
of
sets):(block
p
e
r
I
n
s
t
r
16
L1:128:64:8:f_L2:2048:64:8:f
L1:256:128:2:l_L2:4096:128:2:l
L1:128:64:8:l_L2:2048:64:8:l
L1:64:128:8:r_L2:1024:128:8:r
L1:64:128:8:f_L2:1024:128:8:f
L1:64:128:8:l_L2:1024:128:8:l
L1:128:64:8:r_L2:1024:64:8:r
L1:128:64:8:f_L2:1024:64:8:f
L1:64:128:8:r_L2:512:128:8:r
L1:64:128:8:f_L2:512:128:8:f
L1:128:64:8:l_L2:1024:64:8:l
L1:64:128:8:l_L2:512:128:8:l
L1:512:128:2:l_L2:4096:128:2:l
L1:256:64:8:f_L2:2048:64:8:f
L1:256:64:8:l_L2:2048:64:8:l
L1:128:128:8:r_L2:1024:128:8:r
L1:128:128:8:f_L2:1024:128:8:f
C
y
c
l
e
s
L1:128:128:8:l_L2:1024:128:8:l
1.0499
1.0399
1.0299
1.0199
1.0099
0.9999

ANAGRAM benchmark
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
policy)_L2:(No.
of
sets):(block
Anagram
CPI
L1:128:128:8:l_L2:1024:128:8:l
1.000004509
L1:512:128:2:l_L2:4096:128:2:l
1.000004509
L1:256:64:8:l_L2:2048:64:8:l
1.000006355
L1:1024:64:2:l_L2:8192:64:2:l
1.000006355
L1:512:32:8:l_L2:4096:32:8:l
1.000008563
L1:2048:32:2:l_L2:16384:32:2:l
1.000008563
L1:512:128:1:l_L2:8192:128:1:l
1.000321074
L1:512:128:1:f_L2:8192:128:1:f
1.000321074
L1:512:128:1:r_L2:8192:128:1:r
1.000321074
L1:512:128:1:l_L2:4096:128:1:l
1.000333232
L1:512:128:1:f_L2:4096:128:1:f
1.000333232
L1:512:128:1:r_L2:4096:128:1:r
1.000333232
L1:1024:64:1:l_L2:16384:64:1:l
1.000342179
L1:1024:64:1:f_L2:16384:64:1:f
1.000342179
L1:1024:64:1:r_L2:16384:64:1:r
1.000342179
L1:1024:64:1:l_L2:8192:64:1:l
1.000348435
L1:1024:64:1:f_L2:8192:64:1:f
1.000348435
L1:1024:64:1:r_L2:8192:64:1:r
1.000348435
Anagram Benchmark: CPI - versus - L1_L2 cache config

17
L1:1024:64:1:r_L2:8192:64:1:r
L1:1024:64:1:f_L2:8192:64:1:f
L1:1024:64:1:l_L2:8192:64:1:l
L1:1024:64:1:r_L2:16384:64:1:r
L1:1024:64:1:f_L2:16384:64:1:f
L1:1024:64:1:l_L2:16384:64:1:l
L1:512:128:1:r_L2:4096:128:1:r
L1:512:128:1:f_L2:4096:128:1:f
L1:512:128:1:l_L2:4096:128:1:l
L1:512:128:1:r_L2:8192:128:1:r
L1:512:128:1:f_L2:8192:128:1:f
L1:512:128:1:l_L2:8192:128:1:l
L1:2048:32:2:l_L2:16384:32:2:l
L1:512:32:8:l_L2:4096:32:8:l
L1:1024:64:2:l_L2:8192:64:2:l
L1:256:64:8:l_L2:2048:64:8:l
I
n
s
t
r
L1:512:128:2:l_L2:4096:128:2:l
p
e
r
1.0004
1.00035
1.0003
1.00025
1.0002
1.00015
1.0001
1.00005
1
0.99995
0.9999
L1:128:128:8:l_L2:1024:128:8:l
C
y
c
l
e
s
GO benchmark
policy)_L2:(No.
of
sets):(block
GO
CPI
L1:512:128:2:l_L2:4096:128:2:l
1.000804
L1:1024:128:1:l_L2:8192:128:1:l
1.000816
L1:1024:128:1:f_L2:8192:128:1:f
1.000816
L1:1024:128:1:r_L2:8192:128:1:r
1.000816
L1:512:128:2:f_L2:4096:128:2:f
1.000855
L1:512:128:2:r_L2:4096:128:2:r
1.000892
L1:64:128:8:r_L2:512:128:8:r
1.004668
L1:64:128:8:l_L2:512:128:8:l
1.004956
L1:64:128:8:f_L2:512:128:8:f
1.005102
L1:256:128:2:l_L2:2048:128:2:l
1.005156
L1:256:128:2:r_L2:2048:128:2:r
1.005336
L1:256:128:2:f_L2:2048:128:2:f
1.005828
L1:64:128:8:r_L2:1024:128:8:r
1.008772
L1:64:128:8:l_L2:1024:128:8:l
1.008818
L1:64:128:8:f_L2:1024:128:8:f
1.009194
L1:256:128:2:l_L2:4096:128:2:l
1.009255
L1:256:128:2:r_L2:4096:128:2:r
1.009582
L1:256:128:2:f_L2:4096:128:2:f
1.010047
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
L1unified_L2unified
GO Benchmark: CPI - versus - L1_L2 cache config

C
y
c
l
e
s
1.0119
p
e
r
1.0019
1.0079
1.0059
1.0039
L1:256:128:2:f_L2:4096:128:2:f
L1:256:128:2:r_L2:4096:128:2:r
L1:256:128:2:l_L2:4096:128:2:l
L1:64:128:8:f_L2:1024:128:8:f
L1:64:128:8:l_L2:1024:128:8:l
L1:64:128:8:r_L2:1024:128:8:r
L1:256:128:2:f_L2:2048:128:2:f
L1:256:128:2:r_L2:2048:128:2:r
L1:256:128:2:l_L2:2048:128:2:l
L1:64:128:8:f_L2:512:128:8:f
L1:64:128:8:l_L2:512:128:8:l
L1:64:128:8:r_L2:512:128:8:r
L1:512:128:2:r_L2:4096:128:2:r
L1:512:128:2:f_L2:4096:128:2:f
L1:1024:128:1:r_L2:8192:128:1:r
L1:1024:128:1:f_L2:8192:128:1:f
L1:1024:128:1:l_L2:8192:128:1:l
0.9999
L1:512:128:2:l_L2:4096:128:2:l
I
n
s
t
r
1.0099

18
Data Analysis
From analysis of this data, we can clearly point out that L1 unified and L2 unified gives the
lowest CPI values for all three benchmarks.
The trend followed by L1 cache and L2 cache configuration on the three benchmarks going
from lower CPI to higher CPI
GCC Benchmark
L1 united, L2 united
ANAGRAM Benchmark
(back and forth)
GO Benchmark
Now we can clearly reach the following conclusions for the most optimized CPI:
L1 unified and L2 unified for all three benchmarks

Block size of 128 for all three benchmarks
Associativity of 8 for GCC and Anagram; associativity of 2 for GO
Replacement policy is l for all three benchmarks
19
GCC Benchmark (selected cache config for the best six CPIs)
GCC
L1:128:128:8:l_L2:1024:128:8:l
L1:128:128:8:f_L2:1024:128:8:f
L1:128:128:8:r_L2:1024:128:8:r
L1:256:64:8:l_L2:2048:64:8:l
L1:256:64:8:f_L2:2048:64:8:f
L1:256:64:8:r_L2:2048:64:8:r
policy)_L2:(No.
of
sets):(block
CPI
1.001753557
1.001758036
1.002152616
1.002231389
1.002353557
1.002748137
It has been inferred that CPI follows a trend with the least to the maximum CPI values varying
according to block size->replacement->associativity
For example:
128:8:l will have better CPI than 128:8:f
128:8:f will have a better CPI than 128:8:r
128:8:r will have a better CPI than 64:8:l
64:8:l will have a better CPI than 64:8:f
Notation:
Block_size:associativity:replacement_policy
20
ANAGRAM Benchmark (selected cache config for the best six CPIs)
Anagram
L1:128:128:8:l_L2:1024:128:8:l
L1:512:128:2:l_L2:4096:128:2:l
L1:256:64:8:l_L2:2048:64:8:l
L1:1024:64:2:l_L2:8192:64:2:l
L1:512:32:8:l_L2:4096:32:8:l
L1:2048:32:2:l_L2:16384:32:2:l
policy)_L2:(No.
of
sets):(block
CPI
1.000004509
1.000004509
1.000006355
1.000006355
1.000008563
1.000008563
The CPI follows a trend with the least to the maximum CPI values varying according to block
associativity->block size->replacement
For example:
128:8:l will have better CPI than 128:2:l
128:2:l will have a better CPI than 64:8:l
Notation:
21
GO Benchmark (selected cache config for the best six CPIs)

GO
L1:512:128:2:l_L2:4096:128:2:l
L1:1024:128:1:l_L2:8192:128:1:l
L1:1024:128:1:f_L2:8192:128:1:f
L1:1024:128:1:r_L2:8192:128:1:r
L1:512:128:2:f_L2:4096:128:2:f
L1:512:128:2:r_L2:4096:128:2:r
policy)_L2:(No.
of
sets):(block
CPI
1.000804
1.000816
1.000816
1.000816
1.000855
1.000892
The CPI follows a trend with the least to the maximum CPI values varying according to block
replacement->associativity->block size
For example:
128:1:l will have better CPI than 128:1:f
Notation:
However, for optimized CPI, we chose to select the one with the least possible CPI and
domination amongst the above listed six lowest CPI values.
22
The below choices have been made to obtain the lowest possible CPI that would guarantee the
best possible performance under the assumptions listed in the starting on Part3 of this report[a].
CPI is a function of block size, associativity and replacement policy. We also consider L1 and L2
cache configuration combinations. As can been seen in the below table, unified L1 unified L2
cache, a bigger block size, higher associativity and LRU replacement policy help us satisfy the
objective of this part.
The most optimized L1 and L2 configuration is chosen to be:
L1 unified and L2 unified
GCC
Anagram
GO
Cache Config
L1 unified,
L2 unified
L1 unified,
L2 unified
L1 unified,
L2 unified
Block size
Associativity
Replacement
128
"l"
128
"l"
128
"l"
However, there are tradeoffs for considering the best possible CPI. We will look into this in
Part4 of this report.
23
Part 4: Define Cost Function

Since we are given fixed L1 and L2 cache sizes, we consider the cost to be a function of:
Associativity :
Cost increases as the associativity is increased. Therefore, it was assumed that the cost to
implement an 8way associativity is higher than the cost for a 2way associativity which would
be higher than a 1way associativity.
Replacement policy
Considered as a 2% increase from Random to FIFO and a 5% increase from FIFO to LRU
L1, L2 cache configuration

- Unified or Separate:
A separated instruction and data cache would cost higher than a unified cache because of
the hardware overhead accounted for in separate cache.
- L1 is more expensive than L2
Since L1 cache is internal, it has been taken that L1 is more expensive compared to L2
cache.
The above choices have been considered to completely define the cost function for caches in
terms of area overhead and performance.
Data supporting the above stated factors influencing cost function for caches follows in Part5 of
this report.
24
Part 5: Optimize Cache for Performance/Cost

The factors that improve the performance of CPI also have a trade off with cost as stated in Part4
of this report. The following observations have been made from the analyzed data:
-
As the cost increases, the CPI becomes better getting closer to

since better
performance in guaranteed by using higher associativity, replacement policy, cache
configuration.
It is observed from the above plot that as the cost increases, the configurations give CPI
values that are closer to
=1 compared to when the cost is on the lower edge. Also
depicted by the narrowing nature of the CPI curve as the cost increases (Fig1).
The stability factor also increases with use of cache configuration that gives desirable
CPI but nothing comes for free, so we end up paying a higher cost.
Hence we decide to choose values to the right half of the plots guarantying a better CPI.
25
GCC Benchmark
policy)_L2:(No.
of
sets):(block
The above plot shows GCC benchmark showing Cost on the left side vertical axis and CPI on the
right side vertical axis. Both have been plotted with x-axis in common. X-axis shows L1 and L2
cache configuration.
The arrows on the plot show the chosen CPIs and Cache config for optimized cost
26
Fig. 1
The ideal configuration chosen to have cost optimization along with good performance is shown
below with its CPI, cost and cache configuration. This was chosen because it gave a good CPI
value for its cost. Another better alternative is shown in Fig. 3 but that would be a cost of 49.5
which means a relatively high increase in cost for a small change in CPI value that gauarantees
CPI to be closer to CPI ideal. Hence we choose the optimized CPI to be in the one in Fig2.
GCC
L1unified_L2unified
CPI
L1:1024:64:2:l_L2:8192:64:2:l 1.003789425
Cost
27
Fig. 2 (Optimal Configuration)

GCC
L1unified_L2unified
CPI
L1:128:128:8:f_L2:1024:128:8:f 1.001758036
Cost
49.5
Fig. 3
GCC Optimal Configuration: L1 unified, L2 unified Cache; Block size: 64, Associativity: 2,
Replacement Policy: LRU
27
Anagram Benchmark
policy)_L2:(No.
of
sets):(block
which means a relatively high increase in cost for a small change in CPI value that guarantees
CPI to be closer to CPI ideal. Hence we choose the optimized CPI to be in the one in Fig 4.
Anagram
L1unified_L2unified
CPI
L1:1024:64:2:l_L2:8192:64:2:l 1.000006355
28
Cost
27

Anagram
L1unified_L2unified
L1:512:32:8:r_L2:4096:32:8:r
CPI
1.000008519
Cost
48
Fig. 5
ANAGRAM Optimal Configuration: L1 unified, L2 unified Cache; Block size: 64,
Associativity: 2, Replacement Policy: LRU
GO Benchmark
policy)_L2:(No.
29
of
sets):(block
which means a relatively high increase in cost for a small change in CPI value that guarantees
CPI to be closer to CPI ideal. Hence we choose the optimized CPI to be in the one in Fig 6.
GO
CPI
L1:1024:64:2:l_L2:8192:64:2:l 1.001065804
Cost
27

GO
CPI
L1:128:128:8:r_L2:1024:128:8:r 1.001156244
Cost
48
Fig. 7
GO Optimal Configuration: L1 separate, L2 unified Cache; Block size: 64, Associativity: 2,
Replacement Policy: LRU
Below is a table comparing all of the optimized cost and performance based results from the
three benchmarks.
30
OPTIMAL CONFIGURATION
The best optimal configuration from all the benchmarks is guaranteed by Anagram considering
cost in terms of optimized CPI ( Fig. 8).
GCC
L1unified_L2unified
L1:1024:64:2:l_L2:8192:64:2:l
CPI
1.003789425
Cost
27
Anagram
L1unified_L2unified
CPI
Cost
L1:1024:64:2:l_L2:8192:64:2:l 1.0000064
27
GO
L1:1024:64:2:l_L2:8192:64:2:l
CPI
1.001065804
Cost
27
Fig. 8
The best optimal configuration for all the benchmarks considered in terms of average CPI
shown in Fig. 10 as asked in the question
Average CPI
1.001620543
Average Cost
27
Fig. 9
Optimal Configurations:
Optimal Config for all Benchmarks(in terms of avg CPI)
GCC
L1:1024:64:2:l_L2:8192:64:2:l
Anagram
L1:1024:64:2:l_L2:8192:64:2:l
GO
L1:1024:64:2:l_L2:8192:64:2:l
Fig. 10
31
References
-
A Quantative Approach 5th Edition ,John L Hennessy & David A Patterson

http://datasheets.chipdb.org/DEC/21264/21264-Alpha.pdf
http://www.eecs.umich.edu/~taustin/eecs573_public/instruct-progs.tar.gz
www.simplescalar.com
32

Project #1: Computer Architecture EE6304 Due Date: 3/8/2012 Team Number:22

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project #1: Computer Architecture EE6304 Due Date: 3/8/2012 Team Number:22

Uploaded by

Copyright:

Available Formats

Computer Architecture EE6304 Project #1

Computer Architecture EE6304 Project #1

General description for Alpha 21264 EV6 configuration

Part3:Optimize CPI For Each Benchmark

Procedure followed for each Benchmark

Plots for benchmarks: GCC, Anagram, GO

Data tables for lowest CPIs for benchmarks: GCC, Anagram, GO

Part4:Define Cost Function

Part5:Optimize Cache for Performance/Cost

Computer Architecture EE6304 Project #1

Computer Architecture EE6304 Project #1

General description for Alpha 21264 EV6 configuration

Alpha 21264 Microarchitecture

Computer Architecture EE6304 Project #1

Figure:Alpha 21264 Microarchitecture

Computer Architecture EE6304 Project #1

Cache levels: Two levels.

Table showing data from running all the three benchmarks

Computer Architecture EE6304 Project #1

Computer Architecture EE6304 Project #1

Part3: Optimize CPI for each Benchmark

Nsets can then be calculated using the formula:

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

NOTATION: L1:(No. of sets):(block

Computer Architecture EE6304 Project #1

Plots for benchmarks: GCC, Anagram, GO

GCC Benchmark: CPI - versus - L1separate_L2separate cache

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

GCC Benchmark: CPI - versus - L1separate_L2unified cache

Computer Architecture EE6304 Project #1

NOTATION: L1:(No. of sets):(block

GCC Benchmark: CPI - versus - L1unified_L2unified

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

Computer Architecture EE6304 Project #1

NOTATION: L1:(No. of sets):(block

Anagram Benchmark: CPI - versus - L1separate_L2separate

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

Anagram Benchmark: CPI - versus - L1separate_L2unified cache config

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

Computer Architecture EE6304 Project #1

NOTATION: L1:(No. of sets):(block

Anagram Benchmark: CPI - versus - L1unified_L2unified cache config

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

Computer Architecture EE6304 Project #1

NOTATION: L1:(No. of sets):(block

GO Benchmark: CPI - versus - L1separate_L2separate cache

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

GO Benchmark: CPI - versus - L1separate_L2unified cache config

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

Computer Architecture EE6304 Project #1

NOTATION: L1:(No. of sets):(block

GO Benchmark: CPI - versus - L1unified_L2unified cache config

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

Computer Architecture EE6304 Project #1

Data tables for lowest CPIs for benchmarks: GCC, Anagram, GO

L1 and L2 cache config

Plot is on the next page.

Computer Architecture EE6304 Project #1

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block

Computer Architecture EE6304 Project #1

Anagram Benchmark: CPI - versus - L1_L2 cache config

L1:(No. of sets):(block size):(associativity):(replacement policy)_L2:(No. of sets):(block