You are on page 1of 16

PRESENTATION TEMPS REEL

An OS-Based Alternative to Full


Hardware Coherence on Tiled
CMPs
Wutikuer ADILIJIANG & George RAZERA

IESE5
01/2017
INTRODUCTION

TILED ARCHITECHTURES

A HARDWARE/OS SCHEME TO AVOID CACHE


INCOHERENCE

EVALUATION SETUP

EXPERIMENTAL RESULTS

CONCLUSION
INTRODUCTION
Tiled CMPs
Tiled extended chip micro
processor

Cache
A fast and small memory unit
Problem
With more than 16 cores in 65nm and smaller
processes

Delay or the area overheads high in CMPs with


more than 16 cores in 65nm and smaller
processes.
INTRODUCTION
- Software/hardware mechanism
- Support shared-memory parallel applications
- Forgoes hardware maintained cache
coherence In current CMPs
Parallel applications cache coherence
The use of buses and crossbars
Fully distributed directory coherence
protocols

A bottleneck as the number of cores increases


Access latencies increase
The area required increase
Hard to implement and verify
TILED
ARCHITECTURES
A base-line
architecture

32 processors
Processor element (PE)
Cache (L1)
Remote cache access
controller (RAC)

Avoids the possibility of incoherence by not allowing multiple


modifiable shared copies of data.
Treat all L1s as a single logical cache avoid replication of data
TILED
ARCHITECTURES
Every memory line can only reside in one L1 cache
and processors in other tiles must perform remote
cache reads and writes to access

Instead of trying to keep the L1 caches coherent, the


scheme avoids duplicate copies of a single cache line

Extend the traditional page table with a new table that


maps virtual pages to architectural tiles
TILED
ARCHITECTURES

Data Placement
and Remote Cache
Accesses

Virtual adress:
index the local L1
cache,
perform a local TLB
lookup to obtain the
physical address
perform a local MAP
lookup to obtain the
identity of the home
TILED
ARCHITECTURES The first processor
Write
to write on a page
Read-only data sharing
OS intercept s and marks it the as
Read
the owner
The first processor to touch a page
Sebsequent reading by other
Obtain a mapping
processors with existing mappings
while the OS marked the page as read
allowed
only
Sebsequent write/read without a
Other process touch the page table
local mapping
Allowed to creat local mapping
OS generates an entry pointing to
the owner node
OS does not need to keep track of
which processors are sharing the most of
page the state transitions occur only at
the OS level & the hardware
state machine is fairly simple.
EVALUATION
SETUP
Performance analysis - Splash-2 benchmarks and ALPBench
benchmarks.
The benchmarks were compiled with gcc 3.4.4 and glibc
2.3.5 for PowerPC
EVALUATION
SETUP
Performance analysis - Splash-2 benchmarks and ALPBench
benchmarks.
The benchmarks were compiled with gcc 3.4.4 and glibc
2.3.5 for PowerPC
Liberty Simulation enviroment (LSE) Simulator
EXPERIMENTAL
RESULTS
We start by comparing the overall performance of our
architecture against the hardware distributed directory system

Average efficiency of 81% for Dir-Coh (speedup divided by number of processors)


EXPERIMENTAL
RESULTS
To better understand the behavior of the system, we trackthe outcome
of each processor memory request.
EXPERIMENTAL
RESULTS
Local, remote and average latencies

Network and contention effects.


INTODUCTION

Speedup results
Conclusion
Proposition of novel cost effective software/hardware mechanism to
support memory

New mechanism

Evaluation on the Splash-2 and ALPBench benchmarks

2 mechanisms to perform migration of pages and share of read-only data


16% on average for 16 and 32 processors.
Thank You For Your Attention!
Questions ?

Wutikuer ADILIJIANG & George


RAZERA

IESE5
01/2017