UNIT II - Multi Core Architecture

Department of
Computer Science and Engineering
Course Code Course Title L T P C

10211CS129 Modern Computer Architecture 3 0 0 3
Category : Program Core

Unit-2 Multi Core Architecture
School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Course Outcomes
CO
Course Outcomes K - Level
No’s
CO2 Analyze different computer architectures and applications. K3
Knowledge Level (Based on revised Bloom’s Taxonomy)

K1-Remember K2-Understand K3-Apply K4-Analyze K5-Evaluate K6-Create
and Project
Management
(SEPM)
Department of Computer Science and Engineering

TextBooks
1. Hamacher, V.C., Vranesic, Z.G., and Zaky, S.G.,

“Computer Organization”, 5/e. McGraw-Hill. 2014.
[Unit- 1,2]
2. Patterson, D.A., and Hennessy, J.L. , “Computer

Architecture : A Quantitative Approach ”, Morgan
Kaufmann Publishers, , Inc.2017.[Unit -3, 4,5]
and Project
Management
(SEPM)

Unit-1 Basic Structure of a Computer
System
Unit 2 Multi Core Architecture 9 Hours
Memory technologies, hierarchical memory systems, the locality principle

and caching, direct- mapped caches, block size, cache conflicts, associative
caches, write strategies, advanced optimizations, performance improvement
techniques, DRAM – organization, access techniques, scheduling
algorithms and signal systems. Tiled Chip Multi core Processors (TCMP),
Network on Chips (NoC), NoC router – architecture, design, routing
algorithms and flow control techniques, Advanced topics in NoC and
storage – compression, prefetching, QoS.
Memory Technologies
• Modern computer architecture incorporate three principal
memory technologies dominant in modern computing.
• DRAM, SRAM, and magnetic storage media, including hard-

disk drives and tapes.
• A fourth, nonvolatile random access memory (NVRAM), is

emerging as a technology sitting between DRAM and mass
storage.
Memory Technologies
• The term memory hierarchy is used in computer architecture
when discussing performance issues in computer architectural
design, algorithm predictions, and the lower level programming
constructs such as involving locality of reference.
• A "memory hierarchy" in computer storage distinguishes each
level in the "hierarchy" by response time. Since response time,
complexity, and capacity are related, the levels may also be
distinguished by the controlling technology.
• Memory is categorized into volatile and non-volatile memories,
with the former requiring constant power ON of the system to
maintain data storage.
• Furthermore, a typical computer system provides a hierarchy of
different times of memories for data storage.
HIERARCHICAL MEMORY
SYSTEMS
MEMORY HIERARCHY
Memory Technologies
• There is a capacity/performance/price gap between each pair of adjacent
levels of storage.
• Level 0 to Level 3 of the storage devices are volatile memory subsystems
which are accessed by CPU directly. The Level 4 storage devices which are
classified as I/O devices.
• The CPU interacts with memory for two operations i.e READ or WRITE.
• READ is for getting either instructions or Data (Operands).
• Write is generally for writing results upon instruction execution.
• To access memory, the address of the memory location is required. This
address is always loaded in the Memory Address Register (MAR) by the CPU.
• READ or WRITE operation is always carried out on the location specified by
MAR.
• In the case of READ, the memory returns the data to the CPU while in the
case of WRITE the data to be written onto the memory location is given by
CPU.
Memory Technologies
• The communication about the address and data and the associated
Control signals happen in the bus.
• A bus is a set of physical connections between two entities used for
communication using electrical signals.
• This external bus has three components namely,
(i) Address bus
(ii) Data bus
(iii) Control Signals.
• Memory Address Register (MAR) and the Memory Data Register
(MDR) play an important role in communication.
Memory Technologies
CPU Memory Communication Interface
 Data transfer rate or bandwidth is one of the measures of the

performance of the external bus between CPU and Memory.
 The maximum amount of information that can be transferred to or from
the memory per unit time is the data transfer rate or bandwidth and is
measured in bits or words per second.
CACHE MEMORY
 Cache Memory is a special very high-speed memory. The cache is a smaller and
faster memory that stores copies of the data from frequently used main memory
locations.
 There are various different independent caches in a CPU, which store instructions
and data.
 The most important use of cache memory is that it is used to reduce the average time
to access data from the main memory.
 Cache memory faster than main memory.
 Cache memory, also called CPU memory, It can access more quickly than it can
regular RAM.
and Project
 This memory is typically integrated
Management directly with the CPU chip or placed on a
(SEPM)
separate chip that has a separate bus interconnect with the CPU.

CACHE MEMORY
Characteristics of Cache Memory
 Cache memory is an extremely fast memory type that acts as a buffer

between RAM and the CPU.
 Cache Memory holds frequently requested data and instructions so that
they are immediately available to the CPU when needed.
 Cache memory is costlier than main memory or disk memory but more
economical than CPU registers.
 Cache Memory is used to speed up and synchronize with a high-speed
CPU.

THE LOCALITY PRINCIPLE AND
CACHING
• Locality of reference refers to a phenomenon in which a
computer program tends to access same set of memory
locations for a particular time period. In other words, Locality
of Reference refers to the tendency of the computer program to
access instructions whose addresses are near one another. The
property of locality of reference is mainly shown by loops and
subroutine calls in a program.
• In case of loops in program control processing unit repeatedly
refers to the set of instructions that constitute the loop.
• In case of subroutine calls, everytime the set of instructions are
fetched from memory.
• References to data items also get localized that means same
data item is referenced again and again.
THE LOCALITY PRINCIPLE AND CACHING
CPU REFERENCE
In the above figure, you can see that the CPU wants to read or fetch the data or
instruction. First, it will access the cache memory as it is near to it and provides very
fast access. If the required data or instruction is found, it will be fetched. This situation
is known as a cache hit. But if the required data or instruction is not found in the cache
memory then this situation is known as a cache miss.
• The main memory will be searched for the required data or

instruction that was being searched and if found will go through
one of the two ways:
• First way is that the CPU should fetch the required data or
instruction and use it and that’s it but what, when the same data
or instruction is required again.CPU again has to access the same
main memory location for it and we already know that main
memory is the slowest to access.
• The second way is to store the data or instruction in the cache
memory so that if it is needed soon again in the near future it
could be fetched in a much faster way.
Cache Operation: It is based on the principle of locality of

reference. There are two ways with which data or instruction
is fetched from main memory and get stored in cache
memory.
Cache Operation: It is based on the principle of locality of reference.

There are two ways with which data or instruction is fetched from main
memory and get stored in cache memory. These two ways are the
following:
Temporal Locality – Temporal locality means current data or instruction
that is being fetched may be needed soon. So we should store that data or
instruction in the cache memory so that we can avoid again searching in
main memory for the same data.
• Spatial Locality – Spatial locality means instruction or data near to the

current memory location that is being fetched, may be needed soon in the
near future. This is slightly different from the temporal locality. Here we
are talking about nearly located memory locations while in temporal
locality we were talking about the actual memory location that was being
fetched.
• Cache Performance: The performance of the cache is

measured in terms of hit ratio. When CPU refers to memory
and find the data or instruction within the Cache Memory, it is
known as cache hit. If the desired data or instruction is not
found in the cache memory and CPU refers to the main
memory to find that data or instruction, it is known as a cache
miss.
• Hit + Miss = Total CPU Reference
• Hit Ratio(h) = Hit / (Hit+Miss)
• Miss Ratio = 1 - Hit Ratio(h)
• Miss Ratio = Miss / (Hit+Miss)
DIRECT MAPPED CACHES
• There are three different types of mapping used for

the purpose of cache memory which is as follows:
• Direct Mapping
• Associative Mapping
• Set-Associative Mapping
1. Direct Mapping
• The simplest technique, known as direct mapping, maps each
block of main memory into only one possible cache line. or In
Direct mapping, assign each memory block to a specific line in
the cache.
• If a line is previously taken up by a memory block when a new

block needs to be loaded, the old block is trashed. An address
space is split into two parts index field and a tag field.
• The cache is used to store the tag field whereas the rest is stored
in the main memory. Direct mapping`s performance is directly
proportional to the Hit ratio.
• For purposes of cache access, each main memory address

can be viewed as consisting of three fields. The least
significant w bits identify a unique word or byte within a
block of main memory. In most contemporary machines, the
address is at the byte level. The remaining s bits specify one
of the 2s blocks of main memory.
• The cache logic interprets these s bits as a tag of s-r bits (the
most significant portion) and a line field of r bits. This latter
field identifies one of the m=2r lines of the cache. Line offset
is index bits in the direct mapping.
DIRECT MAPPING STRUCTURE
ASSOCIATIVE MAPPING
• In this type of mapping, associative memory is used to store the

content and addresses of the memory word. Any block can go into
any line of the cache.
• This means that the word id bits are used to identify which word in
the block is needed, but the tag becomes all of the remaining bits.
This enables the placement of any word at any place in the cache
memory.
• It is considered to be the fastest and most flexible mapping form.

In associative mapping, the index bits are zero.
ASSOCIATIVE MAPPING
ASSOCIATIVE MAPPING STRUCTURE

SET ASSOCIATIVE MAPPING
• This form of mapping is an enhanced form of direct mapping

where the drawbacks of direct mapping are removed. Set
associative addresses the problem of possible thrashing in the
direct mapping method.
• It does this by saying that instead of having exactly one line that
a block can map to in the cache, we will group a few lines
together creating a set.
• Then a block in memory can map to any one of the lines of a
specific set. Set-associative mapping allows each word that is
present in the cache can have two or more words in the main
memory for the same index address.
• Set associative cache mapping combines the best of direct and
associative cache mapping techniques. In set associative
mapping the index bits are given by the set offset bits.
SET ASSOCIATIVE MAPPING
SET ASSOCIATIVE MAPPING STRUCTURE

BLOCK SIZE
• Block size is the unit of information changed between cache and
main memory.
• As the block size will increase from terribly tiny to larger sizes,
the hit magnitude relation can initially increase as a result of the
principle of locality.
• The high chance that knowledge within the neck of the woods of
a documented word square measure possible to be documented
within the close to future.
• As the block size increases, a lot of helpful knowledge square
measure brought into the cache.
• The hit magnitude relation can begin to decrease, however,
because the block becomes even larger and also the chance of
victimization the new fetched knowledge becomes but the chance
of reusing the information that ought to be abstracted of the cache
to form area for the new block.
CACHE CONFLICT
• One of the notable features of MultiCore processors is that

threads will share a single cache at some level.
• There are two issues that can occur with shared caches: capacity
misses and conflict misses.
• A conflict cache miss is where one thread has caused data
needed by another thread to be evicted from the cache.
• The worst example of this is thrashing where multiple threads
each require an item of data and that item of data maps to the
same cache line for all the threads.
• Shared caches usually have sufficient associativity to avoid this
being a significant issue. However, there are certain attributes of
computer systems that tend to make this likely to occur.
WRITE STRATEGIES
• Whenever a Processor wants to write a word, it checks to see if

the address it wants to write the data to, is present in the cache or
not. If the address is present in the cache i.e., Write Hit.
• We can update the value in the cache and avoid expensive main
memory access. But this results in Inconsistent Data Problem.
As both cache and main memory have different data, it will cause
problems in two or more devices sharing the main memory (as in
a multiprocessor system).
• This is where Write Through and Write Back comes into
the picture.
WRITE STRATEGIES
• Write Through:
• In write-through, data is simultaneously updated to cache and

memory. This process is simpler and more reliable. This is used
when there are no frequent writes to the cache(The number of write
operations is less).
• It helps in data recovery (In case of a power outage or system

failure). A data write will experience latency (delay) as we have to
write to two locations (both Memory and Cache). It Solves the
inconsistency problem. But it questions the advantage of having a
cache in write operation (As the whole point of using a cache was
to avoid multiple access to the main memory).
WRITE STRATEGIES
WRITE STRATEGIES
• Write Back:
• The data is updated only in the cache and updated into the memory
at a later time. Data is updated in the memory only when the cache
line is ready to be replaced (cache line replacement is done using
Belady’s Anomaly, Least Recently Used Algorithm, FIFO, LIFO,
and others depending on the application).
Write Back is also known as Write Deferred.
• Dirty Bit: Each Block in the cache needs a bit to indicate if the
data present in the cache was modified(Dirty) or not
modified(Clean). If it is clean there is no need to write it into the
memory. It is designed to reduce write operation to a memory.
If Cache fails or if the System fails or power outages the
modified data will be lost. Because it’s nearly impossible to
restore data from cache if lost.
WRITE STRATEGIES
ADVANCED OPTIMIZATIONS
• some of the optimizations to improve cache performance
based on metrics like hit time, miss rate, miss penalty, cache
bandwidth and power consumption.
• Most of the techniques would based on hardware while some will
rely on software/programming techniques.
• Compiler Optimizations: The technique relies on optimization of
software rather than hardware. The two techniques that we are
going to discuss are Loop Interchange and Blocking.
• Loop Interchange: Some programs have nested loops that access
data in memory in non sequential order. Simply exchanging the
nesting of the loops can make the code access the data in the order
in which they are stored. Since the arrays are laid out in row-
major fashion accessing them in different ways can affect miss
rate greatly.
• Blocking: This optimization improves temporal locality to
reduce misses. When we need to deal with multiple arrays,
with some arrays accessed by rows and some by columns,
storing in a specific way doesn’t solve the problem. Such
orthogonal accesses mean that transformations such as loop
interchange still leave plenty of room for improvement.
Consider the example code given below.
• The number of capacity misses clearly depends on N and

the size of the cache. If it can hold all three N-by-N matrices,
then all is well, provided there are no cache conflicts.
• Compiler-Controlled Prefetching:
• An alternative to hardware prefetching is for the compiler to
insert prefetch instructions to request data before the
processor needs it.
• This can be done in two ways:
1)Register prefetch will load the value into a register
2) Cache prefetch loads data only into the cache and not the
register.
 This helps in reducing the miss rate in the same way as
prefetching but with greater accuracy since the loaded
blocks are hard coded by the compiler with certainty.
PERFORMANCE IMPROVEMENT
TECHNIQUES
• We can improve Cache performance using higher cache block size, and
higher associativity, reduce miss rate, reduce miss penalty, and reduce the
time to hit in the cache.
• Average Memory Access Time(AMAT):
• AMAT helps in analyzing the Cache memory and its performance. The lesser
the AMAT, the better the performance is. AMAT can be calculated as,
• For simultaneous access:
• AMAT = Hit Ratio * Cache access time + Miss Ratio * Main memory access
time
= (h * tc) + (1-h) * tm
For hierarchical access:
AMAT = Hit Ratio * Cache access time + Miss Ratio * (Cache access time +
Main memory access time)
= (h * tc) + (1-h) * (tc+tm)
TECHNIQUES
• Example 1: What is the average memory access time for a machine with a
cache hit rate of 75% and cache access time of 3 ns and main memory
access time of 110 ns.
• Solution:
• Average Memory Access Time(AMAT) = (h * tc) + (1-h) * (tc + tm)
• Given, Hit Ratio(h) = 75/100 = 3/4 = 0.75
• Miss Ratio (1-h) = 1-0.75 = 0.25
• Cache access time(tc) = 3ns
• Main memory access time(effectively) = tc + tm = 3 + 110 = 113 ns
• Average Memory Access Time(AMAT) = (0.75 * 3) + (0.25 * (3+110))
• = 2.25 + 28.25
• = 30.5 ns
• Note: AMAT can also be calculated as Hit Time + (Miss Rate * Miss
Penalty)
TECHNIQUES
• Example 2: Calculate AMAT when Hit Time is 0.9 ns, Miss Rate is
0.04, and Miss Penalty is 80 ns.
• Solution :
• Average Memory Access Time(AMAT) = Hit Time + (Miss Rate *
Miss Penalty)
• Here, Given,
• Hit time = 0.9 ns
Miss Rate = 0.04
• Miss Penalty = 80 ns
• Average Memory Access Time(AMAT) = 0.9 + (0.04*80)
• = 0.9 + 3.2
• = 4.1 ns
• Hence, if Hit time, Miss Rate, and Miss Penalty are reduced, the AMAT
reduces which in turn ensures optimal performance of the cache.
DRAM
ORGANIZATION:
• Dynamic random access memory (DRAM) is a type of semiconductor
memory that is typically used for the data or program code needed by a
computer processor to function.
• DRAM is a common type of random access memory (RAM) used in

personal computers (PCs), workstations, and servers.
• RAM is a part of the computer's Main Memory, which is directly

accessible by the CPU. Random access allows the PC processor to
directly access any part of the memory rather than proceeding
sequentially from a starting place.
DRAM
• DRAM is located close to a computer's processor and enables
faster access to data than storage media such as hard disk drives
and solid-state drives. RAM is used to store the data that is
currently processed by the CPU.
DRAM
• Dynamic random access memory stores each bit of data in a separate
capacitor within an integrated circuit. The charging and discharging of
the capacitor represent 0 and 1, i.e., the two possible values that can be
stored in a bit.
• DRAM is widely used in digital electronics where low-cost and high-
capacity memory is required.
• How DRAM works?
• Memory is made of bits of data or program code that are arranged in a
two-dimensional grid.
DRAM
• DRAM will store bits of data in a storage or memory cell, consisting of a
capacitor and a transistor.
• The storage cells are typically organized in a rectangular configuration. When
a charge is sent through a column, the transistor at the column is activated.
• A DRAM storage cell is dynamic, meaning that it needs to be refreshed or
given a new electronic charge every few milliseconds to compensate for
charge leaks from the capacitor.
• The memory cells will work with other circuits that can be used to identify
rows and columns, track the refresh process, and instruct a cell whether or not
to accept a charge and read or restore data from a cell.
• DRAM is one option of semiconductor memory that a system designer can
use when building a computer. Alternative memory choices include static
RAM (SRAM), electrically erasable programmable read-only memory
(EEPROM), NOR flash, and NAND flash. Many systems use more than one
type of memory.
DRAM
• Types of DRAM
• Many types of DRAM can be used in a device.
Below are some common types of DRAM.
DRAM
1. Asynchronous DRAM: The memory access was not
synchronized with the system clock. That's why it's called
asynchronous. It was the first type of DRAM in use but was
gradually replaced by synchronous DRAM.
• RAS only Refresh: ROR is a classic asynchronous DRAM
type, and it is refreshed by opening each row in turn. An
external counter is required to refresh the rows sequentially.
The refresh cycles are spread across the overall refresh
interval.
• CAS before RAS refresh: CBR is used to reduce external
circuitry. The counter required for the refresh was incorporated
into the main chip. It became the standard format for the
refresh of an asynchronous DRAM.
DRAM
• 2. Synchronous DRAM: In synchronous DRAM (SDRAM), the clock is
synchronized with the memory interface. All the signals are processed on
the rising edge of the clock.
• Synchronous DRAM syncs memory speeds with CPU clock speeds,
letting the memory controller know the CPU clock cycle. It allows the
CPU to perform more instructions at a time. SDRAM is used in most
computer systems today.
• Single Data Rate SDRAM: Single data rate SDRAM (SDR SDRAM or
SDR) is the original generation of SDRAM. It made a single transfer of
data per clock cycle.
• Double Data Rate SDRAM: DDR SDRAM almost doubles the
bandwidth in the data rate of SDRAM by using double pinning. This
process allows for data to transfer on rising and falling edges of a clock
signal.
It has been available in different iterations over time, including DDR2
SDRAM, DDR3 SDRAM, and DDR4 SDRAM.
DRAM
• 3. Graphics DRAM: Graphics DRAMs are asynchronous and
synchronous DRAMs designed for graphics-related tasks such
as texture memory and framebuffers found on video cards.
• Video DRAM: Video DRAM (VRAM) is a dual-ported variant
of DRAM that was once commonly used to store the
framebuffer in some graphics adaptors.
• Window DRAM: Window DRAM (WRAM) is a variant of
VRAM that was once used in graphics adaptors such as the
Matrox Millennium and ATI 3D Rage Pro. WRAM was
designed to perform better and cost less than VRAM
WRAM offered up to 25% greater bandwidth than VRAM and
accelerated commonly used graphical operations such as text
drawing and block fills.
DRAM
• DRAM IC Packages
• DRAM memory chips are available in a variety of IC packages. It
is found that the DRAM packages used in computers may be
different from those found in other electronics equipment due to
the different requirements.
• Dual-in-line or DIL package is a standard leaded package for
integrated circuits.
• SMT DRAMs are available in a variety of surface-mount
packages. These conform to all the usual SMT packages, and the
size and basic format depend upon the silicon chip size, the
number of leads required, and the application for which it is
intended.
DRAM
• DRAM Memory Module Formats
• Although DRAM is produced as integrated circuits, typically in a surface
mount format for mounting onto printed circuit boards, the memory
available for use in PCs and other computer applications is often in the
format of small modules containing many different ICs.
• These multi-chip modules are available in the following number of
formats, such as:
1. Single inline memory module (SIMM)
2. Dual inline memory module (DIMM)
DRAM
• 1. Single inline memory module (SIMM): Single inline
memory module packaging is considered obsolete now and was
used in the 1980s to 1990s. SIMMs came in 30 and 72 pin sets
and typically had 32-bit data transfer rates.
• 2. Dual inline memory module (DIMM): Current memory
modules come in DIMMs. "Dual inline" refers to pins on both
sides of the modules. A DIMM originally had a 168-pin
connector supporting a 64-bit data bus, which is twice the data
width of SIMMs.
• The wider bus means that more data can pass through a DIMM,
translating to faster overall performance. Latest DIMMs based
on fourth-generation double data rate (DDR4) SDRAM have
288-pin connectors for increased data throughput.
DRAM
Advantages of DRAM
• The main advantages of DRAM include the following:
• Its design is simple, and it only requires one transistor.
• The cost is low in comparison to alternative types of memory such as
SRAM.
• It provides high integration density levels in DRAM.
• It has a simple memory cell structure.
Disadvantages of DRAM
• Here are the following main disadvantages of DRAM, such as:
• Memory or DRAM is Volatile.
• Inter-signal coupling exists between DRAMS.
• Power consumption is high relative to other options.
• The DRAM manufacturing process is complex.
ACCESS TECHNIQUES
• These are 4 types of memory access techniques.

1. Sequential Access
2. Random Access
3. Direct Access
4. Associate Access
ACCESS TECHNIQUES
Sequential Access
• In this method, the memory is accessed in a specific
linear sequential manner, like accessing in a single
Linked List.
• The access time depends on the location of the
data.
ACCESS TECHNIQUES
Random Access
• In this method, any location of the memory can be
accessed randomly like accessing in Array. Physical
locations are independent in this access method.
ACCESS TECHNIQUES
• Direct Access
• In this method, individual blocks or records have a unique address based
on physical location. access is accomplished by direct access to reach a
general vicinity plus sequential searching, counting or waiting to reach
the final destination.
• This method is a combination of above two access methods. The access
time depends on both the memory organization and characteristics
of storage technology. The access is semi-random or direct.
ACCESS TECHNIQUES
• Associate Access
• In this memory, a word is accessed rather than its

address. This access method is a special type of
random access method. Application of thus
Associate memory access is Cache memory.
SCHEDULING ALGORITHMS AND
SIGNAL SYSTEMS
• A Process Scheduler schedules different processes to be assigned to the CPU
based on particular scheduling algorithms. There are six popular process
scheduling algorithms
• First-Come, First-Served (FCFS) Scheduling
• Shortest-Job-Next (SJN) Scheduling
• Priority Scheduling
• Shortest Remaining Time
• Round Robin(RR) Scheduling
• Multiple-Level Queues Scheduling
• These algorithms are either non-preemptive or preemptive. Non-preemptive
algorithms are designed so that once a process enters the running state, it
cannot be preempted until it completes its allotted time, whereas the
preemptive scheduling is based on priority where a scheduler may preempt a
low priority running process anytime when a high priority process enters into
a ready state.
SIGNAL SYSTEMS
First Come First Serve (FCFS)

• Jobs are executed on first come, first serve basis.
• It is a non-preemptive, pre-emptive scheduling
algorithm.
• Easy to understand and implement.
• Its implementation is based on FIFO queue.
• Poor in performance as average wait time is high.
SIGNAL SYSTEMS
Shortest Job Next (SJN)

• This is also known as shortest job first, or SJF
• This is a non-preemptive, pre-emptive scheduling
algorithm.
• Best approach to minimize waiting time.
• Easy to implement in Batch systems where required
CPU time is known in advance.
• Impossible to implement in interactive systems
where required CPU time is not known.
• The processer should know in advance how much
time process will take.
SIGNAL SYSTEMS
• Priority Based Scheduling
• Priority scheduling is a non-preemptive algorithm
and one of the most common scheduling algorithms
in batch systems.
• Each process is assigned a priority. Process with
highest priority is to be executed first and so on.
• Processes with same priority are executed on first
come first served basis.
• Priority can be decided based on memory
requirements, time requirements or any other
resource requirement.
SIGNAL SYSTEMS
• Shortest Remaining Time
• Shortest remaining time (SRT) is the preemptive
version of the SJN algorithm.
• The processor is allocated to the job closest to
completion but it can be preempted by a newer
ready job with shorter time to completion.
• Impossible to implement in interactive systems
where required CPU time is not known.
• It is often used in batch environments where short
jobs need to give preference.
SCHEDULING ALGORITHMS
AND SIGNAL SYSTEMS
• Round Robin Scheduling

• Round Robin is the preemptive process scheduling
algorithm.
• Each process is provided a fix time to execute, it is
called a quantum.
• Once a process is executed for a given time period,
it is preempted and other process executes for a
given time period.
• Context switching is used to save states of
preempted processes.
SIGNAL SYSTEMS
• Multiple-Level Queues Scheduling

• Multiple-level queues are not an independent scheduling
algorithm. They make use of other existing algorithms to
group and schedule jobs with common characteristics.
• Multiple queues are maintained for processes with common
characteristics.
• Each queue can have its own scheduling algorithms.
• Priorities are assigned to each queue.
• For example, CPU-bound jobs can be scheduled in one queue
and all I/O-bound jobs in another queue. The Process
Scheduler then alternately selects jobs from each queue and
assigns them to the CPU based on the algorithm assigned to
the queue.
SIGNAL SYSTEMS
• Control signals regulate the operations and coordination of all

processor components while executing the instructions. It is the
control unit of the CPU, which is responsible for generating
control signals. If we divide the instruction cycle into two phases,
it would be the fetch phase and the execution phase.
• Even the processor’s hardware is divided into two sections. The
section that deals with the fetch phase also generates control
signals. We refer to this section or unit as the Control Unit. The
other section deals with the execution phase read the data
operands of instruction, performs the required computation, and
stores the result. We often refer to this unit as ALU.
• Thus, the control unit generates control signals so that the
processing of instructions takes place in the correct sequence and
at the right time.
SIGNAL SYSTEMS
How are Control Signals Generated?

• There are two approaches to design a control unit for
generating the control signals.
• Hardwired Control
• Micro programmed Control
SIGNAL SYSTEMS
• Hardwired Control
• To execute a particular instruction, a system needs to
execute a sequence of steps such as:
• Instruction fetch
• Instruction decoding
• Operand fetch
• Instruction execution
• Operand store
• Each of these steps requires one clock cycle.
SIGNAL SYSTEMS
• Micro programmed Control

• The hardwired control uses logic circuitry to generate the
control signals. However, the micro programmed control
uses software to generate control signals.
• The control signals’ settings in the micro programmed unit
depend on a special program (control program) stored in a
special memory.
TILED CHIP MULTI CORE
PROCESSORS (TCMP)
Department of Computer Science and Engineering 1-69

TILED CHIP MULTICORE
PROCESSORS (TCMP)
 A tile processor is a type of processor architecture that is designed to
efficiently handle parallel computations by dividing them into smaller,
independent tasks called tiles.
 Each tile contains a subset of the overall computation and can be processed
independently or in parallel with other tiles.
 The architecture of a tile processor typically consists of multiple processing

elements (PEs) or cores, which are responsible for executing the instructions
and performing computations.
 Processing Elements of tile contains a high-performance arithmetic unit with

its own private register file and local SRAM memory. These PEs are
organized in a grid or array-like structure, where each PE is connected to its
neighboring PEs for communication and data exchange.

TILED CHIP MULTICORE PROCESSORS (TCMP)
Here are the key components and characteristics of a typical tile processor
architecture:
1.Grid Structure: The PEs are arranged in a grid structure, often referred to as
a mesh or a 2D array. The grid structure allows for easy interconnection and
communication between neighboring PEs.
2.Local Memory: Each PE has its own local memory, which is used to store
data and instructions specific to that PE. This local memory provides low-
latency access and enables fast data sharing between neighboring PEs.
3.Communication Network: The PEs are interconnected through a

communication network that allows them to exchange data and synchronize
their operations. The network can be either a shared bus, a crossbar switch, or
a more complex network-on-chip (NoC) architecture.

4.Tile Division: The computation is divided into tiles, which
are typically small, fixed-sized portions of the overall problem.
Each tile is assigned to a specific PE for processing. The tile
size is chosen to balance the workload across PEs and optimize
data locality.
5.Task Scheduling: A tile processor employs task scheduling

algorithms to distribute tiles across the available PEs. The
scheduler assigns tiles to PEs based on various factors such as
load balancing, data dependencies, and available resources.
6.Data Parallelism: The tile processor exploits data parallelism

by executing the same set of instructions on multiple data
elements simultaneously. Each PE operates on a subset of data
within a tile, enabling efficient parallel execution.
7.Synchronization and Communication: PEs often need to synchronize and

exchange data during computation. Synchronization barriers are employed to
ensure that all PEs have completed their assigned tasks before proceeding to
the next stage. Communication primitives like message passing or shared
memory mechanisms facilitate data exchange between PEs.
8.Scalability: Tile processors can be scaled up to accommodate a larger number

of PEs, allowing for higher parallelism and increased computational power. The
scalability of the architecture enables efficient utilization of resources for
various application domains.
Overall, the tile processor architecture optimizes parallel computation by

dividing the workload into smaller, manageable tasks and efficiently
distributing them across multiple PEs.

Tile Processor Architecture
The Tile Processor Architecture consists of a 2D grid of identical

compute elements, called tiles. Each tile is a powerful, full
featured computing system that can independently run an entire
operating system, such as Linux. Likewise, multiple tiles can be
combined to run a multiprocessor operating system such as SMP
Linux.

Instead of using buses or rings to connect the many on-chip cores,

the Tile Architecture couples its processors using five 2D mesh
networks. They are
1.user dynamic network (UDN)
2.I/O dynamic network (IDN)
3.Static network (STN)
4.Memory dynamic network (MDN)
5.Tile dynamic network (TDN))
which provide the transport medium for off-chip memory access,

I/O, interrupts, and other communication activity.
2D Mesh is a very popular topology in Network on Chip due to its

facilitated implementation, simplicity of the XY routing strategy and
the network scalability.
TILED CHIP MULTICORE PROCESSORS (TCMP)-
For better clarity of Tile Processor design
understanding

Network On Chips (Noc)
A Network-on-Chip (NoC) is an interconnect infrastructure that

is integrated into a chip or a system-on-chip (SoC). It provides a
communication framework for connecting the various
components within the chip, such as processors, memory units,
accelerators, and input/output interfaces.
It replaces traditional bus-based interconnects with a packet-

switched network, similar to how data is transferred in computer
networks.
SoC is a chip that integrates components of a computer including

a central processing unit (CPU), graphics, and memory
interfaces.
Key Characteristics And Concepts Associated With Network-on-chip Architectures:
1.Communication Channels: NoCs consist of a network of communication
channels, often referred to as "links," which allow for the exchange of data and
control information between the components. These channels can be point-to-point
connections or shared communication buses.
2.Routers: Routers form the backbone of a NoC and are responsible for routing
packets of data across the network. They receive packets from input channels and
determine the appropriate path to forward the packets to the desired destination.
3.Topologies: NoCs can have various network topologies, determining the structure
and connectivity of the communication channels. Common topologies include
mesh, torus, ring, star, and tree, each with its own advantages and trade-offs in
terms of scalability, latency, and power consumption.

4.Packetization: Data communication in NoCs is typically done using packets.

Data is divided into smaller packets, each containing a portion of the data and
additional control information. Packetization helps in efficient data routing, flow
control, and error handling within the network.
5.Routing Algorithms: NoCs employ routing algorithms to determine the path

that packets should take from the source to the destination. Routing algorithms
consider factors such as network congestion, load balancing, and minimizing
packet latency and power consumption.
6.Quality-of-Service (QoS): NoCs can support different classes of traffic with

varying QoS requirements, such as real-time or latency-sensitive data. QoS
mechanisms in NoCs ensure that packets are prioritized and delivered according
to their specific requirements.

7.Power Optimization: NoCs often employ power optimization

techniques to minimize energy consumption. Techniques like
power gating, voltage scaling, and dynamic routing can be used
to reduce power dissipation in the network.
Network-on-Chip architectures are commonly used in complex

systems and applications that require high-performance and
scalable interconnect solutions.
They enable efficient communication between components,

support parallel processing, and help in managing increasing
data transfer requirements in modern chip designs.
Network On Chips (Noc) - ARCHITECTURE

Communication layer: It to the component responsible for managing the

exchange of data and control information between different components or
processing elements within the chip. It serves as the interface between the
components and the underlying network infrastructure, facilitating
communication and data transfer.
Application layer : It is the topmost layer of the communication stack,
responsible for managing the communication requirements and interactions of
the higher-level software applications or functional units with the NoC
infrastructure. It provides an interface between the software or higher-level
modules and the underlying NoC communication layer.
Software Adoption : The utilization of software-based techniques, tools, and
frameworks to optimize and enhance the performance, efficiency, and
functionality of NoC architectures. Software adoption plays a crucial role in
effectively utilizing the capabilities of the NoC and enabling the development of
complex and high-performance systems.

Hardware Adoption : The utilization of specialized hardware

components, designs, and techniques to enhance the performance,
efficiency, and functionality of the NoC architecture. Hardware
adoption plays a crucial role in optimizing the communication
infrastructure, reducing latency, improving power efficiency, and
meeting the requirements of complex system designs.
OSI Layer Mapping :Network-on-Chip (NoC) architectures, the

mapping of the traditional OSI layers may not directly apply.
NoCs typically have their own unique architecture and
communication protocols that may differ from traditional
networking protocols.

NOC ROUTER – ARCHITECTURE AND DESIGN
Network-on-Chip (NoC) router architecture and design are essential components

of a NoC system. A NoC router facilitates communication between different
processing elements or IP cores within the chip, enabling efficient data transfer
and synchronization.
Key Aspects of NoC Router:
 Input and Output Ports: A NoC router typically has multiple input and output
ports to accommodate communication with neighboring routers and connected
processing elements. The number of ports depends on the specific NoC design
and network topology. Each port serves as a communication interface for
receiving and transmitting data packets.
 Input Buffering: Input buffering is crucial in NoC routers to handle incoming

packets and manage congestion. Each input port is typically associated with a
buffer that temporarily stores incoming packets until they can be forwarded to
the appropriate output port. Buffering helps to accommodate variations in
processing speeds and mitigate congestion within the NoC.
Noc Router – Architecture and
Design

Noc Router – Architecture and Design
 Routing Mechanism: The routing mechanism determines how data packets are
forwarded from the input ports to the desired output ports. NoC routers
employ routing algorithms or tables to make routing decisions based on packet
headers, destination addresses, and network conditions. Common routing
algorithms include deterministic routing, adaptive routing, and dimension-
ordered routing.
 Crossbar Switch or Interconnect: The interconnect or crossbar switch connects
the input ports to the output ports, allowing for the exchange of data packets. It
enables the routing and switching of packets between different ports based on
the routing decisions made by the router. The interconnect is a critical
component that should have sufficient bandwidth to accommodate the traffic
demands of the NoC.
 Arbitration: In situations where multiple input ports contend for the same
output port, arbitration mechanisms resolve conflicts and determine the order
of packet transmission. Different arbitration algorithms can be employed, such
as round-robin, prioritized, or fairness-based arbitration, to ensure fair and
efficient resource allocation.
Noc Router – Architecture and Design
Error Handling and Reliability: NoC routers may incorporate

error detection and error correction mechanisms to ensure
reliable data transmission. This includes error detection codes,
retransmission strategies, and fault tolerance techniques to
detect and recover from errors that may occur during packet
transmission.
Flow of Control : To generate tasks that need to be executed

by different components or processing elements within the
system. These tasks can be generated by software running on
the processors or by hardware modules triggering specific
operations.

Network-on-Chip Design Methodologies
 Network-on-Chip (NoC) design methodologies refer to the approaches and

techniques used in designing efficient and scalable communication
architectures for complex System-on-Chip (SoC) designs.
 NoCs are a paradigm for on-chip communication that replace traditional bus-
based communication architectures with a network of interconnected
communication nodes.
Methodologies Used In Noc Design :
1.Topology Selection: The choice of NoC topology is crucial for achieving

efficient communication. Different topologies, such as mesh, torus, or tree-based
structures, have different trade-offs in terms of latency, power consumption,
scalability, and fault tolerance.

2.Routing Algorithms: Routing algorithms determine how packets are

forwarded through the NoC. They should be designed to minimize latency,
congestion, and power consumption while ensuring deadlock freedom and
maintaining Quality of Service (QoS) requirements. Various routing strategies,
including deterministic, adaptive, and stochastic approaches, can be employed.
3.Flow Control: Flow control mechanisms regulate the flow of data in the NoC
to prevent congestion and ensure reliable communication. Techniques like
credit-based flow control, wormhole routing, or virtual channel allocation are
commonly used to manage the data flow and handle contention.
4.Performance Analysis and Optimization: Designers employ performance

analysis tools and techniques to evaluate the NoC's performance
characteristics, including latency, throughput, and scalability.

5.Integration with System-Level Design: NoC design methodologies should

consider the integration of the communication architecture with the overall system
design. This includes interface standards, integration of IP cores, arbitration
mechanisms, and synchronization protocols to ensure seamless integration and
compatibility with the rest of the SoC.
6.Verification and Testing: Robust verification and testing methodologies are
essential to ensure the correctness and reliability of the NoC design. Techniques
such as formal verification, simulation, emulation, and prototyping are used to
validate the design and detect any functional or performance issues.
7.Design Space Exploration: Design space exploration involves exploring
different design options, such as varying topology, routing algorithms and buffer
sizes to identify the optimal configuration that meets the design constraints.
Techniques like performance modeling, trade-off analysis, and heuristic-based
exploration help in finding the best design solution.


Routing Algorithms And Flow Control Techniques
Routing Algorithms in Network-on-Chips (NoCs)

1.Dimension-Ordered Routing: This routing algorithm is commonly used in
mesh-based NoCs. It routes packets in a predetermined order, prioritizing
specific dimensions (e.g., X or Y) at each hop. It simplifies routing decisions
and reduces routing overhead.
2.XY Routing: XY routing is another routing algorithm used in mesh-based

NoCs. Each packet is routed in two steps: first in the X direction towards the
destination column and then in the Y direction to reach the destination router. It
is a deterministic routing algorithm that is easy to implement.
3.West-First Routing: This routing algorithm is commonly used in torus-based

NoCs. It prioritizes the West direction when routing packets, aiming to reduce
the average distance traveled by packets and minimize congestion.

4.Adaptive Routing: Adaptive routing algorithms dynamically select the path

for each packet based on the current network conditions. They take into
account factors such as congestion, link reliability, and available bandwidth to
make routing decisions. Adaptive routing can improve performance and avoid
congestion hotspots, but it adds complexity to the routing process.
5.Minimal Routing: Minimal routing aims to find the shortest path between
the source and destination routers. It selects the next hop that minimizes the
distance to the destination, considering the topology of the NoC. Minimal
routing algorithms are often used in combination with other routing
techniques to optimize performance.

Flow Control Techniques in Network-on-Chips (NoCs)

1.Credit-Based Flow Control: Credit-based flow control uses credit tokens to
regulate the flow of packets in the NoC. Each router grants a certain number of
credits to a sender, indicating the available buffer space in the receiver. The
sender can send packets as long as it has sufficient credits. This technique helps
prevent congestion and provides flow control in a distributed manner.
2.Virtual Channel Flow Control: Virtual channel flow control assigns multiple
virtual channels to each physical link in the NoC. Each virtual channel
maintains its own buffers and control signals. It enables packets from different
traffic flows to be multiplexed onto the same physical link, reducing
congestion and providing isolation between different traffic classes.

3.Wormhole Routing: In wormhole routing, packets are divided into small flits
(flow control digits) and sent through the network as soon as the first flit arrives
at a router. The subsequent flits follow the same path, allowing pipelining of
packet transmission. Wormhole routing reduces latency but requires careful
management of flow control to avoid deadlocks and head-of-line blocking.
4.Store-and-Forward: Store-and-forward is a simple flow control technique
where the entire packet is received at a router before being forwarded to the next
router. This approach ensures that the packet is error-free and reduces the
chances of congestion. However, it introduces higher latency compared to other
flow control techniques.
5.Cut-Through Routing: Cut-through routing is a variant of wormhole routing
where flits are forwarded through the network as soon as they arrive at a router,
without waiting for the complete packet. This technique reduces latency but
requires additional mechanisms to handle flow control and ensure data integrity.

Advanced Topics In NoC And Storage
1.Scalability and Heterogeneity: Scaling NoCs to accommodate larger system

sizes and diverse architectures is a challenging task. Researchers investigate
scalable NoC designs, network partitioning techniques, hierarchical interconnects,
and support for heterogeneous components to address the scalability and
heterogeneity requirements of modern systems.
2.3D Integration and NoCs: Three-dimensional integration techniques, such as

through-siliconvias (TSVs) and inter-layer communication, offer opportunities for
vertically stacking multiple layers of processing and memory components.
Research focuses on designing efficient 3D NoC architectures, considering
thermal management, power delivery, and communication bottlenecks.
3.Network Virtualization: Network virtualization aims to partition the NoC

resources into multiple virtual networks, enabling isolation, customization, and
sharing of communication infrastructure among different applications or tenants.
Researchers explore techniques for virtual NoC design, mapping virtual networks
onto physical resources, and ensuring isolation and QoS guarantees.
Advanced Topics In NoC And Storage
4.Machine Learning and Artificial Intelligence in NoCs: The application of

machine learning and AI techniques in NoCs is an emerging area of research.
Researchers explore using machine learning algorithms to optimize routing
decisions, traffic prediction, congestion control, power management, and fault
detection in NoCs.
5.Optical and Photonic NoCs: Optical and photonic interconnects offer the
potential for high-speed and energy-efficient communication in NoCs.
Researchers investigate the integration of optical components, such as
waveguides and photonic switches, into NoC architectures to overcome
bandwidth limitations and reduce power consumption.
6.Neuromorphic NoCs: Neuromorphic computing, inspired by the brain's neural
architecture, involves the design of specialized hardware for efficient and brain-
like computation. Researchers explore the integration of neuromorphic systems
with NoCs to create efficient communication architectures for neuromorphic
computing platforms.
Compression, Pre fetching, QoS
Compression, prefetching, and Quality of Service (QoS) are techniques that

can be employed in Network-on-Chip (NoC) architectures to enhance
performance, efficiency, and resource management.
1.Compression in NoCs: Compression techniques can be applied to reduce the

amount of data transmitted across NoC links, thereby improving bandwidth
utilization and reducing energy consumption. Data compression can be
performed at different levels in the NoC, such as compressing packets, flits, or
individual data fields.
Compression algorithms like Run-Length Encoding (RLE), Huffman

coding, and Lempel-Ziv-Welch (LZW) are commonly used. However,
compression introduces additional latency and requires compression and
decompression units at each router, which need to be carefully balanced with the
benefits gained.

Prefetching in NoCs: Prefetching involves fetching data in advance before it is

actually needed, aiming to reduce memory access latency and improve overall
system performance.
 In NoCs, prefetching techniques can be employed at different levels, such as

cache prefetching or network-level prefetching. Cache prefetching predicts
future memory accesses and brings data into caches ahead of time.
 Network-level prefetching predicts communication patterns and initiates

data transfer before explicit requests are made. Prefetching algorithms, such
as stride prefetching and next-line prefetching, can be employed to
anticipate memory access patterns and optimize data movement in NoCs.

Quality of Service (QoS) in NoCs: QoS mechanisms in NoCs aim to
guarantee performance levels, prioritize traffic, and meet the requirements of
different applications or traffic classes.
Traffic Differentiation: Different traffic classes or applications can be

assigned different priorities or service levels based on their criticality or
performance requirements. This allows higher-priority traffic to be given
preferential treatment in terms of routing, buffer allocation, and resource
utilization.
Traffic Shaping and Scheduling: QoS mechanisms include traffic shaping

techniques to regulate the rate of data transmission, preventing congestion
and ensuring fairness. Traffic scheduling algorithms, such as weighted
round-robin or strict priority queuing, can be employed to prioritize and
allocate resources based on predefined QoS parameters.
Department of Computer Science and Engineering 1-

100
Bandwidth Reservation: QoS mechanisms can include bandwidth

reservation techniques to allocate a certain amount of bandwidth to
specific traffic classes or applications, ensuring that they meet their
performance requirements.
Congestion Management: QoS in NoCs also involves congestion

management techniques to detect and alleviate congestion in the
network. Congestion-aware routing algorithms, flow control
mechanisms, and congestion control protocols can be employed to
maintain optimal performance and fairness.
Department of Computer Science and Engineering 1-

101

UNIT II - Multi Core Architecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT II - Multi Core Architecture

Uploaded by

Copyright:

Available Formats

Department of

Computer Science and Engineering

Course Code Course Title L T P C

Category : Program Core

CO2 Analyze different computer architectures and applications. K3

Knowledge Level (Based on revised Bloom’s Taxonomy)

Department of Computer Science and Engineering

1. Hamacher, V.C., Vranesic, Z.G., and Zaky, S.G.,

2. Patterson, D.A., and Hennessy, J.L. , “Computer

Department of Computer Science and Engineering

Memory technologies, hierarchical memory systems, the locality principle

• DRAM, SRAM, and magnetic storage media, including hard-

• A fourth, nonvolatile random access memory (NVRAM), is

 Data transfer rate or bandwidth is one of the measures of the

Department of Computer Science and Engineering

Characteristics of Cache Memory

 Cache memory is an extremely fast memory type that acts as a buffer

Department of Computer Science and Engineering

• The main memory will be searched for the required data or

Cache Operation: It is based on the principle of locality of

Cache Operation: It is based on the principle of locality of reference.

• Spatial Locality – Spatial locality means instruction or data near to the

• Cache Performance: The performance of the cache is

• There are three different types of mapping used for

• If a line is previously taken up by a memory block when a new

• For purposes of cache access, each main memory address

• In this type of mapping, associative memory is used to store the

• It is considered to be the fastest and most flexible mapping form.

ASSOCIATIVE MAPPING STRUCTURE

• This form of mapping is an enhanced form of direct mapping

SET ASSOCIATIVE MAPPING STRUCTURE

• One of the notable features of MultiCore processors is that

• Whenever a Processor wants to write a word, it checks to see if

• In write-through, data is simultaneously updated to cache and

• It helps in data recovery (In case of a power outage or system

• The number of capacity misses clearly depends on N and

• DRAM is a common type of random access memory (RAM) used in

• RAM is a part of the computer's Main Memory, which is directly

• These are 4 types of memory access techniques.

• In this memory, a word is accessed rather than its

First Come First Serve (FCFS)

Shortest Job Next (SJN)

• Round Robin Scheduling

• Multiple-Level Queues Scheduling

• Control signals regulate the operations and coordination of all

How are Control Signals Generated?

• Micro programmed Control

Department of Computer Science and Engineering 1-69

 The architecture of a tile processor typically consists of multiple processing

 Processing Elements of tile contains a high-performance arithmetic unit with

Department of Computer Science and Engineering 1-70

3.Communication Network: The PEs are interconnected through a

Department of Computer Science and Engineering 1-71

5.Task Scheduling: A tile processor employs task scheduling

6.Data Parallelism: The tile processor exploits data parallelism

7.Synchronization and Communication: PEs often need to synchronize and

8.Scalability: Tile processors can be scaled up to accommodate a larger number

Overall, the tile processor architecture optimizes parallel computation by

Department of Computer Science and Engineering 1-73

The Tile Processor Architecture consists of a 2D grid of identical

Department of Computer Science and Engineering 1-74

Instead of using buses or rings to connect the many on-chip cores,

which provide the transport medium for off-chip memory access,