You are on page 1of 288

Unit -1

Performance Tuning Concepts


Elements of Cloud Infrastructure, Hardware, Storage, Operating Systems, Hypervisors, Networks,
Power management, Introduction to the term performance,Elements of Performance Tuning,
Resource Management, Resource Provisioning and Monitoring, Process, Resource Allocation 5esource
Monitoring , Performance Analysis and Performance Monitoring Tools.
Innovation Centre for Education

What is Cloud Performance?

Cloud performance refers to monitoring of applications and the servers in


private and public clouds. Cloud performance management tools are
usually used to know not only cloud servers performance, from processor,
memory and storage usage point of view, but also the way applications
(cloud based) are performing. Cloud performance monitoring helps
administrators track the widely varying workloads to which cloud
applications are subjected, noting potential problems under peak loads.
Cloud performance software can help identify instances where a particular
application may require additional resources to prevent potential outages.

2
Innovation Centre for Education

Many factors are needed to be considered for performance


tuning, which can become more complex with respect to cloud
computing, including:

 The service consumer, e.g. an application or a browser.

 The network speed, e.g., internal network speed and public networks speed
.
 Cloud provider tunable, e.g., database and network configurations.

3
Innovation Centre for Education

Depending on the type of cloud computing service they consume, the service consumer
varies greatly. Considering the storage as an example, the native application is a consumer of
the storage service. Tuning, or configuration, would be the manner in which the service is
consumed, which even includes the use of any available caching systems that may eliminate
many of the times that application has to request to the cloud provider for storage block
services.
The tuning of network may be in any number of ways, which includes the path and method of
transmission from the service consumer to the service provider. The tuning of network
typically is the configuration within the routers and other transmission points. On the basis of
trial and error this can be achieved, and also we need to keep in mind it should not impact
others who are also using the network, so we need to find a solution by compromising around
what’s the requirement for the use of the cloud computing providers and others who will be
leveraging the network.
And lastly, the cloud computing provider normally provides some tunables, generally around
the processing and location of data. The main intention here is to place the data as close/near
to the processing as possible, and hence eliminating many of the unnecessary hops that
internal process requires for fetching the data. In case of a very interactive application which
makes many database requests during the normal course of processing, the requirement of
data to be placed as close as possible becomes an absolute requirement.

4
Innovation Centre for Education

Elements of Cloud
Infrastructure Hardware

Server Technology
Server is a computer program that provides the resources to the other computers in the
network. Servers can be run on a dedicated computer and provides several services in the
network. The server will manage the communication mechanism between the computers in
the network through the network protocols. The server hardware will vary depends upon the
application running in the server.
The client-server type of services includes:
• File Server: Provides file management functionalities
• Print Server: Provides printing functions and share the printer to access in the network
• Database Server: Provides the databases functionalities running on the server.

5
Innovation Centre for Education

Blade Server
A blade server is a single circuit board added with the hardware components such as
processors, memory and network cards that is usually found on multiple boards. Since the
blade server uses the laptop technology, it looks like thin and requires less power and cooling
compared to other servers.
The individual blades in the chassis (cabinet) are connected using a bus system. Combining all
the cabinets to form blade server and the entire cabinets share a common power supply and
cooling resources. Each blade will have its own operating system and software applications
installed on it.

6
Innovation Centre for Education

Rack Servers
A rack server is server which is designed to fit into the rack. The rack contains multiple mounting
slots and bays, each bays is used for holding the hardware unit and screw that. In a single
rack we can have multiple servers, network components. To reduce the floor space in the
organization, keep all the servers in the rack and the cooling system is necessary to prevent
the excessive heat in the room. Each rack server has power supply separately. The rack
server comes with different sizes.
U is the standard unit of measure of vertical usable space, height of racks( metal frame
designed to hold the hardware devices) and the cabinets. The server comes from 1U to 7U
sizes, in which 1U is equal to 1.75inches.

7
Innovation Centre for Education

Enterprise Server
An Enterprise server is a computer that has programs collectively to serve the requirements of
an enterprise. Mainframe computer systems were used as enterprise servers. Due to the ability
to manage enterprise wide programs, UNIX servers and Wintel servers are also generally called
as enterprise servers.
Examples of enterprise servers are Sun Microsystem’s UNIX based systems, IBM iSeries
systems, HP systems and so more.

8
Innovation Centre for Education

High Performance Server


High Performance computer systems are the most powerful and flexible research instruments
today. It is used in various fields like climatology, quantum chemistry, computational medicine,
high energy physics and many other areas. The hardware structure or architecture determines
to a large extent what the possibilities and impossibilities are speeding up the system.
HPC is an essential part of business success in many organizations. Over the past years, the
HPC cluster has disrupted the supercomputing market. Typical HPC systems can deliver
industry-leading, cost-effective performance.

9
Innovation Centre for Education

Server Workload:

Server Workload can be defined as the amount of processing that the server is assigned at a
given time. The workload can consist of certain amount of application programming running and
the users interacting with the systems applications.

The workload can be specified as the benchmark which is used to evaluate the server in terms
of its performance, which is divided into response time and throughput. The response time is the
time between the user request and the response to the request. The throughput is the amount
of work that is accomplished over a period of time. The amount of work handled by the server
estimates the efficiency and performance of that particular server.

10
Innovation Centre for Education

Memory Workload: Programs or instructions require memory to store data and


perform intermediate computations. The amount of memory used by the server
over a given period of time or at a specific instant of time is determined by the
memory workload. The usage of the main memory is increased by the paging and
segmentation process which uses a lot of virtual memory, however if the number of
programs to be executed becomes large, it needs more amount of memory which
should be effectively managed.

CPU Workload: The number of instructions that is executed by the processor during
a given period of time is indicated by the CPU workload. More processing power is
needed if the CPU is overloaded always. The improvement in the performance is
obtained for the same number of instructions by decreasing the number of cycles
required by the instruction.

I/O Workload: The number of inputs got by a server and the number of outputs
produced by the server over a particular duration of time is called as I/O workload.

Database Workload: The workload of a database is analyzed by the


determination of the number of queries executed by the database over a given
period of time, or the average number of queries executed at an instant of time.
11
Innovation Centre for Education

Storage
Storage networking is the process of linking storage devices together and connecting
to other IT networks and provides a centralized repository for data that can be
accessed by users and uses high-speed connections to provide fast performance.

Storage networking is used in reference to storage area networks (SANs) which links
multiple storage devices and provides block-level storage.

Storage networking can also refer to network attached storage (NAS) devices. NAS is
a standalone device which connects to a network and provides file-level storage. The
NAS unit typically does not use a keyboard or display and is controlled and
configured over the network using a browser. An operating system is not needed on a
NAs device and usually a stripped down operating system like FreeNAS is used. The
Network Attached Storage uses file based protocols like NFS, Server Message Block
SMB/CIFS and NCP(Network Control Protocol).

The Storage networking can include direct-attached storage (DAS), network-attached


storage (NAS) and storage-area network (SANs).
12
Innovation Centre for Education

The benefits of storage networking are improved performance, reliability and


availability and also make it easier to back up data for disaster recovery purpose.
Storage networks are often used in parallel to storage management technologies like
storage resource management software, virtualization and compression.

Types of Storage System: Storage is a technology consisting of components and


recording media to retain digital data for the purpose of future usage. It is a core
function and fundamental component of computers. All computers use a storage
hierarchy which puts expensive and small storage options into the CPU. The volatile
technologies are referred to as memory and the slow permanent technologies are
referred to as Storage.

The computer represents data using binary numeral system. Text, numbers, pictures,
audio are converted into string of bits or binary digits having value of 0 or 1. Storage
in a Computer is measured usually in bytes, Kilobytes (KB), Megabytes (MB),
Gigabytes (GB) and presently in Terabytes (TB).

13
Innovation Centre for Education

Types of Storage:
Primary Storage:
Primary storage is also known as memory, which is directly accessible to the CPU.
The CPU continuously reads the instructions which is stored and executes them as
and when required. RAM, ROM amd Cache memory are examples of primary
memory.

Secondary Storage:
Secondary storage is also referred to as external memory or auxiliary storage and is
the one which is not directly accessible by the CPU. The computer uses the
input/output channels to access secondary storage and transfers the desired data
using intermediate area in primary storage. Secondary storage is a non-volatile
memory. Hard Disks are example of Secondary storage. Some other examples of
secondary storage technologies are Flash memory, Floppy disks, magnetic tapes,
Optical devise like CD, DVD and also Zip drives.

14
Innovation Centre for Education

Storage Devices and


Technologies: Hard Disk
Drive:

15
Innovation Centre for Education

RAID or Disk Array


Redundant Array of Independent Disks (RAID) in which the hard disk drives are grouped
together using hardware or software and is treated as a single data storage unit. The data is
recorded across multiple hard disk drives in parallel, increasing the access time significantly.
The array of multiple hard disk drives which forms the RAID can also be partitioned and
assigned with a file system.

RAID Functions :
• Striping
Striping is the process in which consecutive logical bytes of data is stored per blocks in the
consecutive physical disks which forms the array.
• Mirroring
Mirroring is the process in which the data is written to the same block on two or more physical
disks in the array.
• Parity Calculation
If there are N numbers of disks in the RAID array, N-1 consecutive blocks are used for storing
data blocks and the Nth block is used for storing the parity. When any of the N-1 data blocks
are altered, N-2 XOR calculations are performed on the N-1 blocks. The Data Blocks and
Parity Block is written across the array of hard disk drives that forms the RAID array.
Iyf oane of the N blocks fail, the data in that particular block is reconstructed using N-2 XOR
calculations on the remaining N-1 blocks If two or more blocks fail in the RAID array the
reconstruction of the data from the failed blocks is not possible.
16
Innovation Centre for Education

Optical Storage
Optical Storage is one of the low-cost and reliable storage media used in personal computers
for incremental data archiving. Optical storage are available in any one of the three basic
formats. They are Compact Disk (CD), Digital versatile disk (DVD) and Blue-ray disk (BD). The
media unit costs are very low in Recordable CDs, DVDs and Blue-ray disks. However, when
comparing to the hard disk drives or tape drives the capacities and speeds of the optical discs
are comparatively lower.

17
Innovation Centre for Education

Solid State Drives:


The solid state drives do not contain any moving parts like the magnetic drives. The solid state
drives are also known as flash memory, thumb drives, USB flash drives, Memory Stick, Secure
digital cards. The SSDs are relatively expensive when compared to the other types for their low
capacity, but are very convenient for backing up the data. SSD drives are presently also
available in the order of 500GB to TBs.

18
Innovation Centre for Education

Network and Online Storage:


Network storage is a method in which the user’s data is stored and backed-up onto their
company’s network servers. The files stored online are stored in a hard disk drive located
remotely to the computer and can be accessed from a remote location. As the internet access is
more widespread the remote backup services are gaining popularity. Backing up of users data
using the internet to a remote location helps to protect data against worst-case scenarios such
as fires, floods, or earthquakes which would destroy any backups stored in the same location.

19
Innovation Centre for Education

FC-AL (Fibre Channel Arbitrated Loop)


FC-AL is a fibre channel topology used to connect devices using a loop topology, is similar to a
token ring network. A token is used to prevent data from colliding when two or more streams are
sent at the same time. FC-AL passes data using one-way loop technique.
FC-AL technology eliminates the expensive fibre channel switches and allows several servers
and storage devices to be connected. The fibre channel arbitrated loop can also be called as
arbitrated loop.
The FC topology has three major fibre channels to connect ports:
• Switched Fabric: A network topology using crossbar switches that connects devices.

• Point-to-Point: Allows two-way data communication connecting one device to another.

• Arbitrated Loop: Devices are connected in a loop, only two devices can communicate
at the same time. This is the most common fibre channel used in FC-AL.
Fibre channel arbitrated loop can connect upto 127 devices with a port attached to the fabric. In
FC-AL only one port can transmit data at a time. FC-AL uses the arbitration signal to choose the
port. Once the port is selected by the arbitration signal, it can use the fibre channel, gigabit-
speed network topology is used for network storage.

20
Innovation Centre for Education

The FC-AL topology has the following properties:


All devices share the same bandwidth in the loop
Can be cabled using hub or loop technology

If a port malfunctions in the loop, all ports stop working


Two ports function like arbitrated loop and not as a point-to-point
127 ports (devices) can be supported with one port attached to the fabric
Has a serial architecture compatible with small computer system interface (SCSI)
FC-AL topology is used where interconnection of storage devices is needed and where multiple
node connections require high-bandwidth connections.
FC-Al loops are of two types
•Private loops: Private loops are not connected to the Fabric, so the nodes are inaccessible to
the nodes that do not belong to the loop.
•Public loops: Public loops are connected to the Fabric through one FL Port and are accessible
to other nodes that are not part of the loop.

21
Innovation Centre for Education

FABRIC
Storage Area Network (SAN) Fabric is hardware used to connect workstations and servers to
the storage device in a SAN. Fibre Channel switching technology is used in a SAN Fabric to
enable any-server to any storage device connectivity.
Fabric is a network topology used to connect network nodes with each other using one or more
network switches.
Switched Fabric in Fibre Channel:
Switched Fabric is a topology in which the devices are connected to each other through one or
more Fibre Channel switches. This topology has the best scalability of the three Fibre Channel
topologies like Arbitrated loop and Point-to-point as the traffic is spread across multiple physical
links and also the only one requiring switches. The visibility among various devices also called
as nodes in a fabric is controlled with Zoning.
There are two methods of implementing zoning
•Hardware Zoning: Hardware Zoning is a port based zoning method. Physical ports are
assigned to a zone. A port can be assigned to one zone or multiple zones at the same time. The
devices are connected to a particular port, if the devices are shifted to different port, the entire
zone ceases to operate.
•Software Zoning: Software Zoning is a SNS based zoning method. The device’s physical
connectivity to a port does not play a role in the definition of zones. Even if a device is
connected to a different port, it remains in the same zone.
22
Innovation Centre for Education

Storage Area Network:


A Storage Area Network (SAN) is a dedicated network that carries data between computer
systems and storage devices, which can include tape and disk resources. SAN forms a
communication infrastructure providing physical connections and consists of a management
layer, which organizes the connections, storage elements, and computer systems so that the
data transfer is secure and robust.

23
Innovation Centre for Education

Zoning
Zoning in a Storage area network is the allocation of resources for the devices load balancing
and for allowing access to data only for certain users. Zoning is similar to that of a file system.
Zoning can be of two kinds, Hard or Soft zoning. Hard zoning is one in which each device is
assigned to a particular zone and this does not change. In the case of soft zoning device
assignments can be changed to accommodate variations in demand on different servers in the
network.
Zoning is used to minimize the risk of data corruption, helps to secure data against hackers and
minimizing the spread of viruses and worms. The disadvantage of zoning is the complication in
the scaling process if the number of users and servers in a SAN increases significantly.

24
Innovation Centre for Education

Storage Virtualization
Storage Virtualization is a concept in which the storage system uses virtualization concepts
which enables better functionality and advanced features within and across storage systems.
Storage virtualization hides the complexity of the SAN by pooling together multiple storage
devices to appear as a single storage device.
Storage virtualization can be of three types
•Host based: In the Host based virtualization the virtualization layer is provided by a server and
presents a single drive for the applications. The host based storage virtualization depends on
the software at the server often at the OS level. Volume Manager is the tool which is used to
enable this functionality. The volume manager is configured so that several drives are presented
as a single resource which can be divided as needed.
•Appliance Based: In the Appliance based virtualization a hardware appliance is used which sits
on the storage network
•Network Based: The network based virtualization is similar to the appliance based except that it
works at the switching level. Appliance and network based work at the storage infrastructure
level. Data can be migrated from one storage unit to another without reconfiguring any of the
servers; the virtualization layer handles the remapping.

25
Innovation Centre for Education

Storage virtualization can be implemented using software applications.


The main reasons to implement storage virtualization are

• Improvised storage management in an IT environment

• Better availability with automated management

• Better storage utilization

• Less energy usage

• Increase in loading and backup speed


• Cost effective, no need to purchase additional software and hardware
Storage virtualization can be applied to different storage functions such as physical storage,
RAID arrays, LUNs, storage zones and logical volumes.
The two main types of virtualization are:
•Block Virtualization: The abstraction (separation) of logical storage (partition) from the
physical storage so that the partition can be accessed without regard to the physical storage.
The separation of the logical storage from the physical storage allows greater flexibility to
manage storage for the users.
•File Virtualization: File virtualization eliminates the dependencies between the data accessed
at the file level and the physical location where the files are stored. This optimizes the storage
use and server consolidation and also to perform non-disruptive file migrations.
26
Innovation Centre for Education

Operating Systems

The collection of software to manage computer hardware resources and to provide common
services for computer programs is called as Operating system. The operating system is required
for the functioning of the application programs.
Various tasks like recognizing the input from the keyboard, transferring output to the display,
maintaining track of files and directories on the disk and controlling the peripheral devices are
performed by the operating system. It is also responsible for the security, ensuring that the
unauthorized users do not access the system. The operating system also provides a consistent
application interface which is required if more than one type of computer uses the operating
system. The Application Program Interface (API) allows a developer to write an application on a
particular computer and the same can be run on another computer having different amount of
memory or quantity of storage.
The users normally interact with the operating system through a set of commands. The
command processor or command line interpreter in the operating system accepts and executes
the commands. The graphical user interface allows the users to enter commands by pointing
and clicking at objects on the screen. Examples of modern operating systems are Android, BSD,
iOS, Linux, Microsoft Windows and IBM z/OS.

27
Innovation Centre for Education

Real-time operating system is one which is used to control machineries, scientific instruments
and industrial systems. RTOS has minimal user-interface capabilities and has no end user
interfaces. The system will be a sealed box when delivered for usage. RTOS executes a
particular operation precisely every time it occurs with the same amount of time.

28
Innovation Centre for Education

Features of the operating system:


•Multi-User: Multi-User operating system is one which allows two or more users to run
programs at the same time
•Multiprocessing: Multiprocessing is an operating system which supports running a program
on more than one CPU
•Multitasking: Multitasking is an operating system which allows more than one program
to run concurrently
•Multithreading: Multithreading is an operating system which allows different parts of a single
program to run concurrently

29
Innovation Centre for Education

Cloud Performance Tuning


The operating system performs the following system tasks
•Process management: Processor management is responsible for ensuring that each process
and application receives the required processor’s time for its functioning and using as many
processor cycles as possible.Depending on the operating system, the basic unit of software that
the operating system deals with in scheduling the work done by the processor is called as
process or a thread. A process is software which performs actions and can be controlled by a
user or other applications or by the operating system.

•Memory management: A single process must have enough memory to execute and should
not run into the memory space for another process. The various memories in the system must
be properly used so that each process can run effectively. Memory management is responsible
for accomplishing this. The operating system sets up the boundary for the types of software and
for individual applications.

30
•Device management: A driver is a program which is used to control the path between the operating
system and the hardware on the computer’s motherboard. The driver acts as a translator between the
electrical signals of the hardware and the high-level programming languages of the operating system. The
operating system assigns high-priority blocks to drivers so as to release and ready the hardware resource
for further usage as quickly as possible.

•File management: File management is also known as file system, is the system which the operating
system uses to organize and to keep track of the files. A hierarchical file system uses directories to organize
files into a tree structure. An entry is created by the operating system in the file system to show the start
and end locations of a file, the file name, file type, file archiving, and user’s permissions to read and modify
the file and the date and time of the files creation.
Innovation Centre for Education

Virtualization overview:
The technique in which one physical resource is split into multiple virtual resources is called as
virtualization. By this a single physical resource can be used for multiple purpose and perform
required actions.
The benefits of Virtualization are:
Reduces hardware cost - single server acts as multiple server
Workload is optimized - by dynamically sharing the resources.

IT responsiveness and flexibility - It gives a single consolidated access to all


available resources..

Virtualization or system virtualization is creating many virtual systems with single physical
system. These virtual systems use virtual resources with independent operating system.

32
Innovation Centre for Education

Hypervisor:
A technique that allows multiple Operating systems to run on a single hardware at the same
time, who’s hardware is virtualized. This is also called as virtual machine manager.
Hypervisor types:
There are basically two type of hypervisor, type 1 and type 2 .
•Type 1 hypervisor runs directly on the physical hardware, without intermediate operating
system. This is also called as “bare metal hypervisors”. In this case hypervisor itself is the
operating system managing the hardware. This layer is more effective in comparison with the
type 2 hypervisor as there is no OS system. That makes the physical resource server the need
for the individual virtual machines. This hypervisor is only built to host other operating system.
Most of the enterprise operating system are type 1 operating system.

33
Innovation Centre for Education

Type 2 hypervisor runs over the operating system using virtual PC or virtual box. In this case
there are two layers the operating system and the hypervisor between each virtual machine.
This is more of an application which is installed over an operating system.

34
Innovation Centre for Education

Hypervisor features:
Most organizations run their servers in virtual environments in their data centers. This helps
them to carry their workload with high availability and better performance.
Operating system and workload can be consolidated into one server, reducing the cost of
operations and hardware.
Multiple operating systems can be run on a single hardware at the same time, each running
applications executing as per its requirements.
Dynamically assigning of resources is possible from virtual resource to the physical resource
through methods like dispatching and paging.
Workload is managed with ease in a single server to improve the performance, system use and
price.

35
Innovation Centre for Education

Network Topologies
You may have come across the word topology used in networks. “Topology” refers to the
physical arrangement of network components and media within a networking structure.
Topology Types
There are four primary kinds of LAN topologies: bus, tree, star, and ring.
Bus topology

Bus topology is a linear LAN architecture in which the network components transmit and
propagate through the length of the medium and are received by all other components. The
common signal path directed through wires or other media across one part of a network to
another is a Bus. Bus topology is commonly implemented by IEEE 802.3/Ethernet networks.

36
Tree topology
Tree topology is similar to the bus topology. A tree network can contain multiple nodes and branches. The network
components transmit and propagate through the length of the medium and are received by all other components
as in a bus topology.
Disadvantage of bus topology over tree topology is that, the entire network goes down if the connection to any
one user is broken, disrupting communication between all users.
Advantages – Less cabling required, hence cost-effective.
Star Topology

Star topology is a kind of LAN topology wherein a network is created with the help of a common
central switch or a hub and the nodes are connected to them as in a Hub and spoke model (point to
point connection). A star network is often combined with a bus topology. The central hub is then
connected to the backbone of the bus. This combination is called a tree.
Advantages – Connection disruption between a node and the central device will not hinder the
communication between the other nodes and the network remains functional.
Disadvantage– More cabling required so a bit higher on the cost factor.
Innovation Centre for Education

Ring topology
Ring topology is a single closed loop formed by unidirectional transmission links formed by a
series of repeaters which will be linked to each other. A repeater in the network connects each
station to the network. Even though logically a ring, ring topologies are mostly arranged as a
closed-loop star, such topologies works as a unidirectional closed-loop star, instead of point-to-
point links. Token Ring is an example of Ring topology.
In an event of a connection failure between two components, redundancy is used to avoid
collapse of the entire ring.

39
Innovation Centre for Education

OSI Model
To define and abstract network communications we use the Open Systems Interconnection
(OSI) model. The OSI has seven logical layers. The below Figure lists those seven layers.
Each logical layer has a well defined functionality, described below:
OSI model functionality by layer
Layer Function
7 Application Interaction with application software

6 Presentation Data formatting

5 Session Host-to-host connection management

4 Transport Host-to-host data transfer

3 Network Addressing and routing

2 Data-Link Local network data transfer

1 Physical Physical hardware

40
Switching Concepts
Innovation Centre for Education

Switching Concepts continued…….


Application-specific integrated circuits (ASICs) are used in switches to maintain and build their
filter tables.
A layer 2 switch can be called as a multiport bridge as it breaks up collision domains.
Layer 2 switches are faster than routers as they do not look for the Network layer header
information and just looks for the frame's hardware addresses to decide whether to forward,
flood, or drop the frame. An independent bandwidth is provided on each port of the Switch in
order to create dedicated collision domains.
Layer 2 switching provides the following:
• Hardware-based bridging (ASIC)
• Wire speed
• Low latency
• Low cost
Efficiency of the layer 2 switching is maintained as there is no modification done to the Data
packet. The switching process is made less error-prone and considerably faster than the routing
processes by designing the device to read just the frame encapsulation of the packet. The
feature of increased bandwidth due to individual collision domains in each interface makes it
possible to connect multiple devices to each interface.
Switch Functions at Layer 2

44
Innovation Centre for Education

Basically the functions of layer 2 switches can be Address learning.


Layer 2 switches maintains a MAC database which gets the information about the source
hardware address which is remembered when each frame is received on an interface. This
information is also known as a forward/filter table.
Forward/filter decisions
The frame which is received on an interface of the switch searches for the destination hardware
address in the MAC database and is forwarded out via a specified destination port.
Loop avoidance
Network loops might occur due to connection between multiple switches which are created for
redundancy purposes. Spanning Tree Protocol (STP) is used to avoid network loops and
maintaining redundancy.
Limitations of Layer 2 Switching
Layer 2 switches don’t break broadcast domains by default which reduces the size and growth
potential of the network and reduces the overall performance.
The Layer 2 Switches has slow convergence time for spanning trees along with the broadcasts
and multicasts is major hitch as the network grows which leads to the reason for its replacement
by routers (layer 3 devices) in the inter network.

45
Switch Vs Hub Vs Router
A switch is a networking device which can transmit the incoming data to the destination port without disturbing
any other ports. By this “without disturbing any other ports”, means to say that switch can differentiate between
two different computers by using their MAC address. Mostly, Switches are used in Local Area Network. It is also
known as Layer 2 device because switch works on Data-Link Layer. We can also say that switch works like the
manager of data transmission for all the computers which are connected to the switch because the switch is
having all the information which is transmitting over the computers.

Hub, acts a repeater which transfers/broadcasts the incoming data to all the computers connected to it. This is the
biggest drawback of the hub that broadcasts the data packet to the whole network. To overcome this limitation,
we use Switches in the place of Hub. It is a Layer 1 device because it works on the Physical Layer.

The router is a very intelligent device other than switch and hub. The router is used to connect two or more
different networks to maintain the flow of data. It is also known as Layer 3 devices because Routers works on the
Network Layer which is Layer3 For ex:- If we have to connect two MAN(Metropolitan Area Network) we use
routers to connect two different networks.
Innovation Centre for Education

What is Routing?
Packet forwarding from a network to another and determining the path using various metrics are
the basic functions of a router. The load on the link between devices, delay, bandwidth and
reliability, even hop count are the metrics of Routing.
Routers can do all that switches do and more. Routers look into the Destination and Source IP
addresses part of the network header which is added during the packet encapsulation in the
network layer.
What is a VLAN?
VLAN is a virtual LAN. Switches create broadcast domain by VLAN’s. Normally a router creates
that broadcast domains but with the help of VLAN’s, a switch can create broadcast domains.
The administrator includes some switch ports in a VLAN other than the default VLAN (VLAN 1).
The ports in a single VLAN have a single broadcast domain.
Switches normally communicate with each other, some ports on switch A could be included to
VLAN 5 and other ports on switch B can be in VLAN 5. In such a case the broadcasts between
these two switches will not be seen on any other port in any other VLAN, other than 5.
Whereas, the devices which are in the VLAN 5 can all communicate between them as they are
on the same VLAN. Communication is not possible with the devices of other VLAN's without
extra configuration.

47
Innovation Centre for Education

When do I need a VLAN?


VLAN’s are required in the following situations:
 when you have a large number of devices on your LAN
 when the broadcast traffic on your LAN is more
 when security to a group of users is required to avoid slowdown due to broadcasts
 when a group of people Groups of users running the same applications wants to be in a
separate broadcast domain and want to avoid regular users
 when multiple virtual switches are required in a single switch

48
Innovation Centre for Education

Power management

Power management is a feature of some electrical appliances, especially copiers, computers and
computer peripherals such as monitors and printers, that turns off the power or switches the system to
a low-power state when inactive. In computing this is known as PC power management and is built
around a standard called ACPI(Advanced Configuration and Power Interface). This supersedes APM.
All recent (consumer) computers have ACPI support.

PC power management for computer systems is desired for many reasons, particularly:
• Reduce overall energy consumption
• Prolong battery life for portable and embedded systems
• Reduce cooling requirements

• Reduce noise
• Reduce operating costs for energy and cooling
• Lower power consumption also means lower heat dissipation, which increases system stability, and
less energy use, which saves money and reduces the impact on the environment.

49
Innovation Centre for Education

Processor level techniques


The power management for microprocessors can be done for the whole processor, or in specific areas.
With dynamic voltage scaling and dynamic frequency scaling, the CPU core voltage, clock rate, or both, can be
altered to decrease power consumption at the price of potentially lower performance. This is sometimes done
in real time to optimize the power-performance tradeoff.
Examples:
AMD Cool'n'Quiet AMD
PowerNow IBM Energy Scale
Intel Speed Step
Transmeta LongRun and LongRun2
VIA LongHaul (Power Saver)

Additionally, processors can selectively power off internal circuitry (power gating). For example: Newer Intel
Core processors support ultra-fine power control over the functional units within the processors.

AMD CoolCore technology get more efficient performance by dynamically activating or turning off parts of the
processor.

Intel VRT technology split the chip into a 3.3V I/O section and a 2.9V core section. The lower core voltage
reduces power consumption.
50
Innovation Centre for Education

Heterogenous computing
ARM's big.LITTLE architecture can migrate processes between faster "big" cores and more
power efficient "LITTLE" cores.
Hibernation in computing
When a computer system hibernates it saves the contents of the RAM to disk and powers down
the machine. On startup it reloads the data. This allows the system to be completely powered
off while in hibernate mode. This requires a file the size of the installed RAM to be placed on the
hard disk, potentially using up space even when not in hibernate mode. Hibernate mode is
enabled by default in some versions of Windows and can be disabled in order to recover this
disk space.

51
Innovation Centre for Education

Advanced Power Management


Advanced power management (APM) is an API developed by Intel and Microsoft and released
in 1992 which enables an operating system running an IBM-compatible personal computer to
work with the BIOS (part of the computer's firmware) to achieve power management.
Revision 1.2 was the last version of the APM specification, released in 1996. ACPI is intended
as the successor to APM. Microsoft dropped support for APM in Windows Vista. The Linux
kernel still mostly supports APM, with the last fully functional APM support shipping in 3.3.
APM uses a layered approach to manage devices. APM-aware applications (which include
device drivers) talk to an OS-specific APM driver. This driver communicates to the APM-aware
BIOS, which controls the hardware. There is the ability to opt-out of APM control on a device-
by-device basis, which can be used if a driver wants to communicate directly with a hardware
device.

52
Innovation Centre for Education

The layers in APM.


Communication occurs both ways; power management events are sent from the BIOS to the
APM driver, and the APM driver sends information and requests to the BIOS via function calls.
In this way the APM driver is an intermediary between the BIOS and the operating system.
Power management happens in two ways; through the above mentioned function calls from the
APM driver to the BIOS requesting power state changes, and automatically based on device
activity.

53
Innovation Centre for Education

Advanced Configuration and Power Interface


In computing, the Advanced Configuration and Power Interface (ACPI) specification provides an
open standard for device configuration and power management by the operating system.
First released in December 1996, ACPI defines platform-independent interfaces for hardware
discovery, configuration, power management and monitoring. With the intention of replacing
Advanced Power Management, the MultiProcessor Specification and the Plug and Play BIOS
Specification, the standard brings power management under the control of the operating
system, as opposed to the previous BIOS-central system which relied on platform-specific
firmware to determine power management and configuration policy.[2] The specification is
central to Operating System-directed configuration and Power Management (OSPM), a system
implementing ACPI which removes device management responsibilities from legacy firmware
interfaces.
The standard was originally developed by Intel, Microsoft and Toshiba, and was later joined by
HP and Phoenix. The latest version is "Revision 5.0", which was published on 6 December
2011. As the ACPI technology gained wider adoption with many operating systems and
processor architectures, the desire to improve the governance model of the specification has
increased significantly. In October 2013, original developers of the ACPI standard agreed to
transfer all assets to the UEFI Forum, where all future development will be taking place.

54
Innovation Centre for Education

BatteryMAX (idle detection)


BatteryMAX is an Idle Detection System used for computer power management developed at
Digital Research, Inc.'s European Development Centre (EDC) in Hungerford, UK. It was
invented by British borne engineers Roger Gross and John Constant in August 1989 and was
first released with DR DOS 5.0. It was created to address the new genre of portable personal
computers (lap-tops) which ran from battery power. As such, it was also an integral part of
Novell's PalmDOS 1.0 operating system tailored for early palmtops in 1992.
Power saving in laptop computers traditionally relied on hardware inactivity timers to determine
whether a computer was idle. It would typically take several minutes before the computer could
identify idle behavior and switch to a lower power consumption state. By monitoring software
applications from within the operating system, BatteryMAX is able to reduce the time taken to
detect idle behavior from minutes to microseconds. Moreover it can switch power states around
20 times a second between a user's keystrokes. The technique was named Dynamic Idle
Detection and includes halting, or stopping the CPU for periods of just a few microseconds until
a hardware event occurs to restart it.
DR DOS 5.0 was the first Personal Computer operating system to incorporate an Idle Detection
System for power management. An US patent describing the Idle Detection System was filed on
9 March 1990 and was granted in 11 October 1994.

55
Innovation Centre for Education

CPU power dissipation


Central Processing Unit power dissipation or CPU power dissipation is the process in which
central processing units (CPUs) consume electrical energy, and dissipate this energy both by
the action of the switching devices contained in the CPU (such as transistors or vacuum tubes)
and by the energy lost in the form of heat due to the impedance of the electronic circuits.

CPU power management


 Designing CPUs that perform tasks efficiently without overheating is a major consideration of
nearly all CPU manufacturers to date. Some implementations of CPUs use very little power;
for example, the CPUs in mobile phones often use just a few hundred milliwatts of electricity.
Some microcontrollers, used in embedded systems may use a few milliwatts. In comparison,
CPUs in general purpose personal computers, such as desktops and laptops, dissipate
significantly more power because of their higher complexity and speed. These microelectronic
CPUs may consume power in the order of a few watts to hundreds of watts. Historically, early
CPUs implemented with vacuum tubes consumed power on the order of many kilowatts.

© 2013 IBM Corporation


50
Innovation Centre for Education

• CPUs for desktop computers typically use a significant portion of the power consumed by the
computer. Other major uses include fast video cards, which contain graphics processing
units, and the power supply. In laptops, the LCD's backlight also uses a significant portion of
overall power. While energy-saving features have been instituted in personal computers for
when they are idle, the overall consumption of today's high-performance CPUs is
considerable. This is in strong contrast with the much lower energy consumption of CPUs
designed for low-power devices. One such CPU, the Intel XScale, can run at 600 MHz with
only half a watt of power, whereas x86 PC processors from Intel in the same performance
bracket consume roughly eighty times as much energy

57
Implications of increased clock frequencies
Processor manufacturers consistently delivered increases in clock rates and instruction-level parallelism, so that
single-threaded code executed faster on newer processors with no modification. Now, to manage CPU power
dissipation, processor makers favor multi-core chip designs, and software has to be written in a multi-threaded or
multi-process manner to take full advantage of the hardware. Many multi-threaded development paradigms
introduce overhead, and will not see a linear increase in speed vs number of processors. This is particularly true
while accessing shared or dependent resources, due to lock contention. This effect becomes more noticeable as
the number of processors increases. Recently, IBM has been exploring ways to distribute computing power more
efficiently by mimicking the distributional properties of the human brain.
Innovation Centre for Education

Energy Star
 Energy Star (trademarked ENERGY STAR) is an international standard for energy efficient
consumer products originated in the United States of America. The EPA estimates that it
saved about $14 billion in energy costs in 2006 alone. The Energy Star program has helped
spread the use of LED traffic lights, efficient fluorescent lighting, power management systems
for office equipment, and low standby energy use.

59
Innovation Centre for Education

VESA Display Power Management Signaling


(Video Electronics Standards Association)VESA Display Power Management Signaling (or
DPMS dos Protected mode Services) is a standard from the VESA consortium for
managing the power supply of video monitors for computers through the graphics card
e.g.; shut off the monitor after the computer has been unused for some time (idle), to save
power.
DPMS 1.0 was issued by VESA in 1993, and was based on the United States Environmental
Protection Agency's (EPA) earlier Energy Star power management specifications. Subsequent
revisions were rolled into future VESA BIOS Extensions.

60
Successfully managing performance ensures that your system is
efficiently using resources and that your server provides the best
possible services to your users and to your business needs. Moreover,
effective performance management can help you quickly respond to
changes in your system
Innovation Centre for Education

Elements of Performance Tuning


The fundamental element of resource management is the discovery process. It involves
searching of the suitable physical resources in which the virtual machines are to be created
matching the user's request.
Resource scheduling selects the best resource from the matched physical resources. In fact it
identifies the physical resource where the Virtual machines are to be created to provision the
resources from cloud infrastructure.
Resource allocation allocates the selected resource to the job or task of user's request. Actually,
it means the job submission to the selected cloud resource. After the submission of the job, the
resource is monitored.

64
Innovation Centre for Education

Resource Management :
Resource management includes processes such as resource discovery (includes resource
scheduling), resource allocation and resource monitoring. The said processes manage
resources such as disk space, CPU cores, and bandwidth (network). The said resources must
be further sliced and shared between different virtual machines running heterogeneous
workloads mostly.
The taxonomy for resource management elements are depicted in the below figure

65
Innovation Centre for Education

The fundamental element of resource management is the discovery process. It involves


searching for the appropriate resource types available that match the application requirements.
The process is managed by the cloud service provider. This process is being taken by the
resource broker or user broker to discover available resources.
Discovery consists of detailed description of resources available. Resource discovery provide a
way for a resource management system (RMS) to determine the state of the resources that are
managed by it and other RMSs that interoperate with it. The resource discovery works with
dissemination of resources to provide information about the state of resources to the information
server.
The Allocation process is the process of assigning an available source needed cloud
applications over the internet. These resources are allocated based on user request and pay-
per-use method. In this process, scheduling and dispatching method is being applied to allocate
the resources. The scheduler will schedule assigned resources to the client. Then, the
dispatcher will allocate the assigned resources to the client.
Resource monitoring is a key tool for controlling and managing hardware and software
infrastructures. It also provides information and Key Performance Indicators (KPIs) for both
platforms and applications in cloud to be used for data collection to assist in decision method of
allocating the resources. It's also one key component to monitor the state of the resources in the
event of failure whether at the physical layer or at the services layer.

66
Innovation Centre for Education

Performance Analysis and Performance Monitoring Tools


The first step in a strategy for managing system performance is to set measurable objectives.
You can begin by setting goals that match the demands of your business and identifying areas
of the system where performance improvements can have a positive effect.
The tasks that follow make up a performance strategy. Implementing a performance strategy is
an iterative process that begins with defining your performance goals or objectives, and then
repeating the rest of the tasks until you accomplish your goals.
 Set performance goals.
- Set goals that match the demands of your business.
- Identify the areas of the system where an improvement in performance can affect your
business.
- Set goals that can be measured.
- Make the goals reasonable
 Collect performance data continuously with Collection Services.
 Always save performance measurement data before installing a new software release, a
major hardware upgrade, a new application, or a large number of additional workstations
or jobs.
 Your data should include typical medium to heavy workloads.

67
Innovation Centre for Education

 Check and analyze performance data.


– Summarize the collected data and compare the data to objectives or resource guidelines.
– Perform monthly trend analysis that includes at least the previous three months of
summarized data. As time progresses, include at least six months of summary data to
ensure that a trend is consistent. Make decisions based on at least three months of trend
information and your knowledge of upcoming system demands.
– Analyze performance data to catch situations before they become problems. Performance
data can point out objectives that have not been met. Trend analysis shows you whether
resource consumption is increasing significantly or performance objectives are
approaching or exceeding guideline values.

 Tune performance when ever guidelines are not met

 Plan for capacity when:


 Trend analysis shows a significant growth in resource utilization
 A major new application , additional interactive work stations, or new batch jobs, will be
added to the current hardware configuration.
 You have reviewed business plans and expect a significant change.

68
Innovation Centre for Education

Collecting performance data for analysis


Collecting data is an important step toward improving performance. When you collect
performance data, you gather information about your server that can be used to understand
response times and throughput. It is a way to capture the performance status of the server, or
set of servers, involved in getting your work done. The collection of data provides a context, or a
starting point, for any comparisons and analysis that can be done later. When you use your first
data collections, you have a benchmark for future improvements and a start on improving your
performance today. You can use the performance data you collect to make adjustments,
improve response times, and help your systems achieve peak performance. Performance
problem analysis often begins with the simple question: "What changed?" Performance data
helps you answer that question.

69
Innovation Centre for Education

System workload
An accurate and complete definition of a system's workload is critical to predicting or
understanding its performance.
A difference in workload can cause far more variation in the measured performance of a system
than differences in CPU clock speed or random access memory (RAM) size. The workload
definition must include not only the type and rate of requests sent to the system, but also the
exact software packages and in-house application programs to be executed.
Workloads can be classified into following categories:
Multiuser
A workload that consists of a number of users submitting work through individual terminals.
Typically, the performance objectives of such a workload are either to maximize system
throughput while preserving a specified worst-case response time or to obtain the best possible
response time for a constant workload.
Server
A workload that consists of requests from other systems. For example, a file-server workload
is mostly disk read and disk write requests. It is the disk-I/O component of a multiuser workload
(plus NFS or other I/O activity), so the same objective of maximum throughput within a given
response-time limit applies. Other server workloads consist of items such as math-intensive
programs, database transactions, printer jobs.
Workstation
A workload that consists of a single user submitting work through a keyboard and receiving
results on the display of that system. Typically, the highest-priority performance objective of
such
60
a workload is minimum response time to the user's requests. © 2013 IBM Corporation
Innovation Centre for Education

Performance objectives

After defining the workload that your system will have to process, you can choose performance
criteria and set performance objectives based on those criteria.
The overall performance criteria of computer systems are response time and throughput.
Response time is the elapsed time between when a request is submitted and when the
response from that request is returned. Examples include:

 The amount of Time a database query takes


 The amount of time it takes to echo characters to the terminal
 The amount of time it takes to access a web page

Throughput is a measure of the amount of work that can be accomplished over some unit of
time. Examples include:
 Database transactions per minute
 Kilobytes of a file transferred per sec
 Kilobytes of a file read or written per sec
 Web server hits per sec
Innovation Centre for Education

The relationship between these metrics is complex. Sometimes you can have higher throughput
at the cost of response time or better response time at the cost of throughput. In other
situations, a single change can improve both. Acceptable performance is based on reasonable
throughput combined with reasonable response time.

In planning for or tuning any system, make sure that you have clear objectives for both
response time and throughput when processing the specified workload. Otherwise, you risk
spending analysis time and resource dollars improving an aspect of system performance that
is of secondary importance
Performance monitoring tools
There are numerous performance analysing and monitoring tools, among many, common tools
we will discuss here in the following section.
nmon
nmon is a free tool to analyze AIX and Linux performance
This free tool gives you a huge amount of information all on one screen.

72
Innovation Centre for Education

The nmon tool is designed for AIX and Linux performance specialists to use for monitoring and
analyzing performance data, including:
– CPU utilization
– Memory use
– Kernel statistics and run queue information
– Disks I/O rates, transfers, and read/write ratios
– Free space on file systems
– Disk adapters
– Network I/O rates, transfers, and read/write ratios
– Paging space and paging rates
– CPU and AIX specification
– Top processors
– IBM HTTP Web cache
– User Defined disk groups
– Machine details and resources
– Asynchronous I/O -- AIX only
– Workload Manager (WLM) -- AIX only
– IBM TotalStorage® Enterprise Storage Server® (ESS) disks -- AIX only
– Network File System (NFS)
– Dynamic LPAR (DLPAR) changes -- only pSeries p5 and OpenPower for either AIX or
Linux
73
Innovation Centre for Education

Performance analysis with the trace facility


The operating system's trace facility is a powerful system-observation tool.
The trace facility captures a sequential flow of time-stamped system events, providing a fine
level of detail on system activity. Events are shown in time sequence and in the context of other
events. Trace is a valuable tool for observing system and application execution. Unlike other
tools that only provide CPU utilization or I/O wait time, trace expands that information to aid in
understanding what events are happening, who is responsible, when the events are taking
place, how they are affecting the system and why.
Simple Performance Lock Analysis Tool (splat)
The Simple Performance Lock Analysis Tool (splat) is a software tool that generates reports on
the use of synchronization locks. These include the simple and complex locks provided by the
AIX kernel, as well as user-level mutexes, read and write locks, and condition variables provided
by the PThread library. The splat tool is not currently equipped to analyze the behavior of the
Virtual Memory Manager (VMM) and PMAP(Process-Mapping) locks used in the AIX kernel.

74
Innovation Centre for Education

Analyzing collected performance data


The key to analyzing performance data is to do it on a regular basis. Just like performance
monitoring, analyzing is an ongoing process. Performance analysis is a method for
investigating, measuring, and correcting deficiencies so that system performance meets the
user's expectations. It does not matter that the system is a computer; it could be an automobile
or a washing machine. The problem-solving approach is essentially the same:
– Understand the symptoms of the problem.
– Use tools to measure and define the problem.
– Isolate the cause.
– Correct the problem.
– Use tools to verify the correction.

75
Innovation Centre for Education

The following tasks help you to analyze your data. These tasks allow you to focus on the
problem, isolate the cause, and make the corrections. Use these same tasks to verify the
corrections.
View collected performance data
Look at the data that you collected with Performance Tools.
Summarize performance data into reports
Performance reports provide a way for you to effectively research areas of the system that
are causing performance problems.
Create graphs of performance data to show trends
Use historical data to show the changes in resource utilization on your system over time.
Work with performance database files
Find information about the performance database files -- what they contain, how to use them,
and how to create them.

76
IBM application for Performance Management:
The data that Performance Explorer collects is stored in Performance Explorer database files.

The following table shows the Performance Explorer (PEX) data files collected by the system when
using data collection commands. Type the Display File Field Description (DSPFFD) command as follows
to view the contents for a single file:

DSPFFD FILE(xxxxxxxxx)
where xxxxxxxxx is the name of the file that you want to display .

Type of information contained in file File name


Trace Resources Affinity QAYPEAFN
Auxiliary storage management event data QAYPEASM
Auxiliary storage pool (ASP) information data QAYPEASPI
Base event data QAYPEBASE
Basic configuration information QAYPECFGI
Basic configuration information QAYPECFGI

Communications event data QAYPECMN

Disk event data QAYPEDASD


Disk server event data QAYPEDSRV

Event type and subtype mapping QAYPEEVENT

File Serving event data QAYPEFILSV

Configured filter information QAYPEFTRI

Performance measurement counter (PMC) selection QAYPEFQCFG

Heap event data QAYPEHEAP


Unit 2
Fundamentals of High Availability
High Availability (HA)
• The basic understanding of high availability in
the IT world is when any system continuously
operates (100 percent operational) without
any down time despite occurrences of failure
in hardware, software, and application.
High Availability (HA)
A high availability (HA) setup is an infrastructure
without a single point of failure.
It prevents a single server failure from being a
downtime event by adding redundancy to
every layer of your architecture.
A load balancer facilitates redundancy for the
backend layer (web/app servers), but for a
true high availability setup, you need to have
redundant load balancers as well.
What is HA Clustering ?
• One service goes down => others take over its
work
• IP address takeover, service takeover,
• Not designed for high-performance
• Not designed for high troughput (load
balancing)
Does it Matter ?
• Downtime is expensive*
• You miss out on service You miss out on
opportunity
• Your boss complains Your bosses boss
complains
• New users don't return
The Rules of HA
• Keep it Simple
• Prepare for Failure
• Complexity is the enemy
• Test your HA setup Test your HA setup
Myths
• Virtualization will solve your HA Needs
• Live migration is the solution to all your
problems
• HA will make your platform more stable
You care about ?
Your data ?
• Consistent
• Realtime
• Eventual Consistentcy
Your Connection
• Always
• Most of the time
Eliminating the SPOF
Find out what Will Fail
• Disks
• Fans
• Power (Supplies)
Find out what Can Fail
• Network
• Going Out Of Memory
High Availability/Clustering in Linux

How to Configure and


Maintain High
Availability/Clustering in
Linux
High Availability/Clustering in Linux
• High Availability (HA) simply refers to a quality
of a system to operate continuously without
failure for a long period of time.
• HA solutions can be implemented using
hardware and/or software, and one of the
common solutions to implementing HA is
clustering.
High Availability/Clustering in Linux
• In computing, a cluster is made up of two or
more computers (commonly known as nodes
or members) that work together to perform a
task.

• In such a setup, only one node provides the


service with the secondary node(s) taking over
if it fails.
Cluster Types
• Storage: provide a consistent file system
image across servers in a cluster, allowing the
servers to simultaneously read and write to a
single shared file system.
• High Availability: eliminate single points of
failure and by failing over services from one
cluster node to another in case a node goes
becomes inoperative.
Cluster Types
• Load Balancing: dispatch network service
requests to multiple cluster nodes to balance
the request load among the cluster nodes.
• High Performance: carry out parallel or
concurrent processing, thus helping to
improve performance of applications.
Cluster Types

• Another widely used solution to providing HA


is replication (specifically data replications).
• Replication is the process by which one or
more (secondary) databases can be kept in
sync with a single primary (or master)
database.
Cluster Setup
• To setup a cluster, we need at least two
servers. For the purpose of this demo, we will
use two Linux servers:

Node1: 192.168.1.10
Node2: 192.168.1.11
Cluster Setup
• We will demonstrate the basics of how to
deploy, configure and maintain high
availability/clustering in Ubuntu 16.04/18.04
and CentOS 7.
• We will demonstrate how to add Nginx HTTP
service to the cluster.
Cluster Setup
Configuring Local DNS Settings on Each Server

In order for the two servers to communicate to


each other, we need to configure the
appropriate local DNS settings in the
/etc/hosts file on both servers.
Cluster Setup

$ sudo vi /etc/hosts

Add the following entries with actual IP addresses


of your servers.

192.168.1.10 node1.example.com
192.168.1.11 node2.example.com

Save the changes and close the file.


Cluster Setup
Installing Nginx Web Server

Now install Nginx web server using the following


commands.

$ sudo apt install nginx [On Ubuntu]


$ sudo yum install epel-release [On CentOS 7]
$ sudo yum install nginx [On CentOS 7]
Cluster Setup
• Once the installation is complete, start the
Nginx service for now and enable it to auto-
start at boot time, then check if it’s up and
running using the systemctl command.
• On Ubuntu, the service should be started
automatically immediately after package pre-
configuration is complete, you can simply
enable it.
Cluster Setup

$ sudo systemctl enable nginx


$ sudo systemctl start nginx
$ sudo systemctl status nginx
Cluster Setup
• After starting the Nginx service, we need to
create custom webpages for identifying and
testing operations on both servers.

• We will modify the contents of the default


Nginx index page as shown.
Cluster Setup
$ echo "This is the default page for
node1.example.com" | sudo tee
/usr/share/nginx/html/index.html #VPS1
$ echo "This is the default page for
node2.example.com" | sudo tee
/usr/share/nginx/html/index.html #VPS2
Installing and Configuring Corosync
and Pacemaker

Next, we have to install Pacemaker, Corosync,


and Pcs on each node as follows.

$ sudo apt install corosync pacemaker pcs


#Ubuntu
$ sudo yum install corosync pacemaker pcs
#CentOS
Installing and Configuring Corosync
and Pacemaker
Once the installation is complete, make sure
that pcs daemon is running on both servers.

$ sudo systemctl enable pcsd


$ sudo systemctl start pcsd
$ sudo systemctl status pcsd
Creating the Cluster
• During the installation, a system user called
“hacluster” is created.
• So we need to set up the authentication
needed for pcs. Let’s start by creating a new
password for the “hacluster” user, we need to
use the same password on all servers:

$ sudo passwd hacluster


Creating the Cluster
• Next, on one of the servers (Node1), run the
following command to set up the
authentication needed for pcs.

• $ sudo pcs cluster auth node1.example.com


node2.example.com -u hacluster -p
password_here --force
Creating the Cluster
• Now create a cluster and populate it with
some nodes (the cluster name cannot exceed
15 characters, in this example, we have used
examplecluster) on Node1 server.

• $ sudo pcs cluster setup --name


examplecluster node1.example.com
node2.example.com
Creating the Cluster
Now enable the cluster on boot and start the
service.

$ sudo pcs cluster enable --all


$ sudo pcs cluster start --all
Creating the Cluster
Now check if the cluster service is up and
running using the following command.

$ sudo pcs status


OR
$ sudo crm_mon -1
Creating the Cluster
• From the output of the above command, you
can see that there is a warning about no
STONITH devices, yet the STONITH is still
enabled in the cluster.
• In addition, no cluster resources/services have
been configured.
Configuring Cluster Options
• The first option is to disable STONITH (or
Shoot The Other Node In The Head), the
fencing implementation on Pacemaker.

• This component helps to protect your data


from being corrupted by concurrent access.
For the purpose of this guide, we will disable it
since we have not configured any devices.
Configuring Cluster Options
To turn off STONITH, run the following command:

$ sudo pcs property set stonith-enabled=false

Next, also ignore the Quorum policy by running the


following command:

$ sudo pcs property set no-quorum-policy=ignore


Configuring Cluster Options
After setting the above options, run the
following command to see the property list
and ensure that the above options, stonith
and the quorum policy are disabled.

$ sudo pcs property list


Adding a Resource/Cluster Service
• We will look at how to add a cluster resource.
We will configure a floating IP which is the IP
address that can be instantly moved from one
server to another within the same network or
data center.
• In short, a floating IP is a technical common
term, used for IPs which are not bound strictly
to one single interface.
Adding a Resource/Cluster Service
• In this case, it will be used to support failover in a
high-availability cluster.

• Keep in mind that floating IPs aren’t just for


failover situations, they have a few other use
cases.

• We need to configure the cluster in such a way


that only the active member of the cluster
“owns” or responds to the floating IP at any given
time.
Adding a Resource/Cluster Service
We will add two cluster resources: the floating IP address
resource called “floating_ip” and a resource for the
Nginx web server called “http_server”.

First start by adding the floating_ip as follows. In this


example, our floating IP address is 192.168.1.20.

$ sudo pcs resource create floating_ip


ocf:heartbeat:IPaddr2 ip=192.168.1.20
cidr_netmask=24 op monitor interval=60s
Adding a Resource/Cluster Service
where:

floating_ip: is the name of the service.


“ocf:heartbeat:IPaddr2”: tells Pacemaker which
script to use, IPaddr2 in this case, which
namespace it is in (pacemaker) and what
standard it conforms to ocf.
“op monitor interval=60s”: instructs Pacemaker
to check the health of this service every one
minutes by calling the agent’s monitor action.
Adding a Resource/Cluster Service
Then add the second resource, named
http_server.
Here, resource agent of the service is
ocf:heartbeat:nginx.

$ sudo pcs resource create http_server


ocf:heartbeat:nginx
configfile="/etc/nginx/nginx.conf" op monitor
timeout="20s" interval="60s"
Adding a Resource/Cluster Service
Once you have added the cluster services, issue the
following command to check the status of
resources.

$ sudo pcs status resources

Looking at the output of the command, the two


added resources: “floating_ip” and “http_server”
have been listed. The floating_ip service is off
because the primary node is in operation.
Adding a Resource/Cluster Service
• If you have firewall enabled on your system,
you need to allow all traffic to Nginx and all
high availability services through the firewall
for proper communication between nodes:
Adding a Resource/Cluster Service
-------------- CentOS 7 --------------
$ sudo firewall-cmd --permanent --add-
service=http
$ sudo firewall-cmd --permanent --add-
service=high-availability
$ sudo firewall-cmd --reload
Adding a Resource/Cluster Service
-------------- Ubuntu --------------
$ sudo ufw allow http
$ sudo ufw allow high-availability

$ sudo ufw reload


Testing High Availability/Clustering
• The final and important step is to test that our
high availability setup works.
• Open a web browser and navigate to the
address 192.168.1.20 you should see the
default Nginx page from the
node2.example.com.
Testing High Availability/Clustering
To simulate a failure, run the following
command to stop the cluster on the
node2.example.com.

$ sudo pcs cluster stop http_server

Then reload the page at 192.168.1.20, you


should now access the default Nginx web page
from the node1.example.com.
Testing High Availability/Clustering
Alternatively, you can simulate an error by
telling the service to stop directly, without
stopping the the cluster on any node, using
the following command on one of the nodes:

$ sudo crm_resource --resource http_server --


force-stop
Testing High Availability/Clustering
• Then you need to run crm_mon in interactive
mode (the default), within the monitor
interval of 2 minutes, you should be able to
see the cluster notice that http_server failed
and move it to another node.
Cluster quorum disk
• A cluster quorum disk is the storage medium
on which the configuration database is stored
for a cluster computing network.
• The cluster configuration database, also called
the quorum, tells the cluster which physical
server(s) should be active at any given time.
• The quorum disk comprises a shared block
device that allows concurrent read/write
access by all nodes in a cluster.
Cluster quorum disk
• In networking, clustering is the use of multiple
servers (computers) to form what appears to
users as a single highly available system.
• A Web page request is sent to a "manager"
server, which then determines which of several
other servers to forward the request for handling.
• Cluster computing is used to load-balance the
traffic on high-traffic Web sites.
• Load balancing involves dividing the work up
among multiple servers so that users get served
faster.
Cluster quorum disk
• Although clusters comprise multiple servers, users or other
computers see any given cluster as a single virtual server.
• The physical servers themselves are called cluster nodes.
• The quorum tells the cluster which node should be active at
any given time, and intervenes if communications fail
between cluster nodes by determining which set of nodes
gets to run the application at hand.
• The set of nodes with the quorum keeps running the
application, while the other set of nodes is taken out of
service.
Redundant Network connectivity's
Network Redundancy

• Network redundancy is a process through which


additional or alternate instances of network
devices, equipment and communication mediums
are installed within network infrastructure.
• It is a method for ensuring network availability in
case of a network device or path failure and
unavailability. As such, it provides a means of
network failover.
Network Redundancy

• Network redundancy is a simple concept to


understand.
• If you have a single point of failure and it fails
you, then you have nothing to rely on.
• If you put in a secondary (or tertiary) method
of access, then when the main connection
goes down, you will have a way to connect to
resources and keep the business operational.
Network Redundancy
• Network redundancy is primarily implemented
in enterprise network infrastructure to
provide a redundant source of network
communications.
• It serves as a backup mechanism for quickly
swapping network operations onto redundant
infrastructure in the event of unplanned
network outages.
Network Redundancy

• Typically, network redundancy is achieved


through the addition of alternate network
paths, which are implemented through
redundant standby routers and switches.
• When the primary path is unavailable, the
alternate path can be instantly deployed to
ensure minimal downtime and continuity of
network services.
Active-Active high availability cluster
• An active-active cluster is typically made up of at
least two nodes, both actively running the same
kind of service simultaneously.
• The main purpose of an active-active cluster is to
achieve load balancing.
• Load balancing distributes workloads across all
nodes in order to prevent any single node from
getting overloaded.
• Because there are more nodes available to serve,
there will also be a marked improvement in
throughput and response times.
Active-Active high availability cluster
• For example, in a set up, which consists of a
load balancer and two (2) HTTP servers (i.e. 2
nodes), is an example of this type of HA
cluster configuration.
• Instead of connecting directly to an HTTP
server, web clients go through the load
balancer, which in turn connects each client to
any of the HTTP servers behind it.
Active-Active high availability cluster
• Assigning of clients to the nodes in the cluster
isn't an arbitrary process.
• Rather, it's based on what ever load balancing
algorithm is set on the load balancer.
• So, for example in a "Round Robin" algorithm,
the first client to connect is sent to the 1st
server, the second client to the 2nd server, the
3rd client back to the 1st server, the 4th client
back to the 2nd server, and so on.
Active-Active high availability cluster
• In order for the high availability cluster to operate
seamlessly, it's recommended that the two nodes
be configured for redundancy. In other words,
their individual configurations/settings must be
virtually identical.

• Another thing to bear in mind is that a cluster like


this works best when the nodes store files in a
shared storage like a NAS.
Active-Passive high availability cluster
• Like the active-active configuration, active-
passive also consists of at least two nodes.
• However, as the name "active-passive" implies,
not all nodes are going to be active.
• In the case of two nodes, for example, if the first
node is already active, the second node must be
passive or on standby.
• The passive (a.k.a. failover) server serves as a
backup that's ready to take over as soon as the
active (a.k.a. primary) server gets disconnected or
is unable to serve.
Active-Passive high availability cluster
• When clients connect to a 2-node cluster in active-
passive configuration, they only connect to one server.
• In other words, all clients will connect to the same
server.
• Like in the active-active configuration, it's important
that the two servers have exactly the same settings
(i.e., redundant).
• If changes are made on the settings of the primary
server, those changes must be cascaded to the failover
server.
• So when the failover does take over, the clients won't
be able to tell the difference.
Active-Passive high availability cluster
• High-Availability cluster aka Failover-cluster
(active-passive cluster) is one of the most widely
used cluster types in the production
environment.
• This type of cluster provides you the continued
availability of services even one of the node from
the group of computer fails.
• If the server running application has failed for
some reason (hardware failure), cluster software
(pacemaker) will restart the application on
another node.
Active-Passive high availability cluster
• Mostly in production, you can find this type of
cluster is mainly used for databases, custom
application and also for file sharing.
• Fail-over is not just starting an application; it has
some series of operations associated with it; like
mounting filesystems, configuring networks and
starting dependent applications.

• CentOS 7 / RHEL 7 supports Fail-over cluster using


the pacemaker.
N+1 redundancy
What does N+1 redundancy mean?

N+1 redundancy is a formula meant to express a


form of resilience used to ensure system
availability in the event of a component
failure.
The formula suggests that components (N) have
at least one independent backup component
(+1).
N+1 redundancy
• The "N" can refer to many different
components that make up a data center
infrastructure, including servers, hard disks,
power supplies, switches routers and cooling
units.
• The level of resilience is referred to as
active/passive or standby, as backup
components do not actively participate within
the system during normal operation.
N+1 redundancy
• It is also possible to have N+1 redundancy with
active/active components. In such cases, the
backup component remains active in the
operation even if all other components are fully
functional.
• In the event that one component fails, however,
the system will be able to perform. An
active/active approach is considered superior in
terms of performance and resiliency.
N+1 redundancy
• Very simply, it means that anything in your data
center can be shut down and kept down for a
period of time, without directly affecting ongoing
processing.
• If everything has been well maintained, the
statistical chance of a simultaneous failure
practically drops off the charts.
• Further, to actually bring down a facility that is
designed and installed for concurrent
maintainability most likely takes not two, but at
least three sequential events.
N-to-1 configuration
• An N-to-1 configuration is based on the concept that
multiple, simultaneous server failures are unlikely;
therefore, a single redundant server can protect
multiple active servers.
• When a server fails, its applications move to the
redundant server.
• For example, in a 4-to-1 configuration, one server can
protect four servers.
• This configuration reduces redundancy cost at the
server level from 100 percent to 25 percent. In this
configuration, a dedicated, redundant server is cabled
to all storage and acts as a spare when a failure occurs.
N-to-1 configuration
• An N-to-1 failover configuration reduces the
cost of hardware redundancy and still
provides a potential, dedicated spare.
• In an asymmetric configuration there is no
performance penalty.
• There are no issues with multiple applications
running on the same system; however, the
drawback is the 100 percent redundancy cost
at the server level.
N-to-1 configuration
• The problem with this design is the issue of
failback.
• When the failed server is repaired, you must
manually fail back all services that are hosted
on the failover server to the original server.
• The failback action frees the spare server and
restores redundancy to the cluster.
N-to-1 configuration
• Most shortcomings of early N-to-1 cluster
configurations are caused by the limitations of
storage architecture.
• Typically, it is impossible to connect more than
two hosts to a storage array without complex
cabling schemes and their inherent reliability
problems, or expensive arrays with multiple
controller ports.
Innovation Centre for Education

Hardware
Performance
Tuning

1
Innovation Centre for Education

Introduction
The server is the heart of the entire network operation. The performance of the server is a
critical factor in the efficiency of the overall network, and it affects all users. Although simply
replacing the entire server with a newer and faster one might be an alternative, it is often more
appropriate to replace or to add only to those components that need it and to leave the other
components alone. Often, poor performance is due to bottlenecks in individual hardware
subsystems, an incorrectly configured operating system, or a poorly tuned application. The
proper tools can help you to diagnose these bottlenecks and removing them can help improve
performance.
For example, adding more memory or using the correct device driver can improve performance
significantly. Sometimes, however, the hardware or software might not be the direct cause of the
poor performance. Instead, the cause might be the way in which the server is configured. In this
case, reconfiguring the server to suit the current needs might also lead to a considerable
performance improvement.
In this unit we will study in detail the processors, memory, storage also we will be dealing with
power management and heat dissipation linked with performance.

2
Innovation Centre for Education

CPU or Processor
The central processing unit (CPU or processor) is the key component of any computer system.
In this chapter, we cover several different CPU architectures from Intel (IA32, Intel 64
Technology, and IA64) and AMD1 (AMD64), and outline their main performance characteristics.
Processor technology
The central processing unit (CPU) has outperformed all other computer subsystems in its
evolution. Thus, most of the time, other subsystems such as disk or memory will impose a
bottleneck upon your application (unless pure number crunching or complex application
processing is the desired task).
Understanding the functioning of a processor in itself is already quite a difficult task, but today IT
professionals are faced with multiple and often very different CPU architectures.
Comparing different processors is no longer a matter of looking at the CPU clock rate, but is
instead a matter of understanding which CPU architecture is best suited for handling which kind
of workload. Also, 64-bit computing has finally moved from high-end UNIX® and mainframe
systems to the Intel-compatible arena, and has become yet another new technology to be
understood.
The Intel-compatible microprocessor has evolved from the first 8004 4-bit CPU, produced in
1971, to the current line of Xeon and Core processors. AMD offers the world’s first IA32
compatible 64-bit processor. Our overview of processors begins with the current line of Intel
Xeon CPUs, followed by the AMD Opteron and the Intel Core™ Architecture. For the sake of
simplicity, we do not explore earlier processors.
3
Innovation Centre for Education

Intel Xeon processors


Moore’s Law states that the number of transistors on a chip doubles about every two years.
Similarly, as transistors have become smaller, the frequency of the processors have increased,
which is generally equated with performance.
However, around 2003, physics started to limit advances in obtainable clock frequency.
Transistor sizes have become so small that electron leakage through transistors has begun to
occur. Those electron leaks result in large power consumption and substantial extra heat, and
could even result in data loss. In addition, cooling processors at higher frequencies by using
traditional air cooling methods has become too expensive.
This is why the material that comprises the dielectric in the transistors has become a major
limiting factor in the frequencies that are obtainable. Manufacturing advances have continued to
enable a higher per-die transistor count, but have only been able to obtain about a 10%
frequency improvement per year. For that reason, processor vendors are now placing more
processors on the die to offset the inability to increase frequency. Multi-core processors provide
the ability to increase performance with lower power consumption.

4
Innovation Centre for Education

Dual-core processors
Intel released its first dual core Xeon processors in October 2005. Dual core processors are two
separate physical processors combined onto a single processor socket. Dual core processors
consist of twice as many cores, but each core is run at lower clock speeds as an equivalent
single core chip to lower the waste heat usage. Waste heat is the heat that is produced from
electron leaks in the transistors.
Recent dual-core Xeon processor models that are available in System x servers:
Xeon 7100 Series MP processor (Tulsa)

5
Innovation Centre for Education

Xeon 5100 Series DP processor (Woodcrest)


The Woodcrest processor is the first Xeon DP processor that uses the Intel Core
microarchitecture instead of the Netburst microarchitecture. Intel core microarchitecture is
explained in detail in the following sections.

Processor Speed L2 cache Front-side bus Power (TDP)


Xeon 5110 1.6 GHz 4 MB 1066 MHz 65 W
Xeon 5120 1.86 GHz 4 MB 1066 MHz 65 W
Xeon 5130 2.00 GHz 4 MB 1333 MHz 65 W
Xeon 5140 2.33 GHz 4 MB 1333 MHz 65 W
Xeon 5148 LV 2.33 GHz 4 MB 1333 MHz 40 W
Xeon 5150 2.66 GHz 4 MB 1333 MHz 65 W
Xeon 5160 3.0 GHz 4 MB 1333 MHz 80 W
Table - Woodcrest processor models

6
Innovation Centre for Education

Xeon 5200 Series DP processor (Wolfdale)


The Wolfdale dual-core processor is based on the new 45 nm manufacturing process. It
features SSE4, which provides expanded computing capabilities over its predecessor. With the
Intel Core microarchitecture and Intel EM64T, the Wolfdale delivers superior performance and
energy efficiency to a broad range of 32-bit and 64-bit applications.

Processor Speed L2 cache Front-side Power (TDP)


model bus
E5205 1.86 GHz 6 MB 1066 MHz 65 W
L5238 2.66 GHz 6 MB 1333 MHz 35 W
X5260 3.33 GHz 6 MB 1333 MHz 80 W
X5272 3.40 GHz 6 MB 1600 MHz 80 W
X5270 3.50 GHz 6 MB 1333 MHz 80 W
Table- Wolfdale processor models

7
Innovation Centre for Education

Xeon 7200 Series MP Processor (Tigerton)


Tigerton comes with 7200 series 2-core and 7300 series 4-core options. For detailed
information about this topic, refer to following Xeon 7300 Tigerton discussed during quad-core
processor.
Xeon 5500 (Gainestown)
The Intel 5500 series, with Intel Nehalem Microarchitecture, brings together a number of
advanced technologies for energy efficiency, virtualization, and intelligent performance.
Interaged with Intel QuickPath Technology, Intelligent Power Technology and Intel Virtualization
Technology, the Gainestown is available with a range of features for different computing
demands. Most of the models for Gainestown are quad-core. Refer the below quad-core section
for specific models.

8
Innovation Centre for Education

Quad-core processors
Quad-core processors differ from single-core and dual-core processors by providing four
independent execution cores. Although some execution resources are shared, each logical
processor has its own architecture state with its own set of general purpose registers and
control registers to provide increased system responsiveness. Each core runs at the same clock
speed.
Intel quad-core and six-core processors include the following:
Xeon 5300 Series DP processor (Clovertown)
The Clovertown processor is a quad-core design that is actually made up of two Woodcrest dies
in a single package. Each Woodcrest die has 4 MB of L2 cache, so the total L2 cache in
Clovertown is 8 MB.
The Clovertown processors are also based on the Intel Core microarchitecture as described in
the following Intel core microarchitecture topic.
Processor models available include the E5310, E5320, E5335, E5345, and E5355. The
processor front-side bus operates at either 1066 MHz (processor models ending in 0) or 1333
MHz (processor models ending in 5). For specifics, see following below table. None of these
processors support Hyper-Threading.

9
Innovation Centre for Education

In addition to the features of the Intel Core microarchitecture, the features of the Clovertown
processor include:
Intel Virtualization Technology - processor hardware enhancements that support software-
based virtualization.
Intel 64 Architecture (EM64T) - support for both 64-bit and 32-bit applications.
Demand-Based Switching (DBS) - technology that enables hardware and software power
management features to lower average power consumption of the processor while maintaining
application performance.
Intel I/O Acceleration Technology (I/OAT) - reduces processor bottlenecks by offloading
network-related work from the processor.

Processor Speed L2 cache Front-side Power Demand-


model bus (TDP) based
switching
E5310 1.60 GHz 8 MB 1066 MHz 80 W No
E5320 1.86 GHz 8 MB 1066 MHz 80 W Yes
E5335 2.00 GHz 8 MB 1333 MHz 80 W No
E5345 2.33 GHz 8 MB 1333 MHz 80 W Yes
E5355 2.66 GHz 8 MB 1333 MHz 120 W Yes
Table - Clovertown processor models

10
Innovation Centre for Education

Xeon 7300 Series MP processor (Tigerton)


The Xeon 7300 Tigerton processor is the first quad-core processor that Intel offers for the multi-
processor server (7xxx series) platform. It is based on the Intel Core microarchitecture
described following section, “Intel Core microarchitecture”. The Tigerton comes with 2-core and
4-core options. Xeon 7200 Tigerton dual-core processors are a concept similar to a two-way
SMP system except that the two processors, or cores, are integrated into one silicon die. This
brings the benefits of two-way SMP with lower software licensing costs for application software
that licenses per CPU socket. It also brings the additional benefit of less processor power
consumption and faster data throughput between the two cores. To keep power consumption
down, the resulting core frequency is lower, but the additional processing capacity means an
overall gain in performance.
Each Tigerton core has separate L1 instruction and data caches, as well as separate execution
units (integer, floating point, and so on), registers, issue ports, and pipelines for each core. A
multi-core processor achieves more parallelism than Hyper-Threading technology because
these resources are not shared between the two cores.

11
Innovation Centre for Education

The Tigerton processor series is available in a range of features to match different computing
demands. Up to 8 MB L2 Cache and 1066 MHz front-side bus frequency are supported, as
listed in below Table. All processors integrate the APIC Task Programmable Register, which is a
new Intel VT extension that improves interrupt handling and further optimizes virtualization
software efficiency.
For specific processor SKUs, see the below Table.

Processor Speed L2 Cache Front-side Power (TDP)


model bus
E7310 1.60 GHz 4 MB 1066 MHz 80 W
L7345 1.86 GHz 8 MB 1066 MHz 80 W
E7320 2.13 GHz 4 MB 1066 MHz 80 W
E7330 2.40 GHz 6 MB 1066 MHz 80 W
X7350 2.93 GHz 8 MB 1066 MHz 130 W
Table - Tigerton processor models

12
Innovation Centre for Education

Xeon 5400 Series DP processor (Harpertown)


The Harpertown is based on the new 45 nm manufacturing process and features up to a 1600
MHz front-side bus clock rate and 12 MB L2 cache. Harpertown includes new Intel Streaming
SIMD Extensions 4 (SSE4) instructions, thus providing building blocks for delivering expanded
capabilities for many applications. The new 45nm enhanced Intel Core microarchitecture
delivers more performance per watt in the same platforms.
For specific processor SKUs, see below table.

Processor Speed L2 Cache Front-side Power (TDP)


model bus
E5405 2.00 GHz 12 MB 1333 MHz 80 W
E5420 2.50 GHz 12 MB 1333 MHz 80 W
L5430 2.66 GHz 12 MB 1333 MHz 50 W
E5440 2.83 GHz 12 MB 1333 MHz 80 W
X5450 3.00 GHz 12 MB 1333 MHz 120 W
X5460 3.16 GHz 12 MB 1333 MHz 120 W
X5482 3.20 GHz 12 MB 1600 MHz 150 W
Table - Harpertown processor models

13
Innovation Centre for Education

Xeon 7400 Series MP Processor (Dunnington)


Dunnington comes with 4-core and 6-core options. For more detailed information about this
topic, refer the below topic on six-xore processor.
Xeon 5500 Series DP processor (Gainestown)
The Intel 5500 series is based on the Nehalem microarchitecture. As the follow-on to the
5200/5400 (Wolfdale/ Harpertown), the Gainestown offers the models listed in the below table.
Processor Speed L3 cache QPI link Power (TDP)
model speed (giga
transfer/s)
W5580 3.20 GHz 8 MB 6.4 GT/s 130 W
X5570 2.93 GHz 8 MB 6.4 GT/s 95 W
X5560 2.80 GHz 8 MB 6.4 GT/s 95 W
X5550 2.66 GHz 8 MB 6.4 GT/s 95 W
E5540 2.53 GHz 8 MB 5.86 GT/s 80 W
E5530 2.40 GHz 8 MB 5.86 GT/s 80 W
L5520 2.26 GHz 8 MB 5.86 GT/s 60 W
E5520 2.26 GHz 8 MB 5.86 GT/s 80 W
L5506 2.13 GHz 4 MB 4.8 GT/s 60 W
E5506 2.13 GHz 4 MB 4.8 GT/s 80 W
E5504 2.00 GHz 4 MB 4.8 GT/s 80 W
E5502 1.86 GHz 4 MB 4.8 GT/s 80 W
Table- Gainestown processor models

14
Innovation Centre for Education

Six-core processors
Six-core processors extend the quad-core paradigm by providing six independent execution
cores.
Xeon 7400 Series MP processor (Dunnington)
With enhanced 45 nm process technology, the Xeon 7400 series Dunnington processor features
a single-die 6-core design with 16 MB of L3 cache. Both 4-core and 6-core models of
Dunnington have shared L2 cache between each pair of cores. They also have a shared L3
cache across all cores of the processor. A larger L3 cache increases efficiency of cache-to-core
data transfers and maximizes main memory-to-processor bandwidth. Specifically built for
virtualization, this processor comes with enhanced Intel VT, which greatly optimizes
virtualization software efficiency.

15
Innovation Centre for Education

As below Figure illustrates, Dunnington processors have three levels of cache on the processordie:
 L1 cache
– The L1 execution, 32 KB instruction and 32 KB data for data trace cache in each core is used to store
micro-operations (decoded executable machine instructions). It serves those to the processor at rated
speed. This additional level of cache saves decode time on cache hits.
 L2 cache
– Each pair of cores in the processor has 3 MB of shared L2 cache, for a total of 6 MB, or 9 MB of L2
cache. The L2 cache implements the Advanced Transfer Cache technology.
 L3 cache
– uTnhneinDgton processors have 12 MB (4-core), or 16 MB (6-core) shared L3 cache.

Figure - Block diagram of Dunnington 6-core processor package


16
Innovation Centre for Education

Dunnington comes with 4-core and 6-core options; for specific models see the below figure

Processor Cores per Speed L3 Front-side Power


model processor Cache bus (TDP)
L7445 4 2.13 GHz 12 MB 1066 MHz 50 W
E7420 4 2.13 GHz 8 MB 1066 MHz 90 W
E7430 4 2.13 GHz 12 MB 1066 MHz 90 W
E7440 4 2.40 GHz 16 MB 1066 MHz 90 W
L7455 6 2.13 GHz 12 MB 1066 MHz 65 W
E7450 6 2.40 GHz 12 MB 1066 MHz 90 W
X7460 6 2.66 GHz 16 MB 1066 MHz 130 W
Table - Dunnington processor models (quad-core and 6-core)

17
Innovation Centre for Education

Intel Core microarchitecture


The Intel Core microarchitecture is based on a combination of the energy-efficient Pentium® M
microarchitecture found in mobile computers, and the current Netburst microarchitecture that is
the basis for the majority of the Xeon server processors. The Woodcrest processor is the first
processor to implement the Core microarchitecture.
The key features of the Core microarchitecture include:
 Intel Wide Dynamic Execution
The Core microarchitecture is able to fetch, decode, queue, execute, and retire up to four
instructions simultaneously in the pipeline. The previous Netburst Microarchitecture was only
able to run three instructions simultaneously in the pipeline. The throughput is improved
effectively by processing more instructions in the same amount of time.
In addition, certain individual instructions are able to be combined into a single instruction in a
technique known as macrofusion. By being combined, more instructions can fit within the
pipeline. To use the greater pipeline throughput, the processors have more accurate branch
prediction technologies and larger buffers, thus providing less possibility of pipeline stalls and
more efficient use of the processor.

18
Innovation Centre for Education

 Intel Intelligent Power Capability


The advantage of the Pentium M™ microarchitecture is the ability to use less power and enable
longer battery life in mobile computers. Similar technology has been improved and modified and
added into the Core microarchitecture for high-end computer servers.
The Intelligent Power Capability provides fine-grained power control that enables sections of the
processor that are not in use to be powered down. Additional logic is included in components
such as the ALU unit, FP unit, cache logic, and bus logic that improves power consumption on
almost an instruction-by-instruction basis. Processor components can then be powered on the
instant they are needed to process an instruction, with minimal lag time so that performance is
not jeopardized. Most importantly, the actual power utilization is substantially lower with the Core
microarchitecture due to this additional power control capability.

19
Innovation Centre for Education

 Intel Advanced Smart Cache


The L2 cache in the Core microarchitecture is shared between cores instead of each core using a separate L2.
The below Figure illustrates the difference in the cache between the traditional Xeon with Netburst
microarchitecture and the Intel Core microarchitecture.
The front-side bus utilization would be lower, similar to the Tulsa L3 shared cache as discussed in “Xeon 7100
Series MP processor (Tulsa)”. With a dual-core processor, the Core microarchitecture allows for one, single
core to use the entire shared L2 cache if the second core is powered down for power-saving purposes. As the
second core begins to ramp up and use memory, it will allocate the L2 memory away from the first CPU until it
reaches a balanced state where there is equal use betweencores.
Single core performance benchmarks, such as SPEC Int and SPEC FP, benefit from this architecture because
single core applications are able to allocate and use the entire L2 cache. SPEC Rate benchmarks balance the
traffic between the cores and more effectively balance the L2 caches.

20 Figure - Intel Xeon versus Intel Core Architecture


Innovation Centre for Education

 Intel Smart Memory Access


Intel Smart Memory Access allows for additional technology in the processor to prefetch more
often, which can assist performance. Previously, with the NetBurst architecture, when a write
operation appeared in the pipeline, all subsequent read operations would stall until the write
operation completed. In that case, prefetching would halt, waiting for the write operation to
complete, and the pipeline would fill with nop or stall commands instead of productive
commands. Instead of making forward progress, the processor would make no progress until
the write operation completed.
Using Intel’s memory disambiguation technologies, additional load-to-memory operations are
executed prior to a store operation completing. If the load instruction turns out to be incorrect,
the processor is able to back out the load instruction and all dependent instructions that might
have executed based on that load. However, if the load instruction turns out to be valid, the
processor has spent less time waiting and more time executing, which improves the overall
instruction level parallelism.
 Intel Advanced Digital Media Boost
Advanced Digital Media Boost increased the execution of SSE instructions from 64 bits to 128
bits. SSE instructions are Streaming Single Instruction Multiple Data Instructions that
characterize large blocks of graphics or high bandwidth data applications. Previously, these
Instructions would need to be broken into two 64-bit chunks in the execution stage of the
processor, but now support has been included to have those 128-bit instructions executed at
one per clock cycle.
Innovation Centre for Education

Intel Nehalem microarchitecture


The new Nehalem is built on the Core microarchitecture and incorporates significant system
architecture and memory hierarchy changes. Being scalable from 2 to 8 cores, the Nehale m
includes the following features:
 QuickPath Interconnect (QPI)
QPI, previously named Common System Interface or CSI, acts as a high-speed
interconnection between the processor and the rest of the system components, including
system memory, various I/O devices, and other processors. A significant benefit of the QPI
is that it is point-to-point. As a result, no longer is there a single bus that all processors must
connect to and compete for in order to reach memory and I/O. This improves scalability and
eliminates the competition between processors for bus bandwidth.
Each QPI link is a point-to-point, bi-directional interconnection that supports up to 6.4 GTps
(giga transfers per second). Each link is 20 bits wide using differential signaling, thus
providing bandwidth of 16 GBps. The QPI link carries QPI packages that are 80 bits wide,
with 64 bits for data and the remainder used for communication overhead. This gives a
64/80 rate for valid data transfer; thus, each QPI link essentially provides bandwidth of 12.8
GBps, and equates to a total of 25.6 GBps for bi-directional interconnection.
Innovation Centre for Education

 Integrated Memory Controller


The Nehalem replaces front-side bus memory access with an integrated memory controller
and QPI. Unlike the older front-side bus memory access scheme, this is the Intel approach
to implementing the scalable shared memory architecture known as non-uniform memory
architecture (NUMA) that we introduced with the IBM eX4 “Hurricane 4” memory controller
on the System x3950 M2.
As a part of Nehalem’s scalable shared memory architecture, Intel integrated the memory
controller into each processor’s silicon die. With that, the system can provide independent
high bandwidth, low latency local memory access, as well as scalable memory bandwidth as
the number of processors is increased. In addition, by using Intel QPI, the memory
controllers can also enjoy fast efficient access to remote memory controllers.
Innovation Centre for Education

Comparing to its predecessor, Nehalem’s cache hierarchy extends to three levels; referring the
below figure. The first two levels are dedicated to individual cores and stay relatively small. The
third level cache is much larger and is shared among all cores.

Figure - Intel Nehalem versus Intel Core


Innovation Centre for Education

AMD Opteron processors


The first Opteron processor was introduced in 2003 after a major effort by AMD to deliver the first 64-bit
processor capable of running in 64-bit mode and running existing 32-bit software as fast as or faster than
current 32-bit Intel processors. Opteron was designed from the start to be an Intel-compatible processor, but
AMD also wanted to introduce advanced technology that would set Opteron apart from processors developed
by Intel.
The Opteron processor has a physical address limit of 40-bit addressing (meaning that up to 1 TB of memory
can be addressed) and the ability to be configured in 2-socket and 4-socketmulti-processor configurations.
Apart from improved scalability, there was another significant feature introduced with the Opteron CPU, namely
the 64-bit addressing capability of the processor. With the Itanium processor family, Intel introduced a
completely new 64-bit architecture that was incompatible with existing software. AMD decided there was
significant market opportunity for running existing software and so upgraded the existing x86 architecture to
64-bit architecture
As with the transitions that the x86 architecture underwent in the years before both the move to 16-bit with the
8086 and the move to 32-bit with the Pentium architecture, the new AMD64 architecture simply expanded the
IA32 architecture by adding support for new 64-bit addressing, registers, and instructions.
this approach is the ability to move to 64-bit without having to perform a major rewrite of all operating system
and application software. The Opteron CPU has three distinct operation modes that enable the CPU to run in
either a 32-bit mode or a 64-bit mode, or a mixture of both. This feature gives users the ability to transition
smoothly to 64-bit computing. The AMD64 architecture is discussed in more detail in “64-bit extensions:
AMD64 and Intel 64 Technology” in this unit.
Innovation Centre for Education

The below figure shows the architecture of the Opteron Revision F processor

Figure - Architecture of the Opteron Revision F processor

Internally, each core of the processor is connected directly to a 1 MB L2 cache. Memory


transfers between the two caches on the processors occur directly through a crossbar switch,
which is on the chip. By direct communication between the L2 caches on different cores, the
HyperTransport is not used for processor core-to-core communication. The bandwidth of the
HyperTransport can then be used to communicate with other devices or processors.
Innovation Centre for Education

AMD quad-core Barcelona


Barcelona is the third-generation AMD Opteron processor, and it extends the design of the
Revision F processor. It features a native multi-core design where all four cores are on one
piece of silicon, as opposed to packaging two dual-core die together into one single processor.
With several significant enhancements over previous generations, Barcelona provides
increased performance and energy efficiency.
Features of the Barcelona processor include:
 Optimizer Technology increases memory throughput greatly compared to previous generations of AMD
Opteron processors.
 Floating Point Accelerator provides 128-bit SSE floating point capabilities, which enables each core to
simultaneously execute up to four floating point operations per clock for significantly performance
improvement.
 Smart Cache represents significant cache enhancements with 512 KB L2 cache per core and 2 MB shared
L3 cache across all fourcores.
 Technology can reduce energy consumption by turning off unused parts of the processor.
 Independent Dynamic Core Technology enables variable clock frequency for each core, depending on the
specific performance requirement of the applications it is supporting, thereby helping to reduce power
consumption.
 Power Management (DDPM) technology provides an independent power supply to the cores and to the
memory controller, allowing the cores and memory controller to operate on different voltages, depending on
usage.
Innovation Centre for Education

The AMD Opteron processors are identified by a four-digit model number in the form ZYXX, with
the following meaning:

 nZ indicates the maximum scalability of the processor.

 1000 Series = 1-way servers and workstations


 2000 Series = Up to 2-way servers and workstations
 8000 Series = Up to 8-way servers and workstations
 nY indicates socket generation. This digit is always 3 for third-generation AMD Opteron
processors for Socket F (1207) and Socket AM2.

 1000 Series in Socket AM2 = Models 13XX


 2000 Series in Socket F (1207) = Model 23XX
 8000 Series in Socket F (1207) = Model 83XX

 XX indicates relative performance within the series.

 XX values above 40 indicate a third-generation quad-core AMD Opteron processor.


Innovation Centre for Education

AMD quad-core Shanghai


Shanghai is next generation AMD processor that will be produced on 45nm process. Shanghai
will have a third iteration of HyperTransport and likely to have three times more cache than
Barcelona, with improved performance and power characteristics.
Innovation Centre for Education

Opteron split-plane
Split-plane, also referred to as Dual Dynamic Power Management (DDPM) technology, was
introduced in the Barcelona processor as described during “AMD quad-core Barcelona”. The
below figure illustrates that in a split plane implementation, power delivery is separated between
the cores and the integrated memory controller, as opposed to a unified power plane design.
The new and flexible split-plane design allows power to be dynamically allocated as needed,
thus making more efficient use of power and boosting performance.
For instance, when the core is executing a CPU-intensive computation, the memory controller
can adjust the power requirement to lower the power consumption for more efficiency.
Conversely, when needed to boost memory performance, the memory controller can request
increased power supply to obtain higher frequency and achieve lower memory latency.
Because the power is coming from the system board, refreshing the system board design is
required to support a split-plane power scheme implementation. For the best level of
compatibility, however, a split-plane CPU will work well with the feature disabled while installed
in a system board that does not support the split-plane implementation.
Innovation Centre for Education

Figure- Split-plane versus unified plane


Innovation Centre for Education

IBM CPU passthru card


AMD Opteron-based systems such as the System x3755 support up to four processors and
their associated memory. In a four-way configuration, the x3755 has the configuration that is
illustrated in the below.

Figure - Block diagram of the x3755 with four processors installed

Note that each adjacent processor is connected through HyperTransport links, forming a
square. The third HT link on each processor card connects to an I/O device, or is unused.
Innovation Centre for Education

The x3755 also supports a 3-way configuration (by removing the processor card in slot 4 in the
upper-left quadrant) as shown in the previous Figure. This configuration results in processor
connections as shown in the top half of the below Figure.

Figure - The benefit of the passthru card for three-way configurations

However, with the addition of the IBM CPU Passthru card, part number 40K7547, the
processors on cards 1 and 3 are directly connected together, as depicted in the bottom half of
the above figure.
The passthru card basically connects two of the HyperTransport connectors together and
provides a seamless connection between the processors on either side. The resulting block
diagram of the x3755 is shown in next slide.
Innovation Centre for Education

Figure -Block diagram of a three-way x3755 with a passthru card installed

There are performance benefits which can be realized by adding the passthru card in a three-
way configuration. Without the passthru card, the configuration requires that snoop requests
and responses originating from one of the two end processors (refer the figure The benefit of
the passthru card for three-way configurations ), and certain non-local references, travel over
two hops. With the passthru card, this now becomes only one hop.
Innovation Centre for Education

The benefit in decreased latency and increased memory throughput is shown in below figure.

Figure - Memory throughput benefit of the passthru card

[ For more information see the white paper Performance of the IBM System x 3755 by Douglas
M Pase and Matthew A Eckl, which is available from:
http://www.ibm.com/servers/eserver/xseries/benchmarks/related.html]
Innovation Centre for Education

64-bit computing
As discussed above, there are three 64-bit implementations in the Intel-compatible processor
marketplace:
Intel IA64, as implemented on the Itanium 2 processor
Intel 64 Technology, as implemented on the 64-bit Xeon DP and Xeon MP processors
AMD AMD64, as implemented on the Opteron processor

There exists some uncertainty as to the definition of a 64-bit processor and, even more
importantly, the benefit of 64-bit computing.
Definition of 64-bit:
A 64-bit processor is a processor that is able to address 64 bits of virtual address space. A 64-
bit processor can store data in 64-bit format and perform arithmetic operations on 64-bit
operands. In addition, a 64-bit processor has general purpose registers (GPRs) and arithmetic
logical units (ALUs) that are 64 bits wide.
Innovation Centre for Education

The Itanium 2 has both 64-bit addressability and GPRs and 64-bit ALUs. So, it is by definition a
64-bit processor.
Intel 64 Technology extends the IA32 instruction set to support 64-bit instructions and
addressing, but are Intel 64 Technology and AMD64 processors real 64-bit chips? The answer is
yes. Where these processors operate in 64-bit mode, the addresses are 64-bit, the GPRs are
64 bits wide, and the ALUs are able to process data in 64-bit chunks. Therefore, these
processors are full-fledged,64-bit processors in this mode.
Note that while IA64, Intel 64 Technology, and AMD64 are all 64-bit, they are not compatible for
the following reasons:
Intel 64 Technology and AMD64 are, with exception of a few instructions such as 3DNOW,
binary compatible with each other. Applications written and compiled for one will usually run at
full speed on the other.
IA64 uses a completely different instruction set to the other two. 64-bit applications written for
the Itanium 2 will not run on the Intel 64 Technology or AMD64 processors, and vice versa.
Innovation Centre for Education

64-bit extensions: AMD64 and Intel 64 Technology


Both the AMD AMD64 and Intel 64 Technology (formerly known as EM64T) architectures
extend the well-established IA32 instruction set with:
 A set of new 64-bit general purpose registers (GPR)
 6t4in-sbtiructionpointers
 The ability to process data in 64-bit chunks
 Up to 1 TB of address space that physical memory is able to access
 6t4in-tbeiger support and 64- bit flat virtual address space
Even though the names of these extensions suggest that the improvements are simply in
memory addressability, both the AMD64 and the Intel 64 Technology are fully functional 64-bit
processors.
There are three distinct operation modes available in AMD64 and Intel 64 Technology:
 3t2le-gbaicy mode
 Compatibility mode
 -Fbuitlm64ode (long mode)
For more information about the AMD64 architecture, see: http://www.x86-64.org/
For more information about Intel 64 Technology, see:
http://www.intel.com/technology/64bitextensions/
105
© 2013 IBM Corporation
Innovation Centre for Education

Benefits of 64-bit (AMT64, Intel 64 Technology) computing


Processors using the Intel 64 Technology and AMD64 architectures are making this transition very smooth by
offering 32-bit and 64-bit modes. This means that the hardware support for 64-bit will be in place before you
upgrade or replace your software applications with 64-bit versions. IBM System x already has many models
available with the Intel 64 Technology-based Xeon and AMD64 Opteronprocessors.
Here are examples of applications that will benefit from 64-bitcomputing:
 Encryption applications
Most encryption algorithms are based on very large integers and would benefit greatly from the use of 64-bit
GPRs and ALUs. Although modern high-level languages allow you to specify integers above the 232 limit, in a
32-bit system, this is achieved by using two 32-bit operands, thereby causing significant overhead when
moving those operands through the CPU pipelines. A 64-bit processor will allow you to perform 64-bit integer
operation with one instruction.
 Scientific applications
Scientific applications are another example of workloads that need 64-bit data operations. Floating-point
operations do not benefit from the larger integer size because floating-point registers are already 80 or 128 bits
wide even in 32-bit processors.
 Software applications requiring more than 4 GB of memory
The biggest advantage of 64-bit computing for commercial applications is the flat, potentially massive, address
space.
32-bit enterprise applications such as databases are currently implementing Page Addressing Extensions
(PAE) and Addressing Windows Extensions (AWE) addressing schemes to access memory above the 4 GB
limit imposed by 32-bit address limited processors. With Intel 64 Technology and AMD64, these 32-bit
addressing extension schemes support access to memory up to 128 GB insize.
Innovation Centre for Education

In addition, 32-bit applications might also get a performance boost from a 64-bit Intel 64
Technology or AMD64 system running a 64-bit operating system. When the processor runs in
Compatibility mode, every process has its own 4 GB memory space, not the 2 GB or 3 GB
memory space each gets on a 32-bit platform. This is already a huge improvement compared to
IA32, where the operating system and the application had to share those 4 GB of memory.
When the application is designed to take advantage of more memory, the availability of the
additional 1 or 2 GB of physical memory can create a significant performance improvement. Not
all applications take advantage of the global memory available. APIs in code need to be used to
recognize the availability of more than 2 GB of memory.
Furthermore, some applications will not benefit at all from 64-bit computing and might even
experience degraded performance. If an application does not require greater memory capacity
or does not perform high-precision integer or floating-point operations, then 64-bit will not
provide any improvement.
Innovation Centre for Education

64-bit memory addressing


The width of a memory address dictates how much memory the processor can address. A 32-bit processor can
address up to 232 bytes or 4 GB. A 64-bit processor can theoretically address up to 264 bytes or 16 Exabytes
(or 16777216 Terabytes), although current implementations address a smaller limit, as shown in the below
Table.
Processor Flat Addressing with PAEa
Intel 32-bit Xeon MP (32-bit) processors 4 GB (32-bit) 128 GB
including Foster MP and Gallatin
Intel 64-bit Xeon DP Nocona (64-bit) 64 GB (36-bit) 128 GB in compatibility
Intel 64-bit Xeon MP Cranford (64-bit) 64 GB (36-bit) 128 GB in compatibility
Intel 64-bit Xeon MP Potomac (64-bit) 1 TB (40-bit) 128 GB in compatibility
Intel 64-bit dual core MP including Paxville, 1 TB (40-bit)
128 GB in compatibility
Woodcrest, and Tulsa mode
Intel 64-bit quad core MP including 1 TB (40-bit) 128 GB in compatibility
Clovertown, Tigerton, and Harpertown mode
Intel 64-bit (64-bit) six core MP Dunnington
1 TB (40-bit) 128 GB in compatibility
mode
AMD Opteron Barcelona 256 TB (48-bit) 128 GB in compatibility
mode
Table - Memory supported by processors
* These values may be further limited by the operating system

The 64-bit extensions in the processor architectures Intel 64 Technology and AMD64 provide better
performance for both 32-bit and 64-bit applications on the same system. These architectures are based on 64-
bit extensions to the industry-standard x86 instruction set, and provide support for existing 32-bit applications.
Innovation Centre for Education

Processor performance
Processor performance is a complex topic because the effective CPU performance is affected
by system architecture, operating system, application, and workload. This is even more so with
the choice of three different CPU architectures, IA32, Itanium 2, and AMD64/Intel 64.
An improvement in system performance gained by a CPU upgrade can be achieved only when
all other components in the server are capable of working harder. Compared to all other server
components, the Intel processor has experienced the largest performance improvement over
time and in general, the CPU is much faster than every other component. This makes the CPU
the least likely server component to cause a bottleneck. Often, upgrading the CPU with the
same number of cores simply means the system runs with lower CPU utilization, while other
bottlenecked components become even more of a bottleneck.
In general, server CPUs execute workloads that have very random address characteristics. This
is expected because most servers perform many unrelated functions for many different users.
So, core clock speed and L1 cache attributes have a lesser effect on processor performance
compared to desktop environments. This is because with many concurrently executing threads
that cannot fit into the L1 and L2 caches, the processor core is constantly waiting for L3 cache
or memory for data and instructions to execute.
Innovation Centre for Education

Comparing CPU architectures


Every CPU we have discussed so far had similar attributes. Every CPU has two or more
pipelines, an internal clock speed, L1 cache, L2 cache (some also L3 cache). The various
caches are organized in different ways; some of them are 2-way associative, whereas others go
up to 16-way associativity. Some have an 800 MHz FSB; others have no FSB at all (Opteron).
Which is fastest? Is the Xeon DP the fastest CPU of them all because it is clocked at up to 3.8
GHz? Or is the Itanium Montvale the fastest because it features up to 18 MB of L3 cache? Or is
it perhaps the third generation Opteron because its L3 cache features a 32-way associativity?
As is so often the case, there is never one simple answer. When comparing processors, clock
frequency is only comparable when comparing processors of the same architectural family. You
should never compare isolated processor subsystems across different CPU architectures and
think you can make a simple performance statement. Comparing different CPU architectures is
therefore a very difficult task and has to take into account available application and operating
system support.
Arsesault, we do not compare different CPU architectures in this section, but we do compare the
features of the different models of one CPU architecture.
Innovation Centre for Education

Cache associativity
Cache associativity is necessary to reduce the lookup time to find any memory address stored
in the cache. The purpose of the cache is to provide fast lookup for the CPU, because if the
cache controller had to search the entire memory for each address, the lookup would be slow
and performance would suffer.
To provide fast lookup, some compromises must be made with respect to how data can be
stored in the cache. Obviously, the entire amount of memory would be unable to fit into the
cache because the cache size is only a small fraction of the overall memory size. The
methodology of how the physical memory is mapped to the smaller cache is known as set
associativity.
First, we must define some terminology. Referring to below figure, main memory is divided into
pages. Cache is also divided into pages, and a memory page is the same size as a cache page.
Pages are divided up into lines or cache lines. Generally, cache lines are 64 bytes wide.
For each page in memory or in cache, the first line is labeled cache line 1, the second line is
labeled cache line 2, and so on. When data in memory is to be copied to cache, the line that this
data is in is copied to the equivalent slot in cache.
Lboeolkoiwng at Figure, when copying cache line 1 from memory page 0 to cache, it is stored
in cache line 1 in the cache. This is the only slot where it can be stored in cache. This is a one-
way associative cache, because for any given cache line in memory, there is only one position
in cache where it can be stored. This is also known as direct mapped, because the data can
only go into one place in the cache.
Innovation Centre for Education

With a one-way associative cache, if cache line 1 in another memory page needs to be copied
to cache, it too can only be stored in cache line 1 in cache. You can see from this that you would
get a greater cache hit rate if you use greater associativity.

Figure - One-way associative (direct mapped) cache


Innovation Centre for Education

Figure - A 2-way set associative cache

Above figure shows the 2-way set associative cache implementation. Here there are two
locations in which to store the first cache line for any memory page. As the figure illustrates,
main memory on the right side will be able to store up to two cache line 1 entries concurrently.
Cache line 1 for page 0 of main memory could be located in way-a of the cache; cache line 1 for
page n of main memory could be located in way-b of the cache simultaneously.
Innovation Centre for Education

Expanding on a one-way and two-way set associative cache, a 3-way set associative cache, as
shown in below figure provides three locations. A 4-way set associative cache provides four
locations. An 8-way set associative cache provides eight possible locations in which to store the
first cache line from up to eight different memory pages.

Figure - 3-way set associative cache


Innovation Centre for Education

Set associativity greatly minimizes the cache address decode logic necessary to locate a
memory address in the cache. The cache controller simply uses the requested address to
generate a pointer into the correct cache page. A hit occurs when the requested address
matches the address stored in one of the fixed number of cache locations associated with that
address. If the particular address is not there, a cache miss occurs.
Notice that as the associativity increases, the lookup time to find an address within the cache
could also increases because more pages of cache must be searched. To avoid longer cache
lookup times as associativity increases, the lookups are performed in parallel. However, as the
associativity increases, so does the complexity and cost of the cache controller.
For high performance X3 Architecture systems such as the System x3950, lab measurements
determined that the most optimal configuration for cache was 9-way set associativity, taking into
account performance, complexity, and cost.
A fully associative cache in which any memory cache line could be stored in any cache location
could be implemented, but this is almost never done because of the expense (in both cost and
die areas) in parallel lookup circuits required.
eLarvrgeres s generally have random memory access patterns, as opposed to sequential
memory access patters. Higher associativity favors random memory workloads due to its ability
to cache more distributed locations of memory.
Innovation Centre for Education

Cache size
Faster, larger caches usually result in improved processor performance for server workloads.
Performance gains obtained from larger caches increase as the number of processors within the
server increase. When a single CPU is installed in a four-socket SMP server, there is little
competition for memory access.
Consequently, when a CPU has a cache miss, memory can respond, and with the deep pipeline
architecture of modern processors, the memory subsystem usually responds before the CPU
stalls. This allows one processor to run fast almost independently of the cache hit rate.
On the other hand, if there are four processors installed in the same server, each queuing
multiple requests for memory access, the time to access memory is greatly increased, thus
increasing the potential for one or more CPUs to stall. In this case, a fast L2 hit saves a
significant amount of time and greatly improves processor performance.
As a rule, the greater the number of processors in a server, the more gain from a large L2
cache. In general:
With two CPUs, expect 4% to 6% improvement when you double the cache size
With four CPUs, expect 8% to 10% improvement when you double the cache size

gWhithorei more CPUs, you might expect as much as 10% to 15% performance gain when
you double processor cache size
Innovation Centre for Education

Shared cache
Shared cache is introduced in “Intel Core microarchitecture” section, as shared L2 cache to
improve resource utilization and boost performance. In Nehalem, sharing cache moves to the
third level cache for overall system performance considerations. In both cases, the last level of
cache is shared among different cores and provides significant benefits in multi-core
environments as oppose to dedicated cache implementation.
Benefits of shared cache include:
 Improved resource utilization, which makes efficient usage of the cache. When one core idles,
the other core can take all the shared cache.
 Shared cache, which offers faster data transfer between cores than system memory, thus
improving system performance and reducing traffic to memory.
 Reduced cache coherency complexity, because a coherency protocol does not need to be set
for the shared level cache because data is shared to be consistent rather than distributed.
 More flexible design of the code relating to communication of threads and cores because
programmers can leverage this hardware characteristic.
 Reduced data storage redundancy, because the same data in the shared cache needs to be
stored only once.
Innovation Centre for Education

CPU clock speed(frequency)


Processor clock speed affects CPU performance because it is the speed at which the CPU
executes instructions. Measured system performance improvements because of an increase in
clock speed are usually not directly proportional to the clock speed increase. For example, when
comparing a 3.0 GHz CPU to an older 1.6 GHz CPU, you should not expect to see 87%
improvement. In most cases, performance improvement from a clock speed increase will be
about 30% to 50% of the percentage increase in clock speed. So for this example, you could
expect about 26% to 44% system performance improvement when upgrading a 1.6 GHz CPU to
a 3.0 GHz CPU.
Innovation Centre for Education

Scaling versus the number of processor cores


In general, the performance gains shown in below figure, can be obtained by adding CPUs when the server
application is capable of efficiently utilizing additional processors—and there are no other bottlenecks
occurring in the system.

Figure - Typical performance gains when adding processors

These scaling factors can be used to approximate the achievable performance gains that can be obtained
when adding CPUs and memory to a scalable Intel IA-32 server.
For example, begin with a 1-way 3.0 GHz Xeon MP processor and add another Xeon MP processor. Server
throughput performance will improve up to about 1.7 times. If you increase the number of Xeon processors to
four, server performance can improve to almost three times greater throughput than the single processor
configuration.
At eight processors, the system has slightly more than four times greater throughput than the single processor
configuration. At 16 processors, the performance increases to over six-fold greater throughput than the single
CPU configuration. © 2013 IBM Corporation
119
Innovation Centre for Education

Database applications such as IBM DB2, Oracle, and Microsoft SQL Server usually provide the
greatest performance improvement with increasing numbers of CPUs. These applications have
been painstakingly optimized to take advantage of multiple CPUs. This effort has been driven by
the database vendors’ desire to post #1 transaction processing benchmark scores. High-profile
industry-standard benchmarks do not exist for many applications, so the motivation to obtain
optimal scalability has not been as great. As a result, most non-database applications have
significantly lower scalability. In fact, many do not scale beyond two to four CPUs.
Innovation Centre for Education

Processor features in BIOS


BIOS levels permit various settings for performance in certain IBM System x servers.
Processor Adjacent Sector Prefetch
 When this setting is enabled (and enabled is the default for most systems), the processor
retrieves both sectors of a cache line when it requires data that is not currently in its cache.
 When this setting is disabled, the processor will only fetch the sector of the cache line that
includes the data requested. For instance, only one 64-byte line from the 128-byte sector will
be prefetched with this setting disabled.
 This setting can affect performance, depending on the application running on the server and
memory bandwidth utilization. Typically, it affects certain benchmarks by a few percent,
although in most real applications it will be negligible. This control is provided for benchmark
users who want to fine-tune configurations and settings.
Processor Hardware Prefetcher
 When this setting is enabled (disabled is the default for most systems), the processor is able
to prefetch extra cache lines for every memory request. Recent tests in the performance lab
have shown that you will get the best performance for most commercial application types if
you disable this feature. The performance gain can be as much as 20%, depending on the
application.
 For high-performance computing (HPC) applications, we recommend that you turn HW
Prefetch enabled. For database workloads, we recommend that you leave the HW Prefetch
disabled.
Innovation Centre for Education

IP Prefetcher
 Each core has one IP prefetcher. When this setting is enabled, the prefetcher scrutinizes
historical reading in the L1 cache, in order to have an overall diagram and to try to load
foreseeable data.
DCU Prefetcher
 Each core has one DCU prefetcher. When this setting is enabled, the prefetcher detects
multiple reading from a single cache line for a determined period of time and decides to load
the following line in the L1 cache.
Basically, hardware prefetch and adjacent sector prefetch are L2 prefetchers. The IP prefetcher
and DCU prefetcher are L1 prefetchers.
All prefetch settings decrease the miss rate for the L1/L2/L3 cache when they are enabled.
However, they consume bandwidth on the front-side bus, which can reach capacity under heavy
load. By disabling all prefetch settings, multi-core setups achieve generally higher performance
and scalability. Most benchmarks turned in so far show that they obtain the best results when all
prefetchers are disabled.
Innovation Centre for Education

Below table lists the recommended settings based on benchmark experience in IBM labs.
Customers should always use these settings as recommendations and evaluate their
performance based on actual workloads. When in doubt, always vary to determine which
combination delivers maximum performance.

Table -Recommended prefetching settings


Innovation Centre for Education

Memory

Memory technology
This section introduces key terminology and technology that are related to memory; the topics
discussed are:
 “DIMMs and DRAMs”
 “Ranks”
 “SDRAM”
 “Registered and unbuffered DIMMs”
 “Double Data Rate memory”
 “Fully-buffered DIMMs”
 “MetaSDRAM”
 “DIMM nomenclature”
 “DIMMs layout”
 “Memory interleaving”
Innovation Centre for Education

DIMMs and DRAMs

DRAM chips on a DIMM


Innovation Centre for Education

Figure - DRAM capacity as printed on a PC3200 (400 MHz) DDR DIMM

The sum of the capacities of the DRAM chips (minus any used for ECC functions if any), equals the
capacity of the DIMM. Using the previous example, the DRAMs in above figure are 8 bits wide, so:

8 x 128M = 1024 Mbits = 128 MB per DRAM


128 MB x 8 DRAM chips = 1024 MB or 1 GB of memory
Innovation Centre for Education

Ranks
A rank is a set of DRAM chips on a DIMM that provides eight bytes (64 bits) of data. DIMMs are
typically configured as either single-rank (1R) or double-rank (2R) devices but quad-rank
devices (4R) are becoming more prevalent.
Using x4 DRAM devices, and not including DRAMS for ECC, a rank of memory is composed of
64 / 4 = 16 DRAMs. Similarly, using x8 DRAM devices, a rank is composed of only 64 / 8 = 8
DRAMs.
It is common, but less accurate, to refer to memory ranking in terms of “sides”. For example,
single-rank DIMMs can often be referred to as single-sided DIMMs, and double-ranked DIMMs
can often referred to as double-sided DIMMs.
However, single ranked DIMMs, especially those using x4 DRAMs often have DRAMs mounted
on both sides of the DIMMs, and quad-rank DIMMs will also have DRAMs mounted on two
sides. For these reasons, it is best to standardize on the true DIMM ranking when describing the
DIMMs.
Innovation Centre for Education

DIMMs may have many possible DRAM layouts, depending on word size, number of ranks, and
manufacturer design. Common layouts for single and dual-rank DIMMs are identified here:
 x8SR = x8 single-ranked modules
These have five DRAMs on the front and four DRAMs on the back with empty spots in between
the DRAMs, or they can have all 9 DRAMs on one side of the DIMM only.
 x8DR = x8 double-ranked modules
These have nine DRAMs on each side for a total of 18 (no empty slots).
 x4SR = x4 single-ranked modules
These have nine DRAMs on each side for a total of 18, and they look similar to x8 double-
ranked modules.
 x4DR = x4 double-ranked modules
These have 18 DRAMs on each side, for a total of 36.
The rank of a DIMM also impacts how many failures a DIMM can tolerate using redundant bit
steering.
Innovation Centre for Education

SDRAM
Synchronous Dynamic Random Access Memory (SDRAM) is used commonly in servers today,
and this memory type continues to evolve to keep pace with modern processors. SDRAM
enables fast, continuous bursting of sequential memory addresses. After the first address is
supplied, the SDRAM itself increments an address pointer and readies the next memory
location that is accessed. The SDRAM continues bursting until the predetermined length of data
has been accessed. The SDRAM supplies and uses a synchronous clock to clock out data from
the SDRAM chips. The address generator logic of the SDRAM module also uses the system-
supplied clock to increment the address counter to point to the next address.
There are two types of SDRAMs currently in the market: registered and unbuffered. Only
registered SDRAM are now used in System x servers, however. Registered and unbuffered
cannot be mixed together in a server.
With unbuffered DIMMs, the memory controller communicates directly with the DRAMs, giving
them a slight performance advantage over registered DIMMs. The disadvantage of unbuffered
DIMMs is that they have a limited drive capability, which means that the number of DIMMs that
can be connected together on the same bus remains small, due to electrical loading.
Icnontrast, registered DIMMs use registers to isolate the memory controller from the DRAMs,
which leads to a lighter electrical load. Therefore, more DIMMs can be interconnected and
larger memory capacity is possible. The register does, however, typically impose a clock or
more of delay, meaning that registered DIMMs often have slightly longer access times than their
unbuffered counterparts.
Innovation Centre for Education

Double Data Rate memory

Data transfers made to and from an SDRAM DIMM use a synchronous clock signal to establish
timing. For example, SDRAM memory transfers data whenever the clock signal makes a
transition from a logic low level to a logic high level. Faster clock speeds mean faster data
transfer from the DIMM into the memory controller (and finally to the processor) or PCI
adapters. However, electromagnetic effects induce noise, which limits how fast signals can be
cycled across the memory bus, and have prevented memory speeds from increasing as fast as
processor speeds.

All Double Data Rate (DDR) memories, including DDR, DDR2, and DDR3, increase their
effective data rate by transferring data on both the rising edge and the falling edge of the clock
signal. DDR DIMMs use a 2-bit prefetch scheme such that two sets of data are effectively
referenced simultaneously. Logic on the DIMM multiplexes the two 64-bit results (plus ECC bits)
to appear on subsequent data transfers. Thus, two data transfers can be performed during one
memory bus clock cycle, enabling double the data transfer rate over non-DDR technologies.
Innovation Centre for Education

DDR2

DDR2 is the technology follow-on to DDR, with the primary benefits being the potential for faster
throughput and lower power. In DDR2, the memory bus is clocked at two times the frequency of
the memory core. Stated alternatively, for a given memory bus speed, DDR2 allows the memory
core to operate at half the frequency, thereby enabling a potential power savings.
Although DDR memory topped out at a memory bus clock speed of 200 MHz, DDR2 memory
increases the memory bus speed to as much as 400 MHz. Note that even higher DDR2 speeds
are available, but these are primarily used in desktop PCs. Below figure shows standard and
small form factor DDR2 DIMMs.

Figure - A standard DDR2 DIMM (top) and small form-factor DDR2 DIMM (bottom)
Innovation Centre for Education

DDR2 also enables additional power savings through use of a lower operating voltage. DDR
uses a range of 2.5 V to 2.8 V. DDR2 only requires 1.8 V.

Because only the highest speed DDR2 parts have similar memory core speeds as compared to
DDR due to being clocked at half the bus rate, DDR2 employs a number of mechanisms to
reduce the potential for performance loss. DDR2 increases the number of bits prefetched into
I/O buffers from the memory core to 4 per clock, thus enabling the sequential memory
throughput for DDR and DDR2 memory to be equal when the memory bus speeds are equal.

However, because DDR2 still has a slower memory core clock when the memory bus speeds
are equal, the memory latencies of DDR2 are typically higher.
Fortunately, the latest generation and most common DDR2 memory speeds typically also
operate at significantly higher memory frequency.
Innovation Centre for Education

DDR2 Memory is commonly found in recent AMD Opteron systems, as well as in the HS12
server blade and the x3850M2 and x3950M2 platforms.
The below table lists the common DDR2 memory implementations

Table - DDR2 memory implementations


Innovation Centre for Education

DDR3

DDR3 is the next evolution of DDR memory technology. Like DDR2, it promises to deliver ever-
increasing memory bandwidth and power savings. DDR3 DIMMs are used on all 2-socket-
capable servers using the Intel 5500-series processors and newer.

DDR3 achieves its power efficiency and throughput advantages over DDR2 using many of the
same fundamental mechanisms employed by DDR2.

DDR3 improvements include:


 Lower supply voltages, down to 1.5 V in DDR3 from 1.8 V in DDR2
 Memory bus clock speed increases to four times the core clock
 Prefetch depth increases from 4-bit in DDR2 to 8-bit in DDR3
 Continued silicon process improvements
Innovation Centre for Education

Fully-buffered DIMMs

As CPU speeds increase, memory access must keep up so as to reduce the potential for
bottlenecks in the memory subsystem. As the replacement for the parallel memory bus design
of DDR2, FB-DIMMs enabled additional bandwidth capability by utilizing a larger number of
serial channels to the memory DIMMs. This increase in memory channels, from two in DDR2 to
four with FB-DIMMs, allowed a significant increase in memory bandwidth without a significant
increase in circuit board complexity or the numbers of wires on the board.

Although fully-buffered DIMM (FB-DIMM) technology replaced DDR2 memory in most systems,
it too has been replaced by DDR3 memory in the latest 2-socket systems utilizing the Intel
5500-series processors. Non-IBM systems using the Intel 7400-series processors or earlier also
still utilize FB-DIMMs, though the x3850 M2 and x3950 M2 systems utilize IBM eX4 chipset
technology to allow memory bandwidth improvements while maintaining usage of standard
DDR2 DIMMs.

FB-DIMMs use a serial connection to each DIMM on the channel. As shown in below figure, the
first DIMM in the channel is connected to the memory controller. Subsequent DIMMs on the
channel connect to the one before it. The interface at each DIMM is a buffer known as the
Advanced Memory Buffer (AMB).
Innovation Centre for Education

Figure - Comparing the DDR stub-bus topology with FB-DIMM serial topology
Innovation Centre for Education

Figure - Fully buffered DIMMs architecture


Innovation Centre for Education

Advanced Memory Buffer

An FB-DIMM uses a buffer known as the Advanced Memory Buffer (AMB), as circled in the
below figure. The AMB is a memory interface that connects an array of DRAM chips to the serial
memory channel. The AMB is responsible for handling FB-DIMM channel and memory requests
to and from the local FB-DIMM, and for forwarding requests to other AMBs in other FB-DIMMs
on the channel.

Figure - Advanced Memory Buffer on an FB-DIMM


Innovation Centre for Education

The functions that the AMB performs include the following:

 Channel initialization to align the clocks and to verify channel connectivity. It is a


synchronization of all DIMMs on a channel so that they are all communicating at the same time.
 Support the forwarding of southbound frames (writing to memory) and northbound frames
(reading from memory), servicing requests directed to a specific FB-DIMMs AMB and merging
the return data into the northbound frames.
 Detect errors on the channel and report them to the memory controller.
 Act as a DRAM memory buffer for all read, write, and configuration accesses addressed to a
specific FB-DIMMs AMB.
 Provide a read and write buffer FIFO.
 Support an SMBus protocol interface for access to the AMB configuration registers.
 Provide a register interface for the thermal sensor and status indicator.
 Function as a repeater to extend the maximum length of FB-DIMM Links.
Innovation Centre for Education

Low Power FB-DIMMs

While FB-DIMMs commonly consume up to twice the power of standard DDR2 DIMMs due to
the usage of the AMB, significant developments have been made to improve this in the latest
generation of FB-DIMMs. While the performance aspects of these DIMMs are fundamentally
unchanged, the power consumption of the DIMMs can be cut by up to 50%.

The following are some of the mechanisms used to reduce FB-DIMM power:

 AMB and DRAM process improvements


 Lower DRAM and AMB voltages
 Potential elimination of a level of voltage regulation, which saps efficiency
Innovation Centre for Education

FB-DIMM performance

By using serial memory bus technology requiring fewer board traces per channel, FB-DIMMs
enabled memory controller designers to increase the number of memory channels, and thus
aggregate bandwidth, without a significant increase in memory controller complexity. This daisy-
chained, serial architecture adds latency to memory accesses, however, and when multiplied
over many DIMMs can add up to a significant performance impact.
While all applications will have slightly different behavior, the below figure illustrates the
performance loss for a moderately memory-intensive database application as FB-DIMMs are
added to the memory channels of a server. Note that for this application, the performance loss
was measured at ~2% overall per additional DIMM added across the memory channels.
Innovation Centre for Education

MetaSDRAM
Many DDR2- and DDR3-based servers are now supporting a new type of memory DIMM,
known as MetaSDRAM, often shortened to the founding company’s name, MetaRAM. This
technology’s primary benefit is to allow high capacity DIMM sizes without the exponentially
higher prices always commanded by the largest DIMM sizes.
MetaSDRAM accomplishes this by using a custom-designed chipset on each DIMM module
which makes multiple, inexpensive SDRAMs appear as a single, large capacity SDRAM.
Because this chipset acts as an onboard buffer and appears as a single electrical load on the
memory bus, higher frequencies can often be achieved using MetaSDRAM-based DIMMs than
would be possible using standard DIMMs, which often have to switch to slower speeds as
multiple DIMMs are added per channel.
MetaSDRAM DIMMs offset the additional power gain imposed by the additional SDRAMs by
implementing intelligent power management techniques, allowing two to four times the memory
capacity to fit within a typical server’s power and cooling capability.
Due to the inclusion of the additional MetaSDRAM chipset between the memory bus and the
DDR2 or DDR3 SDRAMs, there is some latency added to the memory transactions. The impact
of this latency, however, has been measured across a number of commercial application
environments to impact overall application performance by just 1% to 2%. Because the
technology has the potential to double a server’s memory size, this small impact is well under
the potential gains achievable from the added memory capacity. However, if MetaSDRAM is
being used only to reduce cost in a server’s memory subsystem, it is important to realize that
performance can be slightly lower than standard DIMMs, depending on configuration.
Innovation Centre for Education

DIMM nomenclature
The speed of a memory DIMM is most commonly indicated by numeric PC and DDR values on
all current DIMM types. The PC value correlates to the theoretical peak transfer rate of the
module, whereas the DDR value represents the bus transfer rate in Millions of transfers per
second. The tables in this section list the nomenclature and the bus speed, transfer rate, and
peak throughput.
With DDR, DDR2, and DDR3, because the SDRAM transfers data on both the falling and the
rising edges of the clock signal, transfers speeds are double the memory bus clock speed. The
peak throughput per channel can be calculated by multiplying the DDR transfer rate times the
transfer size of 64bits (8 Bytes) per transfer.
The below Table summarizes the common nomenclatures for DDR memory.

Table - DDR memory implementations


Innovation Centre for Education

Table - DDR2 memory implementations

Table - DDR3 memory implementations


Innovation Centre for Education

DIMMs layout

The DIMM location within the system’s DIMM sockets is often mandated by a system’s
installation guide to ensure that DIMMs are in a supported ordering, can be seen by the memory
controller, and are spread across the available memory channels to optimize bandwidth.
However, there are also other performance implications for the DIMM layout.

For DDR and DDR2 DIMMs, optimal performance is typically obtained when fully populating the
available DIMM slots with equivalent memory size and type of DIMMs. When this is not feasible
in a solution, populating in multiples of 4 or 8 DIMMs, all of the same size and type, can often
have a similar outcome.

These methods allow the memory controller to maintain the maximum number of open memory
pages, and allow the memory controller to optimize throughput and latency by allowing address
bit permuting (sometimes called symmetric mode or enhanced memory mapping). This reorders
memory addresses to optimize memory prefetch efficiency and reduces average memory
read/write latencies significantly.
Innovation Centre for Education

Memory interleaving
Interleaving is a technique that is often used to organize DIMMs on the motherboard of a server
to improve memory transfer performance. The technique can be implemented within a single
cache line access or across multiple cache lines to improve total memory bandwidth. When two
DIMMs are grouped together and accessed concurrently to respond to a single cache line
request, the interleave is defined as two-way. When four DIMMs are grouped together and
accessed concurrently for a single cache line, the interleave is four-way.

Interleaving improves memory performance because each DIMM in the interleave is given its
memory address at the same time. Each DIMM begins the access while the memory controller
waits for the first access latency time to expire. Then, after the first access latency has expired,
all DIMMs in the interleave are ready to transfer multiple 64-bit objects in parallel, without delay.

Although interleaving was often a tunable parameter in older server architectures, it is typically
optimized at system boot time by the BIOS on modern systems, and therefore does not need to
be explicitly tuned.
Innovation Centre for Education

Memory performance
Performance with memory can be simplified into two main areas: bandwidth and latency.

Bandwidth
Basically, the more memory bandwidth you have, the better system performance will be,
because data can be provided to the processors and I/O adapters faster. Bandwidth can be
compared to a highway: the more lanes you have, the more traffic you can handle.

The memory DIMMs are connected to the memory controller through memory channels, and the
memory bandwidth for a system is calculated by multiplying the number of data width of a
channel by the number of channels, and then multiplied by the frequency of the memory.

For example, if a processor is able to support up to 400 MHz (DDR-400) registered ECC
memory and has two 8-byte channels from the memory controller to access the memory, then
the memory bandwidth of the system will be 8 bytes*2 channels*400 MHz, or 6.4 GBps.
Innovation Centre for Education

Latency
The performance of memory access is usually described by listing the number of memory bus clock cycles that
are necessary for each of the 64-bit transfers needed to fill a cache line. Cache lines are multiplexed to
increase performance, and the addresses are divided into row addresses and columnaddresses.
A row address is the upper half of the address (that is, the upper 32 bits of a 64-bit address). A column address
is the lower half of the address. The row address must be set first, then the column address must be set. When
the memory controller is ready to issue a read or write request, the address lines are set, and the command is
issued to the DIMMs.
When two requests have different column addresses but use the same row address, they are said to “occur in
the same page.” When multiple requests to the same page occur together, the memory controller can set the
column address once, and then change the row address as needed for each reference. The page can be left
open until it is no longer needed, or it can be closed after the first request is issued. These policies are referred
to as a page open policy and a page closed policy,respectively.
cTthe a of changing a column address is referred to as Column Address Select (CAS).
There are three common access times:
CAS: Column Address Select
RAS to CAS: delay between row access and column access
RAS: Row Address Strobe
Innovation Centre for Education

Sometimes these numbers are expressed as x-y-y by manufacturers.


These numbers are expressed in clocks, and might be interpreted as wait times or latency.
Therefore, the lower these numbers are the better, because access times imply data access
latency.
CAS Latency (CL) measures the number of memory clocks that elapse between the time a
memory controller sets the column address to request a line of data and when the DIMM is able
to respond with that data. Even if other latencies are specified by memory manufacturers, CL is
the most commonly used when talking about latency. If you look at the sticker on a DIMM
(below figure), it might list the CL value for that particular device.

Figure - CAS Latency value as printed on a PC3200 (400 MHz) DDR DIMM
Innovation Centre for Education

Loaded versus unloaded latency


When talking about latency, manufacturers generally refer to CL, which is a theoretical latency
and corresponds to an unloaded latency condition. A single access to memory from a single
thread, executed by a single processor while no other memory accesses are occurring is
referred to as unloaded latency.
As soon as multiple processors or threads concurrently accesses memory, the latencies
increase. This condition is termed loaded latency. While the conditions for unloaded latency are
easily defined when memory accesses only happen one at a time, loaded latency is much more
difficult to specify. Since latency will always increase as load is added, and is also dependent on
the characteristics of the load itself (for example, the percentage of the time the data is found
within an open page), loaded latency cannot be characterized by any single, or even small
group, of data points. For this reason alone, memory latencies are typically compared only in
the unloaded state.
Since real systems and applications are almost never unloaded, there is typically very little value
in utilizing unloaded latency as a comparison point between systems of different architectures.
Innovation Centre for Education

STREAM benchmark
Many benchmarks exist to test memory performance. Each benchmark acts differently and
gives different results because each simulates different workloads. However, one of the most
popular and simple benchmarks is STREAM.
STREAM is a synthetic benchmark program that measures sequential access memory
bandwidth in MBps and the computational rate for simple vector kernels. The benchmark is
designed to work with larger data sets than the available cache on most systems, so the results
are indicative of very large vector-style applications. It provides real-world sustained memory
bandwidth and not the theoretical peak bandwidth that is provided by vendors.
It is critical to note, however, that very few server applications access large sections of memory
sequentially as the STREAM benchmark does. More commonly, real application workloads
access memory randomly, and random memory performance can be many times slower than
sequential memory bandwidth due to system-level memory access latencies. Such latencies
include overheads from the processor cache, bus protocols, memory controller, and the
memory latency itself. In addition, tuning the memory subsystem for maximum bandwidth is
often done to the detriment of random memory performance, so a system that is better with
STREAM could actually perform worse in many real world workloads.
Innovation Centre for Education

SMP and NUMA architectures


There are two fundamental system architectures used in the x86 server market: SMP and
NUMA. Each architecture can have its strengths and limitations, depending on the target
workload, so it is important to understand these details when comparing systems.
SMP architecture
Prior to the introduction of Intel 5500-series processors, most systems based on Intel
processors typically use the Symmetric Multiprocessing (SMP) Architecture. The exception to
this is the IBM x3950 M2, which uses a combination of SMP in each node and NUMA
architecture between nodes.
SMP systems are fundamentally defined by having one or more Front-Side Buses (FSB) that
connect the processors to a single memory controller, also known as a north bridge. Although
older system architectures commonly deployed a single shared FSB, newer systems often have
one FSB per processor socket.
In an SMP architecture, the CPU accesses memory DIMMs through the separate memory
controller or north bridge. The north bridge handles traffic between the processor and memory,
and controls traffic between the processor and I/O devices, as well as data traffic between I/O
devices and memory. In the below figure, the central position of the north bridge and the shared
front-side bus are shown. These components play the dominant role in determining memory
performance.
Innovation Centre for Education

Figure - An Intel dual-processor memory block

Whether the memory controller and processors implement shared, or separate front side buses, there is still a
single point of contention in the SMP design. As shown in above figure, if a single CPU is consuming the
capabilities of the FSB, memory controller, or memory buses, there will be limited gain from using a second
processor. For applications that have high front-side bus and memory controller utilizations (which is most
common in scientific and technical computing environments), this architecture can limit the potential for
performance increases. This becomes more of an issue as the number of processor cores are increased,
because the requirement for additional memory bandwidth increases for these workloads, as well.
The speed of the north bridge is tied to the speed of the front-side bus, so even as processor clock speeds
increase, the latency to memory remains virtually the same. The speed of the front-side bus places an upper
bound on the rate at which a processor can send data to or receive data from memory. For this reason,
aggregate memory bandwidth of the memory channels attached to the memory controller is always tuned to be
equal to or greater than the aggregate bandwidth of the front-side bus interfaces attached to the memory
controller.
Innovation Centre for Education

NUMA architecture
To overcome the limitations of a shared memory controller design, the latest Intel processors, as
well as Opteron processors, utilize a Non-Uniform Memory Architecture (NUMA), rather than an
SMP design. In NUMA configurations, the memory controller is integrated into the processor,
which can be a benefit for two reasons:
The memory controller is able to be clocked at higher frequencies, because the communications
between the processor cores and memory controller do not have to go out to the system board.
In this design, as the processor speed is increased, the memory controller speed can often also
be increased, thus reducing the latency through the memory controller while increasing the
sustainable bandwidth through the memory controller.
As additional processors are added to a system, additional memory controllers are also added.
Because the demand for memory increases as processing capability increases, the additional
memory controllers provide linear increases in memory bandwidth to accommodate the load
increase of additional processors, thus enabling the potential for very efficient processor scaling
and high potential memory bandwidths.
In x86 server architectures, all system memory must be available to all server processors. This
requirement means that with a NUMA architecture, we have a new potential performance
concern, remote memory. From a processor point of view, local memory refers to the memory
connected to the processor’s integrated memory controller. Remote memory, in comparison, is
that memory that is connected to another processor’s integrated memory controller.
Innovation Centre for Education

The below figure depicts the architecture of an AMD Opteron processor. As in below figure,
there is a processor core and cache. However, in place of a bus interface unit and an external
memory controller, there is an integrated memory controller (MCT), an interface to the
processor core (SRQ), three HyperTransport (HT) units and a crossbar switch to handle routing
data, commands, and addresses between processors and I/O devices.

© 2013 IBM Corporation


155
Figure - An AMD dual-processor memory block
Innovation Centre for Education

The 32-bit 4 GB memory limit

A memory address is a unique identifier for a memory location at which a processor or other device can store a piece of data for
later retrieval. Each address identifies a single byte of storage. All applications use virtual addresses, not physical. The operating
system maps any (virtual) memory requests from applications into physical locations in RAM. When the combined total amount of
virtual memory used by all applications exceeds the amount of physical RAM installed in the server, the difference is stored in the
page file, which is also managed by the operating system.

32-bit CPUs, such as older Intel Xeon processors, have an architectural limit of only being able to directly address 4 GB of
memory. With many enterprise server applications now requiring significantly greater memory capacities, CPU and operating
system vendors have developed methods to give applications access to more memory.

The first method was implemented by Microsoft with its Windows NT® 4.0 Enterprise Edition operating system. Prior to
Enterprise Edition, the 4 GB memory space in Windows was divided into 2 GB for the operating system kernel and 2 GB for
applications. Enterprise Edition offered the option to allocate 3 GB to applications and 1 GB to the operating system.

The 32-bit Linux kernels, by default, split the 4 GB virtual address space of a process in two parts: 3 GB for the user-space virtual
addresses and the upper 1 GB for the kernel virtual addresses. The kernel virtual area maps to the first 1 GB of physical RAM
and the rest is mapped to the available physical RAM.

The potential issue here is that the kernel maps directly all available kernel virtual space addresses to the available physical
memory, which means a maximum of 1 GB of physical memory for the kernel. For more information, see the article High Memory
in the Linux Kernel, which is available at: http://kerneltrap.org/node/2450

For many modern applications, however, 3 GB of memory is simply not enough. To address more than 4 GB of physical memory
on 32-bit operating systems, two mechanisms are typically used: Physical Address Extension (PAE) and for Windows, Address
Windowing Extensions (AWE).
Innovation Centre for Education

Physical Address Extension


32-bit operating systems written for the 32-bit Intel processor use a segmented memory addressing scheme.
The maximum directly addressable memory is 4 GB (232). However, an addressing scheme was created to
access memory beyond this limit: the Physical Address Extension(PAE).
This addressing scheme is part of the Intel Extended Server Memory Architecture and takes advantage of the
fact that the 32-bit memory controller actually has 36 bits that are available for use for memory and L2
addressing. The extra four address bits are normally not used but can be employed along with the PAE
scheme to generate addresses above the standard 4 GB addresslimit.
PAE uses a four-stage address generation sequence and accesses memory using 4 KB pages, as shown in
below figure

Figure - PAE-36 address translation


Innovation Centre for Education

Windows PAE and Address Windowing Extensions Although recent Service Packs for
Enterprise and Datacenter editions of Windows Server 2003 and 2008 now have PAE enabled
by default, older
versions of these operating systems may not. For these older operating systems, the user must
add the /PAE switch to the corresponding entry in the BOOT.INI file to access memory above
the 4GB boundary.

PAE is supported only on 32-bit versions of the Windows operating systems.


64-bit versions of Windows do not support, nor need, PAE, as sufficient address space is
already accommodated.
Innovation Centre for Education

The following 32-bit Windows versions support PAE, with the given amount of physical RAM
indicated:

 Windows 2000 Advanced Server (8 GB maximum)


 Windows 2000 Datacenter Server (32 GB maximum)
 Windows XP (all versions) (4 GB maximum)
 Windows Server 2003, Standard Edition (4 GB maximum)
 Windows Server 2003, Enterprise Edition (32 GB maximum)
 Windows Server 2003, Enterprise Edition R2 or SP1 (64 GB maximum)
 Windows Server 2003, Datacenter Edition (64 GB maximum)
 Windows Server 2003, Datacenter Edition R2 or SP1 (128 GB maximum)
 Windows Server 2008, Standard and Web Editions (4 GB maximum)
 Windows Server 2008, Enterprise Edition (64 GB maximum)
 Windows Server 2008, Datacenter Edition (64 GB maximum)
Innovation Centre for Education

64-bit memory addressing

To break through the 4 GB limitations of 32-bit addressing, CPU and operating system vendors
extended the x86 specification to 64-bits. Known by many names, this technology is most
generally referred to as x86-64 or x64, though Intel refers to it as EM64T, and AMD uses the
name AMD64. Fundamentally, this technology enables significantly increased memory
addressability for both operating systems and applications.
Below table illustrates the differences between the 32-bit and 64-bit operating systems.

Table - Virtual memory limits


a. 3 GB for the application and 1 GB for the kernel if system booted with a /3GB switch.
b. 4 GB if the 32-bit application has the LARGEADDRESSAWARE flag set (LAA).
© 2013 IBM Corporation
160
Innovation Centre for Education

The width of a memory address dictates how much memory the processor can address. As
shown in below figure, a 32-bit processor can address up to 232 bytes or 4 GB. A 64-bit
processor can theoretically address up to 264 bytes or 16 Exabytes (or 16777216 Terabytes).

Table - Relation between address space and number of address bits


Innovation Centre for Education

Current implementation limits are related to memory technology and economics. As a result,
physical addressing limits for processors are typically implemented using less than the full 64
potential address bits, as shown in below figure

. Table - Memory supported by processors

These values are the limits imposed by the processors themselves. They represent the
maximum theoretical memory space of a system using these processors.

[Tip: Both Intel 64 and AMD64 server architectures can utilize either the 32-bit (x86) or 64-bit
(x64) versions of their respective operating systems. However, the 64-bit architecture
extensions will be ignored if 32-bit operating system versions are employed. For systems using
x64 operating systems, it is important to note that the drivers must also be 64-bit capable.]
UNIT-2

Hardware Performance Turning


Hardware Performance Tuning
General Hardware Recommendations
• Simultaneous Multithreading

• Simultaneous multithreading (SMT), commonly


known as Hyper-threading in Intel CPUs, is a
technology that enables the Operating System to
see a single core / CPU as two cores / CPUs.
• This feature can usually be enabled or disabled in
the Compute Node's BIOS.
Hardware Performance Tuning
• It's important to understand that SMT will not
make jobs run faster.
• Rather, it will allow two jobs to run
simultaneously where only one job would have
run before.
• Thus, SMT, in some cases, can increase the
amount of completed jobs within the same time
span than if it was turned off.
• CERN has seen a throughput increase of 20% with
SMT enabled
Hardware Performance Tuning
The following guidelines are known for specific
use cases:

– Enable it for general purpose workloads


– Disable it for virtual router applications
Notable CPU Flags
The following CPU flags have special relevance
to virtualization. On Linux-based systems, you
can see what flags are enabled and functional by
doing:

• $ cat /proc/cpuinfo

• vmx (Intel) and smx (AMD): Hardware


virtualization support
Notable CPU Flags
• NUMA (non-uniform memory access) is a method
of configuring a cluster of microprocessor in a
multiprocessing system so that they can share
memory locally, improving performance and the
ability of the system to be expanded.
• NUMA is used in a symmetric multiprocessing (
SMP ) system. An SMP system is a "tightly-
coupled," "share everything" system in which
multiple processors working under a single
operating system access each other's memory
over a common bus or "interconnect" path.
Notable CPU Flags
• Ordinarily, a limitation of SMP is that as
microprocessors are added, the shared bus or
data path get overloaded and becomes a
performance bottleneck.
• NUMA adds an intermediate level of memory
shared among a few microprocessors so that
all data accesses don't have to travel on the
main bus.
General Hardware Recommendations
• It's recommended to ensure that each NUMA
node has the same amount of memory.
• If you plan to upgrade the amount of memory
in a compute node, ensure the amount is
balanced on each node.
• For example, do not upgrade one node by
16GB and another by 8GB.
Memory Speeds

• Memory speeds are known to vary by chip.

• If possible, ensure all memory in a system is


the same brand and type.
Upgrade your system
• I guess this goes without saying but every
performance article I write will always include
this point front-and-center.
• Each piece of hardware has its own maximum
speed. Where that speed barrier lies in
comparison to other hardware in the same
category almost always correlates directly
with cost.
Upgrade your system
• You cannot tweak a go-cart to outrun a
Corvette without spending at least as much
money as just buying a Corvette — and that
’s without considering the time element.
• If you bought slow hardware, then you will
have a slow Hyper-V environment.
Upgrade your system
• Fortunately, this point has a corollary: don’t panic.
Production systems, especially server-class systems,
almost never experience demand levels that compare
to the stress tests that admins put on new equipment.
• If typical load levels were that high, it’s doubtful
that virtualization would have caught on so quickly.
• We use virtualization for so many reasons nowadays,
we forget that “cost savings through better
utilization of under-loaded server equipment―was
one of the primary drivers of early virtualization
adoption.
BIOS Settings for Hyper-V Performance
• Don ’t neglect your BIOS! It contains some of the
most important settings for Hyper-V.

• C States. Disable C States! Few things impact Hyper-V


performance quite as strongly as C States! Names and
locations will vary, so look in areas related to
Processor/CPU, Performance, and Power Management.
If you can ’t find anything that specifically says C
States, then look for settings that disable/minimize
power management.
• C1E is usually the worst offender for Live Migration
problems, although other modes can cause issues.
BIOS Settings for Hyper-V Performance
• Virtualization support: A number of features have
popped up through the years, but most BIOS
manufacturers have since consolidated them all
into a global †œ Virtualization Support ―
switch, or something similar.
• I don ’t believe that current versions of Hyper-
V will even run if these settings are n ’t
enabled.
• Here are some individual component names, for
those special BIOSs that break them out:
BIOS Settings for Hyper-V Performance
Virtual Machine Extensions (VMX)
• AMD-V — AMD CPUs/mainboards. Be
aware that Hyper-V can ’t (yet?) run
nested virtual machines on AMD chips
• VT-x, or sometimes just VT — Intel
CPUs/mainboards.
• Required for nested virtualization with Hyper-
V in Windows 10/Server 2016
BIOS Settings for Hyper-V Performance
• Data Execution Prevention: DEP means less for
performance and more for security. It ’s also a
requirement.
• But, we ’re talking about your BIOS settings
and you ’re in your BIOS, so we ’ ll talk
about it.
• Just make sure that it ’s on. If you don ’t
see it under the DEP name, look for:
– No Execute (NX) — AMD CPUs/mainboards
– Execute Disable (XD) — Intel CPUs/mainboards
Storage Settings for Hyper-V
Performance
Learn to live with the fact that storage is slow.
• Remember that speed tests do not reflect
real world load and that file copy does not test
anything except permissions.
• Learn to live with Hyper-V ’s I/O
scheduler. If you want a computer system to
have 100% access to storage bandwidth, start
by checking your assumptions.
Storage Settings for Hyper-V
Performance
• Just because a single file copy does n ’t go
as fast as you think it should, does not mean
that the system won ’t perform its
production role adequately.
• If you ’re certain that a system must have
total and complete storage speed, then do not
virtualize it.
• The only way that a VM can get that level of
speed is by stealing I/O from other guests.
Storage Settings for Hyper-V
Performance
Enable read caches
– Your SAN/NAS will likely have a read cache
– Your internal storage will likely have a read cache
– Cluster Shared Volumes have a read cache
• Carefully consider the potential risks of write
caching. If acceptable, enable write caches. If
your internal disks, DAS, SAN, or NAS has a
battery backup system that can guarantee clean
cache flushes on a power outage, write caching is
generally safe.
Storage Settings for Hyper-V
Performance
• Internal batteries that report their status
and/or automatically disable caching are best.
• UPS-backed systems are sometimes OK, but
they are not foolproof.
• Prefer few arrays with many disks over many
arrays with few disks.
Storage Settings for Hyper-V
Performance
• Unless you ’re going to store VMs on a remote
system, do not create an array just for Hyper-V.
• By that, I mean that if you ’ ve got six internal
bays, do not create a RAID-1 for Hyper-V and a
RAID-x for the virtual machines.
• That ’s a Microsoft SQL Server 2000 design.
This is 2017 and you ’re building a Hyper-V
server. Use all the bays in one big array.
Storage Settings for Hyper-V
Performance
• Do not architect your storage to make the
hypervisor/management operating system go
fast. I can ’t believe how many times I read on
forums that Hyper-V needs lots of disk speed.
• After boot-up, it needs almost nothing. The
hypervisor remains resident in memory.
• Unless you ’re doing something questionable
in the management OS, it won ’t even page to
disk very often.
• Architect storage speed in favor of your virtual
machines.
Storage Settings for Hyper-V
Performance
• Set your fibre channel SANs to use very tight
WWN masks.
• Live Migration requires a hand off from one
system to another, and the looser the mask,
the longer that takes.
• With 2016 the guests should’n ’t crash,
but the hand-off might be noticeable.
Storage Settings for Hyper-V
Performance
• Keep I SCSI/SMB networks clear of other traffic. I see a
lot of recommendations to put each and every I SCSI
NIC on a system into its own VLAN and/or layer-3
network.
• I ’m on the fence about that; a network storm in
one I SCSI network would probably justify it.
• However, keeping those networks quiet would go a
long way on its own.
• For clustered systems, multi-channel SMB needs each
adapter to be on a unique layer 3 network (according
to the docs; from what I can tell, it works even with
same-net configurations).
Storage Settings for Hyper-V
Performance
• If using gigabit, try to physically separate I
SCSI/SMB from your virtual switch.
• Meaning, don ’t make that traffic endure the
overhead of virtual switch processing, if you can
help it.
• Round robin MPIO might not be the best,
although it ’s the most recommended.
• If you have one of the aforementioned network
storms, Round Robin will negate some of the
benefits of VLAN/layer 3 segregation.
• I like least queue depth, myself.
Storage Settings for Hyper-V
Performance
• MPIO and SMB multi-channel are much faster
and more efficient than the best teaming.
• If you must run MPIO or SMB traffic across a
team, create multiple virtual or logical NICs.
• It will give the teaming implementation more
opportunities to create balanced streams.
• Use jumbo frames for I SCSI/SMB connections if
everything supports it (host adapters, switches,
and back-end storage).
• You ’ ll improve the header-to-payload bit
ratio by a meaningful amount.
Storage Settings for Hyper-V
Performance
• Enable RSS on SMB-carrying adapters. If you have
RDMA-capable adapters, absolutely enable that.
• Use dynamically-expanding VHDX, but not
dynamically-expanding VHD.
• I still see people recommending fixed VHDX for
operating system VHDXs, which is just absurd.
Fixed VHDX is good for high-volume databases,
but mostly because they ’ ll probably expand
to use all the space anyway.
Storage Settings for Hyper-V
Performance
• Dynamic VHDX enjoys higher average write
speeds because it completely ignores zero
writes.
• No defined pattern has yet emerged that
declares a winner on read rates, but people
who say that fixed always wins are making
demonstrably false assumptions.
Storage Settings for Hyper-V
Performance
• Do not use pass-through disks. The performance
is sometimes a little bit better, but sometimes it
’s worse, and it almost always causes some
other problem elsewhere.
• The trade-off is not worth it. Just add one spindle
to your array to make up for any perceived speed
deficiencies.
• If you insist on using pass-through for
performance reasons, then I want to see the
performance traces of production traffic that
prove it.
Storage Settings for Hyper-V
Performance
• Don ’t let fragmentation keep you up at night.
Fragmentation is a problem for single-spindle
desktops/laptops, †œ admins ―that never
should have been promoted above first-line help
desk, and salespeople selling defragmentation
software.
• If you ’re here to disagree, you better have a
URL to performance traces that I can
independently verify before you even bother
entering a comment.
Storage Settings for Hyper-V
Performance
• I have plenty of Hyper-V systems of my own
on storage ranging from 3-spindle up to >100
spindle, and the first time I even feel
compelled to run a defrag (much less get
anything out of it) I ’ ll be happy to issue a
mea culpa.
• For those keeping track, we ’re at 6 years
and counting
Memory Settings for Hyper-V
Performance
• There is n ’t much that you can do for
memory.
• Buy what you can afford and, for the most
part, don ’t worry about it.
• Buy and install your memory chips optimally.
Multi-channel memory is somewhat faster
than single-channel.
• Your hardware manufacturer will be able to
help you with that.
Memory Settings for Hyper-V
Performance
• Don ’t over-allocate memory to guests. Just
because your file server had 16GB before you
virtualized it does not mean that it has any use
for 16GB.
• Use Dynamic Memory unless you have a system
that expressly forbids it. It ’s better to stretch
your memory dollar farther than wring your
hands about whether or not Dynamic Memory is
a good thing.
• Until directly proven otherwise for a given server,
it ’s a good thing.
Memory Settings for Hyper-V
Performance
• Don ’t worry so much about NUMA. I’ve read
volumes and volumes on it.
• Even spent a lot of time configuring it on a high-load
system. Wrote some about it.
• Never got any of that time back.
• I ’ ve had some interesting conversations with
people that really did need to tune NUMA.
• They constitute … oh, I ’d say about .1% of all the
conversations that I ’ ve ever had about Hyper-V.
• The rest of you should leave NUMA enabled at defaults
and walk away.
Network Settings for Hyper-V
Performance
• Networking configuration can make a real
difference to Hyper-V performance.
• Learn to live with the fact that gigabit networking
is †œ slow ―and that 10GbE networking often
has barriers to reaching 10Gbps for a single test.
• Most networking demands don ’t even bog
down gigabit.
• It ’s just not that big of a deal for most people.
Network Settings for Hyper-V
Performance
• Learn to live with the fact that a) your four-
spindle disk array can ’t fill up even one
10GbE pipe, much less the pair that you assigned
to I SCSI and that b) it ’s not Hyper-V ’s
fault.
• I know this does’n ’t apply to everyone, but
wow, do I see lots of complaints about how
Hyper-V can ’t magically pull or push bits
across a network faster than a disk subsystem can
read and/or write them.
Network Settings for Hyper-V
Performance
• Disable VMQ on gigabit adapters. I think some
manufacturers are finally coming around to
the fact that they have a problem.
• Too late, though. The purpose of VMQ is to
redistribute inbound network processing for
individual virtual NICs away from CPU 0, core
0 to the other cores in the system.
• Current-model CPUs are fast enough that they
can handle many gigabit adapters.
Network Settings for Hyper-V
Performance
• If you are using a Hyper-V virtual switch on a
network team and you ’ ve disabled VMQ
on the physical NICs, disable it on the team
adapter as well.
• I ’ ve been saying that since shortly after
2012 came out and people are finally
discovering that I ’m right, so, yay?
Anyway, do it.
Network Settings for Hyper-V
Performance
• Don ’t worry so much about v RSS. RSS is like VMQ,
only for non-VM traffic. V RSS, then, is the projection of
VMQ down into the virtual machine.
• Basically, with traditional VMQ, the VMs ’ inbound
traffic is separated across p NICs in the management
OS, but then each guest still processes its own data on
v CPU 0. v RSS splits traffic processing across v CPUs
inside the guest once it gets there.
• The †œ drawback ―is that distributing processing
and then redistributing processing causes more
processing
Network Settings for Hyper-V
Performance
• So, the load is nicely distributed, but it’s also
higher than it would otherwise be. The upshot:
almost no one will care.
• Set it or don ’t set it, it ’s probably not
going to impact you a lot either way.
• If you ’re new to all of this, then you ’ ll
find an †œ RSS ―setting on the network
adapter inside the guest.
• If that ’s on in the guest (off by default) and
VMQ is on and functioning in the host, then you
have v RSS. Woo hoo.
Network Settings for Hyper-V
Performance
• Don ’t blame Hyper-V for your networking ills. I
mention this in the context of performance because
your time has value.
• I ’m constantly called upon to troubleshoot Hyper-V
†œ networking problems ―because someone is
sharing MACs or IPs or trying to get traffic from the
dark side of the moon over a Cat-3 cable with three
broken strands.
• Hyper-V is also almost always blamed by people that
just don ’t have a functional understanding of
TCP/IP. More wasted time that I ’ ll never get back.
Network Settings for Hyper-V
Performance
• Use one virtual switch. Multiple virtual
switches cause processing overhead without
providing returns. This is a guideline, not a
rule, but you need to be prepared to provide
an unflinching, sure-footed defense for every
virtual switch in a host after the first.
Network Settings for Hyper-V
Performance
• Don ’t mix gigabit with 10 gigabit in a
team.
• Teaming will not automatically select 10GbE
over the gigabit.
• 10GbE is so much faster than gigabit that it
’s best to just kill gigabit and converge on
the 10GbE.
Network Settings for Hyper-V
Performance

• 10x gigabit cards do not equal 1x 10GbE card.


• I ’m all for only using 10GbE when you can
justify it with usage statistics, but gigabit just
cannot compete.

You might also like