Infrastructure Architecture Essentials For Data Center and Cloud

INFRASTRUCTURE ARCHITECTURE
ESSENTIALS FOR
DATA CENTER AND CLOUD
Shankar Kambhampaty
1
Infrastructure Architecture
Essentials for
Data Center and Cloud
Shankar Kambhampaty
Infrastructure Architecture Essentials for Data Center and Cloud
First Release
Copyright @2022 Shankar Kambhampaty
All rights reserved. This release of the book, or any part of it, may not be duplicated, stored in an information system for retrieval purposes,
or communicated in any form or by any means, electronic, mechanical, photocopying, recording, or scanning without the author’s written
permission.
Limits of Liability: While the author has used his best efforts in preparing this book, the author makes no warranties or representations
concerning the accuracy or completeness of the book’s contents and disclaims any express warranties of merchantability, or implied warranties
of merchantability, or fitness for a particular purpose. There are no warranties that extend beyond the descriptions contained in this paragraph.
Warranty may not be created or extended by written sales materials or sales representatives. The completeness and accuracy of the content
provided in this book and the opinions stated are not warranted or guaranteed to produce any results. The advice and strategies contained
herein are not suitable for every individual. The author shall not be held responsible or liable for any commercial damages, any type of loss, or
loss of profit, or including but not limited to special, incidental, consequential, or other damages.
Disclaimer: This disclaimer is to inform readers of this book that the views, thoughts, and opinions stated in this book belong solely to the
author and not necessarily to the author’s employer, institution, or other groups or individuals. The book is intended for educational purposes
only and does not replace independent professional judgment. All the references marked within square brackets “[]” in different chapters are
meant to refer to quoted material, additional material, or illustrative examples. The Intellectual Property (IP) and all rights for the referenced
content are with the respective organizations/institutions/owners. No portion of that content can be reproduced without written consent from
the respective organizations/institutions/owners. The contents of this release of the book have been checked for accuracy. The author cannot
guarantee full agreement since deviations cannot be precluded entirely. As the book is intended for educational purposes, the author shall not
be responsible for any omissions, errors, or damages arising out of the use of the information contained in the book.
Trademarks: All company names, brand names, and product names used in this book are trademarks, registered trademarks, or trade names
of the respective holders. The author is not associated with any product or vendor mentioned in this book.
Cover image and icons in figures in this book are licensed from Adobe.
First Release: January 5, 2022

To my mother
Preface
After the launch of the third edition of my book on SOA and MSA in the summer of 2018, I received several
messages from students and professionals on the need for articles with insights on infrastructure
architecture. We do not have many books on this subject, and most of the content available is provided
by technology vendors. I began putting together a series of articles on various aspects of infrastructure
architecture over the past year to have them shared through my blog or an online magazine. I had
prepared the draft version of articles when I got a suggestion to pull together the content of the articles
into a book. That is how this book became a reality!
The core concepts for infrastructure architecture have their origins in data centers to provide a scalable,
secure, highly available, and performant environment for applications. Over the past decade, with the
cloud gaining maturity and adoption, these concepts have also been extended to the cloud. Therefore,
I have tried to balance both the data center and cloud infrastructure architecture in this book.
Many new entrants to the IT industry have directly begun working on cloud platforms without a
background in data center solutions. After all, cloud by design, abstracts and wraps infrastructure so
that there is no need to know what runs where and how everything is wired together! Just call an API and
get the infrastructure capability you need. While that is a big plus, getting a good understanding of how
infrastructure architecture is done in a data center goes a long way in building the conceptual clarity
needed in the long run for architecting sophisticated solutions in both data center and cloud. This book
attempts to provide conceptual clarity on infrastructure architecture for solutions deployed in a data
center and the cloud to all students and computing professionals.
Like any other field, infrastructure technology space is vast, with various possible architecture scenarios.
It is always a challenge to decide how much detail one should cover in a book of this kind. Moreover,
technologies change frequently and come from several vendors with varying capabilities. However, the
architecture concepts pretty much remain the same. I, therefore, took this approach of focusing on the
essentials and provided many references to industry and academic literature for every major topic that
a reader can use to dive deeper.
i
Who Should Read This Book
A book such as this one typically caters to the needs of several types of readers.
1. Students, software developers: Read this book to get an overview of key aspects of infrastructure
architecture.
2. Architects: Study each chapter of the book carefully and relate to your experience.
3. Project Managers: Focus only on those sections that apply to your projects and revisit them
whenever needed.
4. CIOs and CTOs: Review the content to address any gaps you may have in your understanding and
provide your feedback on what I should further elaborate on in the subsequent releases of this book.
I want to stay connected with you, provide additional information from time to time, and answer any
questions you may have. Please post any questions or suggestions you may have at my blog –
https://archtecht.com.
I trust readers will benefit from the infrastructure architecture concepts discussed in this book and
welcome any constructive comments or suggestions on the content or format of this book.
Shankar Kambhampaty
ShankarKambhampaty@gmail.com
ii
Acknowledgments
This book is dedicated to my mother, who encouraged me to share my knowledge. I am indebted to my

father, Shri. K. V. J. R. Krupanidhi, for all his guidance through every step of the journey. I am grateful for
the continuous encouragement I received from my sisters, Sharada and Rama, as I went through putting
together the book. I am also obliged to Prasad, Rama, Prasanna, Ravi, and Ritu for their continuous support.
I want to take this opportunity to thank the Almighty and The “Line of Gurus”. Dr. M. Narasimharao and
Shri. K. M. Sastry provided remarkable insights to whom I would like to express my deepest gratitude.
This book would not have been possible without the unstinted support of my wife, Mallika, our daughter,
Sasirekha, and our son, Harish Rohan. They have always been with me, despite many sacrificed weekends
and evenings.
I must thank my senior colleagues at DXC, Erik Wahab, Vinod Bagal, James Brady, Nachiket Sukhtankar,
and Lokendra Kumar Sethi, who have always supported me in all my publications.
I have been fortunate to have worked with Ram Mynampati, AS Murthy, Ganesan Sekhar, Dr. TV Prabhakar,
Arun Jain, Kedarnath Udiyavar, Sree Arimanitahya, Dan Hushon, and other remarkable individuals at
Satyam, Polaris, and CSC/DXC. They have valued knowledge and sharing of it through publications. I am
grateful to them for their constant encouragement.
My special thanks to Rahul Shah and Vishal Thapa for all the figures in the book, the cover design, and the
formatting of the entire text to make the book look attractive. It would not have been possible for the book
to come together without their support. They spent their personal time helping me.
Mahendran Raju, Ajit Deshpande, Vijay Nanduri, MAS Naveed, Kamalakkanan Jayaraman, Swamiraj
Govindan, Altaf Anees Sarker, and Sasirekha Kambhampaty helped me with reviews of the different
chapters in the book. They gave many useful suggestions, and I truly appreciate their inputs.
Madhav Negi and Altaf Anees Sarkar suggested structuring the articles I had written in the form of a book,
and many thanks to them for that.
I deeply value the discussions I had with my team on infrastructure architecture. Krishna Dhavala, Dinesh
Batla, Darpan Verma, Yashpal Singh, Raj Sekhar Mishra, Bajrang Gupta, Nikhil Naik, Venkat Narkulla, Pintu
Maity, Suresh Yaram, Venkat Godavarthy, Radhakrishna Arugula, Rajeev Mandapati, Ramakrishna Jasti, Sri
Charan Surapaneni, SriRam Anjanadri, Ismail Shaik Mohammed, Imran Shaik, Siddharam Gour, Chandra
Sekhar, and Kamlesh Singh - Thank you all.
Shankar Kambhampaty
iii
Contents
Data Center......................................................................................................1
1.1 IT, ITSM, and ITIL......................................................................................................................... 1
1.2 Data Center ................................................................................................................................. 3
1.2.1 Infrastructure capabilities.................................................................................................................................3
1.2.2 Power and Cooling ...........................................................................................................................................5
1.2.3 Cabling ..............................................................................................................................................................5
1.2.4 Security..............................................................................................................................................................5
1.2.5 Automation........................................................................................................................................................5
1.2.6 Monitoring.........................................................................................................................................................5
1.2.7 Data Center Tiers..............................................................................................................................................6
1.2.8 Active and Passive Data Centers.....................................................................................................................6
References..............................................................................................................................................6
Cloud................................................................................................................7
2.1 Private Cloud .............................................................................................................................. 7
2.2 Public Cloud................................................................................................................................ 8
2.3 Hybrid Cloud................................................................................................................................ 8
2.4 Cloud Adoption Framework.......................................................................................................... 9
2.5 Migration Strategies to Cloud.................................................................................................... 11
2.6 Well-Architected Framework...................................................................................................... 12
2.7 Landing Zones........................................................................................................................... 12
2.8 Agile approach for cloud deployments ...................................................................................... 13
References............................................................................................................................................15
Architecture Documents for Infrastructure Solutions...................................... 17

3.1 Conceptual Technology Architecture (CTA)................................................................................ 17
3.2 Logical Technology Architecture (LTA)....................................................................................... 19
3.3 Physical Technology Architecture (PTA).................................................................................... 21
3.4 Architecture documents for Cloud solutions.............................................................................. 23
References............................................................................................................................................23
Architecting Process for Infrastructure Solutions........................................... 25

4.1 Develop conceptual architecture for infrastructure solutions..................................................... 25
4.1.1 Determine technology drivers....................................................................................................................... 25
4.1.2 Identify infrastructure capabilities................................................................................................................ 27
4.1.3 Develop CTA .................................................................................................................................................. 28
4.2 Develop Logical Technology and Physical Technology Architecture........................................... 30
References............................................................................................................................................32
iv
Compute........................................................................................................33
5.1 Mainframe running z/OS or Linux.............................................................................................. 34
5.1.1 Mainframe in Hosted Facility and Cloud...................................................................................................... 35
5.2 Mid-range running AIX or IBM i.................................................................................................. 36
5.2.1 Mid-range on Cloud........................................................................................................................................ 37
5.3 x86 servers running Linux or Windows....................................................................................... 37
5.3.1 Virtualization ................................................................................................................................................. 37
5.3.2 Hypervisors..................................................................................................................................................... 37
5.3.3 Servers for Virtualization............................................................................................................................... 38
5.3.4 Processor, Memory, and Benchmarks.......................................................................................................... 38
5.3.5 Virtual Server Options for Cloud................................................................................................................... 40
5.4 Compute Characteristics........................................................................................................... 41
References............................................................................................................................................41
Network.........................................................................................................43
6.1 Network Basics.......................................................................................................................... 44
6.1.1 OSI Model....................................................................................................................................................... 44
6.1.2 LAN and WAN................................................................................................................................................. 45
6.1.3 Virtual LAN (VLAN) ....................................................................................................................................... 45
6.1.4 Basic Network Diagram................................................................................................................................. 51
6.1.5 Address Translation ...................................................................................................................................... 52
6.1.6 Proxy............................................................................................................................................................... 54
6.2 Network Architecture................................................................................................................. 56
6.2.1 Three-tier Architecture................................................................................................................................... 56
6.2.2 Two-Tier Spine-Leaf architecture.................................................................................................................. 57
6.3 Network virtualization................................................................................................................ 58
6.4 Network services in the cloud.................................................................................................... 59
References............................................................................................................................................59
Storage..........................................................................................................61
7.1 Block Storage............................................................................................................................ 62
7.1.1 Block Storage Options on-premises............................................................................................................. 64
7.1.2 Block Storage Options on Cloud................................................................................................................... 64
7.2 File Storage............................................................................................................................... 65
7.2.1 File storage options for data center.............................................................................................................. 65
7.2.2 File storage options for cloud....................................................................................................................... 66
7.3 Object Storage........................................................................................................................... 66
7.3.1 Object storage options for data center......................................................................................................... 66
7.3.2 Object storage options for cloud................................................................................................................... 67
7.4 Storage Media........................................................................................................................... 67
7.5 Storage Tiers............................................................................................................................. 68
References............................................................................................................................................68
v
Backup and Restore........................................................................................ 69
8.1 Backup/Restore criteria............................................................................................................. 70
8.2 Solution patterns for backup and restore................................................................................... 71
8.3 Back-end Network..................................................................................................................... 73
8.4 Types of Backup........................................................................................................................ 73
8.5 Operational Recovery................................................................................................................. 74
8.6 Backup/Restore solutions.......................................................................................................... 75
8.6.1 Backup/Restore storage options for data center......................................................................................... 75
References............................................................................................................................................76
8.6.2 Backup/Restore storage options for cloud.................................................................................................. 76
Disaster Recovery..........................................................................................77
9.1 Disaster Recovery Characteristics............................................................................................. 78
9.2 Disaster Recovery Process........................................................................................................ 79
9.2.1 Preparation..................................................................................................................................................... 79
9.2.2 Execution........................................................................................................................................................ 80
9.3 Replication for DR ..................................................................................................................... 81
References............................................................................................................................................84
Monitoring......................................................................................................85
10.1 IT Infrastructure Monitoring..................................................................................................... 86
Hardware Monitoring.............................................................................................................................................. 86
Illustration ............................................................................................................................................................... 86
Server Monitoring.................................................................................................................................................... 87
Storage Monitoring.................................................................................................................................................. 88
Network Monitoring................................................................................................................................................. 88
10.2 Application Monitoring............................................................................................................ 89
10.3 Event Monitoring and Correlation............................................................................................. 90
10.4 IT Operations Analytics (ITOA)................................................................................................. 92
10.5 Artificial Intelligence Operations (AIOps)................................................................................. 92
References............................................................................................................................................94
Security..........................................................................................................95
11.1 Access Security....................................................................................................................... 96
11.2 Connectivity Security............................................................................................................... 97
11.3 Data Security........................................................................................................................... 99
11.4 Application Security............................................................................................................... 100
11.5 Cyber Security....................................................................................................................... 104
References..........................................................................................................................................106
Index............................................................................................................................................. 107
vi
Chapter 1
Data Center
Organizations build applications to fulfill their business processes and deploy them on on-premises
data centers or on the public cloud. Regardless of where they are deployed, there are some essential
infrastructure components that most applications need – servers, storage, and network. All these
components require infrastructure solutioning to work together in a data center to meet the demands
of the applications. There are, of course, a few features offered by the public cloud that require further
solutioning, such as serverless compute (e.g., AWS Lambda or Azure Virtual Function). In this chapter,
the basic data center concepts will be outlined.
1.1 IT, ITSM, and ITIL ITSM/ITIL –

In this section, the relevant terms in the context of this book are Key Processes
defined before describing their role in a data center.
• Service Strategy
“Information technology (IT) is the use of any computers, storage, • Service Design
networking and other physical devices, infrastructure and processes
• Service Transition
to create, process, store, secure and exchange all forms of electronic
data”[1]. • Service Operation
• Continual Service
A wide range of IT components and applications are set up, Improvement
configured, and efficiently managed at run-time for an enterprise to
conduct its business smoothly. The extent of ability to build, run and
maintain IT infrastructure and applications represents the IT capability of the enterprise. It is typically
the responsibility of an in-house IT department to build this capability or outsource, in part or whole,
from a third-party organization(s). The IT department of the enterprise provides the IT capability as a set
of end-to-end services and forms the service provider to the enterprise’s business.
Information Technology Service Management (ITSM) is how the IT department of the enterprise
(including the third-party organizations) manages the delivery of IT services to consumers. It is defined
by five processes – service strategy, service design, service transition, service operation, and continual
service improvement[2]. Figure 1.1 depicts the key processes of ITSM pictorially[3].
Information Technology Infrastructure Library (ITIL) is a framework with a collection of best practices
for each major phase of the ITSM to improve and optimize resources and make them efficient. While ITIL
is a popular framework for service management, there are others. One example is Control Objectives for
Information and Related Technology (COBIT) developed by the Information Systems Audit and Control
Association (ISACA) for IT management and governance.
In an ITSM/ITIL organization, service strategy and service design processes result in services meant
to meet consumer requirements and priorities. These services are made available through a service
catalog. Consumers order the services from the service catalog, which triggers workflows for approvals
Chapter 1: Data Center 1

and order fulfillment in an ITSM tool for service management (e.g., ServiceNow, EasyVista). When
introducing or retiring services, the related change management and release management processes
ensure that risk and impact are controlled and there are no interruptions to other services.
• Event Management • Service Catalog Management

• Incident Management • Service Level Management
• Problem Management • Service Portfolio
• Knowledge Management • Service Continuity
Service Service Strategy &
Operation Service Design
Workflows
Ru
rts
les
Ale ITSM
Tool
(Service
Management)
No
tifi c a ti o n s
Continual Service Service

• Operational Metrics • Change Management
Improvement Transition
• Analytical Reports • Release Management
• Operational Dashboards
(with Trends)
• Automation
Figure 1.1: ITSM Processes
During service operation to deliver the service, events occur. An event is referred to as an incident when
it is unplanned and negatively affects the service and needs a response to having the service restored.
Incidents have underlying causes that represent problems. Problems need to be addressed to mitigate or
prevent incidents from occurring. The data relating to problems, incidents, and other service management
processes are structured and managed through a knowledge management process. Therefore, the key
processes for service operation are event management, incident management, problem management,
and knowledge management[3].
The data center is in a state of churn due to new infrastructure technologies and applications being
introduced on an ongoing basis to address changing business, and IT demands. Also, some of the
current infrastructure and application deployments may have to be retired or refreshed. Therefore, there
is a need to continuously assess the state of operations and identify opportunities to plug inefficiencies
and implement solutions (e.g., automation) for service improvement. Operational data trends are studied
through dashboards, metrics captured, and analytics run for this purpose. All these efforts fall under the
continual service improvement ITSM process.
A configuration management database (CMDB) is set up to have an inventory of all IT components that
gets updated when additions/deletions are made through purchase, stock, license, asset, and contract
management functions in the organization.

Data center – 1.2 Data Center
What does it contain? A data center is where IT becomes real, and ITSM/ITIL processes
come into action. Software and infrastructure components are
Key Components
deployed in the data center of an organization, referred to as on-
• IT Systems premises, that run applications of the organization. Some large data
• Power, Cooling centers rent out space to third parties for deploying infrastructure
• Cabling components. Such data centers are called colocation data centers
as equipment from different companies is “co-located” in the data
• Security
center. Other data centers offer full-fledged infrastructure capabilities
• Automation
with massive scalability as services that are accessed over the public
• Monitoring internet or private dedicated network links based on a pay-per-use
model. These are referred to as cloud data centers. In this book, the
words “data center” refer to an on-premises data center. The word “cloud” refers to cloud data centers.
Needless to say, a data center is critical to an enterprise. A data center outage in the event of a disaster
due to an earthquake, terrorism, or any other reason can ruin an enterprise. Hence, a secondary data
center is set up for business continuity in case of failure of the primary data center.
1.2.1 Infrastructure capabilities

The data center hosts the IT systems of an organization to support its business and IT needs. Most
organizations have business applications that are deployed on IT infrastructure that is a mix of platforms –
Mainframe, Mid-range, and Linux/Wintel. The IT infrastructure needs to be architected for performance,
scalability, availability, security, and many other non-functional requirements to meet the goals of business
applications. For most enterprises, the following emerge as core infrastructure capabilities required to
support the applications (see chapter 3 for explanation):
• Compute (Chapter 5)
• Network (Chapter 6)
• Storage (Chapter 7)
• Backup and Restore (Chapter 8)
• Disaster Recovery (Chapter 9)
• Monitoring (Chapter 10)
• Security (Chapter 11)
Infrastructure Architecture Definition

In simple terms, infrastructure architecture is a description of the structure and interaction of
infrastructure components.
According to Open Group, for many people, infrastructure architecture is “the architecture of the low
level hardware, networks, and system software (sometimes called “middleware”) that supports the
applications software and business systems of an enterprise[13].” This definition is the basis for using
the terms “infrastructure architecture” in this book.

Figure 1.2 depicts a high-level view of primary and secondary data centers with representative infrastructure
components. It also shows the infrastructure capability and the chapter in which the infrastructure
component will be described. For instance, a firewall is a representative network component. Firewalls are
shown in Figure 1.2 with “Network (Chapter 6)” next to them to indicate that firewall and the associated
network infrastructure capability will be described in chapter 6. The secondary data center and its
components are used for disaster recovery, which will be explained in detail in Chapter 9.
Internet
Client Devices
(Customers)
MPLS
Client Devices
Primary Data Center (Employees) Secondary Data Center
Security (Chapter 11)
Web Web
Server Server
Web application Firewall Firewall Web application
Firewall Firewall
Monitoring Network Network Monitoring
Security
(Chapter 10) Firewall (Chapter 6) (Chapter 6) Firewall (Chapter 10)
(Chapter 11)
Automated Network Network Automated

Provisioning (Chapter 6) (Chapter 6) Provisioning
Compute Compute Compute Compute Compute Compute Compute Compute

x86 Virtual x86 Mid Range Mainframe Mainframe Mid Range x86 x86 Virtual
Servers Physical e.g. AIX, e.g., z/OS e.g., z/OS e.g. AIX, Physical Servers
(Application Servers IBM i (Chapter 5) (Chapter 5) IBM i Servers (Application
servers, (Chapter 5) (Chapter 5) (Chapter 5) (Chapter 5) servers,
DB Servers, DB Servers,
Servers Servers
(Chapter 5) (Chapter 5)
Storage Storage
DR (Chapter 9)
Mainframe Mainframe
Storage (Chapter 7) Replication (Chapter 7) Storage
DR (Chapter 9)
Replication
Object File Block Block File Object
Backup and Restore

(Chapter 8)
Figure 1.2: Data Centers with representative infrastructure components Focus of this chapter
Note: The term “infrastructure” has a much broader connotation in TOGAF in the context of enterprise architecture. However, the
scope of infrastructure architecture covered in this book is limited to the “Phase D – Technology architecture” of the Architecture
Development Method of TOGAF. Therefore, the words “infrastructure” and “technology” are used synonymously in the context of
capabilities, solutions, and architecture for data center and cloud.

1.2.2 Power and Cooling
One key aspect of the data center is the power and air-conditioning to run and cool all the systems[4].
The data centers consume immense amounts of energy. It is said that the data centers across the world
consume as much as 2% of electricity and contribute to a similar percentage of carbon emissions due
to the use of electricity generated from non-renewable energy sources[5]. While computing systems
consume a large share of energy, a significant share of energy consumed by data centers goes into
cooling the systems. Cooling is essential because computing systems emit substantial heat as they
perform processing. This energy needs to be dissipated through cooling systems.
The focus of current data center design practices is on optimizing energy consumption and cooling and
contributing to lesser carbon emissions. In order to run a data center uninterrupted 7X24, there must
be redundancy of power feeds, including backup power generators and uninterrupted power supplies.
1.2.3 Cabling
The data center uses a great deal of copper and fiber-optic cables to interconnect the different systems
in the data center. Structured and planned cabling strategies are key to ensuring that there is no tangled
mesh of wires and efficient dissipation of heat. The use of best practices, including pre-terminated
cables (with connectors from the factory) and patch panels (to make interconnections of different LAN
or fiber-optic circuits), is key to setting up the data center connections efficiently[6]. Equally important is
to document all cable configurations for ready use as and when changes are required.
Key Infrastructure 1.2.4 Security

Capabilities While physical security using fencing, front-gate security, smart
surveillance with CCTV camera, laser scanners, and thermal
• Compute imaging equipment is necessary, logical security measures are also
• Network equally important[7]. Logical security measures involve setting up
• Storage biometric devices and multi-factor authentication to ensure that
• Backup and Restore only authorized personnel can have access to the systems in the
data center.
• Disaster Recovery
• Monitoring 1.2.5 Automation
• Security Running a data center involves several processes that include
workflows, scheduling, monitoring, maintenance, application
delivery, and so on. These processes must be automated for a data center to work smoothly without
issues. Automation is designed and implemented using APIs and configuration management tools such
as Ansible, Chef, Puppet, and OpenStack.
1.2.6 Monitoring
The staff running the data center require a complete view of all the equipment, their usage, and
performance to respond in a timely manner to any IT problems[8]. A new class of software called Data
Center for Infrastructure Management (DCIM), such as from Sunbird provides dashboards for all critical
aspects of a data center to improve uptime, capacity planning & utilization, and data center energy
efficiency. Also, data center monitoring solutions are available from vendors, such as SolarWinds, to
observe the health of critical components of the data center and take timely actions.

1.2.7 Data Center Tiers
Uptime Institute has classified data centers into four tiers depending on power, cooling, fault, and maintenance
capabilities[9].
1. Tier 1 – Basic capacity: Single path for cooling and power, and no backup components. Expected uptime
of 99.671% per year.
2. Tier 2 – Redundant capacity: Single path for cooling and power and redundant and backup components.
Expected uptime of 99.741% per year.
3. Tier 3 – Concurrently maintainable: Multiple cooling, power paths, and redundant systems. Expected
uptime of 99.982% per year.
4. Tier 4 – Physically isolated Redundancy: Fault-tolerant data center with redundancy for every component.
Expected uptime of 99.995% per year.
The colocation and cloud provider, Switch, developed a Tier 5 Platinum data center standard to designate the
highest possible level in data center excellence. This standard is different from the classification by Uptime.
1.2.8 Active and Passive Data Centers

A data center that services applications and functions as an active application site to serve user requests is
an active data center. On the other hand, a data center that has the infrastructure ready to be activated (e.g.,
in the event of failure of the primary data center) to service applications is a passive data center. Data centers
are established based on two models:
1. Active-active: Both primary and secondary data centers run applications (workloads) in production
environments and serve user requests. A load balancer distributes user requests to both the data
centers[11].
2. Active-passive: Primary data center functions as an active application site, serves user requests, and
replicates critical business data to secondary. The secondary data center is ready to be activated to
service applications should the primary data center fail due to some reason[12].
References
[1]
R. Castagna, “information technology (IT)”, https://searchdatacenter.techtarget.com/definition/IT.
[2]
S. Kempter, A. Kempter, “IT Process Wiki - the ITIL® Wiki”, https://wiki.en.it-processmaps.com/index.php/Main_Page.
[3]
EasyVista, “ITSM CAPABILITIES MAP”, https://www.easyvista.com/itsm-capabilities-map.
[4]
R. Kal, “Data Center Design 101: Everything You Need to Know”,
https://www.vxchnge.com/blog/data-center-design.
[5]
M. Ole, “This is how we reduce data centers’ carbon footprint”,
https://blog.sintef.com/sintefenergy/this-is-how-we-reduce-data-centers-carbon-footprint/.
[6]
W. Ross, “4 Data Center Cabling Strategies That Will Make Your Job Easier”,
https://www.vxchnge.com/blog/data-center-cabling-strategies.
[7]
E. Sampera, “What to Know About Logical Security vs Physical Security”,
https://www.vxchnge.com/blog/logical-security-vs-physical-security.
[8]
B. Tom, “How to Retain Control With Data Center Monitoring Software”,
https://www.vxchnge.com/blog/control-with-data-center-monitoring-software.
[9]
UptimeInstitute, “Tier Classification System”, https://uptimeinstitute.com/tiers.
[10]
Switch, “Tier 5® Platinum Data Centers”,
https://www.switch.com/tier-5/#:~:text=Switch%20invented%20the%20Tier%205,operating%20data%20centers%20since%202000.
[11]
T. Slattery, “Active-active data centers key to high-availability application resilience”,
https://www.techtarget.com/searchnetworking/tip/Active-active-data-centers-key-to-high-availability-application-resilience.
[12]
Citrix, “Active-passive site deployment”,
https://docs.citrix.com/en-us/citrix-adc/current-release/global-server-load-balancing/deployment-types/active-passive-site-deployment.html.
[13]
OpenGroup, “Infrastructure Architecture”, https://pubs.opengroup.org/architecture/togaf80-doc/arch/p4/infra/infra_arch.htm#Approach

Chapter 2
Cloud
Types of Cloud 2.1 Private Cloud

A private cloud is a cloud computing environment dedicated to
Private Cloud an organization, i.e., all the hardware and software resources are
• Control and Performance private to an organization. With the advancement of technologies for
• Two types compute, storage, and network, it has become possible for vendors to
– Converged put together all of them as a bundle in a rack or two. A pre-packaged
bundle of compute (servers), storage, network, and related software
– Hyperconverged
is referred to as converged infrastructure[1]. While such a system
Public Cloud will have components restricted to a specific set of technologies
• Elasticity and Change (e.g., x86 compute only) and even those of a particular vendor, it
• Hyperscalers has several advantages. It reduces “touch labor,” potential failures
due to related errors, and improves the overall manageability of the
Hybrid Cloud system. It can also scale quickly through simplified deployments
• Flexibility and Resiliency of modular configurations to bring new services to market rapidly
• Hyperscalers at a lower cost. Tools are provided by vendors of such systems to
provision and configure compute, storage, network, monitoring, and
security. They offer consistent interfaces and automate as much as
possible. Private cloud is part of the strategy of organizations that require full control over underlying
infrastructure for performance, compliance to data privacy policies, and integration with legacy.
The bundling of infrastructure and software is at two levels –
1) Converged Infrastructure (CI): Hardware-based approach of converging compute, storage, network,

and related software using pre-validated and pre-racked configurations. An example is FlashStack
from Cisco and Pure Storage[2].
2) Hyperconverged Infrastructure (HCI): Software-based approach of converging compute, storage,

network, and related software using commodity components. While hyperconverged infrastructure
is not necessarily superior to converged infrastructure, it is generally regarded as the future of data
center design. An example is VxRack from DELL/EMC[3].
The characteristic feature of these systems is to bring together compute, storage, network in a “box”
with simplified management and lower cost of deployment, maintenance, and support. This feature has
made them good candidates for the private cloud[4].
As indicated earlier, the private cloud is strategic to enterprises when they wish to have full control
of the applications, infrastructure, and data and optimize them for performance[16]. Systems of record
solutions, such as core banking applications, need to have high performance, integrate with mainframe
and mid-range systems, and support strict data privacy regulations. For such use cases, a private cloud
is a good option to consider as it provides tight control on the underlying infrastructure and data storage.
Chapter 2: Cloud 7
2.2 Public Cloud
Public cloud is a cloud computing environment open to all. Public cloud platforms, such as Amazon Web
Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), have grown in maturity and adoption
over the past couple of decades. These platforms provide compute, storage, network, and other infrastructure
resources for organizations to access over the internet or dedicated networks based on a pay-per-use model.
Organizations need to provision infrastructure resources quickly, ideally at a click of a button.

Additionally, many organizations need to address short-duration spikes in demand for infrastructure
resources. On a different dimension, market disruption is driving enterprises to digital transformation
initiatives that necessitate rapidly changing systems of engagement with new features that interact with
mobile and browser clients at web-scale[5]. Public cloud platforms offer highly elastic and self-service
capability to address these needs with automation of provisioning, configuration, and management of
infrastructure services. A public cloud is a suitable option for an enterprise when elasticity and change
are key considerations for the applications and infrastructure that enable its business[16].
The traditional data centers would not be able to address these requirements at a reasonable cost.
The Tech Giants, Amazon, Microsoft, and Google have built public cloud platforms that offer both
infrastructure as a service (IaaS) and platform as a service (PaaS) capability from their specially designed
data centers that may be accessed as services through an API model.
Public cloud is strategic to an enterprise when solutions that implement business processes must support
a large number of users accessing from different types of devices and undergo frequent changes due
to changing technologies and business demands. The services offered by the public cloud platforms
(e.g., Amazon AWS, Microsoft Azure, Google GCP) provide virtually unlimited elasticity with features to
auto-scale resources as demand increases. For this reason, they are referred to as hyperscalers. These
hyperscalers regularly introduce new capabilities in their cloud platforms. Enterprises can formulate
strategies to enhance their products/services using the new capabilities. Many use cases are addressed
more effectively by the public cloud. The systems of engagement that must support many web and
mobile clients may be implemented on the public cloud. Development environments may be spun up
and brought down easily and quickly on the public cloud. Workloads that require a large number of
infrastructure resources for a few hours in a day, week, or month maybe more efficiently managed on the
public cloud. Some public cloud platforms (e.g., Azure) can extend the life of some of the out-of-support
systems (e.g., Windows 2003). Enterprises may also offer software as a service (SaaS) solutions on the
public cloud with lower investments and time to market.
2.3 Hybrid Cloud

“Hybrid cloud is IT infrastructure that connects at least one public cloud and at least one private cloud and
provides orchestration, management and application portability between them to create a single, flexible,
optimal cloud environment for running a company’s computing workloads”[6]. A hybrid cloud is a good option
when flexibility and resiliency are key considerations for the applications and infrastructure of an enterprise[16].
Many enterprises have workloads on virtualization infrastructure in the data center, private cloud platforms,
and the public cloud. Hybrid cloud involves combining the resources across the data center and public cloud
and creating an environment that may be orchestrated and managed, ideally from a “single pane of glass.”
Hybrid cloud brings flexibility to move workloads between the private cloud to public cloud and vice versa
Chapter 2: Cloud 8
based on business demands – for instance, in an enterprise in the retail sector, workload demands during
the holiday season are likely to be significantly higher than off-season. Workloads deployed on a private
cloud environment can scale further into the public cloud. Applications will have to be architected to leverage
the flexibility and deploy the necessary tooling to manage the processes. For instance, hybrid cloud DR
architecture that involves backup of application data from the private cloud to the public cloud for use in case
of disaster (or cyber attack) is one way to leverage flexibility to increase the resiliency of IT infrastructure.
The move towards microservices architecture from monolithic application architecture coupled with
container orchestration in a hybrid cloud environment results in the efficient use of resources. Application
components may be deployed on infrastructure resources belonging to a hybrid cloud using infrastructure
as code (IaC) tools and techniques (e.g., Terraform). Further, application components and microservices,
packaged as containers, may be deployed using CI/CD pipelines to the container orchestrated hybrid
cloud environment using tools like Kubernetes. Such application deployments on a hybrid cloud take
advantage of the collective resources in a private cloud (and other virtualization infrastructure in the
data center) and public cloud to provide a highly efficient and resilient IT infrastructure through optimal
utilization of resources in both private cloud and public cloud.
2.4 Cloud Adoption Framework

Enterprises are rethinking their business models and IT to be more competitive in the marketplace and
implement solutions that help them better meet their needs. The pay-per-use model inherent in cloud
solutions shifts the related IT expenditure from capital expenditure, CAPEX (purchases for long-term use),
to operational expenditure, OPEX (day-to-day expenses), in many cases being advantageous to the financial
accounting of the enterprise. Some products and services are viable only on the cloud. Cloud is, thus, changing
the way existing and emerging business objectives are formulated and realized. A strategy that takes into
account the objectives of an enterprise and optimally leverages cloud capabilities to meet the infrastructure
requirements enables a competitive advantage for the enterprise. The key cloud providers (Amazon AWS,
Microsoft Azure, and Google GCP) have all formulated cloud adoption frameworks to get the strategy and
implementations right. A summary of their cloud adoption frameworks is shown pictorially in Figure 2.1.
6 Perspective
Business Perspective
People Perspective
Governance Perspective AWS Envision Align Launch Scale
Platform Perspective Approach
Security Perspective
Operations Perspective
Azure Define
Plan Ready Adopt
Approach Strategy
Govern Manage
4 Themes Cloud
Adoption
Learn
Lead GCP Tactical Strategic Transformational
Scale Approach
Secure
Figure 2.1: Cloud adoption frameworks for AWS, Azure, and GCP
Chapter 2: Cloud 9
The cloud adoption frameworks for the public cloud platforms, AWS, Azure, and GCP, facilitate the
migration of workloads to the cloud, based on best practices, as outlined by Amazon, Microsoft,
and Google.
AWS advocates the following four-step approach for cloud adoption to realize the benefits[7]:
1. Envision transformation opportunities consistent with business objectives.

2. Bring stakeholder alignment by identifying gaps and developing strategies to address them.
3. Execute pilots to ensure business value and fine-tune the direction.
4. Subsequently, scale the pilots to full-scale deployment.
In addition, it lays out six perspectives that the concerned stakeholders need to own and manage to
develop related capabilities on the cloud.
Azure promotes a methodology for cloud adoption also involving four steps[8]:
1. Define strategy and a business case.

2. Create an actionable cloud adoption plan.
3. Prepare cloud environments with landing zones.
4. Migrate workloads to those cloud environments.
It also emphasizes creating an operations baseline and governance baseline for effective management
and governance.
GCP recommends an approach with three phases[9]:
1. Tactical.
2. Strategic.
3. Transformational.
Each of the above three phases involves four themes, learn, lead, scale and secure to lead the organization
towards greater maturity on the cloud transformation journey.
All the frameworks have certain cloud adoption steps in common –
1) Establish criteria to ensure that cloud adoption results in business value.

2) Define strategy for cloud adoption keeping in mind business objectives with associated investments.
3) Formulate migration strategy for applications.
4) Employ principles of a well-architected framework.
5) Prepare landing zones and run pilots.
6) Migrate applications, govern and manage cloud deployments.
The first two steps are related to business, and the next four are related to IT. Since the focus of this
book is on IT, the next four sections discuss them.
Chapter 2: Cloud 10
2.5 Migration Strategies to Cloud
Enterprises have several types of applications. To ensure that the applications are migrated efficiently to
the cloud with optimal utilization of resources, lower cost and deliver business value in a timely manner,
it is necessary to formulate a migration strategy that specifies WHAT, WHY, HOW, WHERE, and WHEN of
migration to cloud.
WHAT and WHY – An assessment is conducted on the application landscape to determine which
applications are suitable candidates to migrate to the cloud to deliver business value and realize cost
benefits. Migration of all applications to the cloud does not necessarily result in business value to the
enterprise. For instance, it may make more business sense to RETAIN legacy platforms with deep
integration to custom applications in the data center. It may also be possible to identify rarely used or
redundant applications (that implement the same functionality). In such cases, RETIRING of applications
that are no longer needed may be the best treatment. x86 based Windows and Linux applications are
obvious targets for migration in initial phases.
HOW – The key migration approaches are –
a) REHOST – This is essentially a lift and shift of the application from a virtual machine in the data
center to a virtual machine on the cloud without making any changes to the application.
b) REPLATFORM – When some changes are made to the Cloud Migration

application as part of the migration to take advantage of the
• Cloud Adoption
cloud capabilities, the application is said to be replatformed.
Framework
c) REFACTOR (or RE-ENGINEER) – This involves re-architecting • Well-Architected
the application to take full advantage of the native capabilities Framework
of the cloud platform. • Migration Strategies
d) REPURCHASE – In some cases, there is an opportunity to replace – WHAT and WHY
a product sourced from a third-party provider with an alternative • RETAIN
service from a SaaS provider, followed by data migration to a • RETIRE
SaaS service. Such an approach constitutes a REPURCHASE of
• MIGRATE
the capability.
– HOW
Note: Some cloud providers refer to other approaches specific to the • REHOST
services they offer. For instance, AWS offers a RELOCATE service to
• REPLATFORM
perform a hypervisor-level “lift and shift” of applications to VMware
Cloud on AWS[10]. • REFACTOR
• REPURCHASE
WHERE and WHEN – Applications are migrated to environments
• Agile Approach
defined in the cloud called landing zones. These are discussed
in section 2.8 of this chapter. A migration plan is prepared to – Landing Zone
address the WHEN to migrate to the cloud question, considering
dependencies and availability of resources.
Chapter 2: Cloud 11
2.6 Well-Architected Framework
A well-architected framework is a guidance for architecting solutions using cloud services to support
application deployments based on a consistent approach and best practices. All the key cloud
platform providers, namely AWS, Azure, and GCP, have, by and large, specified the same principles that
comprise the foundation of a well-architected cloud solution[11]. Table 2-1 summarizes the principles of
a well-architected framework.
Principle Description AWS Azure GCP
Performance Use of resources efficiently and adapt to

Efficiency changes in demand.   
Perform intended function correctly and
Reliability
consistently and recover from failures.   
Protect data and applications
Security
from threats.   
Define and implement processes to
Operational
Excellence
support the development and run   
applications effectively in production.
Cost Manage costs to maximize the

Optimization value delivered.   
Table 2-1: Principles of a well-architected framework
A solution developed based on the principles of the well-architected framework will run efficiently on
the cloud, meeting the requirements and delivering business value to the enterprise.
2.7 Landing Zones

An application to run anywhere needs compute, storage, network, and data resources. Traditionally,
on-premises data centers provided these resources. Cloud platforms provide these resources in the
cloud in their data centers architected to provide elasticity along with management consoles to make
changes at the click of a button.
When enterprises consider the cloud as a target for their applications, they need to plan for provisioning,
management, and configuration of thousands (if not hundreds of thousands) of resources related to
compute, storage, network, and data. Any attempt to perform such tasks at scale manually is prone to
error and defeats the benefits of the cloud. Thus, there is a need to automate cloud resource provisioning,
management, and configuration.
To this end, cloud platforms provide access to the resources through APIs that may be programmatically
invoked through scripts. Tools such as Chef, Puppet, Ansible, and Terraform with scripting capabilities
used for automation in data centers are supported by cloud platforms to provision resources in the
cloud environment in an automated manner[12]. Scripts specify infrastructure as code (IaC) both for
Chapter 2: Cloud 12
provisioning and configuration. A landing zone is an environment that may be provisioned and configured
on the cloud with compute, storage, network, and data-related resources using which applications or
workloads are deployed to the cloud. It brings efficiency and ease of management to cloud deployments.
Cloud management teams specify landing zones to define environments for development (DEV), testing
(TEST), user acceptance (UAT), and production (PROD) to deploy applications.
The landing zone is an agile technique for infrastructure[13]. A given environment with thousands of
resources pre-defined as infrastructure as code (IaC) scripts may be spun up and down in minutes based
on business demand. It may be triggered by developers through automation with no manual intervention
by operations teams.
A landing zone is, thus, a building block for the cloud and is an integral part of the cloud adoption
framework of all major cloud providers. An enterprise typically establishes a cloud center of excellence
(CCOE) or a cloud management team that would define the landing zones to be provisioned in the cloud
with necessary monitoring and security controls to enable cloud adoption[14].
In other words, a landing zone enables a “software-defined data center” for a given enterprise in the cloud.
2.8 Agile approach for cloud deployments

In general, agility is related to business and agile to IT. Agility is the ability of an enterprise to respond to
change. It may be a response to changing customer demand or market conditions. On the other hand,
agile is a set of tools and techniques that help an enterprise achieve agility. The markets across industry
segments have been in a state of disruption — these disruptions mandate changes to products and
services offered by enterprises. Enterprises will, therefore, need to have the agility to make the required
changes to products and services rapidly. IT will need to support business, and for this, IT demands
agile tools and techniques. In the context of the cloud, agile tools and techniques will be needed both at
the infrastructure level and application level.
Two important techniques support agile in applications, one is DevOps, and the other is containerization.
DevOps involves the setup of CI/CD pipelines that foster automation of build, test, and deploy activities.
Any change can be moved into production in a few hours, sometimes even minutes — tools like Jenkins,
Bamboo, Chef, Puppet, and Ansible support DevOps. Containerization packages the functionality of an
application into independent, deployable units, scales them, and optimizes the use of the underlying
environment for efficiency[15]. The container orchestration platforms bring in greater management and
resiliency. They enable automation for scale and efficiency. Docker, Kubernetes available in various
distributions, such as from GitHub, OpenShift, and Tanzu, support agile processes.
Chapter 2: Cloud 13
An agile approach for cloud is shown pictorially in Figure 2.2.
Guidance CCOE Govern
Business/IT Cloud
Application Build Deploy Build (CI/CD) DEV TEST
Landing Landing
CI Containers Zone A Zone B
Landing Zone IaC Execute IaC UAT PROD

Landing Landing
Scripts Zone C Zone D
Development Deployment
Figure 2.2: Agile approach for cloud deployments
As indicated in the earlier section, a CCOE, or the cloud management team in an enterprise, establishes
cloud governance. They define landing zones for the business or IT development teams to develop
infrastructure as code (IaC) scripts for provisioning, configuring, and managing landing zones.
The agile application development process involves the use of CI tools and techniques by application teams
to create application builds and containers. They are required to deploy application builds and containers
using CI/CD tools and techniques to DEV, TEST, UAT, and PROD environments. To do so, the application
teams use infrastructure as code (IaC) scripts to perform infrastructure provisioning, configuration, and
management. Infrastructure changes are handled through executing scripts that implement required
changes. Thus, changes to functionality result in changes to code that activate CI/CD pipelines to create
and deploy builds. In a well-integrated agile process, application teams trigger infrastructure as code (IaC)
scripts to provision, configure, and manage infrastructure resources.
In this book, the term “data center” is being used to refer to a traditional on-premises data center managed
by IT or their third-party providers. The term “cloud” is used to refer to public cloud platforms.
The architecture documents required for infrastructure solutions are discussed in the next chapter.
Chapter 2: Cloud 14
References
[1]
NetApp, “What Is Converged Infrastructure (CI)?”, https://www.netapp.com/data-storage/flexpod/what-is-converged-infrastructure/.
[2]
Cisco, “FlashStack with Cisco UCS X-Series and Cisco Intersight”,
https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/flashstack-with-ucs-x-and-
intersight.html.
[3]
DELL, “VxRack System FLEX”, https://www.dell.com/en-us/work/shop/povw/vmware-vxrack.
[4]
A. Miller, “Converged vs Hyperconverged Infrastructure: The Differences Between CI & HCI”,
https://www.bmc.com/blogs/converged-infrastructure-vs-hyper-converged-infrastructure/#.
[5]
Mainstream Technologies, “Systems of Record vs Systems of Engagement”,
https://www.mainstream-tech.com/systems-of-record-vs-systems-of-engagement/.
[6]
S. Vennam, IBM, “Hybrid Cloud”, https://www.ibm.com/in-en/cloud/learn/hybrid-cloud.
[7]
AWS, “AWS Cloud Adoption Framework (AWS CAF)”, https://aws.amazon.com/professional-services/CAF/
[8]
Microsoft, “Microsoft Cloud Adoption Framework for Azure”, https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/
[9]
Google Cloud, “The Google Cloud Adoption Framework”,
https://services.google.com/fh/files/misc/google_cloud_adoption_framework_whitepaper.pdf.
[10]
AWS, “AWS Prescriptive Guidance - Overview”,
https://docs.aws.amazon.com/prescriptive-guidance/latest/migration-retiring-applications/overview.html.
[11]
K. Stalcup, “The Azure Well-Architected Review is Worth Your Time”, https://www.parkmycloud.com/blog/azure-well-architected/.
[12]
S. Strut, IBM, “Infrastructure as Code: Chef, Ansible, Puppet, or Terraform?”,
https://www.ibm.com/cloud/blog/chef-ansible-puppet-terraform.
[13]
E. Rifkin, Microsoft, “Creating cloud ready environments with Azure landing zones”,
https://azure.microsoft.com/en-in/blog/creating-cloud-ready-environments-with-azure-landing-zones/.
[14]
D. Ramachandani, “Building a Cloud Centre of Excellence in 2020: 13 Pitfalls and Practical Steps”,
https://www.contino.io/insights/cloud-centre-of-excellence-2020.
[15]
S. Kambhampaty, “Why Your IT Strategy Should Extend The Value Of Cloud With Containerization”,
https://www.forbes.com/sites/forbestechcouncil/2021/07/16/why-your-it-strategy-should-extend-the-value-of-cloud-with-
containerization/.
[16]
S. Kambhampaty, “Choosing The Right Cloud Strategy For Your Enterprise”,
https://www.forbes.com/sites/forbestechcouncil/2021/12/15/choosing-the-right-cloud-strategy-for-your-enterprise/?sh=6b55ad3851b2.
Chapter 2: Cloud 15
Chapter 3
Architecture Documents for

Infrastructure Solutions
The TOGAF standard defines technology architecture and stipulates

Documents to its creation in Phase D of the Architecture Development Method
capture three views
(ADM)[1]. It also refers to Architecture Definition Document as a
of Infrastructure
deliverable from Phase D with guidance on what needs to be part
Architecture
of the technology architecture definition document. The scope of
• Conceptual Technology infrastructure architecture covered in this book is limited to the
Architecture (CTA) “Phase D - Technology architecture” of the Architecture Development
Method of TOGAF. Therefore, the words “infrastructure” and
• Logical Technology
“technology” are used synonymously in the context of capabilities,
Architecture (LTA)
solutions, and architecture for data center and cloud.
• Physical Technology
Architecture (PTA) In practice, it is better to have three separate architecture documents
for infrastructure or technology architecture – one that documents
the conceptual technology architecture, the second that describes
the logical technology architecture, and the other physical technology architecture, each capturing a
different view of architecture[2]. Each view (and the associated document) addresses the needs of a
particular set of stakeholders as one view of the architecture cannot satisfactorily describe the solution
from the perspectives of all stakeholders[3]. One set of stakeholders would need a conceptual view.
The second would need a clear, logical view of how the architecture elements (along with associated
software) come together to meet specific requirements of business/application, their relationship, and
interactions. The third set of stakeholders needs the specifics (such as IP addresses to be assigned) to
stand up the infrastructure components, i.e., the physical view of the infrastructure solution[4].
Thus, it is common to have three separate documents to cover the guidance provided by TOGAF and
industry best practices.
3.1 Conceptual Technology Architecture (CTA)

The CTA is a conceptual view of the architecture. The conceptual view describes what infrastructure (or
technology) capabilities are required[4].
The approach that is adopted is to start with business capabilities and map them to the infrastructure
capabilities required to deliver them. Infrastructure capabilities identified may be implemented to fulfill
the needs of the business. In general, for most enterprises, compute, network, storage, backup/restore,
disaster recovery, monitoring, and security emerge as core infrastructure capabilities.
Chapter 3: Architecture Documents for Infrastructure Solutions 17

While in some cases, CTA might document a conceptual view of specific capabilities (e.g., security), the
conceptual view is generally defined at a much broader level, taking into account different infrastructure
capabilities such as compute, storage, and so on. Therefore, it is common to have a CTA done at the
data center level for a business unit or an enterprise.
Several templates are available from different sources for preparing CTA. These may be suitably modified
based on the organization’s requirements[5]. The following is the indicative list of sections that could be
part of the CTA formulated by enterprise or infrastructure architects.
Conceptual Technology Architecture – Indicative list of sections

1. Context
2. Objectives
3. Overview
4. Scope
5. IT Strategy
6. Design Principles
7. Technology Standards & Guidelines
8. Infrastructure Capabilities
a. Compute
b. Network
c. Storage
d. Backup/Restore
e. Disaster Recovery
f. Monitoring
g. Security
h. Others
9. Service Management
a. Service Catalog
b. Service Monitoring
10. Guidance for developing Infrastructure Solutions
a. Architecture Documents
b. Architecting Process

3.2 Logical Technology Architecture (LTA)
The LTA is the key document that defines the logical view of the infrastructure solution. It describes
the how of the solution. This document should provide complete clarity on the different infrastructure
components, their current state (if a version already exists), and how the target state implements them.
It should also describe the services needed for supporting infrastructure capabilities (monitoring,
backup/restore, DR, and security) for effectively operationalizing the target architecture.
Templates are available in the public domain for preparing LTA[6]. Considering the guidance provided
by TOGAF 9.2 on what should be part of the architecture deliverable document, the following is the
indicative list of sections that should be part of the LTA formulated by the infrastructure architect[7].
Logical Technology Architecture – Indicative list of sections

1. Solution Overview
• Solution Context
2. Requirements
• Business Requirements
• Technical Requirements
• Application Requirements
• SLA Requirements
3. Requirements Traceability Matrix
4. Alignment to Design Principles
5. Scope
• In Scope
• Out of Scope
6. Solution Detail – Logical Architecture
• Baseline (AS-IS) Architecture
• Architecture Decisions
• Target Architecture
– Compute
– Storage
– Network
• Architecture/Solution Building Blocks

• Detailed Building Blocks

7. Target Architecture Operational Model
8. Transition from Baseline to Target Architecture
9. Support Services
• Network
• Monitoring
• Backup/Restore
• Security
10. Risks and Constraints
11. Assumptions and Dependencies
Key points related to the indicative list of sections for LTA are as follows:
1) Solution Overview: The solution overview section provides the overview of the solution presented
in the rest of the LTA document, in a page or two. It also provides a context in which this solution
operates, i.e., what other solutions are related.
2) Requirements: When business units want infrastructure solutions, they tend to request X number
of servers. Specifying the number of servers is not a requirement but a solution. It is the role of
the infrastructure architect to understand the need and specify the infrastructure components for
the solution.
3) Design principles: At the enterprise level, or sometimes at the individual business unit level, specific
design principles are identified (e.g., improve security). The LTA must indicate how the design
principles are adhered to.
4) Solution detail: The solution detail section describes the target state architecture and describes
the Architecture Building Blocks (ABB) and Solution Building Blocks (SBB) of the architecture. It is
recommended that ABBs and SBBs defined in the solution be consistent with TOGAF definitions.
For instance, Data Center, Rack, Switch, Storage are high-level ABBs while Servers, ESXi Cluster are
next-level ABBs. The VM that hosts the application is the SBB[8].
5) Transition and Target Operating Model: The architect should carefully consider the transition from
a baseline (AS-IS) to the target architecture and describe the key points to be kept in mind. The
architect should also describe the operating model for the target state.
6) Support services: The run-time environment will require certain support services that include
solutions for network, monitoring, backup/restore, disaster recovery, security, and so on. The
architect should assess the need for these solutions and factor them into the overall solution
suitably.

3.3 Physical Technology Architecture (PTA)
The PTA takes the logical technology architecture defined in LTA to the next level and presents the
physical view of the infrastructure solution. It describes the where of the solution, i.e., specifics of
where the infrastructure components would be deployed. This document should provide clarity to
implementation teams to stand-up infrastructure, configure correctly from an architecture perspective,
and ensure that adequate inputs are provided to operations teams for recovery of the solution should
it go down for some reason. (Operations teams use the inputs to create a more detailed operations
manual for the recovery of infrastructure components). PTA should also contain the list of infrastructure
components for project teams to order. The list is prepared in discussion with the identified vendors.
Templates are also available in the public domain for preparing PTA[6]. Considering the guidance provided
by TOGAF 9.2 on what should be part of the architecture deliverable document, the following is the
indicative list of sections that should be part of the PTA formulated by the infrastructure architect[7]:
Physical Technology Architecture – Indicative list of sections

1. Solution Overview
• Solution Context
2. Requirements
• Business Requirements
• Technical Requirements
• Application Requirements
• SLA Requirements
3. Scope
• In Scope
• Out of Scope
4. Solution Detail – Physical Architecture
• Baseline (AS-IS) Physical Architecture
• Target Physical Architecture
5. Solution – Hardware, Software
• Component List
• Component Specification
6. Solution – Configuration
• Implementation Detail
• Physical Layout

7. Support services
• Network
• Monitoring
• Backup/Restore
• Security
8. System Recovery
9. Management and Control
10. Risks and Constraints
11. Assumptions And Dependencies
Key points related to the indicative list of sections for PTA are as follows:
1. Solution Overview, Requirements, and Scope: For continuity and context, it is advisable to have
these sections similar to those in the LTA with suitable changes, where necessary. These sections
may also provide a reference to the LTA and other documents.
2. Solution detail: The physical architecture of the solution is described in this section. The baseline
architecture (AS-IS), if one exists, and target physical architecture are described in this section.
3. Solution – Hardware & Software: An important activity performed by the infrastructure architect
is to make a complete list of the infrastructure components (including software) and to work
with vendors and the project teams to determine the right fit at optimal cost. The finalized list of
components is specified in this section (with part numbers) that constitutes the bill of materials.
Organization processes to order those components that need to be procured must be initiated after
due approvals.
4. Solution – Setup & Configure: This important section provides the information needed for
implementation teams to stand up the infrastructure and configure it right from an architecture
perspective. It may be noted that the PTA is an architecture document and is not meant to cover
all the specifics of configuration and implementation. Such detail is maintained in implementation
manuals or run books of the different infrastructure components.
5. Support solutions: As indicated earlier, the run-time environment will require certain support services
for network, monitoring, backup/restore, disaster recovery, security, and so on. The specifics at the
physical level will need to be provided by the architect to leverage the support solutions.
The LTA and PTA represent the architecture at the logical and physical levels. To formulate these
architecture documents, the architect must know the different infrastructure capabilities, components
available from vendors, and their architectural considerations. These are described at length over the
rest of the book.

3.4 Architecture documents for Cloud solutions
Cloud platforms (public, private, or hybrid) have the infrastructure already set up. Hence, a PTA mentioned
above is not needed. However, there is still a need to list all the software deployed on the cloud for the
solution as licensing requirements must be met. It is common practice to either use a simple SharePoint
solution or use an excel that captures the software needed and implement the approval workflows
either in an automated manner or manually. The LTA must still be developed as every solution deployed
must have an approved architecture document that describes the requirements and the solution. Only
then can future changes be made efficiently in a people-independent manner.
References
TOGAF 9.2, “Introduction to Part II - ADM Overview”, https://pubs.opengroup.org/architecture/togaf9-doc/arch/chap04.html.
[1]
[2]
Essential Project Documentation,” Technology Modelling”,
https://enterprise-architecture.org/docs/technology_architecture/technology_architecture_modelling_overview/.
[3]
TOGAF 8.1.1,” Developing Architecture Views”, https://pubs.opengroup.org/architecture/togaf8-doc/arch/chap31.html.
[4]
P. Robinson, “The Tao of Technology Architecture – Part 1”, https://www.ferroquesystems.com/the-tao-of-technology-architecture-part-1/
[5]
M. A. Ogush etal. “HP Architecture Template, description with examples”,
https://www.cs.helsinki.fi/group/os3/HP_arch_template_vers13_withexamples.pdf.
[6]
C. Michaud, “Templates Repository for Software Development Process”,
https://blog.cm-dm.com/pages/Software-Development-Process-templates.
[7]
TOGAF 9.2, “11. Phase D: Technology Architecture”,
https://pubs.opengroup.org/architecture/togaf9-doc/arch/chap11.html#tag_11_03_08.
[8]
TOGAF 9.2, “33. Building Blocks”, https://pubs.opengroup.org/architecture/togaf9-doc/arch/chap33.html.

Chapter 4
Architecting Process for

Infrastructure Solutions
An infrastructure solution must provide a suitable deployment

Architecting Process environment to the applications of business or IT of an enterprise,
• Develop Conceptual run-time operational capabilities, and disaster recovery (DR)
Technology Architecture capabilities for business continuity (in the event of disaster).
(CTA) The suitability of the deployment environment depends on its ability
• Develop Logical to support the non-functional capabilities of the applications in
Technology Architecture terms of performance, scalability, availability, security, and so on.
(LTA) The run-time operational capabilities include structured deployment
• Develop Physical of the applications, planned changes after deployment with minimal
Technology Architecture impact to the environment (and other solutions), and operational
(PTA) recovery in the event of failures. The disaster recovery capabilities
include operationally ready DR site, replication of application data
to DR site, and a well-understood and mature process/automation to operate from DR site, should a
disaster occur.
The role of an infrastructure architect is to formulate infrastructure solutions to meet the above needs.
To this end, the infrastructure architect should follow a structured process to architect an infrastructure
solution. As discussed in the previous chapter, three documents define the infrastructure solutions and
need to be prepared by architects. The architecting process involves the preparation of these documents,
which is described in the following two sub-sections.
4.1 Develop conceptual architecture for infrastructure

solutions
As indicated in the previous chapter, the conceptual technology architecture (CTA) describes what
infrastructure capabilities are required at the data center level for a business unit or an enterprise. It is
not specific to a particular infrastructure solution but the foundation on which infrastructure solutions
are defined. Figure 4.1 depicts the process for defining CTA.
4.1.1 Determine technology drivers

The starting point to formulating CTA is to determine the technology drivers of an enterprise. These are
IT strategy, design principles, and technology standards & guidelines.
IT Strategy: IT strategy (information technology strategy) is an approach for creating infrastructure

capabilities to meet IT and business goals[1]. An IT strategy is explicitly stated in a document or part
of the senior management communication that sets the direction for the IT department to create
Chapter 4: Architecting Process for Infrastructure Solutions 25

infrastructure capabilities to support an organization’s overall business strategy. An IT Plan is also
formulated with specific initiatives, outcomes, and timelines to implement the IT strategy[2].
Design Principles: “Design Principles are a set of considerations that form the basis of any good
product”[3]. They are rules that guide architects and designers when making decisions and trade-offs
on architecture and design. When two or more architecture/design options are equally good, design
principles help architects and designers choose the one that best fits the principles defined.
For example, most enterprises have design principles related to improved security and cost optimization[4].
When the architect identifies two or more solution options that meet the requirements equally, the
architect decides on the option using the design principles - more secure, costs less, or both.
Technology standards & guidelines: “A standard is a document that provides requirements, specifications,
guidelines or characteristics that can be used consistently to ensure that materials, products, processes
and services are fit for their purpose”[5]. Organizations prepare and maintain a list of technology standards
to control technologies for solutions keeping in mind long-term benefits, licensing costs, and support
skillsets. Use of technology not in the list of approved standards typically requires exception approvals
from all stakeholders and sponsors.
Identify Infrastructure
Capabilities
Determine Technology
Drivers Compute
Network
IT Strategy
Develop CTA
Storage
Conceptual
Design Principles Backup/Restore Technolgy
Architecture
Disaster Recovery
Technology
Standards & Monitoring
Guidelines
Security
Figure 4.1: CTA Architecting Process

4.1.2 Identify infrastructure capabilities
An organization delivers goods and services to its clients/customers. The organization’s IT supports
its business with applications and infrastructure that enable the delivery of goods and services.
Infrastructure capabilities are conceptual-level elements that capture what the infrastructure does[6].
As indicated in the previous chapter, the approach adopted is to start with business capabilities and map
them to the infrastructure capabilities required to deliver them.. A value stream analysis of the business
capabilities may be conducted to establish the processes/activities required to deliver their value. (Value
stream analysis is the set of activities required to deliver goods or services). Infrastructure capabilities may
then be mapped to realize the business capabilities of the enterprise[7]. Such an approach results in a core
set of infrastructure capabilities that a data center or business unit needs to support. In general, for most
enterprises, compute, network, storage, backup/restore, disaster recovery, monitoring, and security emerge
as a core set of capabilities. Each of these capabilities has been described in subsequent chapters.
Applications deliver the business capabilities required by organizations. Hence, an important aspect of
ensuring that infrastructure capabilities support business capabilities is defining the characteristics of
applications deployed on the infrastructure.
Application Criticality Tiers

The applications’ criticality levels are expressed in tiers[8]. In general, three tiers of criticality are defined.
While the names of the tiers vary from organization to organization, there is typically a tier defined that
is equivalent to CRITICAL for business critical and highly demanding applications, STANDARD for less
critical applications, and LOW PRIORITY for others. An indicative set of tiers is given in Table 4-1.
Criticality Tier Tier 1 - CRITICAL Tier 2 - STANDARD Tier 3 – LOW PRIORITY
Workload Production environment Production environment Production for LOW

for CRITICAL for STANDARD PRIORITY applications
applications applications and non-production
environments for all
applications
Characteristics • Critical production • Standard production • Low priority
of Tier workloads. workloads. production workloads
• 12/12 (RPO/RTO) DR • 24/24 (RPO/RTO) DR that require
using the identical uses similar but not • 48/48 (RPO/RTO) DR
infrastructure. necessarily identical uses similar but not
• Requires high infrastructure. necessarily identical
availability – clustered • Does not require high infrastructure.
or load balanced. availability. • Does not require high
• The highest tier of DR. • Requires integrated availability.
• Integrated end-to-end DR. • Non-production
monitoring. workloads that do not
require integrated DR.
Table 4-1: Application Criticality Tiers

4.1.3 Develop CTA
The infrastructure architect (or an enterprise architect) developing the CTA determines the technology
drivers, including IT strategy, design principles, and technology standards & guidelines. They identify
the infrastructure capabilities needed to support the business and formulate the conceptual technology
architecture that serves as a foundation for all infrastructure solutions for a given data center or
business unit.
The CTA specifies key infrastructure capabilities required for all infrastructure solutions. It also specifies
the service characteristics related to the infrastructure capabilities.
Service characteristics: As discussed in chapter 1, in an ITSM/ITIL compliant organization, the IT group in

an enterprise (supported by any third-party outsourced organizations) offers IT capabilities as services.
Business units order through a service catalog (established by service strategy and service design ITSM
processes). The services that are offered via service catalog are defined in terms of features or qualities
provided by the service with an associated cost (whether internal or payable to an external outsourced
organization). Service characteristics are the features or qualities offered by a service. Each of the
services is also governed by a service-level agreement. As per Gartner, “service-level agreement (SLA)
sets the expectations between the service provider and the customer and describes the products or
services to be delivered, the single point of contact for end-user problems, and the metrics by which the
effectiveness of the process is monitored and approved”[9].
Indicative service characteristics are provided for four key infrastructure capabilities – compute, storage,
backup/restore, and disaster recovery.
Illustration – Indicative service characteristics
1. Compute
Compute is the infrastructure capability that supports deployment of applications. Tiers are defined for
different compute service characteristics. Typically, three tiers are defined – HIGH, MEDIUM, and LOW.
An indicative table with Compute service tiers and their characteristics is given in Table 4-2[10].
Tier Compute Service Characteristics
Platform OS Availability Monitoring Clustering/Load Balanced DR Tier

Windows, Linux,
High 98.99 Premium Yes High
IBM i, AIX, z/OS
Windows, Linux,
Medium 96.00 Advanced No Medium
IBM i, AIX, z/OS
Windows, Linux,
Low 90.00 Basic No Low
IBM i, AIX, z/OS
Table 4-2: Compute service tiers
The Compute service tiering, in general, aligns with the tiers of application criticality.

2. Storage
Storage is classified into various tiers based on data service characteristics – performance and price.
Table 4-3 illustrates different tiers defined for storage[11].
Tier Data Service Characteristics

Performance Storage Media
(Response Time) Price Examples
Tier 0
Mission Critical Data for uninterrupted <5ms High NVMe, SSD
disruption-free access and usage
Tier 1
Frequently accessed (hot data) and <8ms High SSD
high-performance workloads
Tier 2
Infrequently accessed (warm data) with <12ms Medium SAS
short-term retention requirements
LowTier 3
Archival and rarely accessed data with <30ms Low SATA
long-term retention requirements
Table 4-3: Storage Tiers
3. Backup/Restore
Table 4-4 gives indicative backup/restore service characteristics for the mentioned criteria for
Compute service tier[12].
Backup/Recovery Compute Service Tier

Service Characteristics HIGH MEDIUM LOW
Recovery Point Objective < 12 hours < 24 hours < 48 hours
Recovery Time Objective < 12 hours) < 24 hours <48 hours
Backup Success Rate 99% 97% 95%

Off-hours Off-hours
Backup Window Anytime
(8pm – 8am) (8pm – 8am)
Onsite short-term retention (STR)
4 weeks 4 weeks 2 weeks
backup period
Onsite long-term retention (LTR)
7 years 7 years 7 years
backup period
Off-site backup Yes Yes No
Monitoring & Support 7x24 7x24 7x24
Table 4-4: Backup/Restore service characteristics

4. Disaster Recovery
Recovering from a data center outage requires careful planning, implementation, and regular testing
to ensure that the systems come back up based on pre-defined service characteristics. Table 4-5
provides an indicative set of DR service characteristics[13].
DR Service Characteristics DR Tiers
DR-High DR-Medium DR-Low
Recovery Time Objective < 12 hours < 24 hours < 48 hours
Table 4-5: Disaster Recovery service characteristics
4.2 Develop Logical Technology and Physical

Technology Architecture
ARCHITECTURAL ELABORATION PROCESS
Infrastructure
Requirements
Logical Physical
Technology Technology
Architecture Architecture
Solutions Options
User/Customer Architect Architect Implement
& Assumptions,
Risks & Constraints
Solution
Requirements Logical Solution – Hardware & Software –

ABB & SBB Sizing & Configuration
Technology Implementation
Provider Team
Figure 4.2: LTA & PTA Architecting Process
The process to develop logical and physical technology architecture is shown in Figure 4.2. It involves
three steps – capture infrastructure requirements, prepare logical technology architecture (LTA) and
prepare physical technology architecture (PTA).
An architect assigned to prepare LTA & PTA would need to –
1. Gain a comprehensive knowledge of the conceptual technology architecture, including IT strategy,

design principles, and technology standards and guidelines for the deployment environment.
2. Understand the infrastructure requirements of applications and other related components that need
to be deployed in discussion with key stakeholders.
3. Discuss multiple solution options and constraints with the stakeholders and arrive at an agreed
infrastructure solution that is fit for purpose.

Infrastructure Requirements: While the stakeholder requiring an infrastructure solution might have
sent a written requirements document, the infrastructure architect must discuss with the stakeholder
to comprehensively elicit both explicit and implicit requirements. The architect develops a good
understanding of –
1. Objectives of Project
2. Business Requirements
3. Technical Requirements and SLA
a. Special requirements of applications or third-party software
b. SLA requirements when solution is in production
4. Scope
5. System Context
a. Interfaces
b. Current state, if an existing solution needs to be enhanced
6. Performance Requirements
a. Concurrent users
b. Response time
7. Scalability Requirements
a. Peak volume/load
8. Sizing Considerations
a. Environments (DEV, TEST, UAT and PROD)
9. Application Criticality Tier
10. High-Availability Requirements
11. DR Requirements
12. Support Service Considerations
a. Network
b. Storage
c. Backup/Restore
d. Monitoring
e. Security

Solution Options and Constraints: Based on the understanding of the CTA and the requirements gathered
in discussion with the stakeholder, the infrastructure architect works with infrastructure vendors
to identify options, size infrastructure components, and formulate multiple solution options keeping
in mind costs and benefits, to provide the required ROI to the sponsor of the solution. The architect
identifies assumptions, risks, and constraints and gets the stakeholder’s buy-in to move forward with
one solution option.
Logical Technology Architecture (LTA): For the chosen solution option, the infrastructure architect
clearly identifies the architecture building blocks (ABB) and solution building blocks (SBB) as described
by the TOGAF framework and formulates the LTA[14]. The architect also has reviews conducted, and
refinements made, and ensures agreement with all stakeholders, including the implementation teams.
A list of all software and associated licenses is also made to ensure compliance.
Physical Technology Architecture (PTA): The finalization of LTA is the starting point for the development
of PTA. The focus of PTA is to define a physical technology architecture with specific hardware and
software, their setup, and configuration. For cloud (both private and public), other than a list of software
used, there would not be a need in most cases to develop a PTA, as there would not be a need to stand
up any hardware. The PTA is reviewed by all stakeholders, especially the implementation team. The
implementation team then implements the solution and delivers it to the stakeholder.
The rest of the chapters in the book describe the architectural aspects of the seven infrastructure
capabilities – compute, network, storage, backup and restore, disaster recovery, monitoring, and security.
References
[1]
R. Lebeaux, “IT strategy (information technology strategy)”,
https://searchcio.techtarget.com/definition/IT-strategy-information-technology-strategy.
[2]
CIO Wiki, “IT Strategy (Information Technology Strategy)”, https://cio-wiki.org/wiki/IT_Strategy_(Information_Technology_Strategy).
[3]
B. Brignell, “Design Principles - An open source collection of Design Principles and methods”, https://principles.design/.
[4]
AWS, “AWS Well-Architected Framework”, https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html.
[5]
ISO, “Standards”, https://www.iso.org/standards.html.
[6]
Essential Project Documentation, “Technology Modelling - Defining Technology Capabilities”,
https://enterprise-architecture.org/docs/technology_architecture/define_technology_capability/.
[7]
CIO Wiki, “IT Capability”, https://cio-wiki.org/wiki/IT_Capability.
[8]
J. Ferraro, “Three Top Tips for Successful Business Continuity Planning”,
https://esj.com/Articles/2009/06/09/Business-Continuity.aspx?Page=1.
[9]
Gartner Glossary, “Service-Level Agreement (SLA)”,
https://www.gartner.com/en/information-technology/glossary/sla-service-level-agreement.
[10]
S. Samy, “Service Criticality Tiers Standard and Architecture”,
https://www.linkedin.com/pulse/service-criticality-tiers-standard-architecture-sherif-samy/.
[11]
R. Sheldon, “tiered storage”, https://searchstorage.techtarget.com/definition/tiered-storage.
[12]
IBM, “Backup and Restore: An Essential Guide”, https://www.ibm.com/cloud/learn/backup-and-restore.
[13]
E. Sullivan, “disaster recovery (DR)”, https://searchdisasterrecovery.techtarget.com/definition/disaster-recovery.
[14]
TOGAF 9.2, “33. Building Blocks”, https://pubs.opengroup.org/architecture/togaf9-doc/arch/chap33.html.

Chapter 5
Compute
Internet
Client Devices
(Customers)
MPLS
Client Devices
Web Web
Server Server
Firewall Firewall
Security
(Chapter 11)


Servers Servers
Storage Storage
DR (Chapter 9)
Mainframe Mainframe
DR (Chapter 9)
Replication
Backup and Restore

(Chapter 8)
The infrastructure components on which applications are deployed represent the compute capability of
the infrastructure. In Figure 5.1, showing the data center deployments, the representative components
related to the focus of this chapter, namely compute capability, are highlighted (in a box with a dashed
outline). These are essentially server components.
Chapter 5: Compute 33
There are three key platforms in a data center on which applications
are deployed –
Compute
1. Mainframe running z/OS or Linux – large compute power[1]. • Mainframe

– z/OS
2. Mid-range running AIX or IBM i (earlier called AS/400, iSeries,
– Linux
and System i) – compute power between mainframes and
microprocessors. • Mid-range
– AIX
3. x86 servers running Linux or Windows – commodity servers
– IBM i
with microprocessors with relatively lesser compute power.
• x86
– Windows
5.1 Mainframe running z/OS or Linux – Linux
The IBM mainframe continues to be a highly available and reliable
server environment for applications.
The earlier versions of IBM Mainframe (System/360, eServer zSeries, z9 & z10, zEnterprise System including
z196, zEC12) have evolved into z13, z14, and z15 that support z/OS[2]. IBM has launched LinuxONE, which is
the only Linux-based mainframe, as of date, that may be used as a private cloud solution or hosted in their
data centers.
z/OS Linux
Virtual Virtual
z/OS Linux
Processor Processor
z/VM
(Type 1 Hypervisor)
Logical Partition (LPAR)

(Logical Partitioning)
Processor Resource/System Manager (PR/SM)

(Physical – Hardware Partitioning)
System/360,
eServer zSeries,
z9 & z10, IBM Z - z13, z14, z15 (Hardware)
zEnterprise System
(z196, zEC12)
Figure 5.2 – IBM Mainframe Z: OS perspective
The operating system perspective for IBM Mainframe Z is shown in Figure 5.2[18].
1. The Processor Resource/System Manager (PR/SM) resides on the IBM Z hardware that does
physical partitioning.
2. Logical partitioning, LPAR, is created over the physical partitions.
3. z/VM is the hypervisor that is configured over the LPAR.
4. Virtual processors are defined using z/VM.
5. The operating system, z/OS, or Linux, is set up in the virtual processor.
6. z/OS and Linux can also be installed directly within an LPAR created by PR/SM.
Applications may be written using Fortran, COBOL, JCL, SQL, Assembler, CLIST, REXX, PL/I, C, and C++
(and languages supported by POSIX) and deployed on z/OS. The applications written with languages
supported by Linux may be deployed on the virtual processor (executed by CP), IFL, or on IBM LinuxONE.
The mainframe has a central processor (CP) and specialty engines. Workloads that run on CP are charged
software license costs, while those running on specialty engines are not. The monthly software license
charge (MLC) charges are based on usage of central processor measured in millions of service units
(MSU) per every hour in a month for each LPAR (or capacity group) considering peak usage and rolling
average computations.
Key specialty engines on the mainframe are[3]:
• zIIP – co-assist processor that takes instructions from executing workload and executes it. Eligible
database workloads can be run on this processor. It cannot be used for Batch, CICS programs. No
MLC charges.
• zAAP – runs eligible Java workloads. Discontinued in System z13 and beyond.
• IFL – Integration Facility for Linux only supports Linux instructions. No MLC charges.
• ICF – internal coupling facility is a special LPAR that has requisite synchronizing software and
handles functions to share, cache, update, and balance data access across multiple processors.
• SAP – System Assist Processor that coordinates I/O subsystems. Multiple SAP engines may be
configured in a mainframe system.
5.1.1 Mainframe in Hosted Facility and Cloud

While z/OS hosting services on zSeries mainframe are available in IBM data centers, they are not
available, as of now, in the cloud. zLinux (Linux on z/OS) hosting services are available in IBM data
centers and in the cloud.
5.2 Mid-range running AIX or IBM i
The mid-range IBM systems are best suited for compute and data-intensive workloads like defense or
financial services.
The IBM System p with AIX based OS, and IBM System i with IBM i based operating system have evolved
z/OS Linux
into IBM Power Systems (only based on POWER chips) that supports both AIX and IBM i OS[4].
AIX IBM i
Logical Partition (LPAR) Logical Partition (LPAR)

(Logical Partitioning) (Logical Partitioning)
POWER Hypervisor
System p
– eServer p Series,
eServer p5, System p5,
System p
(mostly AIX based OS) IBM POWER SYSTEMS
(only based on POWER chips)
System i – AS/400,
eServer iSeries, eServer i5,
System i5, System i
(OS/400, IBM i based
operating system)
Figure 5.3 – IBM Power Systems: OS perspective
The IBM i is bundled with the database (DB2 for i), and hence it is not installed or charged separately[5].
It is a turnkey solution system where most basic systems come bundled with the operating system. IBM
i and OS/400 is ‘object-based’ operating system rather than UNIX, Linux, windows, which are ‘file-based’
operating systems.
The operating system perspective for IBM Power Systems is shown in Figure 5.3[19].
1. On the IBM Power Systems, the POWER Hypervisor resides on the hardware.
2. Logical partitioning, LPAR, are defined using POWER Hypervisor.
3. The operating system, AIX or IBM I, is set up in the virtual processor.
Applications written with C, C++, Perl, Java, Python, IBM COBOL for AIX, Fortran, Perl, PHP, REXX, SQL,
and so on may be deployed on Logical partitions, LPARs, for AIX on IBM Power Systems. Likewise,
LPARs may also be defined on IBM Power Systems for IBM i for deploying applications written using
IBM i Control Language, RPG, COBOL/400, COBOL, C, Java, CLP, PHP, Node.js, C++, Ruby, Orion, Python,
Fortran, and so on.
5.2.1 Mid-range on Cloud

IBM i and AIX hosting services are currently available in IBM data centers and on the cloud through
their IBM Power Virtual Server offering[6]. Additionally, they are available as cloud services from IBM
partner, Skytap[7].
5.3 x86 servers running Linux or Windows

There are two types of x86 servers that are typically deployed in a data center. They are[8]:
1. Physical Server: It is a physical computer dedicated to a single tenant, i.e., with dedicated compute,
memory, and disk storage for the user of the system. It is also called “bare-metal server” and is
relatively more expensive as all the server resources are dedicated to one tenant. This type of server
is used when –
a. Special requirements need to be addressed that require specialized hardware to be plugged

into the server.
b. Application server requires specific configurations that are not well-supported by virtual
environments.
2. Virtual Server (or virtual machine): It is a “compute resource that uses software instead of
a physical computer to run programs and deploy apps”[9]. It is the server on which operating system
(Linux or Windows) is installed before deploying the applications.
5.3.1 Virtualization
Virtualization is the process of dividing a physical machine into multiple unique and isolated units called
virtual machines (VM) using virtualization software[10]. A virtualization software (for instance, VMware
vSphere) is installed on the physical machine(s) that enables virtual machines’ creation.
5.3.2 Hypervisors
A component of virtualization software called hypervisor makes it possible to create virtual machines
on the same physical machine[11]. There are two types of hypervisors:
1. Type 1 – Bare Metal Hypervisor: It runs directly on the physical machine and acts as a lightweight
operating system. ESXi Hypervisor from VMware and Hyper-V Hypervisor from Microsoft are
examples of this type of hypervisor.
When a server is virtualized with a bare-metal hypervisor such as ESXi (or Hyper-V), it is called a
host (or a node). These hosts can be configured as a cluster (e.g., VMware ESXi cluster) for VMs to
be moved around in case of failures of hosts (high availability) or to manage the changing load on
the cluster (load balancing). The hosts configured as a cluster share resources such as processor,
memory, storage, and network.
Note 1: vSphere HA is a feature that enables high availability by restarting the failed VMs on other
ESXi hosts that have spare capacities. Likewise, the vSphere DRS feature enables load balancing by
treating the resources of all ESXi hosts as a global pool and automatically migrates VMs to different
ESXi hosts. A cluster has shared storage for all its ESXi hosts that maintain virtual machine disk
(VMDK) files accessible to all the VMs in hosts within the cluster.
Note 2: The storage used by a VM is stored as a file with a. vmdk extension. The format of the file is
virtual machine disk (VMDK).
2. Type 2 – Hosted Hypervisor: It runs as a software layer on an operating system. Examples are
Oracle VM VirtualBox and Microsoft Virtual PC.
5.3.3 Servers for Virtualization

Since many servers are deployed in virtualization environments, vendors provide specific configurations
to optimize space and power[12].
1. Rack servers: These fit into server racks and are suited for intensive computing operations. They
are self-sufficient servers with their hardware, including memory, raid controller data drives, power
supply, and cooling unit.
2. Blade servers: These fit into server chassis, which provides space and power. Cabling is also optimal.
Blade servers take up less space and consume moderate power while providing high processing
power. They are also hot-swappable, and that feature improves serviceability.
Several important concepts must be considered when specifying an x86 server for the solution.
5.3.4 Processor, Memory, and Benchmarks

Key considerations of the components of server that would have a bearing on the capability of a server are
as follows[13]:
1. Processor:
a. A processor (CPU) is a physical component that provides central processing unit capability to
a server. There may be more than one processor in a server.
b. A core is an operation unit within the processor. A single processor may have multiple physical cores.
c. A socket is an array of pins on the motherboard to hold a processor.
d. Many processors use a hyperthreaded model. A thread is a unit of execution in a process that is
executed by a core (in a processor)[14]. Multithreading enables the core (in a processor) to execute
several threads, each running a task of a process concurrently. Hyperthreading runs processes
in parallel by making a single physical processor core available to the operating system as
two “logical” cores. The operating system schedules processes on the two “logical” cores in
a multi-processor system as it does on two physical cores[15]. With hyperthreading, effectively, the
total cores are doubled. Figure 5.4 provides the physical and operating system perspective.
e. Virtual processors (vCPUs) are assigned to a VM by the virtualization software. It represents the
portion of the physical processor that is allocated to the VM.
2. Memory: In addition to the amount of memory in the server, it is important to analyze the type of RAM
(SRAM, DRAM) in the specific model of the server and the amount of Cache memory (L1/L2/L3).
3. Virtualization: As part of virtualization, the hypervisor can overprovision vCPUs/Core in a ratio

higher than 1:1, such as 3:1, or 4:1, effectively making available a higher number of vCPUs.
Illustration: Consider a server with one socket and a processor with 4 cores.
a. Processor view: Total number of physical cores = 1X4 = 4 physical cores.

b. Operating System view: With hyperthreading, the total number of logical cores = 2X4 = 8 cores.
c. Virtualization view: With 4:1 overprovisioning, total vCPUs = 32.
d. These 32 vCPUs can then be allocated to VMs using virtualization software such as vCenter from
VMware.
Figure 5.4 depicts the processor view, operating system view, and virtualization view for the
above illustration.
vCPU 0 vCPU 1 vCPU 16 vCPU 17

Processor Logical Cores
Core Core
(Physical) (Physical) vCPU 2 vCPU 3 vCPU 18 vCPU 19
Hyper- Hyper- CPU 0 CPU 4

Thread Thread vCPU 4 vCPU 5 vCPU 20 vCPU 21
Hyper- Hyper- CPU 1 CPU 5

Thread Thread vCPU 6 vCPU 7 vCPU 22 vCPU 23
vCPU 8 vCPU 9 CvPU 24 vCPU 25

Hyper- Hyper-
Thread Thread
CPU 2 CPU 6 vCPU 10 vCPU 11 vCPU 26 vCPU 27

Hyper- Hyper-
Thread Thread
CPU 3 CPU 7 vCPU 12 vCPU 28 vCPU 28 vCPU 29

Core Core
(Physical) (Physical)
vCPU 14 vCPU 15 vCPU 30 vCPU 31
Socket
Processor View Operating System View Virtualization View
(4:1 Overprovisioning)
Figure 5.4: Processor view-operating system view-virtualization view
4. Server expandability: The type and number of expansion slots, ports, dedicated server storage, and
other components determine its expandability. For instance, a server might offer only a single PCIe
2.0 slot or provide two PCIe slots, one PCIe 2.0 and another PCIe 3.0. More number of slots enables
greater scalability at peak loads.
5. Security-related cryptography features: Some servers have secure data encryption built-in as a
hardware feature using a crypto co-processor to carry out cryptographic operations.
6. Processor clock speed: The faster the clock speed, the more instructions can be executed per
second, and applications run more quickly. However, increasing the number of cores on a processor
might require slower clock speeds to run more applications simultaneously (with each running a
little slowly). Thus, a balance must be achieved based on the type of workload (compute intensive
vs. I/O intensive).
The above concepts need to be kept in mind when arriving at the server configuration.
Intel or AMD processors for x86 servers have traditionally been based on CISC-based architecture.
The Intel Xeon processor is currently widely used in server models of several vendors.
5.3.5 Virtual Server Options for Cloud

All public cloud platforms offer compute as infrastructure-as-a-service (IaaS) and platform-as-a-service
(PaaS) services. They also provide compute as serverless and container services. In addition, they also
offer a private cloud deployment option. Table 5-1 provides an overview of key compute services from
major cloud platform providers[16].
Cloud Amazon Web Google Cloud

Platform Services (AWS) Microsoft Azure Platform (GCP)
IaaS Services • EC2 • Virtual Machines • Compute Engine
• EC2 Auto-Scaling • Virtual Machine Scale
• Lightsail Sets
• VMware Cloud for AWS
PaaS Services • Beanstalk • App Service and • App Engine
Cloud Services
Serverless • Lambda • Azure Functions • Cloud Functions
compute • Fargate • Knative
services • Serverless Application
Repository
Container • Elastic Container • Azure Container • Docker Container
services Service (ECS) Service (AKS) registry
• Elastic Kubernetes • Azure Service Fabric • Kubernetes
Service (EKS) • Azure Cloud Services
• Azure Container
Private Cloud • AWS Outposts • Microsoft Azure Stack • Google Distributed
Cloud Hosted
Table 5-1: Key compute services for AWS, Azure, and GCP
5.4 Compute Characteristics
Tiers are defined for different service characteristics. Typically, there are three tiers – HIGH, MEDIUM,
and LOW. An indicative table with Compute service tiers is given in Table 5-2.
Tier Compute Service Characteristics

Clustering /
Platform OS Availability Monitoring Load Balanced DR Tier
Windows, Linux,
High 98.99 Premium Yes High
IBM i, AIX, z/OS
Windows, Linux,
Medium 96.00 Advanced No Medium
IBM i, AIX, z/OS
Windows, Linux,
Low 90.00 Basic No Low
IBM i, AIX, z/OS
Table 5-2 – Compute service tiers
The Compute service tiering, in general, aligns with the tiers of application criticality described in
Table 4-1[17].
References
[1]
IBM, “Mainframe solutions”, https://www.ibm.com/it-infrastructure/mainframes.
[2]
IBM, “Mainframe - Family tree and chronology”, https://www.ibm.com/ibm/history/exhibits/mainframe/mainframe_FT1.html.
[3]
H. Rama, “MAINFRAME SPECIALTY ENGINES”, https://www.cmg.org/2017/07/mainframe-specialty-engines/.
[4]
IBM, “A Brief History of the IBM AS/400 and iSeries”, https://www.ibm.com/ibm/history/documents/index.html.
[5]
IBM, “DB2 for i Frequently Asked Questions”, https://www.ibm.com/downloads/cas/1DAL4A8G.
[6]
IBM, “AIX & IBM i POWER on IBM Cloud”, https://cloud.ibm.com/catalog/services/power-systems-virtual-server.
[7]
Skytap, “Using Power VMs in Skytap”, https://help.skytap.com/kb-using-power-vms.html.
[8]
B. Lee, “Physical server vs Virtual machine: The Choice is open”,
https://www.vembu.com/blog/physical-server-vs-virtual-machine-choice-open/.
[9]
VMware, “Virtual Machine”, https://www.vmware.com/topics/glossary/content/virtual-machine.
[10]
VMware, “Server Virtualization”, https://www.vmware.com/topics/glossary/content/server-virtualization.
[11]
VMware, “Hypervisor”, https://www.vmware.com/topics/glossary/content/hypervisor.
[12]
Serverstack, “Difference between Rack servers and Blade Servers”,
https://www.serverstack.in/2019/01/19/difference-between-rack-servers-and-blade-servers/.
[13]
U. Panda, “How to decide VMware vCPU to physical CPU ratio”,
https://www.cloudpanda.org/blogs/how-to-decide-vmware-vcpu-to-physical-cpu-ratio.
[14]
R. Bauer, BackBlaze, https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/.
[15]
Wikipedia, “Hyper-Threading”, https://en.wikipedia.org/wiki/Hyper-threading.
[16]
IntelliPaat, “AWS vs Azure vs Google Cloud: Choosing the Right Cloud Platform”, https://intellipaat.com/blog/aws-vs-azure-vs-google-cloud/.
[17]
S. Samy, “Service Criticality Tiers Standard and Architecture”,
https://www.linkedin.com/pulse/service-criticality-tiers-standard-architecture-sherif-samy/.
[18]
IBM, “ZOS Mainframe concepts”, https://www.ibm.com/docs/en/zos-basic-skills?topic=concepts-mainframe-hardware-evolving-design.
[19]
IBM, “Introduction to IBM Power Virtualization Management e-Learning (text only)”,
https://www.ibm.com/docs/en/power-sys-solutions/0008-ESS?topic=P8ESS/p8eew/elearning/powervm_script.html.
Chapter 6
Network
Internet
Client Devices
(Customers)
MPLS
Client Devices
Web Web
Server Server
Firewall Firewall
Security
(Chapter 11)


Servers Servers
Storage Storage
DR (Chapter 9)
Mainframe Mainframe
DR (Chapter 9)
Replication
Backup and Restore

(Chapter 8)
The network capability is foundational to the data center. In Figure 6.1, showing the data center deployments,
the representative components related to the focus of this chapter, namely network capability, are highlighted
(in a box with a dashed outline). All other infrastructure components sit on the network and interact over the
network. The network architecture concepts are equally important in the cloud, too, except there is no need
to stand-up any infrastructure related to them. They may all be configured using the services provided by
cloud providers. Other than organizations that provide products/services whose solutions run only on the
public cloud, all the others would need to deal with network components in the data center and cloud and
their connectivity solutions. This chapter presents the key network architecture concepts.
Chapter 6: Network 43
Note: It may be noted in Figure 6.1 that the network cables run
across the data centers and are connected to all infrastructure
Key Components of
Data Center Network
components. While they contribute to network capability, they are
not shown highlighted with a dashed outline to avoid cluttering the • LAN & WAN
figure with unnecessary detail.
• VLAN
• Subnetwork
6.1 Network Basics • DMZ
This section touches on basic concepts of networking. • Firewall
A data center network is a set of firewalls, routers, switches, • Switch
and several other network components wired together through • Router
fiber-optic or copper cables. The network components use a set of • Load balancer
protocols to communicate over these cables. • Forward Proxy/
Reverse Proxy
6.1.1 OSI Model • NAT
The communication protocols are based on the 7-Layer OSI model
summarized in Table 6-1[1].
Layer Purpose Example protocols
Application Communication services for use by end-user applications. FTP
Presentation Data formatting for view by presentation with encryption HTTPS, SSL
and decryption.
Session Session setup, coordination, and termination between the NetBIOS

applications.
Transport Data transfer between end systems and hosts. TCP, UDP
Network Routing of data packets based on IP address. IP, Layer 3 Switches,

Routers
Data Link Node-to-node data frame transfer based on MAC Address. Layer 2 Switches
Physical Physical link (wired or wireless) for communication. Layer 1 Hubs, NICs,
Cable
Table 6-1: 7 Layers of OSI Model
Layers 1, 2, and 3 are particularly important when defining infrastructure architecture for deploying new
network solutions.
6.1.2 LAN and WAN
A local area network (LAN) is a computer network that connects computers and other devices in a small
area. Local Area Networks (LANs) use Ethernet and provide high data transfer rate – Fast Ethernet 100
Mbps, or Gigabit Ethernet 1/10/40/100 Gbps.
A wide area network (WAN) is a network that provides connectivity across multiple regions. WANs are
of many types –
a) Site to site (or point to point) connected through leased lines.
b) Multisite connected using MPLS (Multiprotocol Label Switching) – MPLS establishes private
connection linking data centers and branch offices by directing data through a path via labels.
c) Software-based SD-WAN – SD-WAN is a software-defined wide area network that allows multisite
traffic to traverse on MPLS or less-costly internet links based on the criticality of the traffic. Encryption
is used for traffic that is sent via internet links.
6.1.3 Virtual LAN (VLAN)

A VLAN is a virtual network formed through the logical grouping of devices on a LAN. The devices are,
typically, switches, and the LAN is Ethernet. With LAN, a network packet is received by all devices on it.
With VLAN, the network packet is sent to only a specific set of devices that constitute what is called,
broadcast domain. VLANs partition the network at Layer 2.
LAN uses protocols such as Token ring and FDDI (Fiber Distributed Data Interface). VLAN uses IEEE802.1q
and Inter-Switch Link (ISL) protocols [2]. The Ethernet LAN represents the collision domain on which
Layer 0 Ethernet (CSMA/CD) packets collide. A VLAN represents the broadcast domain, i.e., a group of
devices configured to receive broadcast traffic (Layer 1 Data Link Frames) from one another. Without
VLANs, a broadcast message sent from a host reaches all network devices increasing CPU overhead on
each device and reducing the overall network security. With VLAN configuration, a broadcast from the
host is limited to devices on the VLAN.
A VLAN can be created from one or multiple LANs. It enables the network administrator to automatically
limit access to a specified group of servers/desktops into different isolated network segments. Two
types of VLANs are commonly in use[3] –
1. Port-based (Untagged) VLANs: A single physical switch is simply split into multiple logical switches.
2. Tagged VLANs: Multiple VLANs use a single switch port. Tags are attached to the individual
Ethernet frames as they exit the port. Tags contain the VLAN identifiers specifying the VLAN to
which the frame belongs. When both switches understand tagged VLANs, the connection can be
accomplished using a single cable connecting from, what is called, a “trunk” port.
Illustration – Port-based (Untagged) VLAN
1. One Switch S1 – Figure 6.2
• All the servers have been connected to one physical switch. However, only the following servers
can communicate with each other due to the configuration of the VLAN.
• VLANs
– VLAN 1 – Server P1 with Server P2.
– VLAN 2 – Server P8 with Server P9.
Ports
Switch S1
Server P1
1
Server P2
Server P8 2
3
4
5
Server P9
6
7
8
Figure 6.2: Port-based VLAN – One Switch, Two VLANs
2. Two Switches S1 and S2 – Figure 6.3
Ports Ports
Switch S2
Switch S1
Server P1 Server Q1
1
Server P2 Server Q2
2
3
3
4
4
Server P8 Server Q8
5
5
Server P9 Server Q9
6
6
7
7
8
Figure 6.3: Port-based VLAN – Two Switches, Two VLANs
• 4 Servers are connected to Switch S1, and the other 4 servers to Switch S2, as shown.
• Both VLANs are configured on the physical switch, and since it is a port-based VLAN configuration,
one cable per VLAN is required. Therefore, two cables will be required for connecting both VLANs.
– One cable from Switch S1 Port 4 to Switch S2 Port 4 for VLAN 1.
– One from Switch S1 Port 8 to Switch S2 Port 8 for VLAN 2.
• Only the following servers can communicate with each other due to the configuration of
the VLAN.
• VLANs
– VLAN 1 - Server P1, Server P2, Server Q1, Server Q2.
– VLAN 2 – Server P8, Server P9, Server Q8, Server Q9.
Illustration – Tagged VLAN
Ports Ports
Switch S2
Switch S1
Server P1 Server Q1
1
Server P2 Server Q2
2
3
3
4
4
Server P8 Server Q8
5
5
Server P9 Server Q9
6
6
7
Trunk
8
Figure 6.4: Tagged VLANs – Two Switches
• Figure 6.4 shows 4 Servers connected to Switch S1 and the other 4 servers to Switch S2.
• One single physical connection is established between the two physical switches. In this
illustration, both ports S1-8 and S2-8 are configured as “trunk” ports and will carry traffic for
both VLANs.
• VLAN tags (IEEE 802.1q) are used for VLAN1 and VLAN2. Tags allow for separation of VLAN1
and VLAN2 traffic without the need for physical separation.
• VLAN tags are set as traffic exits a Switch S1 “port,” so the next Switch S2 needs to understand
802.1q tags because it changes the Ethernet frame when inserted.
Subnetwork: Subnetwork or subnet is a logical network partition at Layer 3 of the OSI model. At Layer 3,
i.e., at the Network layer, each computer or host has at least one IP address as a unique identifier. The
Internet Protocol (IP) is used for sending data from one computer to another over the internet[4].
A subnet is defined to subdivide large IP networks into smaller, more efficient subnetworks. A subnet
aims to divide a large network into smaller, interconnected networks to minimize the broadcast traffic
on a single network segment, thereby improving available network bandwidth. It also optimizes the
usage of available IP address space.
Subnet masks split an IP address into bits that identify the network and host parts. When a device sees
the network identification and host identification bits of another device’s IP address, it can determine if
it is part of the same network or some other network. Figure 6.5 shows the structure of the IP address
and the network identification bits and host identification bits.
168 8 252 2
IP Address 10101000 .00001000 .11111100 .00000010

(Example)
8 bits
32 bits
Before Subnetting
Network Identification Bits Host Identification Bits
After Subnetting
Network Identification Bits Subnet Identification Bits Host Identification Bits
Network Part Host Part
Figure 6.5: IP Address and Subnetting
Subnetting is the segmentation of a network address space. It allows its connected devices to
communicate with each other. Routers are used to communicate between subnets. Subnet masks
specify the range of IP addresses used within a subnet. Two types of subnet specifications have been
in existence[5]:
1. Classful: Three classes of subnets have been defined as summarized in Table 6-2. Older and
less used.
Bits to Bits to
Subnet specify specify Number of
Class Mask Format hosts Number of Hosts networks Networks
Class A 8-bit 255.0.0.0 24 224 – 2 (=16777214) 1 28-1 (=128)
Class B 16-bit 255.255.0.0 16 216 – 2 (=65534) 2 216-2 (=16384)
Class C 24-bit 255.255.255.0 8 28 – 2 (=254) 3 224-3 (=2097152)
Table 6-2: Classful subnet specification
Classes disproportionately distribute the number of available IP addresses.
2. Classless: Classless Inter-Domain Routing (CIDR) notation is used to specify a subnet, as
summarized in Table 6-3. A trailing “/” slash and a number are used that specify how many bits are
used to identify the network portion of the address. Currently widely used.
CIDR Notation Available Hosts Subnet Mask
/8 232-8 - 2 (=16777214) 255.0.0.0

/9 232-9 - 2 (=8388606) 255.128.0.0
/10 232-10 - 2 (=4194302) 255.192.0.0
/11 232-11
- 2 (=2097150) 255.224.0.0
/12 232-12 - 2 (=1048574) 255.240.0.0
/13 232-13 - 2 (=524286) 255.248.0.0
/14 232-14 - 2 (=262142) 255.252.0.0
/15 232-15 - 2 (=131070) 255.254.0.0
/16 232-16 - 2 (=65534) 255.255.0.0
/17 232-17 - 2 (=32766) 255.255.128.0
/18 232-18 - 2 (=16382) 255.255.192.0
/19 232-19 - 2 (=8190) 255.255.224.0
/20 232-20 - 2 (=4094) 255.255.240.0
/21 232-21 - 2 (=2046) 255.255.248.0
/22 232-22 - 2 (=1022) 255.255.252.0
/23 232-23 - 2 (=510) 255.255.254.0
/24 232-24 - 2 (=254) 255.255.255.0
/25 232-25 - 2 (=126) 255.255.255.128
/26 232-26 - 2 (=62) 255.255.255.192
/27 232-27 - 2 (=30) 255.255.255.224
/28 232-28 - 2 (=14) 255.255.255.240
/29 232-29 - 2 (=6) 255.255.255.248
/30 232-30 - 2 (=2) 255.255.255.252
Table 6-3: Classless subnet specification
Illustration: A /20 would indicate that 20-bits are used to identify the network, and the remaining
12-bits are used to identify the host.
6.1.4 Basic Network Diagram
A basic network diagram is shown in Figure 6.6. It shows a router and a DMZ with several infrastructure
components bounded by two Firewalls. A switch, three web servers, an email server, and a DNS server
are shown as part of the DMZ. A switch, two application servers, and a database server are shown
beyond the internal firewall.
Demilitarized Zone (DMZ): A demilitarized zone (DMZ) is a network segment to prevent outside users
from gaining direct access to an organization’s internal network. It represents a “neutral zone” between
the internet and an enterprise’s intranet.
Illustration – Basic Network Diagram

In the diagram in Figure 6.6, the network segment between the external and internal firewalls is the
DMZ. In the DMZ are a switch, a load balancer with three web servers, an email server, and a DNS server
that are not directly exposed to the internet. However, external devices and software can access them
over restricted ports (e.g., Port 80). The servers in the DMZ have access to the application servers and
database server, which are beyond the internal firewall, over specific ports.
External Internal
Firewall Firewall
DMZ
Internet
202.29.120.110 192.168.0.1
Router
(NAT Enabled)
Switch Switch
Load Email DNS

Balancer Server Server
Database Application
Server Server
Web Web Web

Server Server Server
Figure 6.6: Basic Network Diagram
Firewall: “A firewall is a network device that monitors and controls incoming and outgoing network
traffic”[6]. A firewall controls traffic based on security rules specified in its configuration. It constitutes
a barrier between a trusted network and an untrusted network. Firewalls secure both LAN and WAN
environments and are of two types.
a. Traditional firewall: controls incoming or outgoing traffic at a point within the network. It tracks
traffic, typically in Layers 2 – 4 of the OSI model. Both stateless (monitors data in packets)
and stateful (applies intelligence and keeps track of the entire cycle of flow) methods may be
employed by the firewall[7].
b. Next-Gen firewall: application-aware, recognizes user of application through inspection of traffic,
blocks malware, provides integrated IPS, performs deep packet inspection, and recognizes and
decrypts SSL and SSH. It tracks traffic, typically in Layers 2 – 7 of the OSI model.
Router: A router forwards data from one network to another. It is a Layer 3 device used extensively to
forward internet traffic. There are two types of routing that may be performed.
a. Static Routing: A route table is created and maintained by a network administrator manually on
a router.
b. Dynamic Routing: A route table is created and maintained by routing protocol on a router. Commonly
used routing protocols include RIP (Routing Information Protocol), EIGRP (Enhanced Interior
Gateway Routing Protocol), and OSPF (Open Shortest Path First). While routers share dynamic
routing information with each other, the use of routing protocol brings enhanced routing capabilities
by dynamically choosing an optimal path when there are changes to network infrastructure.
Load balancer: A load balancer is a device that sits between the clients and servers and efficiently
distributes client requests to servers. The primary function of the load balancer is to provide high
availability for the hosted application. Load balancers may act at Layer 4 (IP, TCP, FTP, UDP) or
Layer 7 (HTTP). They may also be external facing handling requests from external sources or
internal facing. In Figure 6.6, a load balancer is shown distributing requests to the three web servers
at Layer 7.
Switch: Network switches connect devices on a network by receiving data from one device and
forwarding it to another device. Layer 2 network switches operate at the data link layer (OSI layer 2),
inspect frames, and use MAC addresses to forward data. Multilayer switches perform all functions that
Layer 2 switches do. Layer 3 network switch is one type of multilayer switch that forwards data using
destination IP address. Multilayer switches can perform routing functions, including static routing and
dynamic routing. Multilayer switches can inspect deeper into the protocol stack.
6.1.5 Address Translation

Address translation is done in networks to ensure that external users do not get details of IP addresses
of internal servers for security reasons. A NAT device does network address translation and hides the
IP addresses of internal servers from external clients. There is another reason to use NAT. Private IP
addresses are not routable on the internet. Thus, there is no way that the internal private IP addresses
can communicate to an external internet address. Internet routable public IP addresses incur costs to
an organization, and hence an organization would want to optimize the number of public IPs. To address
these requirements, organizations use private IP ranges on the internal network and deploy NAT on the
perimeter devices to translate the private IP to a public IP to communicate over the internet.
Network Address Translation (NAT): Network address translation is the process by which IP addresses
within a data packet are replaced with different IP addresses. Either routers or firewalls perform
this process. Assume the router shown in Figure 6.6 to be NAT capable. The LAN side IP address
is 192.168.0.1, and the internet side IP address is 202.29.120.110. For any packet being sent to the
internet, the NAT would change the IP address field of the sender in the packet to 202.29.120.110 when
sending the packet to the internet to hide the IP addresses of internal servers. For all the clients/servers
on the internet, only IP 202.29.120.110 is visible.
Then the question arises – how does the router correctly send the responses from servers on the internet
to the requesting client on the internal network and vice versa? It works based on the combination of
IP address and port number for each client communicating from the internal network. The NAT device
maintains a mapping of internal IP addresses and port numbers on which data is sent by internal clients
and the IP addresses of corresponding external servers while performing address translation. NAT hides
the internal device details from external clients/servers in the process.
Illustration – Network Address Translation

Consider Figure 6.7, which shows the basic network diagram extended to include a client on an internal
network and a server on an external network.
1. The client on the internal network initiates a request to the router to connect to 148.211.63.19, port
18 and asks it to respond to IP address 192.168.0.16, port 23767.
2. The router sends a request to the remote server, 148.211.63.19, port 18, and indicates that responses
be provided on IP address 202.29.120.110, port 32122.
3. When data comes back from the server, the router accepts it on the response-port number 32122
provided by it to the server.
4. The router then passes the data to the client on the internal network, the IP address 192.168.0.16,
and response-port 23767 that the client provided to the router when it started the conversation.
External Internal
Firewall Firewall
DMZ
Internet
202.29.120.110 192.168.0.1
Server Port no. Router
32122 (NAT Enabled)
148.311.63.19 Port no. 18
Switch Switch
Client
192.168.0.16
Port no.
Load Email DNS 23767
Server Server
Web Web Web

Figure 6.7: Network Address Translation (NAT)
6.1.6 Proxy
In simple terms, a proxy is doing something on behalf of someone or something else. There are two
types of network proxy:
1. Forward Proxy: A forward proxy (or simply proxy) is an intermediary server that forwards requests
on behalf of multiple clients to an external network. It is typically placed in the DMZ to forward
requests from an isolated, internal network to the internet through a firewall. Forward proxy hides
internal client IPs from devices on external network. Firewalls can also perform such functions and
may be used for that purpose.
2. Reverse Proxy: A reverse proxy is an intermediary server that accepts requests on behalf of multiple
servers. It is also placed in the DMZ. Reverse proxy hides IPs of internal servers from external clients.
Figure 6.8 depicts Forward and Reverse proxy deployment.
DMZ Internal Network
Server 1 Server 2
Client External Reverse Internal FTP

Firewall Proxy Firewall Server
Server Client
External Forward Internal
Firewall Proxy Firewall
Figure 6.8: Forward Proxy and Reverse Proxy
6.1.7 Microsegmentation
It is common to refer to traffic in the data center as being north-south or east-west.
1. North-South traffic: For traffic from a server in a data center to reach the internet, it needs to reach WAN.
The traffic that involves the inward and outward flow of data packets from the server (LAN) to WAN is
north-south traffic. Traditionally, most of the traffic in the past in data centers was of this kind.
2. East-West traffic: The traffic flow within a data center, VLAN, or subnet is referred to as
east-west traffic. For example, the data communication from the application server to the DB server
constitutes east-west traffic. This kind of traffic is the norm in contemporary data centers.
Figure 6.9 depicts north-south traffic and east-west traffic pictorially.
Internet
North
South
Switch
Storage Application Database

Servers Servers
East West
Figure 6.9: North-South traffic and East-West traffic
Microsegmentation is a method of segmentation to create zones in data centers and cloud environments
to divide into distinct security segments at the individual workload level[8]. It enables the isolation of
workloads from one another and secures them based on a Zero trust (trust no user or device) approach.
Using microsegmentation, network administrators can create policies to restrict network traffic, reduce
the network attack surface, and provide consistent security across data centers and cloud platforms.
Traditional network segmentation has been at the Layer 2 level by defining multiple virtual segments
(VLANs) and at the Layer 3 level by defining subnets. It continues to work well for north-south traffic
crossing the perimeter security. However, security for east-west traffic between workloads needs to be
much more granular – VM and workload level[9]. Microsegmentation addresses this need by enabling
the creation of microsegments, isolating, and securing them through policies. At a VM level, multiple
virtual NICs (Network Interface Cards) such as production NIC, management NIC, and backup NIC
may be assigned, and microsegments may be configured for traffic between the VMs using network
virtualization solutions such as VMware NSX.
6.2 Network Architecture
The network architecture in a data center is the architecture for the LAN based on Layer 2 and Layer
3 switching and routing to structure the flow of traffic. A hierarchical network has been found to be
more effective for improved manageability and troubleshooting than a flat network. Hence, network
architecture in a data center using switching and routing components has been hierarchical.
In the past, north-south traffic constituted a portion of data flow in a data center, while east-west traffic
is currently a significant portion of data flow. Hence, two types of network architecture have come into
vogue to support the traffic flows.
1. Three-tier network architecture: It has switches in three layers – best suited for north-south traffic.
2. Two-tier spine-leaf architecture: It has switches in two layers – best suited for east-west traffic.
6.2.1 Three-tier Architecture

Cisco formulated the three-tier hierarchical network architecture. Three layers are:
1. Access: This layer of devices connects user devices such as PCs, IP phones, wireless access points,
printers, and scanners to the network.
2. Distribution: This layer of devices does not provide service to end devices but aggregates data from
the access switches.
3. Core: This layer constitutes the network’s backbone and provides a high-speed connection between
different distribution layer devices.
Figure 6.10 depicts the three-tier network architecture.
Core
Layer 3 Aggregation
Layer 2
Access
ENV A ENV B ENV C
Figure 6.10: Three-tier network architecture
Access Layer: This layer has access switches that implement Layer 2 VLANs for different logically
separated environments (shown as ENV A, ENV B, and ENV C in Figure 6.10). End-user devices are
connected to switches in this layer, and traffic is restricted to the Layer 2 VLANs. They are traditional
switches that consist of 24 to 48 ports of 1 or 10Gbps ports. This layer implements several Layer 2
switching services. One of these services is the spanning tree to prevent multiple connections between
two network switches or two ports on the same switch connected. Otherwise, the loop creates repeated
broadcast messages that flood the network.
Distribution Layer: Layer 3 switches in this layer are called distribution or aggregation switches as they
aggregate data from switches in the access layer. Every Layer 2 switch is connected to a corresponding
Layer 3 switch for the logically separated environment. If any device in an ENV needs to have connectivity
to VLANs defined in a different ENV, it is implemented through tunneling in Layer 3. Layer 3 tunneling
uses network layer tunneling protocols (e.g., IPSec) for the exchange of data packets by the addition of
a new IP header to an IP packet before sending them across a tunnel created over an IP-based network.
Essentially, the Layer 3 switch routes the data to the right ENV at Layer 3.
Core Layer: The switches in this layer, called core switches, have high throughput and advanced routing
capabilities. This layer is the backbone of the network. A packet received by the core switch is routed
to the correct distribution switch and onward to the access switch where the destination device for the
packet is connected. The only service provided by core switches is to route traffic at the fastest possible
speed.
The three-tier network architecture worked well for north-south traffic. However, with increasing
east-west traffic in data centers, the three hops corresponding to the three tiers increase to four, five, or
more, adding significant latency and latency predictability issues. The spanning tree has also exhibited
brittle failure mode for network issues resulting in network outages[13]. Cisco introduced virtual-port-
channel (vPC) technology to overcome the limitations of the Spanning Tree Protocol[10]. It was also
possible to extend the Layer 2 boundary to core switches and have Layer 2 VLANs spread across the
ENVs. This approach enabled specific capabilities such as being able to do vMotion of VMs. Multiple
connections could be made with access switches and distribution switches. However, vPC also works
well when most traffic is north-south between clients and servers.
6.2.2 Two-Tier Spine-Leaf architecture

With east-west traffic becoming predominant as a result of virtualized components sending data to
one another, a different type of network architecture called spine-leaf came into existence. There are
two tiers – one called spine and the other called leaf. The switches in the spine layer carry traffic to the
outside (perimeter) to address the needs of north-south traffic. An example of spine switches is Cisco
Nexus 7000 series switch, and the leaf switch is Cisco Nexus 3000 series switch.
Figure 6.11 shows the Spine-Leaf network architecture.
The leaf layer switches are connected to each of the spine layer switches in a mesh topology. Spine
switches are not connected to one another. If one spine switch were to go down, then the traffic is
routed through the other spine switches.
Spine
Leaf
Figure 6.11: Spine-Leaf network architecture
Every server is only two hops away. Traffic from Leaf switch to Leaf switch goes via spine switch and
constitutes east-west traffic. Layer 2 boundary may be at the spine switches or just leaf switches. No
Spanning tree is used. Fabric path with ISIS protocol is used with equal-cost load balancing that gives
high throughput, or VxLAN protocol is used alternatively. The scalability for east-west traffic throughput
is accomplished by adding more spine switches. Likewise, scalability of the port capacity of leaf
switches is possible by adding more leaf switches. North-south traffic goes through the spine layer if the
Layer 2 boundary is extended to the spine. However, Layer 2 boundary is restricted to the Leaf layer, as
otherwise, broadcast traffic will spread across all the ports (of switches).
6.3 Network virtualization

Network Virtualization is virtualizing what was traditionally delivered in hardware. Network virtualization
may be used to create virtual network layers on the same physical network infrastructure.
The spine-leaf architecture described in the previous section could be the physical network
infrastructure and is referred to as an underlay network. A network virtualization software can run
on the underlay network to create an overlay network that can run different virtual network layers.
Two key network virtualization technologies are widely used[11]:
1. Cisco’s Application Centric Infrastructure (ACI): Integrated overlay approach, i.e., includes hardware
and software. It uses a virtualization technology called VXLAN (virtual extensible local area network
technology).
2. VMware’s NSX-T: Adopts an approach of software overlay over server infrastructure. It uses
a generic technology for network virtualization called GENEVE.
6.4 Network services in the cloud
All public cloud platforms offer services to define the network and configure it remotely through admin
consoles. Table 6-4 provides an overview of key network services from major cloud platform providers[12].

Virtual Network Amazon Virtual Private Virtual Networks Virtual Private Cloud
Cloud (VPC) (VNets)
Data Center Direct Connect ExpressRoute Google Cloud Interconnect
integration
Load Balancer Elastic Load Balancer Load Balancer Google Cloud Load
Balancing
DNS Amazon Route 53 Azure DNS Google Cloud DNS
Firewall AWS Firewall / Web Azure Firewall Google Cloud firewalls
Application Firewall
Table 6-4: Key network services for AWS, Azure, and GCP
References
Shaw K., “The OSI model explained and how to easily remember its 7 layers”,
[1]
https://www.networkworld.com/article/3239677/the-osi-model-explained-and-how-to-easily-remember-its-7-layers.html.
[2]
Harmoush E., “Virtual Local Area Networks (VLANs)”, https://www.practicalnetworking.net/stand-alone/vlans/.
[3]
Quick B., “How Do VLANs Work?”, https://www.inteltech.com/how-do-vlans-work/.
[4]
Ferguson K., ”subnet (subnetwork)”, https://www.techtarget.com/searchnetworking/definition/subnet.
[5]
Erikberg, “Notes: Networks, Subnets, and CIDR”, https://erikberg.com/notes/networks.html.
[6]
G. Palmer, “Network Device and Technologies 1.1 SY0-401”, https://zymitry.com/network-devices-technologies/.
[7]
Njoroge J., “When a Traditional Firewall Doesn’t Go Far Enough”,
https://gtb.net/why-gtb/blog/when-traditional-firewall-doesn%E2%80%99t-go-far-enough.
[8]
VMware,” What is Micro-Segmentation?”, https://www.vmware.com/topics/glossary/content/micro-segmentation.
[9]
paloalto, “What is Microsegmentation?”, https://www.paloaltonetworks.com/cyberpedia/what-is-microsegmentation.
[10]
Cisco, “Cisco Data Center Spine-and-Leaf Architecture: Design Overview White Paper”,
https://www.cisco.com/c/en/us/products/collateral/switches/nexus-7000-series-switches/white-paper-c11-737022.html.
[11]
Morin J., Shaw S. “Network Virtualization for dummies”, https://www.vmware.com/content/microsites/learn/en/47785_REG.html.
[12]
S. Wickramasinghe. “AWS vs Azure vs GCP: Comparing The Big 3 Cloud Platforms”,
https://www.bmc.com/blogs/aws-vs-azure-vs-google-cloud-platforms/.
[13]
Ferro G. “Why Spanning Tree Is Evil”, https://www.networkcomputing.com/networking/why-spanning-tree-evil.
Chapter 7
Storage
Internet
Client Devices
(Customers)
MPLS
Client Devices
Web Web
Server Server
Firewall Firewall
Security
(Chapter 11)


Servers Servers
Storage Storage
DR (Chapter 9)
Mainframe Mainframe
DR (Chapter 9)
Replication
Backup and Restore

(Chapter 8)
Storage is an essential capability needed in the infrastructure of an enterprise to be able to host

applications and their data. In Figure 7.1, showing the data center deployments, the representative
components related to the focus of this chapter, namely storage capability, are highlighted (in a box
with a dashed outline).
Enterprise systems continuously add data to their structured and unstructured data stores and process
it by OLTP, OLAP, and ML systems. Additionally, multiple copies of data are maintained for operational
recovery and disaster recovery requirements. Further, regulatory requirements in several industry
segments require data copies to be archived for several years. Due to cyber-attacks in recent times,
Chapter 7: Storage 61
more copies are also being stored in cyber vaults (CV) to recover
from such attacks. Consequently, there is enormous and increasing
Storage
demand for storage in enterprises. • Block Storage
The capability of storage is assessed by following parameters[1] – – Storage Area Network
(SAN)
1. IOPS: Input/output operations per second.
• File Storage
2. Throughput: Number of bits (Gbps) or bytes (GBps) a system can – Network Attached
read or write per second. Storage (NAS)
3. Latency: Duration for a single data request to be received, the • Object Storage
correct data to be located, and response to be provided by the – RESTful API access
storage media. for storage services
4. Capacity: Amount of data that can be stored in GB or TB.
5. Availability: Percentage of time that a storage system is available
for use, i.e., uptime.
6. Durability: Measure of storage system’s long-term data protection ability, i.e., not suffer from
degradation, bit rot, or other corruption. It is expressed as a percentage.
7. Types of storage media supported: NVMe (SSD), SSD, SAS, SATA.
8. Storage efficiency: Optimize storage through deduplication and compression techniques.
There are three types of storage solutions in use, both on-premises and cloud environments.
1. Block Storage.
2. File Storage.
3. Object Storage.
7.1 Block Storage

Block storage is a type of storage in which data is stored in fixed-size blocks. Metadata is maintained by
the storage system for each block[2]. A software program for the storage system uses metadata to locate
the desired blocks for data retrieval. A raw volume of data storage from the constituent blocks is made
available to a physical server (with additional storage controllers) or virtual machine[3]. Each raw volume
can function as an individual hard drive or storage repository. It can be mounted as a volume or mapped
as a drive for access from the operating system[4].
Storage Area Network (SAN)

Storage area network (SAN) provides block storage. In simple terms, SAN “is a network of disks that is
accessed by a network of servers”[5]. SAN is generally accessed using Fibre Channel, Fibre Channel over
Ethernet (FCoE), or iSCSI protocols. Dell EMC’s PowerMax is an example of SAN storage.
Before SAN, Direct attached storage (DAS) was a solution employed for storage that involved connecting
disks directly to servers. They contained applications and data on them. While disks attached to one
server were accessible from other servers, the data accessed had to flow over the LAN for the servers, and
moving large amounts of data caused performance bottlenecks due to bandwidth issues.
On the other hand, the SAN has a separate network for the Fibre channel to interconnect disks. It does not use
the LAN, so the transfer of SAN storage data does not impact LAN performance. There is a greater cost for
the network setup for the Fibre channel and maintenance, but it gives a greater performance. iSCSI and FCoE
protocols involve transferring data over the standard network components of standard Ethernet LAN. The
iSCSI (short for “Internet SCSI”) protocol enables clients to send SCSI commands to SCSI storage devices
on remote servers. iSCSI can work over long distances using existing network infrastructure. While iSCSI
and FCoE may have disadvantages of using standard Ethernet LAN from the point of view of performance
bottlenecks due to bandwidth issues, they provide the advantage of lower cost of setup and maintenance.
There are three layers to the SAN.
1. Storage layer: This layer has an array of physical disks, most often configured with RAID options to
improve storage capacity, reliability, or both. RAID (redundant array of independent disks) protects
data in the case of a disk/drive failure. It is configured to appear to the operating system (OS) as
a single logical drive. A Logical Unit Number (LUN) is typically assigned to one or more disks (e.g.,
LUN0, LUN1) and may be accessed by servers.
2. Fabric layer: The layer with SAN switches, routers, gateways, and protocol bridges constitutes the
fabric layer over which the host layer accesses data in the storage layer.
3. Host layer: The hosts that connect with the SAN storage via the fabric layer constitutes the host
layer. Each of these hosts has a separate network adapter for Fibre Channel that differs from
ethernet called host bus adapter (HBA). The hosts run the business applications, databases and
communicate using the HBA with the SAN storage over SAN fabric.
Illustration – Storage Area Network
Virtualization Server (ESXi Host)
VM VM VM VM
PCI Bus FC HBA Host Layer
Storage Layer
LUNs
(e.g. DELL/EMC
PowerMax)
Physical Server
Switch (e.g. Brocade)
ARRAY3
CPU CPU CPU CPU (RAID)
Fabric Layer
PCI Bus FC HBA Host Layer
ARRAY2
(RAID)
Switch (e.g. Brocade)
ARRAY1
(RAID)
Figure 7.2 – Storage Area Network (SAN)
Figure 7.2 depicts the SAN solution with the three layers. The host layer consists of physical and virtual
servers (e.g., ESXi host with VMs), fabric Layer shows SAN Switches (e.g., Brocade), and storage layer
(e.g., PowerMax) depicts LUNs configured on the disk arrays (with RAID) for providing block storage.
7.1.1 Block Storage Options on-premises

Block storage in the data center is provided by –
1. Storage Area Network (SAN)
a. It is typically a mix of EFD (Enterprise Flash Storage that is Solid State Drive with higher
performance), FC (Fiber Channel), SAS (Serial-attached SCSI), SATA (Serial AT Attachment).
b. DELL/EMC’s PowerMax is an example of a SAN system.
2. Virtual SAN
a. With the evolution of software-defined storage, there is an option to use Virtual SAN (vSAN). Instead
of using central storage with a separate network for accessing the storage layer, it is possible to pool
storage across multiple servers and use it as a Virtual SAN. The storage is accessible over ethernet.
It has the advantage of lower-cost by using available storage capacity through software means.
VMware vSAN works with vSphere hypervisor and can be managed with vSphere client[6].
7.1.2 Block Storage Options on Cloud

Public cloud platforms provide a rich collection of block storage and database services summarized in
Table 7-1[7]. Relational data services on cloud platforms use block storage, and hence they have also
been listed.

Block storage Elastic Block Storage Azure Disk Persistent disks

(EBS)
In Memory Store ElastiCache RedisCache MemoryStore
RDBMS AWS RDS – Amazon Azure Relational Cloud SQL – MySQL,

Aurora, PostgreSQL, Database service –SQL PostgreSQL, and SQL
MySQL, MariaDB, Oracle Server, PostgreSQL, Server databases
Database, and SQL MySQL, MariaDB
Server.
Indexed NoSQL DynamoDB Cosmos DB Datastore, Big table
Table 7-1: Key block storage and database services for AWS, Azure, and GCP
7.2 File Storage
Many files are generated both by individuals and applications in any organization. The file storage
solution is an effective mechanism to store and retrieve file-based data from a system that provides
access to operating systems as a mount point or drive mapping.
Network Attached Storage (NAS) has disk arrays that are managed by an operating system. It provides
network interfaces and is accessed using file services protocols (NFS/CIFS) over ethernet, which is
different from the block-based protocols such as Fiber Channel (FC) and iSCSI used in SANs. NAS is
better for unstructured data. NAS provides high-capacity storage at a lower cost, and admins can add
more disks to scale capacity.
Figure 7.3 depicts NAS file storage connections over ethernet with physical servers and virtual machines
VM on a virtualization server (e.g., ESXi host). NAS products also most often support iSCSI for access
to SCSI storage.
Virtualization Server (ESXi Host)
VM VM VM VM
Ethernet
Interface
PCI Bus
NAS
TCP/IP NFS/
IP CIFS
Network
Physical Server
CPU CPU CPU CPU iSCSI
Ethernet
Interface
PCI Bus
Figure 7.3 – Network Attached Storage (NAS)
7.2.1 File storage options for data center

Network Attached Storage (NAS) devices are widely used in data centers for file storage. These typically
have –
a. SAS Disk Shelves.
b. NetApp NAS F8000 series is an example of a file storage solution.
7.2.2 File storage options for cloud
The file storage services on public cloud platforms are summarized in Table 7-2[8].

File Storage Elastic File System, FSx Azure File Storage, Google Filestore
for Windows and Lustre Avere vFXT
Table 7-2: Key file storage services for AWS, Azure, and GCP
7.3 Object Storage

When custom metadata is stored along with the actual data to facilitate retrieval, it represents an object.
Mechanisms to store and retrieve such objects are part of object storage solution. Objects are typically
unstructured without hierarchies, in a unified addressed space, and accessed using REST APIs over
HTTP. With the increasing need for enterprises to deal with unstructured data (images, audio, and video)
and the increasing demand for storing, retrieving, and transmitting them, object storage has become
critical for many businesses. This form of storage is best suited for data that does not change and is
widely offered as cloud storage by cloud providers.
7.3.1 Object storage options for data center

Object storage is seeing increasing adoption in data centers too to deal with increasing data volumes,
regulatory requirements, and data security threats[9]. Object storage in the data center has the following
characteristics:
a. Objects are stored in a group of drives.
b. They are replicated many times to unique possible zones.
c. IOPS: Low IOPS for SATA disks and high IOPS for SSDs.
DELL/EMC’s ECS is an example of an Object Storage system.
7.3.2 Object storage options for cloud
Perhaps, one of the earliest services to have gained adoption on the cloud is the object storage service.
A comparison of object storage services offered by AWS, Azure, and GCP is given in Table 7-3[8].

Object Storage S3 (Simple Storage Blob Storage Google Cloud Storage

Service)
Long term Amazon Glacier Azure Archive Storage Google Storage

Storage (Archive Storage)
Hybrid storage Storage Gateway StorSimple Egnyte Sync
Data Transfer Snowball edge, Import/ Data Box & Import/ Storage transfer service
Export disk & Snow Export
Mobile
Table 7-3: Key object storage services for AWS, Azure, and GCP
7.4 Storage Media

The storage media technologies have been developing at a phenomenal pace with more storage
capabilities per unit price. Table 7–4 provides indicative storage parameters for storage media currently
in use[10].
Key Storage Storage

Parameters Media
NVMe (SSD) SSD SAS Hard Disk SATA Hard Disk
IOPS 200,000 – 60,000 – 400,000 188 – 400 73 – 79
10,000,000
Throughput 3 GBps – 32 GBps 6 Gbps – 12 Gbps 6 Gbps – 12 Gbps 500 Mbps – 1
Gbps
Latency 10 microseconds – 100 4 milliseconds – 10 milliseconds –
225 microseconds microseconds – 10 milliseconds 30 milliseconds
100 milliseconds
Interface PCIe U.2, PCIe M.2 PCIe M.2, SAS, SAS SATA
SATA
Table 7-4: Storage Media
7.5 Storage Tiers
An enterprise’s storage data service characteristics are classified into various tiers based on performance
and price. Table 7-5 illustrates different tiers defined for storage[11].
Tier Data Service Characteristics

Performance Storage
(Response Time) Price Media Examples
Tier 0
Mission Critical Data for uninterrupted <5ms High NVMe, SSD
disruption-free access and usage
Tier 1
Frequently accessed (hot data) and <8ms High SSD
high-performance workloads
Tier 2
Infrequently accessed (warm data) with <12ms Medium SAS
short-term retention requirements
Tier 3
Archival and rarely accessed data with <30ms Low SATA
long-term retention requirements
Table 7-5: Storage Tiers
References
Pritchard S., “Storage performance metrics: Five key areas to look at”,
[1]
https://www.computerweekly.com/feature/Storage-performance-metrics-Five-key-areas-to-look-at.
[2]
Sullivan E., “block storage”, https://searchstorage.techtarget.com/definition/block-storage.
[3]
Atlantic.Net, “What is Block Storage?”, https://www.atlantic.net/dedicated-server-hosting/what-is-block-storage/.
[4]
Poojary N., “Understanding Object Storage and Block Storage Use Cases”, https://cloudacademy.com/blog/object-storage-block-storage/.
[5]
Bigelow S., “What is a SAN? Ultimate storage area network guide”,
https://searchstorage.techtarget.com/definition/storage-area-network-SAN.
[6]
Larcom A., “VMware vSAN vs. SAN: What are the differences?”,
https://searchvmware.techtarget.com/tip/How-VMware-vSAN-differs-from-a-traditional-VSAN.
[7]
Chapter247, “AWS vs Azure vs Google Cloud- A detailed comparison of the Cloud Services Giants”,
https://www.chapter247.com/blog/aws-vs-azure-vs-google-cloud-a-detailed-comparison-of-the-cloud-services-giants/.
[8]
A. Adshead, “Cloud storage 101: NAS file storage on AWS, Azure and GCP”,
https://www.computerweekly.com/feature/Cloud-storage-101-NAS-file-storage-on-AWS-Azure-and-GCP.
[9]
U. Boppana, “Red Hat Ceph Storage 3 greatly advances object storage capabilities”,
https://www.redhat.com/en/blog/rise-object-storage-modern-datacenter.
[10]
R. Sheldon, TechTarget, “NVMe speeds vs. SATA and SAS: Which is fastest?”,
https://searchstorage.techtarget.com/feature/NVMe-SSD-speeds-explained.
[11]
STONEFLY, “Everything you need to know about Tiered Storage”, https://stonefly.com/resources/what-is-tiered-storage.
Chapter 8
Backup and Restore
Internet
Client Devices
(Customers)
MPLS
Client Devices
Web Web
Server Server
Firewall Firewall
Security
(Chapter 11)


Servers Servers
Storage Storage
DR (Chapter 9)
Mainframe Mainframe
DR (Chapter 9)
Replication
Backup and Restore

(Chapter 8)
Backup/restore is a capability to create a copy of a system’s data (server or desktop) and use it for recovery,
should the original data be lost or corrupted[1]. A tape library, physical or virtual (VTL), and storage space
are required to store the backups. The focus of backup is on operational recovery as per the Recovery
Point Objective (RPO) defined. Two types of policies are applied in enterprises for retention of backup
data for compliance to organizational processes and regulations - short-term retention (STR) and long-
term retention (LTR)[2]. In Figure 8.1, showing the data center deployments, the representative components
related to the focus of this chapter, namely backup/restore capability, are highlighted (in a box with
a dashed outline). It may be noted that backups are different from snapshots and are meant to address
Chapter 8: Backup and Restore 69

different needs. Snapshot is a point-in-time “picture” of data (e.g., VM
or SAN or NAS storage) meant to be a short-term solution primarily
Key Backup/Restore
Criteria for Operational
in development and test environments, to test patching, updates,
Recovery
and other changes, usually stored locally, and rolled back in case of
failure[3]. On the other hand, backups are copies of data of a system • RPO
used to recover the system in case of failure. • RTO
In a data center or cloud, the operational recovery of a system is • Required Backup
carried out by performing several activities. One important activity Success rate
is restoring data from backups[4]. Several criteria have a bearing on • Available Backup
the choice of technology for backup/restore and the definition of Window
operational processes for recovery. • STR backup period
• LTR backup period
8.1 Backup/Restore criteria • Off-site backup
Key criteria to be kept in mind for backup/restore for operational • Monitoring & Support
recovery and disaster recovery are as follows[5]:
1. Recovery Point Objective (RPO): Maximum allowable data that may be lost in the event of a disaster.
It is measured in terms of time and dependent on the maximum age of the data or files in backup
storage[6].
2. Recovery Time Objective (RTO): Maximum time taken to recover from the adverse incident and
restoration of normal operations to users[7].
3. Backup Success rate: Percentage of attempts of backup when data is copied correctly and
completely[8].
4. Available Backup Window: Duration when suitable for taking backups of data[9].
5. Onsite short-term retention (STR) backup period: Duration for which backup copies are retained in
an environment that enables quick restore of data in the event of failure of a system.
6. Onsite long-term retention (LTR) backup period: Duration for backup copies to be maintained
onsite for point-in-time restore.
7. Off-site backup: Remote site at which backups are stored.
8. Monitoring & Support: The period during which the operations team provides monitoring and
support services.
A “3-2-1 backup” strategy is typically employed for backups – 3 copies of backup data are maintained,
2 in separate locations and 1 in off-site location[5].

Table 8-1 gives indicative backup/restore service characteristics for the mentioned criteria for compute
service tier.
Backup/Recovery Service Characteristics Compute Service Tier

High Medium Low
Recovery Time Objective < 12 hours) < 24 hours <48 hours
Backup Success Rate 99% 97% 95%
Off-hours Off-hours Off-hours
Backup Window
(8pm – 8am) (8pm – 8am) (8pm – 8am)
Onsite short-term retention (STR)
4 weeks 4 weeks 2 weeks
backup period
Onsite long-term retention (LTR)
7 years 7 years 7 years
backup period
Off-site backup Yes Yes No
Monitoring & Support 7x24 7x24 7x24
Table 8-1: Backup/Recovery service characteristics
8.2 Solution patterns for backup and restore

There are two patterns for backup/restore that are widely applied[10] –
1. Agent-based backup/restore pattern

In an agent-based backup pattern, a small application, called an agent, is deployed on the physical
server or virtual machine from which a backup is taken. Each agent supports data backup for specific
applications on the physical server or virtual machine[11]. The agent does backup to a backup server
or to a direct storage pool (but orchestrated by the backup server). The restore is done by a backup
server either using the agent deployed on the physical server or virtual machine or directly by it to a
new virtual machine provisioned by it.
Illustration
Figure 8.2 depicts agent-based backup with IBM’s Spectrum Protect and DELL/EMC’s Networker
technologies.
IBM’s Spectrum Protect has agents deployed on the servers (AIX, Windows, Linux) for backup of
SQL Server, DB2, and so on from which backup is taken. The agent takes the backup and sends it
to the server, which then writes to a storage such as DELL/EMC’s Data Domain. During the restore
process, IBM’s Spectrum Protect server uses the client to restore or spin up a new Virtual machine
and restore the backup. Figure 8.2 also shows inline-deduplication of Non-prod data and LTR
prod data to on-premises cloud storage (e.g., DELL/EMC ECS). In this case, data deduplication is
performed while the backup data is being copied to the backup device.

DELL EMC’s Networker has DDBoost or Networker Management Module (NMM) components
deployed on the physical servers or virtual machines that send the backup to the DELL/EMC’s
Data Domain. DDBoost is software that enhances the interaction of backup servers and clients
with a Data Domain backup appliance.
DD Boost
Front-end Network
Back-end
Network
X86 Physical/ DELL/EMC
Virtual Server Networker Backup / Restore
Backup / Restore
IBM Spectrum Server
Protect Client
Data Domain
Backup / Restore
Short-term Retention
IBM AIX LPAR

VTL
IBM Spectrum
Protect Client Backup / Restore IBM Spectrum IBM Spectrum
Protect Server Protect Server
Non-Prod Prod
Environment Environment
Backup / Restore VTL

In-line Deduplication
IBMi LPAR
Long-term Retention
IBM Spectrum
Protect Client
IBM Spectrum
Protect Server In-line
Prod (SCUN) Deduplication
Environment
LTR Prod Bucket Non-Prod Bucket
On-Premise Cloud Storage
Figure 8.2: Agent-based backup
2. Agentless backup/restore pattern

In an agentless backup pattern, no agent is installed on each physical server or virtual machine
from which backup is taken. Just one instance of a backup solution is deployed on a server.
It connects with all the servers on the network and performs the backup of required data.
Examples of this pattern –
– IBM Spectrum Protect for Virtual Environments is a feature by which the backup server
takes backup of the complete image of the VM as a VMDK file. No agent is used. The restore
process involves spinning up a new VM and deploying the backed-up image.
– Likewise, Networker Image-level backup is a feature by which the backup server takes
backup of the complete image of VM as a VMDK file without an agent. The restore would be
to a new VM, and hence no agent is required.

– NetApp SnapVault for its NAS FAS8000 series product enables taking of baseline snapshot,
which is essentially a point-in-time copy of source data being protected. Snapshot copies taken
subsequently have just the differences of the source data. These copies can be stored on the
source volume or on a different destination volume.
8.3 Back-end Network

When architecting backup/restore solutions, a common practice is to define[12] –
1. Front-end network: Network segments (VLANs) used by applications to communicate with their
components (e.g., databases) and interfaces (e.g., middleware) constitute the front-end network.
2. Back-end network: Network segments (VLANs) used by backup devices to perform backups and
monitoring & management tools to communicate with their components constitute the back-end
network.
This practice separates the backup/restore operations traffic from the application traffic, which avoids
performance bottlenecks.
8.4 Types of Backup

There are three types of backup[13] –
1. Full backup: Backup of all files, objects, bytes. It represents a complete backup of all data and can
be used for recovery without additional efforts. It takes time to do a full backup, depending on the
size of the data and the number of systems on which full backup needs to be done.
2. Differential backup: Differential backup makes a copy of data that has changed since the full
backup. During restore, the last full backup is used, and then the differential backup is applied on
top of it, thus saving time. However, as the number of days from full backup increases, the data to
be backed up also increases, resulting in an increase in time taken to take differential backup.
3. Incremental backup: Backup is taken of changes made since the last backup (full or incremental).
The last full backup is used during restoration, and subsequent incremental backups are applied
in the correct order. This process takes the least amount of backup window available in most
enterprises.

8.5 Operational Recovery
Operational recovery is the recovery of the application and infrastructure of a system in case of failure
due to any reason. The RPO, RTO, and availability criteria determine the time available for the operations
team to recover a system from failure. RPO and RTO have been explained earlier.
Availability is dependent on two factors –
1. Mean Time Between Failures (MTBF): Average time between system failures. MTBF is related to
system uptime and not being under the control of the operations team.
2. Mean Time To Repair (MTTR): Average time to troubleshoot, repair, and restore the system from
failure. MTTR is related to downtime a system can tolerate to comply with availability criteria.
Availability = (MTBF) / (MTBF + MTTR) or (Uptime/ Uptime + Downtime)[14].
Operational recovery involves several activities, including recovery of data from backups. When a system
goes down with associated servers and applications, the data is recovered from backups.
The typical reasons for a system to fail are as follows[15]:
1. Accidental deletion of a file, entire directory, or a VM.
2. Space expansion for databases.
3. Improper patching.
4. Active Directory changes.
When the system needs to be recovered, the actions taken by the operations team for recovery are as
follows:
1. Bring up Network VLANs, if needed.
2. Bring up Servers with the correct operating systems.
3. Deploy applications.
4. Restore data from backups.
Illustration – Database server failure

An indicative sequence of steps involved for operational recovery of a database server from a failure is
as follows:
Restart database server –
1. The server starts, and OS comes up.
• Bring up the database (Crash recovery mode).

• Rollback transaction.
• Bring up the database.

• 3 phases – conduct assessment, identify the transaction to be rolled back, undo the transaction.
• This approach gives the best RPO.
2. The server starts, but OS does not come back up.
• If OS is corrupted, set up VM and backup agent (for agent-based backup).

• Setup VM.
• Launch restore of the backup (e.g., IBM Spectrum Protect).
• Perform full database backup, followed by incremental backups.
• Apply archived transaction logs.
8.6 Backup/Restore solutions

Several backup/restore solutions are available both for deployments in the data center and the cloud.
8.6.1 Backup/Restore storage options for data center

Several vendors provide products to meet the needs of backup/restore for the different storage tiers.
Examples of backup/restore technology products are provided in Table 8-2.
Backup/Restore
Storage Tiers Product Options (Examples)
Tier 0 –
Mission Critical Data for uninterrupted disruption-free access IBM Spectrum Protect[16]
and usage.
Tier 1 –
DELL/EMC Networker[17]
Frequently accessed (hot data) and high-performance workloads.
Tier 2 –
Infrequently accessed (warm data) with short-term retention
requirements.
Veritas NetBackup[18]
Tier 3 –
Archival and rarely accessed data with long-term retention
requirements.
Table 8-2: Product Options (Examples) for backup/restore in the data center

8.6.2 Backup/Restore storage options for cloud
A comparison of backup/restore services offered by AWS, Azure and GCP are given in Table 8-3[19].

Backup AWS Backup Azure Backup Google Cloud Storage
On-prem to Storage Gateway StorSimple Egnyte Sync

Cloud backup AWS DataSync Azure AzCopy gsutil
Data Transfer Snowball edge, Import/ Data Box & Import/ Storage transfer service
Export disk & Snow Export
Mobile
Table 8-3: Key Backup/Restore services for AWS, Azure, and GCP
References
[1]
Acronis, “Data Backup – What is it?”, https://www.acronis.com/en-sg/articles/data-backup/.
[2]
B. Posey, “Backup retention policy best practices: A guide for IT admins”, https://searchdatabackup.techtarget.com/answer/What-are-some-
ata-retention-policy-best-practices.
[3]
C. Puricica, Veeam, “Why snapshots alone are not backups”, https://www.veeam.com/blog/why-snapshots-alone-are-not-backups.html.
[4]
W. Preston, “Why is Operational Recovery Needed?”, https://storageswiss.com/2017/06/16/why-is-operational-recovery-needed/.
[5]
Cloudian, “Data Backup in Depth: Concepts, Techniques, and Storage Technologies”,
https://cloudian.com/guides/data-backup/data-backup-in-depth/.
[6]
dhruva, “Recovery point objective definition”,
https://www.druva.com/glossary/what-is-a-recovery-point-objective-definition-and-related-faqs/.
[7]
C. Puricica, “Demystifying Recovery Objectives”, https://www.veeam.com/blog/rto-rpo-definitions-values-common-practice.html.
[8]
D. Russel, Gartner Research, “Best Practices for Repairing the Broken State of Backup”,
https://www.gartner.com/en/documents/2574917/best-practices-for-repairing-the-broken-state-of-backup.
[9]
Techopedia, “What Does Backup Window Mean?”, https://www.techopedia.com/definition/991/backup-window.
[10]
Acronis, “Agent vs Agentless Backup: Why it Matters”, https://www.acronis.com/en-sg/articles/agent-vs-agentless-backup/.
[11]
Databarracks, “What are the advantages of Agent vs Agentless backup?”,
https://www.databarracks.com/blog/what-are-the-advantages-of-agent-vs-agentless-backup.
[12]
Techopedia, “Front and Back Ends”, https://www.techopedia.com/definition/24794/front-and-back-ends.
[13]
PARABLU, “Demystifying Data Backups”, https://parablu.com/demystifying-data-backups-types-of-backups/.
[14]
WEIBULL.COM, “Availability and the Different Ways to Calculate It”, https://www.weibull.com/hotwire/issue79/relbasics79.htm.
[15]
W. Preston, “Why is Operational Recovery Needed?”, https://storageswiss.com/2017/06/16/why-is-operational-recovery-needed/.
[16]
IBM, “IBM Spectrum Protect”, https://www.ibm.com/products/data-protection-and-recovery.
[17]
DELL, “Dell EMC NetWorker Data Protection Software”,
https://www.delltechnologies.com/en-in/data-protection/data-protection-suite/networker-data-protection-software.htm.
[18]
Veritas, “NETBACKUP - Best-in-class enterprise data backup and recovery”, https://www.veritas.com/protection/netbackup.
[19]
Chapter247, “AWS vs Azure vs Google Cloud- A detailed comparison of the Cloud Services Giants”,
https://www.chapter247.com/blog/aws-vs-azure-vs-google-cloud-a-detailed-comparison-of-the-cloud-services-giants/.

Chapter 9
Disaster Recovery
Internet
Client Devices
(Customers)
MPLS
Client Devices
Web Web
Server Server
Firewall Firewall
Security
(Chapter 11)


Servers Servers
Storage Storage
DR (Chapter 9)
Mainframe Mainframe
DR (Chapter 9)
Replication
Backup and Restore

(Chapter 8)
Figure 9.1 – Data Centers with representative infrastructure components Focus of this chapter
A disaster is an adverse incident that prevents the operation of the applications/infrastructure from
a data center. Based on the outage duration, the enterprise makes the call to declare a disaster.
Disaster recovery is the capability to be able to recover successfully from a disaster in accordance with
pre-defined parameters. Two important parameters define the characteristics of recovery. In Figure
9.1, showing the data center deployments, the representative components related to the focus of this
chapter, namely DR capability, are highlighted (in a box with a dashed outline).
RPO is the Recovery Point Objective, which is the maximum allowable data that may be lost in the event
of a disaster. It is measured in terms of time and dependent on the maximum age of the data or files in
backup storage[1].
Chapter 9: Disaster Recovery 77

RTO is the Recovery Time Objective, which is the maximum time
taken to recover from the adverse incident and restoration of normal
Replication for DR
operations to users[2]. • Host-based replication
DR operators and engineers have to recover all of the infrastructure • Appliance-based
and applications in the specified RTO when a data center goes replication
down. The state of the data and services is not known at the time • VM-Snapshot Replication
of disaster. The interdependencies between various environments • Hypervisor-based
are also not well-understood at the time of disaster. The process of replication
disaster recovery, at a high level, consists of the following[3] –
• Array-based replication
1. Identification of the following affected by the disaster –
a. Business applications.
b. Infrastructure assets.
c. Order in which the assets need to be recovered.
d. Interdependencies between the assets and applications.
2. Actual physical recovery of each of the identified assets.
3. Recovery of any data that may have been lost or damaged during the disaster.
4. Confirmation of the following after recovery –
a. Data quality.
b. Business application functionality.
c. Integration between various systems, applications, services, and platforms.
9.1 Disaster Recovery Characteristics

Successfully recovering from a data center outage as per pre-defined characteristics requires careful
planning, processes, automation, and regular readiness testing[4]. Table 9-1 provides an indicative set of
DR service characteristics[3].
DR Service Characteristics DR Tiers
DR-HIGH DR-MEDIUM DR-LOW
Recovery Time Objective < 12 hours < 24 hours < 48 hours
Table 9-1: DR service characteristics

Tiers are defined to specify the DR characteristics. While called by various names in different
organizations, they broadly fall under three tiers referred to here as – DR-HIGH (Tier 1), DR-MEDIUM
(Tier 2), DR-LOW (Tier 3)[5].
The DR-HIGH, DR-MEDIUM, and DR-LOW tiers should not be confused with Compute criticality tiers,
although there is a clear mapping between them –
• The HIGH compute criticality tier maps to the DR-HIGH tier.
• The MEDIUM compute criticality tier maps to the DR-MEDIUM tier.
• The LOW compute criticality tier maps to the DR-LOW tier.
The DR-HIGH tier is aligned to the needs of the production environment of the organization’s most
critical applications, hence the CRITICAL application criticality tier. It is characterized as having an RPO
and RTO of 12 hours in Table 9-1 with an indicative set of DR service characteristics.
The DR-MEDIUM tier is aligned to the needs of the production environment of the organization’s
STANDARD application criticality tier. It is characterized as having an RPO and RTO of 24 hours in Table
9-1 with an indicative set of DR service characteristics.
The DR-LOW tier is aligned to the non-production environments of the organization’s NON-CRITICAL
application criticality tier. It is characterized as having an RPO and RTO of 48 hours in Table 9-1 with an
indicative set of DR service characteristics.
9.2 Disaster Recovery Process

Disaster Recovery is reacting to a disaster incident. However, a proactive step before that involves
preparing for a disaster scenario, called business continuity planning[6]. Only when business continuity
(BC) planning is done right can disaster recovery (DR) be successful with the alignment of business
planning, architecture strategy, solutions, and operations. Disaster recovery, therefore, has two aspects
to it – preparation and execution.
9.2.1 Preparation
The key activities that are performed in preparing for DR[7] are shown pictorially in Figure 9.2:
1. Establish DR operational characteristics: RPO and RTO are two key characteristics.
2. Setup alternate DR site, which is referred to as secondary site, with the right model:
a. Active-active: Both primary and secondary data center run applications (workloads) in production
environment and serve user requests. A load balancer distributes user requests to both the
data centers.
b. Active-passive: Primary data center functions as an active application site, serves user requests,
and replicates critical business data to secondary. The secondary data center is ready to be
activated to service applications should the primary data center fail due to some reason.

3. Form DR operations team with a clear and well-documented approach to be employed in the event
of a disaster. The DR team –
a. Maintains DR infrastructure and automation.
b. Defines procedures and processes to be employed during DR.
c. Conducts DR tests at regular intervals to ensure that organization is DR ready.
4. Establish procedures to replicate data to the DR site for identified applications: This is discussed in
the next section.
5. Declare DR when the disaster event occurs.
Establish DR Setup alternate DR Form DR

operational characteristics: site with the right model operations team
RPO and RTO
Declare DR Establish
procedures to
replicate data
Figure 9.2: Preparation
9.2.2 Execution
All the preparations for DR will come to fruition should a DR event occur[8]. The activities to be carried as
part of DR execution are shown in Figure 9.3 –
1. Recover shared infrastructure platforms and services.
a. Shared infrastructure platforms: Network components, servers, block storage (SAN),

file storage (NAS), and object storage.
b. Shared infrastructure services: DNS, AD, Citrix, and so on.
2. Recover shared application platforms and services.
a. Shared application platforms: Virtualization servers (ESXi hosts) or private cloud platforms,
Mainframe, Mid-range (AIX and IBMi), SQL Server, Oracle RDBMS, IBM DB2, and so on.
b. Shared application services: Tomcat, WebSphere, MQ-Series.

3. Recover business application and services.
a. Provision infrastructure for business applications to be brought up in the secondary site.
b. Deploy applications and related services.
c. Attach storage with the correct data to the compute.
4. Test and switch all traffic to the secondary site.
Recover shared Recover shared Recover business Test and Switch

infrastructure application applications all traffic to
platforms platforms and services secondary site.
and services and services
Figure 9.3 – DR Execution
9.3 Replication for DR

Backup is making copies of data and preserving it at one or more locations for operational recovery, i.e.,
should the actual data be corrupted or unavailable for some reason. On the other hand, replication is
copying the data from the primary data center to the secondary data center for disaster recovery.
1. Host (or Guest-OS) based replication

Replication software is deployed on a physical or virtual server, and the data is replicated using the
software (e.g., Vision Solutions DoubleTake, Veritas Volume Replicator).
2. Appliance-based replication
Appliances are deployed in primary and secondary sites. Data that needs to be replicated is
deduplicated in the appliance and replicated (e.g., IBM Spectrum Protect server[9], DELL/EMC
RecoverPoint[10]).
Illustration
Figure 9.4 depicts IBM Spectrum Protect servers deployed in primary and secondary data centers.
The clients are deployed on each of the virtual servers with applications. The server uses the clients
on the primary data center to take backups of the applications’ data into SAN-attached Virtual Tape
Library (VTL). IBM Spectrum Protect is configured to work with an appliance to deduplicate and
replicate backups to the secondary data center.

Primary Data Center Secondary DR Site
X86 Virtual Server

Backup / Restore
IBM Spectrum Protect Client
IBM Spectrum IBM Spectrum

Protect Server Protect Server
Applicance-based
Replication
Hardware Hardware
Appliance Appliance
Figure 9.4 – Appliance-based Replication
3. VM-Snapshot Replication
VM-level snapshots are taken at the primary site and sent to the secondary site. They are
point-in-time snapshots created in the hypervisor (e.g., VMware ESXi) and replicated
asynchronously[11].
4. Hypervisor-based replication
The replication software plugs into the hypervisor and copies VM-level data from the primary
site to the secondary site. The data in the secondary site is continuously kept up to date either
synchronously or asynchronously with changes made on the primary site (e.g., Zerto Enterprise
Cloud Edition for VMs, VMware SRM).
Illustration
Figure 9.5 shows replication at the virtual machine level based on the Zerto platform that enables
replication at the hypervisor level[12].

The core of Zerto’s replication technology
Primary Data Center
involves two components:
• Zerto Virtual Manager (ZVM) – VM

vCenter ZVM
manages disaster recovery, business
continuity, and offsite backup functionality
at the site level. It works with VMware VM VRA
vCenter or Microsoft System Center Virtual

Machine. VM VM
• Virtual Replication Appliance (VRA) – VRA

replicates the VMs and associated virtual
disks (e.g., VMDKs). Essentially, one VRA
is installed per ESXi/Hyper-V host.
The VRA copies data as it is created before

it leaves the hypervisor. This continuous
block-level replication minimizes data loss in VM-Level
the event of a disaster. Replication
5. Array-based (or storage-based) Replication

The data in SAN storage array (e.g., Secondary DR Site
DELL/EMC PowerMax) and NAS/iSCSI (e.g.,

NetApp FAS8000 series) storage arrays may vCenter ZVM
be replicated at storage-level by employing
specific tools (e.g., DELL/EMC SRDF[13] of
VM VRA
NetApp SnapMirror[14]) for replication.
Illustration VM VM
Figure 9.6 provides a diagrammatic view of
storage-based replication with DELL/EMC’s
Symmetrix Remote Data Facility (SRDF).
It works with DELL/EMC’s PowerMax SAN
storage and replicates data from the primary
site to the secondary site synchronously or
asynchronously. Figure 9.5 – Hypervisor-based Replication
The copy C1 shown in the primary site (Figure 9.6) is replicated to secondary site C2. A Business
Continuity Volume (BCV) is created that is refreshed with updates from C2 to support the RPO. The
BCV copy constitutes a “gold” copy of data. Snapshots are also taken on BCV for critical applications
at regular intervals to roll back easily should it be necessary.

Users
Primary Data Center MPLS Secondary Data Center

Network
Production
application server
Production Storage DR Storage
DELL/EMC
SRDF
Storage-Level Snap on
C1 C2 BCV
Replication BCV
Gold copy
Figure 9.6 – Storage based Replication
References
dhruva, “Recovery point objective definition”,
[1]
https://www.druva.com/glossary/what-is-a-recovery-point-objective-definition-and-related-faqs/.
[2]
C. Puricica, “Demystifying Recovery Objectives”, https://www.veeam.com/blog/rto-rpo-definitions-values-common-practice.html.
[3]
Process of Disaster recovery, https://mksh.com/5-elements-of-a-disaster-recovery-plan-is-your-business-prepared/.
[4]
E. Sullivan, “disaster recovery (DR)”, https://searchdisasterrecovery.techtarget.com/definition/disaster-recovery.
[5]
T. G. Cagle, “The benefits of the three-tiered system of prioritizing recovery efforts”,
http://www.instavisiontech.com/2021/07/12/the-benefits-of-the-three-tiered-system-of-prioritizing-recovery-efforts/.
[6]
J. Moore, “What is BCDR? Business continuity and disaster recovery guide”,
https://searchdisasterrecovery.techtarget.com/definition/Business-Continuity-and-Disaster-Recovery-BCDR.
[7]
J. Sipple, MKS&H, “5 ELEMENTS OF A DISASTER RECOVERY PLAN – IS YOUR BUSINESS PREPARED?”,
https://mksh.com/5-elements-of-a-disaster-recovery-plan-is-your-business-prepared/.
[8]
R. Long, “Disaster Recovery Strategy Execution, or Will It Really Work?”,
https://www.mha-it.com/2017/01/16/disaster-recovery-strategy-execution/.
[9]
IBM, “Tivoli Storage Manager - Replication of client node data”,
https://www.ibm.com/docs/en/tsm/7.1.1?topic=server-replication-client-node-data.
[10]
DELL, “DELL EMC Recover Point”, https://www.delltechnologies.com/en-in/data-protection/recoverpoint.htm.
[11]
VMware, “vSphere Replication”, https://www.vmware.com/in/products/vsphere/replication.html.
[12]
Zerto, “Hypervisor-Based Replication”, https://www.zerto.com/wp-content/uploads/2019/09/hypervisor-based-replication.pdf.
[13]
DELL EMC, “Dell EMC SRDF”, https://www.delltechnologies.com/asset/en-us/products/storage/technical-support/docu95482.pdf.
[14]
NetApp, “SnapMirror software: Unified replication, faster recovery”,
https://www.netapp.com/data-protection/backup-recovery/snapmirror-data-replication/.

Chapter 10
Monitoring
Internet
Client Devices
(Customers)
MPLS
Client Devices
Web Web
Server Server
Firewall Firewall
Security
(Chapter 11)


Servers Servers
Storage Storage
DR (Chapter 9)
Mainframe Mainframe
DR (Chapter 9)
Replication
Backup and Restore

(Chapter 8)
Monitoring is a capability to capture and analyze vital parameters of infrastructure to respond promptly
and take corrective action when necessary. Several application components and infrastructure
components need to be monitored for smooth functioning and quick resolution of issues in a data
center or a cloud platform. In Figure 10.1, showing the data center deployments, the representative
components related to the focus of this chapter, namely monitoring capability, are highlighted (in a box
with a dashed outline).
Several techniques exist to monitor IT infrastructure and applications in a data center[1].
Chapter 10: Monitoring 85

1. IT Infrastructure Monitoring.
2. Application monitoring.
3. Event monitoring and correlation.
4. IT Operations Analytics (ITOA).
5. Artificial Intelligence Operations (AIOps).
10.1 IT Infrastructure Monitoring

IT Infrastructure Monitoring (ITIM) is the real-time monitoring of infrastructure hardware, processes,
and equipment in the data center or cloud[2]. It typically involves availability monitoring of servers,
hypervisors, storage, network, and database instances to collect resource utilization data. It also
includes functionality to perform historical data analysis and reporting. Infrastructure components
provide a Simple Network Monitoring (SNMP) protocol-based port (UDP port 161 and 162) over which
they send unsolicited alert messages (traps). Monitoring tools read alert messages to remotely monitor
the infrastructure component without the need for an agent.
Hardware Monitoring
Hardware monitoring is “nuts & bolts” monitoring of the hardware. It involves monitoring power supply,
fans, temperature, disks, arrays, memory, CPU and reports the health at the hardware level of servers
and other data center infrastructure components. An example of hardware monitoring software is
SolarWinds hardware monitoring software that monitors server hardware from different vendors[3].
Illustration
An example of hardware monitoring of DELL systems, devices, and components is DELL OpenManage[4].
An indicative list of parameters that can be monitored with DELL OpenManage is as follows:
• Fans • Temperatures • Power supplies
• Memory • CPU • Smart Array
• HBA • Fibre Channel & iSCSI • SMART Drive Monitoring
• MAC & IP address • Link Up/Down

Server Monitoring
Servers host applications and other software to support business and infrastructure functions. Server
monitoring is the process of gaining visibility into the server’s system resources. It helps in capacity
planning by analyzing the server’s system resource usage.
The typical server parameters that are monitored are –
1. Server
a. CPU.
b. Memory.
c. Processes.
d. Operating system services.
e. File System/Disk.
2. Virtualization
a. Server virtualization – e.g., monitoring VM, disk, vCPU, memory, and resources that may be
reclaimed from large VMs in the environment to reduce inefficiency and improve performance. It
also includes monitoring clusters that have the highest resource demands, hosts that are being
heavily utilized, datastores running out of disk space, storage capacity, and utilization of the
vSAN environment.
b. Desktop virtualization – e.g., monitoring of critical parameters about Citrix virtual apps and
virtual desktops including License server health, broker server connectivity, connection failures,
logon duration, Latency, NetScaler connectivity, SSL certificate expiry, firmware upgradations,
NetScaler backups.
Examples of server and virtualization monitoring tools are:
1. Nagios[5]: Monitors disk space on the server, memory, CPU usage, services on Windows/Linux,
license usage, server air temperature, WAN, and internet connection latencies.
2. IBM Tivoli Monitoring (ITM)[6]: Thresholds may be set for parameters such as disk, memory, CPU,
monitors for base operating system/availability, and monitored against the thresholds.
3. vRealize Operations (vROPS)[7]: Monitors VMware-based VM resources for server and desktop
virtualization.
4. SolarWinds Server & Application Monitor[8]: Monitors parameters of servers, including Citrix XenApp
and XenDesktop.

Storage Monitoring
Storage monitoring tracks the performance, availability, and health of storage devices, both physical
and virtual. Three forms of storage – block, file, and object storage, have been discussed in chapter 7.
The typical storage parameters that are monitored are –
1. IOPS.
2. Throughput.
3. Latency.
4. CPU utilization.
5. Queue depth.
6. Capacity.
Examples of storage monitoring tools are:
1. Unisphere for DELL/EMC SAN storage[9]: Monitors Cache write pending, SRDF consistency, FE and
BE Utilizations, Thinpool usage, Disks.
2. OnCommand Unified Manager for NetApp NAS storage[10]: Parameters such as volume and aggregate
space, chassis temperatures, power supplies, disks, shelves, switches are monitored.
3. Elastic Cloud Storage (ECS) Probe[11]: Parameters for ECS object storage.
Network Monitoring Techniques for

Network monitoring tracks the real-time status of network Monitoring in
components such as routers, servers, and firewalls. It also monitors Data Center
various aspects of a network and its operation, such as traffic,
bandwidth utilization, and uptime. It helps detect device failures and • IT Infrastructure
can alert administrators to issues. Monitoring
• Application monitoring
Networking monitoring tools can proactively identify deficiencies,
• Event monitoring and
improve efficiency, and ensure that network is running optimally.
correlation
Examples of Network monitoring tools are: • IT Operations Analytics
1. Paessler PRTG network monitoring tool[12]: Monitors network (ITOA)
bandwidth, network traffic, SNMP, WMI, SSH, and other • Artificial Intelligence
parameters. Operations (AIOps)
2. SevOne Network Data platform[13]: Collects multi-vendor network

performance metrics and flow data.

10.2 Application Monitoring
Application components are deployed on a variety of servers. Monitoring of components and the
associated servers include the following:
1. Web server – e.g., IIS, Apache.
2. Application server – e.g., Tomcat, WebSphere, JBoss.
3. Database servers – e.g., SQL Server, Oracle, DB2.
4. Middleware servers – e.g., MQ Series.
5. ERP servers – e.g., SAP.
The key areas of application monitoring are[14] –
1. Application uptime/availability monitoring: Continuously checks to see if the application is up and

running and responding to requests it receives.
2. Application performance monitoring (APM): Tracks key software application performance metrics
starting at the entry point of the web server/ application server. APM is being extended to include
the front-end, namely, the web browser, mobile, or IoT application which has traditionally been
part of end-user experience monitoring (EUEM). Extending APM to include EUEM helps get an end-
to-end perspective, optimize service performance and response time, and improve user experience.
Such a performance analysis discipline formed by APM and EUEM is referred to as digital experience
monitoring (DEM). DEM includes –
a. Synthetic monitoring – active (controlled) simulated user action by recording and playing user
actions and measuring performance and availability. Synthetic monitoring is performed for both
types of front-ends –
i. Web-based.
ii. Non-web based.
b. Real User Monitoring – passive actual user interactions monitoring by injecting JavaScript on
each page and capturing and analyzing the response, e.g., DCRUM.
A group of software vendors developed and specified an application development index (APDEX)
to report application performance. The anticipated satisfaction of a user is also assessed and
reported as a numerical score[15]. Tools such as Dynatrace generate APDEX reports.
3. Application error monitoring: Finds application bugs to enable developers to prioritize and fix them.
Two types of error monitoring[16]–
a. Front-end monitoring – detects issues with front-end components deployed in web servers and
application servers.
b. Back-end monitoring – detects errors in integrating with back-end components like databases,
middleware, and ERP servers.

4. Application log monitoring: Gathers, analyzes and draws correlation from any application log
data to provide insights on status and issues. Logging is enabled in the application by developers
by implementing logging tools. One popular, open and free stack of tools for logging and related
monitoring is ELK stack, which stands for Elasticsearch, Logstash, and Kibana[17].
5. Application database monitoring: Monitors interaction between the application and its database
and performance of the database to identify issues with the database that could affect the
efficient working of the overlying application. A tool such as Dynatrace can do database
monitoring and report on all key parameters related to database accesses and responses
(e.g., what tables are being used, the latency, and so on).
6. Application security monitoring: Monitors the application for security issues, including malware
and other threats. Tools such as those from Contrast Security are used for this purpose[18].
Enterprises deploy application full-stack monitoring tools that provide a “360-degree” view of an
application. Such tools perform active monitoring by analyzing how the application behaves in normal
and abnormal scenarios and raise alerts. They create baselines and continuously refine them. For
instance, Dynatrace Manage OneAgent[19] is a full-stack monitoring tool that monitors processes,
services, application traffic, resources (CPU, Memory, Disk, Network), application response time,
transaction failures, errors, slowness, log monitoring, application availability.
10.3 Event Monitoring and Correlation

“Event monitoring in IT is the process of collecting, analyzing, and signaling event occurrences to
operating system processes, active database rules, and human operators”[20]. Events received by
monitoring tools are integrated with event monitoring tools.
Several types of events may be generated from a variety of sources due to changes to infrastructure
devices, computing resources, increasing data volumes, and many other reasons[21] –
1. Operating System events: Produced by operating systems (Windows, Linux, Unix, iOS, Android).
2. System events: Generated for abnormal states or system health and resource changes.
3. Network events: Produced by network ports, switches, or routers related to the health of the devices.
4. Web server events: Originated from tools like Microsoft IIS or Apache HTTP server-related hardware
and software.
5. Application events: Generated from business activity monitoring software for business transactions.
6. Database events: Related to reading, updating, and storing data in databases.
7. Other Data center devices: Generated from synthetic checks, probes, real user monitoring, and
client telemetry for user interactions.

Event correlation involves first collecting monitoring alerts from across the infrastructure landscape.
Subsequently, event aggregation, filtering, deduplication, and normalization are performed to categorize
them into groups (or clusters)[21]. This process enables the identification of issues. The issues, thus
identified, may be integrated with an IT Operations Management (ITOM) module of ITSM solution (e.g.,
ServiceNow). ITOM module implements the service operation process of ITSM discussed in chapter 1.
Examples of event monitoring and correlation tools are ServiceLogic SL1[20], EMC SMARTS[22],
OpsView, Splunk[26].
Figure 10.2 provides a pictorial representation of the integration of the monitoring tools for event
monitoring and correlation.
ITSM
(e.g., Microfocus SMAX,
ServiceNow)
ITOM
Event
Monitoring &
Correlation
(e.g., SL1,
Opsview)
Alert Data
Hardware Network
Server, Storage Monitoring &
Monitoring Performance
(e.g., ITM, Nagios, Monitoring
vROps) (e.g., SevOne)
Device Network
Avalailibility Monitoring &
Monitoring & Device Performance
Data Collection Data Collection
Server Storage Gateway Load Router Switches VPIN Servers Firewalls Other
Balancer Devices Data Center
Devices
Figure 10.2 – Event Monitoring

10.4 IT Operations Analytics (ITOA)
Many events are generated continuously in any data center. Just event monitoring and correlation may
result in a significant number of issues to be addressed by the operations team reactively. With the
advancement in big data and analytics technologies, event monitoring and correlation, described earlier,
maybe taken to the next level using these techniques.
IT Operations Analytics is “the practice of monitoring systems and gathering, processing, analyzing and
interpreting data from various IT operations sources to guide decisions and predict potential issues”[24].
While Big Data technologies, such as Hadoop and Cassandra, are well-suited to run analytics on
massive amounts of data and extract intelligence, there are many specialized tools for ITOA. Examples
are Elastic, Sumo Logic, Evolven, Micro Focus OpsBridge[23], Splunk[26].
ITOA tools provide analytics functionality that is generally static to analyze past monitoring data and
determine the issues. To do so, data from multiple sources are correlated and analyzed using specialized
techniques. In other words, ITOA is about using data mining techniques to discover patterns and
correlations to determine complex issues, and their root causes for operations teams to resolve them.
However, the frequently changing infrastructure in distributed environments has presented several
limitations to the analytics provided by the ITOA tools. ITOA tools and solutions have evolved to include
AI/ML and predictive capabilities in analytics to overcome the limitations. That led to concepts of AIOps.
10.5 Artificial Intelligence Operations (AIOps)

While ITOA is about using Big Data and related analytics-focused mainly on past data that is relatively
static, AIOps is about extending them to include AI/ML techniques as well. Big Data with AI/ML
techniques combines more prediction capabilities to the real-time dynamic environment for issues and
outages[25]. These intelligence techniques can also trigger self-healing, i.e., automated remedial action.
An AIOps tool has event correlation, anomaly detection, and root cause determination capabilities. When
applied to IT infrastructure monitoring data using AIOps tools, the machine learning algorithms enable
operations teams and applications teams to work efficiently to detect issues early and resolve them
quickly to minimize the impact on business and customers. With AIOps, analytics may be performed on
massive amounts of complex data in changing IT environments to predict and prevent outages, improve
uptime, and resolve issues using automation as and when they arise.
Examples of AIOps tools are Micro Focus OpsBridge[23] and Splunk ITSI[26]. Many AIOps tools are powerful,
with all the capabilities of event monitoring/correlation tools and ITOA tools[27]. They can effectively
replace other legacy tools in the environment that may have implemented those capabilities and bring in
prediction, root cause analysis, and self-healing capabilities. Figure 10.3 depicts a monitoring solution
with AIOps.

ITSM
(e.g., SMAX, ServiceNow)
ITOM
AIOps
(e.g., OpsBridge,
Splunk)
Alert Data
Hardware Network
Server, Storage Monitoring &
Monitoring Performance
(e.g., ITM, Nagios, Monitoring
VROPs) (e.g., SevOne)
Device Network
Avalailibility Monitoring &
Monitoring & Device Performance
Data Collection Data Collection
Server Storage Gateway Load Router Switches VPIN Servers Firewalls Other
Balancer Devices Data Center
Devices
Figure 10.3 – Monitoring with AIOps

References
[1]
V. Bieri, NextThink, “The Periodic Table of ITOps Tools”, https://www.nexthink.com/blog/the-periodic-table-of-itops-tools/.
[2]
APPDYNAMICS, “What is Infrastructure Monitoring? Best Practices & Use Cases”,
https://www.appdynamics.com/topics/what-is-infrastructure-monitoring#~5-application-performance-monitoring-resources.
[3]
DELL/EMC, “Hardware Monitoring Software”, https://www.solarwinds.com/server-application-monitor/use-cases/hardware-monitor.
[4]
DELL/EMC, “OPENMANAGE SYSTEMS MANAGEMENT SOLUTIONS - Simplify, automate and optimize your IT operations”,
https://www.delltechnologies.com/en-in/solutions/openmanage/index.htm.
[5]
Nagios, “What can Nagios help you do?”, https://www.nagios.org/.
[6]
IBM, “Overview of IBM Tivoli Monitoring”,
https://www.ibm.com/docs/en/itcam-app-mgr/7.2.1?topic=introduction-overview-tivoli-monitoring.
[7]
S. Burns, “Use vROps Manager to perform system health monitoring”,
https://searchvmware.techtarget.com/tip/Use-vROps-Manager-to-perform-system-health-monitoring.
[8]
SolarWinds, “SolarWinds Server & Application Monitor Features”, https://www.solarwinds.com/server-application-monitor/use-cases.
[9]
DELL, “Unisphere”, https://www.delltechnologies.com/en-in/learn/data-storage/unisphere.htm.
[10]
NetApp, “Introduction to OnCommand Unified Manager”, https://docs.netapp.com/ocum-95/index.jsp.
[11]
Broadcom, “Dell EMC Elastic Cloud Storage (ECS) Monitoring”,
https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/ca-unified-infrastructure-management-probes/
GA/alphabetical-probe-articles/monitor-technologies-using-restmon-probe/dell-emc-elastic-cloud-storage-ecs-monitoring.html.
[12]
Paessler, “PRTG Network Monitor”, https://www.paessler.com/prtg.
[13]
Sevone, “SevOne Network Data Platform”, https://www.sevone.com/products/sevone-data-platform/.
[14]
Antonio C., “The 6 Main Types of Application Monitoring”, https://inspirationfeed.com/types-of-application-monitoring/.
[15]
A. Gillis, “Application Performance Index (Apdex)”,
https://searchitoperations.techtarget.com/definition/Application-Performance-Index-Apdex.
[16]
O. Peter, Refugeictsolution.Com.Ng, “Application Monitoring And Error Tracking”,
https://refugeictsolution.com.ng/2021/05/13/application-monitoring-and-error-tracking/
[17]
Elastic, “Free and open log monitoring”, https://www.elastic.co/log-monitoring.
[18]
Contrast Security, “APPLICATION SECURITY MONITORING”,
https://www.contrastsecurity.com/use-cases/application-security-monitoring-asm.
[19]
Dynatrace, “Dynatrace OneAgent”, https://www.dynatrace.com/support/help/setup-and-configuration/dynatrace-oneagent/.
[20]
ScienceLogic, “Event Monitoring”, https://sciencelogic.com/glossary/event-monitoring.
[21]
S. Stradley, Big Panda, “The Definitive Guide to Event Correlation in AIOps: Processes, Examples, and Checklist”,
https://www.bigpanda.io/blog/event-correlation/.
[22]
VMware/EMC, “EMC® Smarts® Server Manager”,
https://docs.vmware.com/en/VMware-Smart-Assurance/9.4.2/302-003-101_01_Smarts_Server_Manager_942_user_and_config_guide.pdf.
[23]
Micro Focus, “What is Event Correlation?”, https://www.microfocus.com/en-us/what-is/event-correlation.
[24]
Tech Target Contributor, “IT operations analytics (ITOA)”, https://searchitoperations.techtarget.com/definition/IT-operations-analytics-ITOA.
[25]
J. Compeer, StackState, “From ITOA to AIOps: 3 Key Differences You Should Know”,
https://www.stackstate.com/blog/from-itoa-to-aiops-3-differences-you-should-know.
[26]
Splunk, “Splunk® IT Service Intelligence (ITSI)”, https://www.splunk.com/en_us/software/it-service-intelligence.html.
[27]
Enlyft, “IT Management Software products”, https://enlyft.com/tech/it-management-software.

Chapter 11
Security
Internet
Client Devices
(Customers)
MPLS
Client Devices
Web Web
Server Server
Web application Firewall Web application
Firewall
Firewall Firewall
Security
(Chapter 11)


Servers Servers
Storage Storage
DR (Chapter 9)
Mainframe Mainframe
DR (Chapter 9)
Replication
Backup and Restore

(Chapter 8)
The one topic that has been most concerning to the CXOs of an enterprise is security. The widespread
adoption of penetration techniques by state and non-state actors has reached a point where the question
of security being raised by the CXOs is not if but when security of their IT systems will be compromised.
Security, in the context of infrastructure, is the capability developed by implementing a comprehensive

set of IT security solutions to address the security risks in the data center and the cloud. In Figure
11.1, showing the data center deployments, the representative components related to the focus of this
chapter, namely security capability, are highlighted (in a box with a dashed outline).
Chapter 11: Security 95

Information security in enterprises is established through security controls based on a model referred
to as CIA Triad. The CIA triad guides organizations on three dimensions – Confidentiality, Integrity,
and Availability[1].
• Confidentiality is the protection of sensitive and private information from unauthorized access.
• Integrity is the protection of data from unauthorized changes for the overall accuracy, consistency, and
completeness.
• Availability is to ensure access to the systems and the resources for authorized users.
The key security solutions for an enterprise to support the CIA triad fall under the following categories –
1. Access Security.
2. Connectivity Security.
3. Data Security.
4. Application Security.
5. Cyber Security.
11.1 Access Security

Access security involves authentication, authorization, and audit trail for all accesses to applications and
infrastructure in a data center or cloud through web clients, mobile clients, devices, and APIs exposed by
enterprise systems. “Authentication confirms that users are who they say they are. Authorization gives
those users permission to access a resource”[2]. The principle of least privilege (POLP), which restricts
users’ access rights to minimal privilege required to perform their function, is a best practice employed
for access security[24].
The starting point to implement access security is to establish the identity of users (human and system).
Therefore, an identity and access management (IAM) solution needs to be implemented that provides
access to enterprise resources. “Identity and access management, or IAM, is the security discipline that
makes it possible for the right entities (people or things) to use the right resources (applications or data)
when they need to, without interference, using the devices they want to use”[3]. Users and their privileges
are added, modified, and deleted across various systems using an identity management system.
An identity management system uses a directory to store all user definitions and privileges for each
user. An access manager authenticates and authorizes users against the directory before providing
access to the enterprise system[4].
Each enterprise system user typically uses five or more applications to get their work done. Establishing
a separate identity in each application and managing it is a nightmare both for the enterprise and its
users. Hence, a single sign-on system is needed to provide easy access to valid users and an elegant
mechanism for the enterprise to control its resources.
Illustration
Figure 11.2 depicts the access security solution with Oracle products added to the basic network diagram
discussed in Chapter 6. When a request comes to the web server for access to particular application

functionality, the request is sent to Oracle Access Manager (OAM). The OAM application checks to see if
the user has already been authenticated. If not, it makes a request to an Oracle Internet Directory (OID) that
verifies the credentials and provides any authorization information, such as privileges, set for the user. The
identity and authorization information of a user is managed by an Oracle Identity Management system[5].
External Internal
Firewall Firewall
DMZ
Internet
202.29.120.110 192.168.0.1
Router
(NAT Enabled)
Switch Switch
Load Email DNS

Server Server
Oracle
Authenticate Verify Credentials
Internet
& Authorize & Fetch Priviledges Directory
Web Web Web
Oracle Access
Manager
(Single Sign-on)
Oracle
Identity
Management
Figure 11.2 – Access Security
11.2 Connectivity Security

Connectivity security is an umbrella term used in this book for the mechanisms that may be considered
for securing user requests when establishing connectivity with applications in a data center or cloud.
There are two aspects to connectivity security.
1. Network Architecture for secure communication: The security concepts addressed through the
network architecture for enterprise involving VLANs, Firewall, Routers, Three-tier architecture, and
spine-leaf architecture, microsegmentation have been discussed in chapter 6.
2. Security solutions for enhanced connectivity security: There are specific security solutions to
provide enhanced connectivity security for data center and cloud solutions. The key connectivity
security solutions are –
a. Web Application Firewall (WAF).

b. Intrusion Prevention System/Intrusion Detection System (IPS/IDS).
c. Data Leakage Prevention (DLP).
d. Virtual Private Network (VPN).

Figure 11.3 depicts the use of these solutions as an enhancement to the basic network diagram
discussed in chapter 6.
VPN Client
used by
External users
Internal
Firewall
Web Data VPN Server

Application IPS/ IDS Leakage for External
Internet
Firewall Prevention Users
Router
(NAT Enabled)
Browser Client
used by
External users
Switch Switch
Load Email DNS

Server Server
Web Web Web

Figure 11.3: Connectivity Security
a. Web Application Firewall: “A web application firewall (WAF) is an application firewall for HTTP
applications”[6]. It is like a reverse proxy as it protects servers that host one or more web
applications by inspecting and filtering traffic between each web application and the internet.
In Figure 11.3, WAF is shown deployed in the DMZ as an appliance that examines HTTP traffic
before it reaches the web server. It may also be deployed as a server-side software plugin or
packaged as a cloud service to detect and filter threats that could expose online applications to
denial-of-service (DoS) or degrade performance. WAFs may be stateless or stateful.
Examples of attacks that WAF may be used to filter are SQL injection, Cross-site Scripting (XSS),
Layer 7 DoS, Cookie poisoning, Web scraping, and Unvalidated input.
b. Intrusion Prevention System/Intrusion Detection System (IPS/IDS): An Intrusion Detection

System (IDS) is a system that monitors network packets and raises alerts on detecting known
signatures of threats such as malware[7]. Intrusion Prevention System (IPS) takes alerts to the
next level of controls by rejecting network packets representing security threats. Figure 11.3
depicts the deployment of IDS/IPS in the DMZ before the traffic reaches the web server.

c. Data Leakage Prevention (DLP): Data Loss Prevention (DLP) involves the detection and prevention
of data loss through unwanted destruction of sensitive data (e.g., ransomware attack) and data
leakage by the unauthorized transmission of data from within an organization to an external
recipient[8]. DLP threats occur through the web, email, data storage devices. DLP solutions
enable discovery and control of sensitive data by setting business rules that classify confidential
information to prevent disclosure maliciously or accidentally.
In Figure 11.3, the DLP solution is shown deployed in the DMZ.
d. Virtual Private Network (VPN): A Virtual Private Network (VPN) enables an encrypted and
protected connection over public networks. The encryption takes place in real-time, making
it more difficult for third parties to track websites that users are accessing and thus provides
anonymity. A VPN server is a specially configured server. The users use the VPN client to connect
to the VPN server. Any traffic from an application (e.g., browser) configured to work with the VPN
client goes through the VPN client and is encrypted when sent to the VPN server on a public
network. The internet service provider (ISP) or other third parties cannot view which websites
and information the users look at[9].
In Figure 11.3, a VPN client is shown that a user on the internet uses to access the organization’s
applications via a VPN server deployed in the DMZ.
11.3 Data Security

Data security is required when data is at rest, in use, and in motion[10].
1. Data at rest: Data, when at rest, may exist in block storage, file storage, or object storage discussed
in chapter 7. Encryption and data masking techniques are widely used to ensure that data is
unusable, even if it reaches unauthorized users. Safe disposal of data is ensured using data wipe
techniques. Organizations may also deploy Data Leakage Prevention (DLP) solutions as an extra
measure to protect data at rest. Regulations such as GDPR and CCPA have been enacted to ensure
organizations build and deploy systems that implement data protection for personally identifiable
information (PII). Several security considerations need to be addressed for block, file, and object
storage[11].
a. Block storage security: SAN security strategy involves multiple integrated layers or zones of
security – e.g., Zone A between switches, Zone B between servers and switches, and so on.
By doing so, failure of layer or zone will not compromise the data under protection. Traditional
FC SANs enjoy a natural security advantage over IP-based networks. While FC SAN is an isolated
private environment with fewer nodes than an IP network and therefore vulnerable to fewer
security threats, there is no single comprehensive security solution available for SANs. LUN
masking and zoning, switch-wide and fabric-wide access control, Role-Based Access Control,
and logical partitioning of a fabric (Virtual SAN) are the most commonly used SAN security
methods.
b. File storage security: NAS storage may be compromised by viruses, worms, unauthorized access,
snooping, and data tampering. Permissions and ACLs constitute the first level of protection to
NAS resources by restricting accessibility and sharing. These permissions are in addition to

default attributes associated with files and folders. NAS devices use file services protocols
(NFS/CIFS) over ethernet. Therefore, authentication and authorization are implemented and
supported on NAS devices in the same way as in Windows or UNIX file services environment. To
establish users’ identity and privileges, authentication, and authorization mechanisms, such as
Kerberos and directory services, are implemented. Likewise, firewalls are deployed to protect the
storage infrastructure from unauthorized access.
c. Object storage security: Data at rest is essentially implemented using encryption techniques.
Typically, each object is encrypted with its encryption key, and the encryption keys themselves
are encrypted with a master encryption key. Client-side encryption is also often employed to
encrypt objects with encryption keys before storing them in object storage. The encryption is
supported by security keys, either natively provided by the vendor or an external key manager.
2. Data in use: Data in use is active and frequently accessed/updated by multiple users through
applications. The techniques employed to secure data in use are to restrict access by user
role and limit system access to only those who need it by having controls in place “before”
providing access to content. In specific cases, information rights management (IRM) and digital
rights protection may be applied to ensure that only the authorized user can use sensitive
information. Another approach is to mask personally identifiable and sensitive data before
providing it to less secure environments. In some cases, it may be sufficient to provide metadata
to consumers instead of raw data. This approach can help prevent the leakage of sensitive
information. Products are also becoming available that may be used to encrypt data in use
(e.g., Sotero KeepEncrypt™).
3. Data in motion: Data in motion is data that is in the process of being transferred from an environment
in which it is at rest (e.g., storage) to an environment subject to third-party services whose security
cannot be guaranteed. Encryption is the key technique employed in protecting data in motion. Data
is encrypted before it traverses any external or internal networks using protected tunnels, such as
HTTPS or SSL/Transport Layer Security, VPNs, and Generic Routing Encapsulation. Several types
of encryption need to be selectively applied, keeping in mind data in motion security requirements.
Two widely applied encryption techniques are[12] –
a. Symmetric: Involves converting plaintext to ciphertext using the same key for encryption and
decryption. Examples are Advanced Encryption Standard and Triple DES.
b. Asymmetric encryption: Uses two interdependent keys, one to encrypt the data and another to
decrypt it. Examples are Diffie-Hellman key exchange and RSA.
11.4 Application Security

Applications conduct an enterprise’s business processes and require users and third-party systems to use
them. Applications are, therefore, the most common external attack target using their weaknesses and
software vulnerabilities.
Application security is the discipline of security that involves processes, practices, and tools to protect
applications from threats and security weaknesses throughout the application life cycle[13]. Securing

applications involves remediation of weaknesses in applications and protecting them not just in development
and design but also after deployment. The steps in the remediation process are identifying, planning, and
fixing security weaknesses during development time and protection of applications during runtime.
Security weaknesses are identified through static, dynamic, and interactive testing done by security
scanning tools during development. Their fixes are planned and made part of the releases keeping
in mind the dependencies involved. Runtime protection is designed to protect in real-time to defend
against malicious attacks after an application is deployed in a production environment.
Development time: The main activities conducted by developers for developing secure applications are:
1. Design secure applications to avoid vulnerabilities and risks (e.g., Top 10 of Open Web Application
Security Project – OWASP).
2. Use code repositories to develop secure and version-controlled code (e.g., GitHub).
3. Check out code from the repository.
4. Develop code, unit test, and run static code analysis (e.g., Java code with Eclipse and unit testing
with JUnit). Static code analysis is performed using Static Application Security Testing (SAST).
SAST[14]:
a. Static application security testing is a white box testing method to test code for vulnerabilities.
b. Typically, it plugs into IDEs to run scans on code.
c. Examples of vulnerabilities that can be identified through SAST are:
i. SQL injection.
ii. Cross-site scripting.
iii. Buffer Overflows.
d. Examples of SAST tools are:
i. Klocwork.
ii. SpectralOps.
iii. Veracode.
5. Check-in code (e.g., create a pull request to merge code).
6. Create software build, run integration tests & conduct sanity tests to have a working application.
7. Deploy & run tests. Functional, non-functional, dynamic analysis and interactive security testing
are performed. Dynamic analysis is performed using Dynamic Application Security Testing (DAST),
Interactive Application Security Testing (IAST), and Software Composition Analysis (SCA).

DAST[15]:
a. Dynamic application security testing (DAST) is a black-box testing method performed by running
tests on an application in execution.
b. Simulates external attacks and identifies vulnerabilities based on outcomes.
c. Examples of vulnerabilities that can be identified through DAST are:
i. Path traversal.
ii. Insecure server configuration.
iii. SQL/Command injection.
iv. Cross-site scripting.
d. Examples of DAST tools are:
i. Acunetix.
ii. AppScan.
iii. Netsparker.
IAST[16]:
a. Interactive Application Security Testing (IAST) is a white box testing method on an application
instrumented with specific interfaces to identify vulnerabilities.
b. IAST testing is performed in real-time while the application is running to identify the problematic
code lines from a security perspective and notify the developer for remediation.
c. All the OWASP vulnerabilities may be identified with IAST tools.
d. Examples of IAST tools are:
i. Hdiv Detection (IAST).
ii. Synposis Seeker.
iii. Veracode.

SCA[17]:
a. Software Composition Analysis (SCA) is used to scan an application’s codebase to identify all
open-source components, license compliance data, and security vulnerabilities.
b. SCA tools generate for an application –
i. Inventory report of all the open-source components.
ii. Information about each component.
iii. Identify open-source components with known vulnerabilities.
iv. Rank vulnerabilities.
v. Identify the right patch or version to fix the vulnerabilities
c. Examples of SCA tools:
i. FlexNet Code Insight.
ii. WhiteSource.
iii. JFrog Xray.
Runtime:
Post-deployment of an application into production, it is protected through web application firewalls (WAF),
bot management, and RASP (runtime application self-protection). WAF has been explained earlier in this
chapter.
Bot Management[18]:
a. Bot management is part of the runtime security of applications to protect mobile apps, web
applications, and APIs from malicious bots while permitting access for the bots that help the business
of an enterprise.
b. Examples of bot attacks –
i. Fake Accounts.
ii. Credit Card Fraud.
iii. Marketing Fraud.
c. Bot management solutions protect applications from attacks by different approaches to detect and
manage bots. Examples of approaches –
i. Passive: Identify malicious bots with header information and web requests.
ii. Active: Challenge the web request with tests that bots cannot perform easily (e.g., Prompt user
with a CAPTCHA).
iii. Pattern identification: Classify activity and distinguish between human users, business bots, and
malicious bots.

d. Solutions for Bot Management include –
i. Imperva Bot Management.
ii. F5 Bot management.
iii. Radware Bot Manager.
RASP[19]:
a. Runtime Application Self-Protection (RASP) is an application security method to protect it from inside,
unlike the WAF solution, which protects from outside.
b. Monitoring, detection, and protection-related RASP code is deployed into the application servers.
c. All requests to the application are intercepted by RASP, and necessary security actions are taken.
d. Examples of RASP solutions are:
iv. Hdiv RASP.
v. Imperva Real-time Application Self Protection (RASP).
vi. Micro Focus Fortify Application Defender.
11.5 Cyber Security

Cyber security is the discipline of security that involves processes, practices, and tools to protect
compute, storage, network, and applications in the data center or cloud from attack, damage, or
unauthorized access[20]. Cyber Security is also known as information technology security or electronic
information security.
Cyber security protects information technology systems from cybercrime (for financial gain or
disruption), cyber-attack (targeted information gathering) or cyber terrorism (to spread panic
and fear)[21].
The key types of cybersecurity threats and tools for protection are[22]:
1. Malware: Malicious software that can be used to cause harm to a user. Viruses, worms, Trojans, and
spyware are different forms of malware.
Tools for protection (Examples) – Avast Antivirus, Kaspersky Anti-Virus, Trend Micro Antivirus+
2. Social engineering: Uses human interaction to trick users into revealing sensitive information.
Tools for protection (Examples) – Policies, Training programs for users.
3. Phishing: Fraudulent email or messages meant to deceive users as being from reputable or known
sources to steal sensitive data.
Tools for protection (Examples) – Proofpoint Email Security and Protection, Mimecast Email
Security with Threat Protection, SpamTitan Email Security.

4. Distributed denial-of-service (DDoS) attacks: Flooding the target system with messages from
multiple sources to disrupt the traffic of the target system to prevent it from functioning or affect its
performance.
Tools for protection (Examples) – SolarWinds Security Event Manager, AWS Shield, Indusface
AppTrana.
5. Advanced persistent threats (APTs): Sustained targeted attacks to infiltrate a network and remain
undetected for an extended period to steal data.
Tools for protection (Examples) – Security Information and Event Management (SIEM) Tools
such as SolarWinds Security Event Manager, Splunk Enterprise Security,
6. Man-in-the-middle (MitM): Attacks involve an interception and relay of messages between two
parties who believe they are communicating.
Tools for protection (Examples): Hetty, Bettercap, Mitmproxy.
7. Ransomware: Involves locking the user’s computer system files and demanding a payment to
unlock them.
Tools for protection (Examples): Bitdefender Antivirus Plus, AVG Antivirus, Avast Antivirus.
8. Password Attacks: As the password is the most used mechanism to authenticate users to
a system, getting to know the right password is a common attack approach. Brute-force attacks
(randomly trying different passwords) and Dictionary attacks (a list of common passwords used
to gain access) are often used by hackers to get to know the password by trial-and-error approach.
Password policies, including account lockout, password change at regular intervals, and password
complexity, mitigate password attacks.
National Institute of Standards and Technology (NIST) has developed a cybersecurity framework that
provides a uniform set of rules, guidelines, and standards for organizations[23]. The five core functions
of the NIST framework are Identify, Protect, Detect, Respond, and Recover. It is a standard approach
for cybersecurity and provides the foundation for an enterprise-wide strategy around cyber risk and
compliance.

References
[1]
M. Chapple, “Confidentiality, Integrity And Availability – The CIA Triad”,
https://www.certmike.com/confidentiality-integrity-and-availability-the-cia-triad/.
[2]
Okta, “Authentication vs. Authorization”, https://www.okta.com/identity-101/authentication-vs-authorization/.
[3]
IBM, “Why is IAM important?”, https://www.ibm.com/in-en/topics/identity-access-management.
[4]
VMware, “What is Identity Management?”, https://www.vmware.com/topics/glossary/content/identity-management.
[5]
Oracle, “Oracle® Identity and Access Management Introduction”,
https://docs.oracle.com/cd/B28196_01/idmanage.1014/b31291/overview_intro.htm.
[6]
OWASP, “Web Application Firewall”, https://owasp.org/www-community/Web_Application_Firewall.
[7]
J. Peters, Varonis, “IDS vs. IPS: What is the Difference?”, https://www.varonis.com/blog/ids-vs-ips/.
[8]
Imperva, “Data Loss Prevention (DLP)”, https://www.imperva.com/learn/data-security/data-loss-prevention-dlp/.
[9]
Kaspersky, “What is VPN? How It Works, Types of VPN”, https://www.kaspersky.com/resource-center/definitions/what-is-a-vpn.
[10]
Froehlich A., “How to secure data at rest, in use and in motion”,
https://searchsecurity.techtarget.com/feature/Best-practices-to-secure-data-at-rest-in-use-and-in-motion.
[11]
Broumandnia A., “Information Storage and Management”, https://www.slideshare.net/AliBroumandnia/chapter-15-94885464.
[12]
Gittlen S., “Data security guide: Everything you need to know”,
https://searchsecurity.techtarget.com/Data-security-guide-Everything-you-need-to-know.
[13]
MicroFocus, “What is Application Security?”, https://www.microfocus.com/en-us/what-is/application-security.
[14]
Synopsys, “Static Application Security Testing”, https://www.synopsys.com/glossary/what-is-sast.html.
[15]
J. Peterson, “Dynamic Application Security Testing: DAST Basics”,
https://www.whitesourcesoftware.com/resources/blog/dast-dynamic-application-security-testing/.
[16]
J. Peterson, “All About IAST – Interactive Application Security Testing”,
https://www.whitesourcesoftware.com/resources/blog/iast-interactive-application-security-testing/.
[17]
Synopsys, “Software Composition Analysis”, https://www.synopsys.com/glossary/what-is-software-composition-analysis.html.
[18]
F5, “Bot Management Solutions”, https://www.f5.com/solutions/application-security/bot-management.
[19]
Thompson D. ,” RASP: The What, Why and How”,
https://www.whitesourcesoftware.com/resources/blog/rasp-runtime-application-self-protection/.
[20]
J. D. Groot, digitalguardian, “What is Cyber Security? Definition, Best Practices & More”,
https://digitalguardian.com/blog/what-cyber-security.
[21]
Kaspersky, “What is Cyber Security?”, https://www.kaspersky.co.in/resource-center/definitions/what-is-cyber-security.
[22]
IBM, “What is cybersecurity?”, https://www.ibm.com/topics/cybersecurity.
[23]
NIST, “CYBERSECURITY FRAMEWORK”, https://www.nist.gov/cyberframework.
[24]
L. Rosencrance, “principle of least privilege (POLP)”, https://searchsecurity.techtarget.com/definition/principle-of-least-privilege-POLP.

Index
A C Definition 106
ability 1, 13, 25, 62 capabilities 6, 8, 9, 11, 12, 17, 19, 28, 32, 52, 57, 67, degradation 62
abnormal 90 68, 92 demilitarized 51
accounting 9 capacity 5, 6, 35, 63, 64, 65, 87 dependencies 11, 101
ACLs 99 CAPEX 9 deploy 1, 13, 14, 37, 52, 90, 99
adapter 63 capital 9 deployment 6, 7, 10, 28, 30, 54, 98, 101, 103
Agent 71, 72, 76 CAPTCHA 103 Description 12
Agentless 72, 76 capture 17, 30, 85 DEV 13, 14, 31
aggregate 57, 88 Cassandra 92 DevOps 13
aggregation 57, 91 categories 96 digital 8, 89, 100
agile 11, 13, 14 categorize 91 Direct attached storage (DAS) 62
agility 13 CCOE 13, 14 directory 74, 96, 100
AIOps 86, 88, 92, 93, 94 Checklist 94 discipline 89, 96, 100, 104
AIX 28, 34, 36, 37, 41, 71, 80 Chef 5, 12, 13, 15 disk 38, 63, 64, 65, 67, 76, 87
alert 88 CIA 96, 106 disruption 8, 13, 29, 68, 75, 104
alignment 10, 79 CICS 35 DLP 97, 99, 106
Amazon 8, 9, 10, 40, 59, 64, 66, 67, 76 CIDR 50, 59 DMZ 44, 51, 98, 99
analytics 2, 92, 94 CIFS 65, 100 DNS 51, 59, 80
APDEX 89 CIO 32 Docker 13, 40
API 7, 8, 62 ciphertext 100 document 5, 19, 20, 21, 31
APIs 5, 12, 66, 96, 103 CISC 40 downtime 74
APM 89 Cisco 7, 15, 56, 57, 58, 59 DR 19, 27, 28, 30, 31, 32, 41, 77, 78, 79, 80, 81, 83, 84
AppDynamics 94 Citrix 6, 80, 87 drive 62, 63, 65
appliance 72, 81, 98 Classful 49 Durability 62
application 2, 5, 6, 8, 9, 11, 12, 13, 14, 28, 41, 51, 52, classification 6 Duration 62, 70
54, 71, 73, 74, 78, 79, 80, 81, 85, 89, 90, 94, 96, 97, 98, classified 29, 68 dynamic 52, 92, 101, 106
99, 100, 101, 102, 103, 104, 106 cloud 1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 32, 34, 35, 37, dynamically 52
AppScan 102 41, 43, 55, 59, 62, 64, 66, 67, 68, 70, 71, 75, 76, 80, 85, DynamoDB 64
architect 19, 20, 21, 22, 25, 28, 30, 31, 32 94, 95, 96, 97, 98, 104
Dynatrace 89, 90, 94
architected 10, 12, 15 cluster 37, 38
architecting 12, 25, 73 Clustering 28, 41 E
architects 18, 25, 26 CMDB 2 EBS 64
architectural 32 COBIT 1 EC2 40
Archival 29, 68, 75 COBOL 35, 37 ECS 40, 66, 71, 88, 94
array 38, 63, 83 code 9, 12, 13, 14, 101, 102, 104 efficiency 5, 13, 62, 88
Asymmetric 100 colocation 6 EKS 40
authenticate 105 commodity 7, 34 elastic 8, 94
authentication 5, 96, 100, 106 communicate 44, 46, 47, 49, 52, 63, 73 ElastiCache 64
authorization 96, 97, 100, 106 communication 44, 54, 97 elasticity 8, 12
automate 7, 12, 94 compute 1, 7, 8, 12, 13, 28, 32, 33, 34, 36, 37, 40, 71, Elasticsearch 90
automation 2, 8, 12, 13, 78, 80, 92 79, 81, 104 ELK 90
AWS 1, 8, 9, 10, 11, 12, 15, 32, 40, 41, 59, 64, 66, 67, concepts 7, 8, 1, 38, 40, 41, 43, 44, 92, 97 EMC 7, 62, 66, 71, 72, 75, 76, 81, 83, 84, 88, 91, 94
68, 76, 105 conceptual 7, 17, 25, 28, 30 emissions 5
AzCopy 76 Confidentiality 96, 106 Encapsulation 100
Azure 1, 8, 9, 10, 12, 15, 40, 41, 59, 64, 66, 67, 68, 76 configuration 2, 5, 8, 12, 13, 14, 32, 40, 46, 47, 51, encrypted 99, 100
94, 102 encryption 40, 44, 99, 100
B constitutes 11, 22, 51, 54, 56, 63, 83 engagement 8, 15
BackBlaze 41 container 9, 13 enterprise 1, 3, 8, 9, 11, 12, 13, 14, 15, 18, 20, 23, 28,
Backup 3, 5, 18, 20, 22, 29, 31, 32, 69, 70, 71, 72, 73, containerization 13, 15 32, 51, 61, 68, 76, 77, 94, 95, 96, 97, 100, 103, 105
74, 75, 76, 81
continual 2 environment 7, 6, 7, 8, 9, 12, 13, 20, 22, 27, 30, 34, 57,
Bamboo 13 70, 79, 87, 92, 99, 100, 101
converged 7, 15
bandwidth 49, 63, 88 ERP 89
core 7, 38, 57, 83, 105
baseline 10, 20, 22, 73 ESXi 37, 38, 64, 80, 82, 83
CPU 38, 41, 86, 87, 88, 90
Batch 35 Ethernet 45, 48, 62, 63
criteria 10, 29, 70, 71, 74
BCDR 84 EUEM 89
critical 3, 5, 6, 27, 66, 79, 83, 87
BCV 83 execution 38, 79, 80, 84, 102
Criticality 27, 31, 32, 41
Beanstalk 40 expandability 40
customer 13, 28
Benchmarks 38 expansion 40, 74
cyber 61, 62, 104, 105, 106
biometric 5 ExpressRoute 59
cybersecurity 104
Blade 38, 41
Blob 67 D F
brittle 57 dashboards 2, 5 F5 104, 106
broadcast 45, 48, 57 database 2, 35, 36, 51, 64, 74, 75, 90 Fabric 40, 63
Broadcom 94 DB2 36, 41, 71, 80, 89 factor 5, 20
Brocade 64 DCIM 5 failure 3, 6, 57, 63, 70, 74, 99
browser 8, 89, 99 DCRUM 89 Fargate 40
bundle 7 DDBoost 72 fault 6
bundling 7 DDoS 105 FCoE 62, 63
business 1, 2, 3, 6, 8, 9, 10, 11, 12, 13, 14, 20, 28, 63, decryption 44, 100 Fiber 65
79, 81, 83, 84, 87, 90, 92, 99, 100, 103 deduplication 62, 71, 91 Filestore 66
107
financial 9, 36, 104 ITIL 1, 28 N
firewall 51, 52, 59, 98 ITM 87 Nagios 87, 94
forbes 15 ITOA 86, 88, 92, 94 NAS 62, 65, 68, 70, 73, 80, 83, 88, 99, 100
formulate 8, 11, 25, 28, 32 ITOM 91 NAT 44, 52, 53
foundation 12, 28, 105 ITOps 94 native 11
foundational 43 ITSM 1, 2, 6, 28, 91 network 1, 7, 8, 12, 13, 20, 22, 32, 38, 43, 44, 45, 48,
framework 1, 10, 12, 13, 15, 32, 105 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 62, 63, 64, 65,
J 68, 72, 73, 88, 90, 96, 97, 98, 99, 104, 105
Fraud 103 Jenkins 13
Fraudulent 104 networks 8, 48, 49, 52, 59, 99, 100
FTP 44, 52 K NFS 65, 100
functionality 11, 13, 14, 78, 83, 92, 97 Kerberos 100 NIC 55
keys 100 NICs 44, 55
G Kibana 90 NIST 105, 106
Gateway 52, 67, 76 knowledge 2, 30 node 37, 44, 84
Gbps 45, 62, 67 Kubernetes 9, 13, 40 nodes 99
GCP 8, 9, 10, 12, 40, 59, 64, 66, 67, 68, 76 normal 70, 78, 90
GDPR 99 L
normalization 91
GENEVE 58 Lambda 1, 40
NoSQL 64
GitHub 13, 101 LAN 5, 44, 45, 51, 52, 54, 56, 63
notation 50
Glacier 67 landing 10, 11, 13, 14, 15
NSX 55, 58
Google 8, 9, 10, 15, 40, 41, 59, 64, 66, 67, 68, 76 LANs 45
NVMe 29, 62, 67, 68
governance 1, 10, 14 latencies 87
gsutil 76 layers 56, 58, 59, 63, 64, 99 O
guidance 12, 15, 19, 21 Linux 28, 34, 35, 36, 37, 41, 71, 87, 90 object 36, 66, 67, 68, 80, 88, 99, 100
guidelines 26, 28, 30, 105 locations 81 Objective 30, 69, 70, 71, 77, 78
logical 5, 6, 19, 30, 38, 39, 45, 48, 63, 99 objects 66, 73, 100
H LPAR 35, 36 OLAP 61
Hadoop 92 LPARs 37 OLTP 61
hardware 7, 32, 35, 36, 37, 38, 40, 41, 58, 86, 90, 94 LTR 69, 70, 71 on-premises 1, 12, 14, 62, 64, 71
HBA 63, 86 LUN 63, 99 operating 6, 20, 35, 36, 37, 38, 39, 62, 63, 65, 74, 87,
HCI 7, 15 LUNs 64 90
hierarchical 56 operational 9, 61, 69, 70, 74, 76, 79, 81
hosts 37, 38, 44, 49, 63, 80, 87 M OPEX 9
Hyperconverged 7, 15 Mainframe 34, 35, 41, 80
optimal 8, 9, 11, 22, 38, 52
hyperscalers 8 malware 52, 90, 98, 104
optimize 1, 38, 52, 89, 94
hyperthreaded 38 mapping 53, 65
orchestrated 9, 71
Hyperthreading 38 MariaDB 64
organization 1, 2, 7, 10, 18, 27, 28, 51, 52, 65, 79, 80, 99
hypervisor 11, 35, 37, 41, 64, 82, 84 mask 100
OS 28, 34, 35, 36, 41, 63, 74, 75, 81
Mbps 45, 67
I OSI 44, 48, 51, 52, 59
Measure 62
IaaS 8, 40 OSPF 52
mechanism 65, 96, 105
IaC 9, 12, 13, 14 outage 3, 30, 77, 78
memory 38, 39, 86, 87
IAM 96, 106 outages 57, 92
MemoryStore 64
IAST 101, 102, 106 overlay 58
metadata 62, 66, 100
identification 49, 91, 103 overprovisioning 39
method 55, 101, 102, 104
identifier 48 overview 8, 15, 20, 23, 59, 94, 106
methodology 10
IFL 35 OWASP 101, 102, 106
metrics 2, 28, 68, 88, 89
IIS 89, 90 microprocessors 34 P
illustrates 29, 68 microseconds 67 PaaS 8, 40
Illustration 28, 39, 46, 48, 50, 51, 53, 63, 71, 74, 81, microsegmentation 55, 59, 97 packet 45, 52, 53, 57
82, 83, 86, 96 parameters 62, 67, 77, 85, 86, 87, 88, 90
microsegments 55
implementation 30, 32 partitioning 35, 36, 99
microservices 9
incident 2, 70, 77, 78, 79 passive 6, 79, 89
Microsoft 8, 9, 10, 15, 37, 38, 40, 59, 64, 66, 67, 76,
indicative 18, 19, 20, 21, 22, 27, 28, 29, 30, 41, 67, 71, 83, 90 password 105
74, 78, 79, 86
middleware 73, 89 patching 70, 74
inefficiency 87
migrate 11 patterns 71, 92
Information 1, 32, 52, 96, 103, 105, 106
migration 10, 11, 15 PCIe 40, 67
infrastructure 7, 8, 1, 2, 4, 6, 7, 8, 9, 12, 13, 14, 15, 17,
MLC 35 perform 5, 11, 12, 14, 52, 73, 90, 94, 96, 103
18, 19, 20, 21, 22, 25, 27, 28, 30, 31, 32, 33, 41, 43, 44,
52, 58, 61, 63, 69, 74, 77, 78, 80, 81, 85, 86, 87, 90, 91, mobile 8, 89, 96, 103 performance 5, 7, 29, 63, 68, 73, 75, 87, 88, 89, 90,
92, 94, 95, 96, 100 model 8, 9, 20, 38, 39, 44, 48, 51, 52, 59, 79, 96 94, 98, 105
initiatives 8 modular 7, 15 permission 4, 96
integrated 14, 27, 52, 90, 91, 99 module 91 perspective 34, 35, 36, 39, 89, 102
integration 7, 11, 59, 91, 101 monitor 85, 94 Phishing 104
interaction 72, 90, 104 monolithic 9 physical 1, 5, 6, 22, 30, 32, 35, 37, 38, 39, 41, 45, 46,
MPLS 45 47, 48, 58, 62, 63, 64, 69, 71, 72, 78, 81, 88
interconnect 5, 63
MSU 35 plan 10, 11, 12, 84
interconnections 5
MTBF 74 platform 8, 11, 12, 59, 82, 85, 88, 94
interdependencies 78
MTTR 74 platforms 7, 8, 10, 11, 12, 13, 14, 34, 55, 59, 64, 66,
interfaces 7, 65, 73, 102
78, 80
IOPS 62, 66, 67, 88 multilayer 52
POLP 96, 106
IP address 44, 48, 49, 52, 53, 86 Multiprotocol 45
practice 73, 76, 84, 92, 96
IPS 52, 97, 98, 106 multisite 45
practices 1, 5, 10, 12, 76, 100, 104, 106
IPSec 57 Multithreading 38
predict 92
iSCSI 62, 63, 65, 83, 86 MySQL 64
predictability 57
iSeries 34, 41
108
prediction 92 routing 52, 56, 57 T
predictive 92 RPO 27, 69, 70, 74, 75, 77, 79, 83 tagged 45
prepare 30 RTO 70 target 12, 19, 20, 22, 100, 105
presentation 44 TB 62
S
presented 20, 92 TCP 44, 52
S3 67
prevent 2, 51, 57, 92, 99, 100, 105 Terraform 9, 12, 15
SaaS 8, 11
prevention 99, 106 thread 38
SANs 65, 99
primary 3, 6, 52, 79, 81, 82, 83 threads 38, 41
SAST 101
principles 10, 12, 20, 26, 28, 30, 32 Tier 6, 27, 28, 29, 31, 41, 57, 68, 71, 75
SATA 29, 62, 66, 67, 68
priorities 1 tiers 6, 27, 28, 29, 32, 41, 57, 68, 75
SBB 20, 32
prioritize 89 TOGAF 19, 21, 23, 32
SCA 101, 103
PRIORITY 27 togaf8 23
scalability 40
privacy 7 togaf9 23, 32
scalable 7
privileges 96, 97, 100 Tomcat 80, 89
scale 7, 8, 10, 12, 13, 65
process 1, 2, 14, 25, 28, 30, 37, 38, 52, 53, 61, 71, 72, transformation 8, 10
scenario 79
73, 78, 87, 90, 91, 100, 101 trunk 45, 48
scenarios 7, 90
processors 35, 38, 39, 40 trusted 51
scripting 12, 101, 102
PROD 13, 14, 31
Protect 12, 71, 72, 75, 76, 81, 105
secondary 3, 6, 79, 81, 82, 83 U
secure 1, 10, 40, 51, 97, 100, 101, 106 UAT 13, 14, 31
protected 73, 99, 100, 103
securing 55, 97 UDP 44, 52
protecting 100, 101
security 5, 6, 7, 13, 19, 20, 22, 32, 51, 52, 55, 66, 90, underlay 58
protection 62, 76, 84, 96, 99, 100, 101, 103, 104, 105, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 106 uptime 5, 6, 62, 74, 88, 89, 92
106
segmentation 49, 55, 59
protocols 44, 52, 57, 62, 63, 65, 100
segments 13, 45, 55, 61, 73
V
provider 1, 6, 11, 28, 99 vaults 62
serverless 1
provision 7, 8, 12, 14 vCenter 39, 83
servers 1, 7, 15, 20, 34, 37, 38, 40, 41, 45, 46, 47, 48,
provisioning 8, 12, 13, 14 vCPU 41, 87
51, 52, 53, 57, 62, 63, 64, 71, 72, 74, 80, 81, 86, 87, 88,
Proxy 44, 54 89, 98, 99, 104 vCPUs 39
public 1, 8, 9, 10, 14, 19, 21, 32, 43, 52, 59, 66, 99 service 1, 2, 6, 8, 11, 28, 29, 30, 32, 35, 41, 56, 57, 64, Veracode 101, 102
Puppet 5, 12, 13, 15 67, 68, 71, 76, 78, 79, 89, 91, 94, 98, 99, 105 Virtual 1, 35, 37, 38, 39, 40, 41, 45, 59, 64, 71, 72, 81,
service design 1, 28 83, 97, 99
Q virtualization 9, 37, 38, 39, 41, 55, 58, 87
quality 78 service operation 2
service provider 1, 28, 99 virtualized 37
R services 1, 2, 6, 7, 8, 9, 11, 12, 13, 15, 19, 20, 22, 28, virtualizing 58
rack 7, 41 35, 36, 37, 40, 41, 43, 44, 57, 59, 62, 64, 65, 66, 67, 68, VLAN 44, 45, 46, 47, 48, 54
racks 38 70, 76, 78, 80, 81, 87, 90, 100 VMDK 38, 72
RAID 63, 64 service strategy 1, 28 VPN 97, 99, 106
ransomware 99 setup 13, 32, 44, 63, 94 vSAN 64, 68, 87
RASP 103, 104, 106 SIEM 105 vulnerabilities 100, 101, 102, 103
Recover 80, 81, 84, 105 snapshots 69, 76, 82
recovering 78 SNMP 88
W
WAF 97, 98, 103, 104
redundant 6, 11, 63 solution 12, 19, 20, 22, 25, 30, 31, 32, 34, 36, 38, 62,
WAN 44, 45, 51, 54, 87
REFACTOR 11 64, 65, 66, 70, 72, 91, 92, 96, 99, 104
Windows 8, 28, 34, 37, 41, 66, 71, 87, 90, 100
regulations 69 solutioning 1
workflows 1, 5
REHOST 11 solutions 2, 5, 8, 9, 12, 14, 20, 22, 23, 25, 28, 41, 43,
44, 55, 62, 73, 75, 79, 92, 94, 95, 96, 97, 98, 99, 103, workload 35, 40, 55
reliable 34
104, 106
RELOCATE 11 X
specifications 49 x86 7, 34, 37, 38, 40
replace 4, 11, 92
specify 12, 13, 20, 49, 50
REPLATFORM 11
SQL 35, 37, 64, 71, 80, 89, 98, 101, 102
replicate 80, 81
SRAM 39
Replication 78, 81, 82, 83, 84
SRDF 83, 84, 88
Replicator 81
SSD 29, 62, 67, 68
report 89, 90, 103
standard 6, 32, 41, 63, 105
repositories 101
standards 26, 28, 30, 32, 105
representation 91
storage 1, 7, 8, 12, 13, 15, 28, 29, 32, 38, 40, 61, 62,
requirements 1, 8, 9, 12, 18, 29, 30, 31, 32, 37, 52, 61,
63, 64, 65, 66, 67, 68, 69, 70, 71, 75, 76, 77, 80, 81, 83,
66, 68, 75, 100
84, 87, 88, 94, 99, 100, 104
resiliency 8, 13
STR 69, 70, 71
resources 1, 7, 8, 9, 11, 12, 13, 14, 38, 68, 87, 90, 94,
strategic 8
96, 99, 106
strategies 4, 5, 6, 8, 10
restore 19, 20, 22, 28, 29, 32, 69, 70, 71, 72, 73, 74,
75, 76 strategy 1, 7, 9, 10, 11, 15, 28, 30, 32, 79, 84, 99, 105
RETAIN 11 subnet 48, 49, 50, 54, 59
retained 70 subnets 49, 55
retired 2 Subnetting 49
retiring 2, 15 Subnetwork 44, 48
Rollback 74 switch 6, 45, 46, 47, 51, 52, 57, 81, 99
routable 52 switches 44, 45, 48, 52, 56, 57, 59, 63, 88, 90, 99
route 52, 57 synchronizing 35
routed 57 synchronously 82, 83
router 52, 53 synthetic 90
routes 57
109
Infrastructure Architecture
Essentials for
Data Center and Cloud
Many new entrants to the IT industry have
directly begun working on cloud platforms
without a background in data center solutions
and infrastructure architecture. This book gives
readers the required conceptual clarity on
IT infrastructure architecture to develop
About the Author
and maintain solutions both for data center
and cloud. Shankar Kambhampaty has been
involved for 32 years in architecture,
• Describes an infrastructure architecting design, and development for several
process based on industry standards. IT projects executed globally. He has
provided leadership in technology
• Focusses on the essentials of Compute,
and IT architecture over the past
Network, Storage, Backup/Restore, Disaster
decade in several organizations.
Recovery, Monitoring, and Security from an
Shankar has written papers for
architecture perspective.
International Conferences and is
• Caters to students, developers, author of the book Service-Oriented
architects & designers, CXOs who need Architecture & Microservices
a good understanding of the concepts of Architecture for Enterprise, Cloud,
infrastructure architecture. Big Data and Mobile, published by
Wiley. He has also been a frequent
• Provides many references to industry speaker at architecture/technology
and academic literature at the end of events and an invited member of
every chapter to guide anyone who wishes Forbes Technology Council.
to go deeper.
To know more, visit Shankar’s blog:
www.archtecht.com.

Infrastructure Architecture Essentials For Data Center and Cloud

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Infrastructure Architecture Essentials For Data Center and Cloud

Uploaded by

Copyright:

Available Formats

INFRASTRUCTURE ARCHITECTURE

Copyright @2022 Shankar Kambhampaty

First Release: January 5, 2022

This book is dedicated to my mother, who encouraged me to share my knowledge. I am indebted to my

Architecture Documents for Infrastructure Solutions...................................... 17

Architecting Process for Infrastructure Solutions........................................... 25

1.1 IT, ITSM, and ITIL ITSM/ITIL –

Chapter 1: Data Center 1

• Event Management • Service Catalog Management

Continual Service Service

Figure 1.1: ITSM Processes

Chapter 1: Data Center 2

1.2.1 Infrastructure capabilities

Infrastructure Architecture Definition

Chapter 1: Data Center 3

Automated Network Network Automated

Compute Compute Compute Compute Compute Compute Compute Compute

Backup and Restore

Chapter 1: Data Center 4

Key Infrastructure 1.2.4 Security

Chapter 1: Data Center 5

1.2.8 Active and Passive Data Centers

Chapter 1: Data Center 6

Types of Cloud 2.1 Private Cloud

The bundling of infrastructure and software is at two levels –

1) Converged Infrastructure (CI): Hardware-based approach of converging compute, storage, network,

2) Hyperconverged Infrastructure (HCI): Software-based approach of converging compute, storage,

Organizations need to provision infrastructure resources quickly, ideally at a click of a button.

2.3 Hybrid Cloud

2.4 Cloud Adoption Framework

1. Envision transformation opportunities consistent with business objectives.

1. Define strategy and a business case.

GCP recommends an approach with three phases[9]:

All the frameworks have certain cloud adoption steps in common –

1) Establish criteria to ensure that cloud adoption results in business value.

HOW – The key migration approaches are –

b) REPLATFORM – When some changes are made to the Cloud Migration

Principle Description AWS Azure GCP

Performance Use of resources efficiently and adapt to

Cost Manage costs to maximize the

2.7 Landing Zones

2.8 Agile approach for cloud deployments

Guidance CCOE Govern

Landing Zone IaC Execute IaC UAT PROD

Figure 2.2: Agile approach for cloud deployments

Architecture Documents for

The TOGAF standard defines technology architecture and stipulates

3.1 Conceptual Technology Architecture (CTA)

Chapter 3: Architecture Documents for Infrastructure Solutions 17

Conceptual Technology Architecture – Indicative list of sections

7. Technology Standards & Guidelines

10. Guidance for developing Infrastructure Solutions

Chapter 3: Architecture Documents for Infrastructure Solutions 18

Logical Technology Architecture – Indicative list of sections

4. Alignment to Design Principles

• Baseline (AS-IS) Architecture

• Architecture/Solution Building Blocks

Chapter 3: Architecture Documents for Infrastructure Solutions 19

8. Transition from Baseline to Target Architecture

10. Risks and Constraints

11. Assumptions and Dependencies

Chapter 3: Architecture Documents for Infrastructure Solutions 20

Physical Technology Architecture – Indicative list of sections