HCIA-Storage Learning Guide: Huawei Storage Certification Training

Huawei Storage Certification Training
HCIA-Storage
Learning Guide
V4.5
HUAWEI TECHNOLOGIES CO., LTD.

Copyright © Huawei Technologies Co., Ltd. 2020. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any
means without prior written consent of Huawei Technologies Co., Ltd.
Trademarks and Permissions
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of
their respective holders.
Notice
The purchased products, services and features are stipulated by the contract made
between Huawei and the customer. All or part of the products, services and features
described in this document may not be within the purchase scope or the usage scope.
Unless otherwise specified in the contract, all statements, information, and
recommendations in this document are provided "AS IS" without warranties,
guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has
been made in the preparation of this document to ensure accuracy of the contents, but
all statements, information, and recommendations in this document do not constitute
a warranty of any kind, express or implied.
Huawei Technologies Co., Ltd.

Address: Huawei Industrial Base Bantian, Longgang Shenzhen 518129
People's Republic of China
Website: http://e.huawei.com
Huawei Proprietary and Confidential

Copyright © Huawei Technologies Co., Ltd
HCIA-Storage Learning Guide Page 1
Huawei Certification System

Huawei Certification follows the "platform + ecosystem" development strategy,
which is a new collaborative architecture of ICT infrastructure based on "Cloud-Pipe-
Terminal". Huawei has set up a complete certification system comprising three
categories: ICT infrastructure, Platform and Service, and ICT vertical. Huawei's technical
certification system is the only one in the industry covering all of these fields.
Huawei offers three levels of certification: Huawei Certified ICT Associate (HCIA),
Huawei Certified ICT Professional (HCIP), and Huawei Certified ICT Expert (HCIE).
Huawei Certified ICT Associate-Storage (HCIA-Storage) is designed for Huawei
engineers, students and ICT industry personnel. HCIA-Storage covers storage
technology trends, basic storage technologies, common advanced storage technologies,
business continuity solutions for storage and storage system O&M management.
The HCIA-Storage certificate introduces you to the storage industry and markets,
helps you understand sector innovation, and makes sure you stand out among your
industry peers.
Contents
1 Storage Technology Trends ............................................................................................... 6

1.1 Storage Technology Trends ...........................................................................................................................................6
1.1.1 Data and Information ..................................................................................................................................................6
1.1.2 Data Storage...................................................................................................................................................................8
1.1.3 Development of Storage Technologies ................................................................................................................ 10
1.1.4 Development Trend of Storage Products ............................................................................................................ 13
2 Basic Storage Technologies............................................................................................. 18
2.1 Intelligent Storage Components ................................................................................................................................ 18
2.1.1 Controller Enclosure .................................................................................................................................................. 18
2.1.2 Disk Enclosure ............................................................................................................................................................. 19
2.1.3 Expansion Module ...................................................................................................................................................... 20
2.1.4 HDD................................................................................................................................................................................ 20
2.1.5 SSD ................................................................................................................................................................................. 25
2.1.6 Interface Module ........................................................................................................................................................ 28
2.2 RAID Technologies......................................................................................................................................................... 29
2.2.1 Traditional RAID ......................................................................................................................................................... 29
2.2.2 RAID 2.0+ ...................................................................................................................................................................... 38
2.2.3 Other RAID Technologies ......................................................................................................................................... 41
2.3 Common Storage Protocols ........................................................................................................................................ 42
2.3.1 SCSI................................................................................................................................................................................. 42
2.3.2 iSCSI, FC, and FCoE .................................................................................................................................................... 44
2.3.3 SAS and SATA .............................................................................................................................................................. 48
2.3.4 PCIe and NVMe ........................................................................................................................................................... 50
2.3.5 RDMA and IB ............................................................................................................................................................... 53
2.3.6 CIFS, NFS, and NDMP ............................................................................................................................................... 55
2.4 Storage System Architecture ...................................................................................................................................... 56
2.4.1 Storage System Architecture Evolution ................................................................................................................ 56
2.4.2 Storage System Expansion Methods ..................................................................................................................... 58
2.4.3 Huawei Storage Product Architecture .................................................................................................................. 61
2.5 Storage Network Architecture ................................................................................................................................... 65
2.5.1 DAS................................................................................................................................................................................. 65
2.5.2 NAS................................................................................................................................................................................. 65
2.5.3 SAN................................................................................................................................................................................. 67
2.5.4 Distributed Architecture ........................................................................................................................................... 71
2.6 Introduction to Huawei Intelligent Storage Products .......................................................................................... 72

2.6.1 All-Flash Storage ........................................................................................................................................................ 72
2.6.2 Hybrid Flash Storage ................................................................................................................................................. 75
2.6.3 Distributed Storage .................................................................................................................................................... 76
2.6.4 Edge Data Storage (FusionCube) .......................................................................................................................... 78
3 Advanced Storage Technologies .................................................................................... 80
3.1 Storage Resource Tuning Technologies and Applications .................................................................................. 80
3.1.1 SmartThin ..................................................................................................................................................................... 80
3.1.2 SmartTier ...................................................................................................................................................................... 81
3.1.3 SmartQoS ..................................................................................................................................................................... 84
3.1.4 SmartDedupe ............................................................................................................................................................... 85
3.1.5 SmartCompression ..................................................................................................................................................... 87
3.1.6 SmartMigration ........................................................................................................................................................... 88
3.2 Storage Data Protection Technologies and Applications ................................................................................... 89
3.2.1 HyperSnap .................................................................................................................................................................... 89
3.2.2 HyperClone................................................................................................................................................................... 92
3.2.3 HyperReplication ........................................................................................................................................................ 94
3.2.4 HyperMetro .................................................................................................................................................................. 97
4 Storage Business Continuity Solutions ....................................................................... 100
4.1 Backup Solution Introduction ................................................................................................................................... 100
4.1.1 Overview ..................................................................................................................................................................... 100
4.1.2 Architecture ................................................................................................................................................................ 101
4.1.3 Networking Modes................................................................................................................................................... 103
4.1.4 Common Backup Technologies ............................................................................................................................ 104
4.1.5 Applications................................................................................................................................................................ 109
4.2 DR Solution Introduction ........................................................................................................................................... 111
4.2.1 DR Solution Overview ............................................................................................................................................. 111
4.2.2 DR Solution Architecture ........................................................................................................................................ 115
4.2.3 Common DR Technologies..................................................................................................................................... 117
4.2.4 DR Application Scenarios ....................................................................................................................................... 119
5 Storage System O&M Management ........................................................................... 121
5.1 Storage System O&M Management ...................................................................................................................... 121
5.1.1 Storage Management Overview .......................................................................................................................... 121
5.1.2 Introduction to Storage Management Tools .................................................................................................... 122
5.1.3 Introduction to Basic Management Operations .............................................................................................. 123
5.2 Storage System O&M Management ...................................................................................................................... 123
5.2.1 O&M Overview.......................................................................................................................................................... 123
5.2.2 O&M Management Tool ........................................................................................................................................ 124
5.2.3 O&M Scenarios ......................................................................................................................................................... 124

1 Storage Technology Trends
1.1 Storage Technology Trends

1.1.1 Data and Information
1.1.1.1 What Is Data
Data refers to recognizable symbols that record events. It is a physical symbol or a
combination of these physical symbols that record the properties, states, and
relationships of events. In a narrow sense, data refers to numbers. In a broad sense, it can
be a combination of characters, letters, and digits, graphs, images, videos and audios that
have specific meanings, and can also be the abstract representation of an attribute,
quantity, location, and relationship between objects. For example, 0, 1, 2, windy weather,
rain, fall in temperature, student records, and transportation of goods are all data.
In computer science, data is a generic term for all media such as numbers, letters,
symbols, and analog parameters that can be input to and processed by computer
programs. Computers store and process a wide range of objects that generate compl ex
data.
DAMA defines data as the expression of facts in the form of texts, numbers, graphics,
images, sounds, and videos.
1.1.1.2 Data Types

Based on data storage and management modes, data is classified into structured, semi-
structured, and unstructured data.
Structured data can be represented and stored in a relational database, and is often
represented as a two-dimensional table. For example, SQL Server, MySQL, and Oracle.
Semi-structured data does not conform to the structure of relational databases or other
data tables, but uses tags to separate semantic elements or enforces hierarchies of
records and fields. For example, XML, HTML, and JSON.
Unstructured data is not organized in a regular or complete data structure, or does not
have a predefined data model. For example, texts, pictures, reports, images, audios, and
videos.
1.1.1.3 Data Processing Cycle

Data processing is the reorganization or reordering of data by humans or machines to
increase their specific value. A data processing cycle includes three basic steps: input,
processing, and output.
 Input: inputs data in a specific format, which depends on the processing mechanism.
For example, when a computer is used, the input data can be recorded on several
types of media, such as disks and tapes.
 Processing: performs actions on the input data to obtain more data value. For
example, the time card hours are calculated to payroll, or sales orders are calculated
to generate sales reports.
 Output: generates and outputs the processing result. The form of the output data
depends on the data use. For example, the output data can be an employee's salary.
1.1.1.4 What Is Information

Information refers to the objects that are transmitted and processed by the voice,
message, and communication systems. It refers to all the contents that are spread in the
human society. By acquiring and identifying different information of nature and society,
man can distinguish between different things and understand and transform the world.
In all communication and control systems, information is a form of universal connection.
In 1948, mathematician Claude Elwood Shannon pointed out in paper A Mathematical
Theory of Communications that the essence of information is the resolution of random
uncertainty.
Information is the data with context. The context includes:
 Application meanings of data elements and related terms
 Format of data expression
 Time range of the data
 Relevance of data to particular usage
Generally speaking, the concept of "data" is more objective and is not subjective to
people's will. Information is the processed data that has value and meanings.
For example, in the perspective of a football fan, the history of football, football matches,
coaches, players, and even the rules of FIFA are all the football data. Data of his or her
favorite team, star, and followed football events is information.
People can never know "all data" but can obtain "adequate information" that allows
them to make decisions.
1.1.1.5 Data vs. Information

Data is a raw and unorganized fact that needs to be processed to make it meaningful,
whereas information is a set of data that is processed in a meaningful way according to a
given requirement.
Data does not have any specific purpose whereas information carries a meaning that has
been assigned by interpreting data.
Data alone has no significance while information is significant by itself.
Data never depends on information while information is dependent on data.
Data is measured in bits and bytes while information is measured in meaningful units like
time and quantity.
Data can be structured, tabular data, graph, data tree whereas information is language,
ideas, and thoughts based on the given data.
Data is a record that reflects the attributes of an object and is the specific form that
carries information. Data becomes information after being processed, and information
needs to be digitized into data for storage and transmission.
1.1.1.6 Information Lifecycle Management

Information lifecycle management (ILM) is an information technology strategy and
concept, not just a product or solution, for enterprise users. Data is key to informatization
and is the core competitiveness of an enterprise. Information enters a cycle from the
moment it is generated. A lifecycle is completed in the process of data creation,
protection, access, migration, archiving, and destruction. This process requires good
management and cooperation. If the process is not well managed, too many resources
may be wasted or the work will be inefficient due to insufficient resources.
EMC advises customers to implement ILM in three steps. Step 1: implement automatic
network storage and optimize storage infrastructure. Step 2: improve the service level
and optimize information management. Step 3: implement an integrated lifecycle
management environment.
Data management of ILM is generally divided into the following stages:
Data creation stage: Data is generated from terminals and saved to storage devices.
 Data protection stage: Different data protection technologies are used based on data
and application system levels to ensure that various types of data and information
are effectively protected in a timely manner. A storage system provides data
protection functions, such as RAID, HA, disaster recovery (DR), and permission
management.
 Data access stage: Information must be easy to access and can be shared among
organizations and applications of enterprises to maximize business value.
 Data migration stage: When using IT devices, you need to upgrade and replace
devices, and migrate the data from the old to new devices.
 Data archiving stage: The data archiving system supports the business operation for
enterprises by providing the record query for transactions and decision-making.
Deduplication and compression are often used in this phase.
 Data destruction stage: After a period of time, data is no longer saved. In this phase,
destroy or reclaim data that does not need to be retained or stored, and clear the
data from storage systems and data warehouses.
1.1.2 Data Storage

1.1.2.1 What Is Data Storage
In a narrow sense, storage refers to the physical storage media with redundancy,
protection, and migration functions, such as floppy disks, CDs, DVDs, disks, and even
tapes.
In a broad sense, storage refers to a portfolio of solutions that provide information
access, protection, optimization, and utilization for enterprises. It is the pillar of the data-
centric information architecture.
Data storage covered in this course refers to storage in a broad sense.
1.1.2.2 Data Storage System

Storage technologies are not separate or isolated. Actually, a complete storage system
consists of a series of components.
A storage system consists of storage hardware, storage software, and storage solutions.
Hardware involves storage devices and devices for storage connections, such as disk
arrays, tape libraries, and Fibre Channel switches. Storage software greatly improves the
availability of a storage device. Data mirroring, data replication, and automatic data
backup can be implemented by using storage software.
1.1.2.3 Physical Structure of Storage

A typical storage system comprises the disk, control, connection, and storage
management software subsystems.
In terms of its physical structure, disks reside in the bottom layer and are connected to
back-end cards and controllers of the storage system via connectors such as optical fibers
and serial cables.
The storage system is connected to hosts via front-end cards and storage network
switching devices to provide data access services.
Storage management software is used to configure, monitor, and optimize subsystems
and connectors of the storage system.
1.1.2.4 Data Storage Types

Storage systems can be classified into internal and external storage systems based on the
locations of storage devices and hosts.
An internal storage system is directly connected to the host bus, and includes the high-
speed cache and memory required for CPU computing and the disks and CD-ROM drives
that are directly connected to the main boards of computers. Its capacity is generally
small and hard to expand.
An external storage system is classified into direct-attached storage (DAS) and fabric-
attached storage (FAS) by connection mode.
FAS is classified into network-attached storage (NAS) and storage area network (SAN) by
transmission protocol.
1.1.2.5 Evolution of Data Management Technologies

Data management is a process of effectively collecting, storing, processing, and applying
data by using computer hardware and software technologies. The purpose of data
management is to make full use of the data. Data organization is the key to effective
data management.
Data management technology is used to classify, organize, encode, input, store, retrieve,
maintain, and output data. The evolution of data storage devices and computer
application systems promotes the development of databases and data management
technologies. Data management in a computer system goes through four phases: manual
management, file system management, traditional database system management, and
big data management.
1.1.2.6 Data Storage Application

Data generated by individuals and organizations is processed by computing systems and
stored in data storage systems.
In the ICT era, storage is mainly used for data access, data protection for security, and
data management.
Online storage means that storage devices and stored data are always online and
accessible to users anytime, and the data access speed meets the requirements of the
computing platform. The working mode is similar to the disk storage mode on PCs.
Online storage devices use disks and disk arrays, which are expensive but provide good
performance.
Offline storage is used to back up online storage data to prevent possible data disasters.
Data stored on the offline storage is not often accessed, and is read and written in
sequence. If tape libraries are used as offline storage medium and data is read, tapes will
be rolled to the beginning to locate the data. When the written data needs to be
modified, all data needs to be rewritten. Therefore, the access speed of the offline
storage is slow and the efficiency is low. A typical offline storage product is a tape library,
which is relatively cheap.
Nearline storage is a storage type for providing more storage choices to customers. Its
costs and performance are between online storage and offline storage. If the data is not
frequently used or the amount of accessed data is small, it can be stored on nearline
storage devices. These devices still provide fast addressing capabilities and a high
transmission rate. For example, archive files that are not used for a long time into
nearline storage. Therefore, nearline storage is suitable to scenarios not requiring high
performance but requiring relatively high access performance.
1.1.3 Development of Storage Technologies

1.1.3.1 Storage Architecture
The storage architecture we use today derives from traditional storage, external storage,
and storage networks, and has developed into distributed and cloud storage.
Traditional storage is composed of disks. In 1956, IBM invented the world's first hard disk
drive, which used 50 x 24-inch platters with a capacity of only 5 MB. It was as big as two
refrigerators and weighed more than a ton. It was used in the industrial field at that time
and was independent of the mainframe.
External storage is also called direct attached storage. Its earliest form is Just a Bundle Of
Disks (JBOD), which simply combines disks and is represented to hosts as a bundle of
independent disks. It only increases the capacity and cannot ensure data security.
The disks deployed in servers have the following disadvantages: limited slots and
insufficient capacity; poor reliability as data is stored on independent disks; disks become
the system performance bottleneck; low storage space utilization; data is scattered since
it is stored on different servers.
JBOD solves the problem of limited slots to a certain extent, and the RAID technology
improves reliability and performance. External storage gradually develops into storage
arrays with controllers. The controllers contain the cache and support the RAID function.
In addition, dedicated management software can be configured. Storage arrays are

represented as a large, high-performance, and redundant disk to hosts.
DAS has the characteristics of scattered data and low storage space utilization.
As the amount of data in our society is explosively increased, the requirements for data
storage are flexible data sharing, high resource utilization, and extended transmission
distance. The emergence of networks infuses new vitality to storage.
SAN: establishes a network between storage devices and servers to provide block storage
services.
NAS: builds networks between servers and storage devices with file systems to provide
file storage services.
Since 2011, unified storage that supports both SAN and NAS protocols is a popular
choice. Storage convergence sets a new trend: converged NAS and SAN. This convergence
provides both database and file sharing services, simplifying storage management, and
improving storage utilization.
SAN is a typical storage network. It first emerges as FC SAN using the Fibre Channel
network to transmit data, and later supports IP SAN.
Distributed storage uses general-purpose server hardware to build storage resource pools
and is applicable to cloud computing scenarios. Physical resources are organized using
software to form a high-performance logical storage pool, ensuring reliability and
providing multiple storage services.
Generally, distributed storage scatters data to multiple independent storage servers in a
scalable system structure. It uses those storage servers to share storage loads and
location servers to locate storage information. Distributed storage architecture has the
following characteristics: universal hardware, unified architecture, and storage-network
decoupling; linear expansion of performance and capacity, up to thousands of nodes;
elastic resource scaling and high resource utilization.
Storage virtualization consolidates the storage devices into logical resources, thereby
providing comprehensive and unified storage services. Unified functions are provided
regardless of different storage forms and device types.
The cloud storage system combines multiple storage devices, applications, and services. It
uses highly virtualized multi-tenant infrastructure to provide scalable storage resources
for enterprises. Those storage resources can be dynamically configured based on
organization requirements.
Cloud storage is a concept derived from cloud computing, and is a new network storage
technology. Based on functions such as cluster applications, network technologies, and
distributed file systems, a cloud storage system uses application software to enable
various types of storage devices on networks to work together, providing data storage
and service access externally. When a cloud computing system stores and manages a
huge amount of data, the system requires a matched number of storage devices. In this
way, the cloud computing system turns into a cloud storage system. Therefore, we can
regard a cloud storage system as a cloud computing system with data storage and
management as its core. In a word, cloud storage is an emerging solution that
consolidates storage resources on the cloud for people to access. Users can access data
on the cloud anytime, anywhere, through any networked device.
1.1.3.2 Storage Media

History of HDDs:
 From 1970 to 1991, the storage density of disk platters increased by 25% to 30%
annually.
 Starting from 1991, the annual increase rate of storage density surged to 60% to
80%.
 Since 1997, the annual increase rate rocketed up to 100% and even 200%, thanks to
IBM's Giant Magneto Resistive (GMR) technology, which further improved the disk
head sensitivity and storage density.
 IBM 1301: used air-bearing heads to eliminate friction and its capacity reached 28
MB.
 IBM 3340: was a pre-installed box unit with a capacity of 30 MB. It was also called
"Winchester" disk drive, named after the Winchester 30-30 rifle because it was
planned to run on two 30 MB spindles.
 In 1992, 1.8-inch HDDs were invented.
History of SSDs:
 Invented by Dawon Kahng and Simon Min Sze in 1967, the floating gate transistor
has become the basis of NAND flash technology. If you are familiar with MOS tubes,
you'll find that the transistor is similar to MOSFET except a floating gate in the
middle. That is why it got the name. It is wrapped in high-impedance materials and
insulated up and down to preserve charges that enter the floating gate through the
quantum tunneling effect.
 In 1976, Dataram sold SSDs called Bulk Core. The SSD had the capacity of 2 MB
(which was very large at that time), and used eight large circuit boards, each board
with eighteen 256 KB RAMs.
 At the end of the 1990s, some vendors began to use the flash medium to
manufacture SSDs. In 1997, Altec Computer Systeme launched a parallel SCSI flash
SSD. In 1999, BiTMICRO released an 18-GB flash SSD. Since then, flash SSD has
gradually replaced RAM SSD and become the mainstream product of the SSD
market. The flash memory can store data even in the event of power failure, which is
similar to the HDD.
 In May 2005, Samsung Electronics announced its entry into the SSD market, the first
IT giant entering this market. It is also the first SSD vendor that is widely recognized
today.
 In 2006, NextCom began to use SSDs on its laptops. Samsung launched the SSD with
the 32 GB capacity. According to Samsung, the market of SSDs was 1.3 billion USD in
2007 and reached 4.5 billion USD in 2010. In September, Samsung launched the
PRAM SSD, another SSD technology that used the PRAM as the carrier, and hoped to
replace NOR flash memory. In November, Microsoft's Windows Vista came into being
as the first PC operating system to support SSD-specific features.
 In 2009, the capacity of SSDs caught up with that of HDDs. PureSilicon's 2.5-inch SSD
provides 1 TB capacity and consists of 128 pieces of 64 Gbit/s MLC NAND memory.
Finally, SSD provides the same capacity as HDD in the same size. This is very
important. HDD vendors once believed that the HDD capacity could be easily
increased by increasing the disk density with low costs. However, the SSD capacity
could be doubled only when the internal chips were doubled, which was difficult.
However, the MLC SSD proves that it is possible to double the capacity by storing
more bits in one cell. In addition, the SSD performance is much higher than that of
HDD. The SSD has the read bandwidth of 240 MB/s, write bandwidth of 215 MB/s,
read latency less than 100 microseconds, 50,000 read IOPS, and 10,000 write IOPS.
HDD vendors are facing a huge threat.
The flash chips of SSD evolve from SLC with one cell storing one bit, MLC with two bits,
TLC with three bits, and now develop into QLC with one cell storing four bits.
1.1.3.3 Interface Protocols

Interface protocols refer to the communication modes and requirements that interfaces
for exchanging information must comply with.
Interfaces are used to transfer data between disk cache and host memory. Different disk
interfaces determine the connection speed between disks and controllers.
During the development of storage protocols, the data transmission rate is increasing. As
storage media evolves from HDDs to SSDs, the protocol develops from SCSI to NVMe,
including the PCIe-based NVMe protocol and NVMe over Fabrics (NVMe-oF) protocol to
connect host networks.
NVMe-oF uses ultra-low-latency transmission protocols such as remote direct memory
access (RDMA) to remotely access SSDs, resolving the trade-off between performance,
functionality, and capacity during scale-out of next-generation data centers.
Released in 2016, the NVMe-oF specification supported both Fibre Channel and RDMA. In
the RDMA-based framework, InfiniBand supported converged Ethernet and Internet Wide
Area RDMA Protocol (iWARP).
In the NVMe-oF 1.1 specification released in November 2018, TCP was added as an
architecture option, that is, RDMA over Converged Ethernet (RoCE). With RoCE, no cache
was required and the CPU could directly access disks.
NVMe is an SSD controller interface standard. It is designed for PCIe interface-based SSDs
and aims to maximize flash memory performance. It can provide intensive computing
capabilities for enterprise-class workloads in data-intensive industries, such as life
sciences, financial services, multimedia, and entertainment.
NVMe SSDs are commonly used in databases. Featuring high speed and low latency,
NVMe can be used for file systems and all-flash storage arrays to achieve excellent
read/write performance. The all-flash storage system using NVMe SSDs provides efficient
storage, network switching, and metadata communication.
1.1.4 Development Trend of Storage Products

1.1.4.1 Development History of Storage Products
Overall development history:
 All-flash storage: In terms of media, the price of flash chips decreases year by year,
and HDDs are gradually used as tapes for storing cold data and archive data.
 Cloudification: In terms of storage architecture trend, elastic scalability is provided by

the distributed architecture, and moving workloads to the cloud helps reduce the
total cost of ownership (TCO).
 Intelligence: In terms of O&M, software intelligence is provided with intelligent
hardware functions, such as smart disk enclosures.
1.1.4.2 New Requirements for Data Storage in the Intelligence Era

Historically, human society has experienced three technological revolutions, that is,
steam, electricity, and information. Each period has a huge impact on our working and
personal lives. Now, the fourth revolution, the age of intelligence, is here. New
technologies such as AI, cloud computing, Internet of Things, and big data are used for
large-scale digital transformation in many sectors. New services are centered on data and
intelligence, promoting changes such as service-oriented extension, network-based
collaboration, intelligent production, and customization. Together we can dive deep into
our data, giving us insights and making it more valuable.
The entire human society is rapidly evolving into an intelligent society. During this
process, the data volume is growing explosively. The average mobile data usage per user
per day is over 1 GB. During the training of autonomous vehicles, each vehicle generates
64 TB data every day. According to Huawei's Global Industry Vision 2025, the amount of
global data will increase from 33 ZB in 2018 to 180 ZB in 2025. Data is becoming a core
business asset of enterprises and even countries. The smart government, smart finance,
and smart factory built based on effective data utilization greatly improve the efficiency
of the entire society. More and more enterprises have realized that data infrastructure is
the key to intelligent success, and storage is the core foundation of data infrastructure. In
the past, we used to classify storage systems based on new technology hotspots,
technical architecture, and storage media. As the economy and society transform from
digitalization to intelligence, we tend to call the new type of storage as the storage in the
intelligence era.
It has several trends:
First, intelligence, classified by Huawei as Storage for AI and AI in Storage. Storage for AI
indicates that in the future, storage will better support enterprises in AI training and
applications. AI in Storage means that storage systems use AI technologies and integrate
AI into storage lifecycle management to provide outstanding storage management,
performance, efficiency, and stability.
Second, storage arrays will transform, for example, towards all-flash storage arrays. In
the future, more and more applications will require low latency, high reliability, and low
TCOs, and all-flash storage arrays will be the good choice. Although new storage media
will emerge to compete, all-flash storage will be the mainstream storage media in the
future. Today, all-flash storage is still not the mainstream in the storage market.
The third trend is distributed storage. In the 5G intelligent era, high-performance
application scenarios such as AI, HPC, and autonomous driving and the generated
massive amount of data require distributed storage devices. With dedicated hardware,
they can provide efficient, cost-effective, and EB-level large-capacity storage. Distributed
storage is facing the challenges of intensification and large-scale expansion, as well as
the possible changes of chips and algorithms in the future. Scientists attempt to use chip,
algorithm, and bus technologies to break the barriers of the von Neumann architecture,
provide more computing power for the underlying data infrastructure to provide efficient
and low-cost storage media, and narrow the gap between storage and computing. These
problems need to be solved by dedicated hardware storage. The concept similar to
Memory Fabric also brings changes to the storage architecture.
The last trend is convergence. In the future, storage will be integrated with the data
infrastructure to support heterogeneous chip computing, streamline diversified protocols,
and collaborate with data processing and big data analytics to reduce data processing
costs and improve efficiency. For example, compared with the storage provided by
general-purpose servers, the integration of data and storage will lower the TCO because
data processing is offloaded from servers to storage. Object, big data, and other protocols
are converged and interoperate to implement migration-free big data. Such convergence
greatly affects the design of storage systems and is the key to improving storage
efficiency.
1.1.4.3 Data Storage Trend

In the intelligence era, we must focus on innovation to hardware, protocols, and
technologies. From IBM mainframe to the x86, and then to the virtualization, all-flash
storage media and all-IP network protocols become a major trend.
In the intelligence era, Huawei Cache Coherence System (HCCS) and Compute Express
Link (CXL) are designed based on ultra-fast new interconnection protocols, helping to
implement high-speed interconnection between heterogeneous processors of CPUs and
neural processing units (NPUs). RoCE and NVMe support high-speed data transmission
and containerization technologies. In addition, new hardware and technologies provide
abundant choices for data storage. The Memory Fabric architecture implements memory
resource pooling with all-flash + storage class memory (SCM) and provides microsecond-
level data processing performance. SCM media include Optane, MRAM, ReRAM, FRAM,
and Fast NAND. In terms of reliability, system reconstruction and data migration are
involved. As the chip-level design of all-flash storage advances, upper-layer applications
will be unaware of the underlying storage hardware.
Currently, the access performance of SSDs has been improved by 100-fold compared with
that of HDDs. For NVMe SSDs, the access performance is 10,000 times higher than that
of HDDs. While the latency of storage media has been greatly reduced, the ratio of
network latency to the total latency has rocketed from less than 5% to about 65%. That
is to say, in more than half of the time, storage media is idle, waiting for the network
communication. How to reduce network latency is the key to improving input/output
operations per second (IOPS).
Development of Storage Media

Let's move on to Blu-ray storage. The optical storage technology started to develop in the
late 1960s, and experienced three generations of product updates and iterations: CD,
DVD, and BD. Blu-ray storage (or BD) is a relatively new member of the optical storage
family. It can retain data for 50 to 100 years, but still cannot meet storage requirements
nowadays. We expect to store data for a longer time. The composite glass material based
on gold nanoparticles can stably store data for more than 600 years.
In addition, technologies such as DNA storage and quantum storage are emerging.
As the science and technology are developing, the disk capacity is increasing and the disk
size is becoming smaller. When it comes to storing information, a hard disk is still very
large compared to genes, but the amount of stored information is far less than that of
genes. Therefore, scientists start to use DNA to store data. At first, a few teams have tried
to write data into the genomes of living cells. But the approach has a couple of
disadvantages. Cells replicate, introducing new mutations over time that can change the
data. Moreover, cells die, indicating that data is lost. Later, teams attempt to store data
using artificially synthesized DNA, which is freed from cells. Although the DNA storage
density now is high enough and a small amount of artificial DNA can store a large
amount of data, the data read and write are not efficient. In addition, the synthesis of
DNA molecules is expensive. However, it can be predicted that, with the development of
gene sequencing technologies, the cost will be reduced.
References:
Bohannon, J. (2012). DNA: The Ultimate Hard Drive. Science. Retrieved from:
https://www.sciencemag.org/news/2012/08/dna-ultimate-hard-drive
Akram F, Haq IU, Ali H, Laghari AT (October 2018). "Trends to store digital data in DNA:
an overview". Molecular Biology Reports. 45 (5): 1479–1490. doi:10.1007/s11033-018-
4280-y
Although atomic storage is a technology of short history, it is not a new concept.
Early on December 1959, physicist Richard Feynman gave a lecture at the annual
American Physical Society meeting at Caltech "There's Plenty of Room at the Bottom: An
Invitation to Enter a New Field of Physics." In this lecture, Feynman considered the
possibility of using individual atoms as basic units for information storage.
In July 2016, researchers from Delft University of Technology, Netherlands published a
paper in Nature Nanotechnology. They used chlorine atoms on copper plates to store 1
kilobyte of rewritable data. However, the memory temporarily can only operate in a
highly clean vacuum environment or in a liquid nitrogen environment with a temperature
of minus 196°C (77K).
References:
Erwin, S. A picture worth a thousand bytes. Nature Nanotech 11, 919–920 (2016).
https://doi.org/10.1038/nnano.2016.141
Kalff, F., Rebergen, M., Fahrenfort, E. et al. A kilobyte rewritable atomic memory. Nature
Nanotech 11, 926–929 (2016). https://doi.org/10.1038/nnano.2016.131
Because an atom is so small, the capacity of atomic storage will be much larger than that
of the existing storage medium in the same size. With the development of science and
technology in recent years, Feynman's idea has become a reality. To pay tribute to
Feynman's great idea, some research teams wrote his lecture into atomic memory.
Although the idea of atomic storage is incredible and its implementation is becoming
possible, atomic memory has strict requirements on the operating environment. Atoms
are moving and even the atoms inside solids are vibrating in the ambient environment,
so it is difficult to keep them in an ordered state in general conditions. Atom storage can
only be used in low temperatures, liquid nitrogen, or vacuum conditions.
If both DNA storage and atomic storage are intended to reduce the size of storage and
increase the capacity of storage, quantum storage is designed to improve performance
and running speed.
After years of research, both the storage efficiency and the lifecycle of the quantum
memory are improved, but it is still difficult to put the quantum memory into practice.
Quantum memory has the problems of inefficiency, large noise, short lifespan, and
difficulty to operate at room temperature. Only by solving these problems, quantum
memory can be put into the market.
The elements in the quantum state are easily lost due to the influence of the external
environment. In addition, it is difficult to ensure 100% accuracy of manufacturing in the
quantum state and performing quantum operations.
References:
Wang, Y., Li, J., Zhang, S. et al. Efficient quantum memory for single-photon polarization
qubits. Nat. Photonics 13, 346–351 (2019). https://doi.org/10.1038/s41566-019-0368-8
Dou Jian-Peng, Li Hang, Pang Xiao-Ling, Zhang Chao-Ni, Yang Tian-Huai, Jin Xian-Min.
Research progress of quantum memory. Acta Physica Sinica, 2019, 68(3): 030307. doi:
10.7498/aps.68.20190039
Storage Network Development

In traditional data centers, IP SAN uses the Ethernet technology to form a multi-hop
symmetric network architecture and use the TCP/IP network protocol stack for data
transmission. FC SAN requires an independent FC network for data transmission.
Although traditional TCP/IP or Fibre Channel networks become mature after years of
development, their technical architecture limits the application of AI computing and
distributed storage.
To reduce the network delay and CPU usage, the Remote Direct Memory Access (RDMA)
technology emerges, used on servers to provide the remote direct memory access
function. RDMA directly transmits data from the memory of one computer to that of
another computer. Data is quickly moved from one system to the remote system storage
without intervention of both operating systems and without time-consuming processing
by processors. In this way, the system has high bandwidth, low latency, and efficient
resource usage.
1.1.4.4 History of Huawei Storage Products

Huawei has been developing storage technology since 2002 and gathers a class of global
elite engineers. Huawei is dedicated to storage innovation and R&D in the intelligent era.
Huawei products are consequently globally recognized for their superior quality by
customers and standard organizations.
Oriented to the intelligence era, Huawei OceanStor storage builds an innovative
architecture based on intelligence, hardware, and algorithms. It builds a memory/SCM-
centric ultimate performance layer based on Memory Fabric, and builds a high-
performance capacity layer based on all-IP and all-flash storage to provide intelligent
tiering for data storage. Computing resource pooling and intelligent scheduling at the
10,000-core level are implemented for CPUs, NPUs, and GPUs based on innovative
algorithms and high-speed interconnection protocols. In addition, container-based
heterogeneous microservices are tailored to business to break the boundaries of memory
performance, computing power, and protocols. Finally, an intelligent management
system is provided across the entire data lifecycle to deliver innovative storage products
for a fully connected, intelligent world.
2 Basic Storage Technologies
2.1 Intelligent Storage Components

2.1.1 Controller Enclosure
2.1.1.1 Controller Enclosure Design
A controller enclosure contains controllers and is the core component of a storage
system.
The controller enclosure uses a modular design with a system subrack, controllers (with
built-in fan modules), BBUs, power modules, management modules, and interface
modules.
 The system subrack integrates a backplane to provide signal and power connectivity
among modules.
 The controller is a core module for processing storage system services.
 BBUs protect storage system data by providing backup power during failures of the
external power supply.
 The AC power module supplies power to the controller enclosure, allowing the
enclosure to operate normally at maximum power.
 The management module provides management, maintenance, and serial ports.
 Interface modules provide service or management ports and are field replaceable
units. In computer science, data is a generic term for all media such as numbers,
letters, symbols, and analog parameters that can be input to and processed by
computer programs. Computers store and process a wide range of objects that
generate complex data.
2.1.1.2 Controller Enclosure Components

 A controller is the core component of a storage system. It processes storage services,
receives configuration management commands, saves configuration data, connects
to disks, and saves critical data to coffer disks.
 The controller CPU and cache process I/O requests from the host and manage
storage system RAID.
 Each controller has built-in disks to store system data. These disks also store
cache data during power failures. Disks on different controllers are redundant of
each other.
 Front-end (FE) ports provide service communication between application servers and
the storage system for processing host I/Os.
 Back-end (BE) ports connect a controller enclosure to a disk enclosure and provide
disks with channels for reading and writing data.
 A cache is a memory chip on a disk controller. It provides fast data access and is a
buffer between the internal storage and external interfaces.
 An engine is a core component of a development program or system on an
electronic platform. It usually provides support for programs or a set of systems.
 Coffer disks store user data, system configurations, logs, and dirty data in the cache
to protect against unexpected power outages.
 Built-in coffer disk: Each controller of Huawei OceanStor Dorado V6 has one or
two built-in SSDs as coffer disks. See the product documentation for more
details.
 External coffer disk: The storage system automatically selects four disks as coffer
disks. Each coffer disk provides 2 GB space to form a RAID 1 group. The
remaining space can store service data. If a coffer disk is faulty, the system
automatically replaces the faulty coffer disk with a normal disk for redundancy.
 Power module: The controller enclosure employs an AC power module for its normal
operations.
 A 4 U controller enclosure has four power modules (PSU 0, PSU 1, PSU 2, and
PSU 3). PSU 0 and PSU 1 form a power plane to power controllers A and C and
provide mutual redundancy. PSU 2 and PSU 3 form the other power plane to
power controllers B and D and provide mutual redundancy. It is recommended
that you connect PSU 0 and PSU 2 to one PDU and PSU 1 and PSU 3 to another
PDU for maximum reliability.
 A 2 U controller enclosure has two power modules (PSU 0 and PSU 1) to power
controllers A and B. The two power modules form a power plane and provide
mutual redundancy. Connect PSU 0 and PSU 1 to different PDUs for maximum
reliability.
2.1.2 Disk Enclosure

2.1.2.1 Disk Enclosure Design
The disk enclosure uses a modular design with a system subrack, expansion modules,
power modules, and disks.
 The system subrack integrates a backplane to provide signal and power connectivity
among modules.
 The expansion module provides expansion ports to connect to a controller enclosure
or another disk enclosure for data transmission.
 The power module supplies power to the disk enclosure, allowing the enclosure to
operate normally at maximum power.
 Disks provide storage space for the storage system to save service data, system data,
and cache data. Specific disks are used as coffer disks.
2.1.3 Expansion Module

2.1.3.1 Expansion Module
Each expansion module provides one P0 and one P1 expansion port to connect to a
controller enclosure or another disk enclosure for data transmission.
2.1.3.2 CE Switch
Huawei CloudEngine series fixed switches are next-generation Ethernet switches for data
centers and provide high performance, high port density, and low latency. The switches
use a flexible front-to-rear or rear-to-front design for airflow and support IP SANs and
distributed storage networks.
2.1.3.3 Fibre Channel Switch

Fibre Channel switches are high-speed network transmission relay devices that transmit
data over optical fibers. They accelerate transmission and protect against interference.
Fibre Channel switches are used on FC SANs.
2.1.3.4 Device Cables

A serial cable connects the serial port of the storage system to the maintenance terminal.
Mini SAS HD cables connect to expansion ports on controller and disk enclosures. There
are mini SAS HD electrical cables and mini SAS HD optical cables.
An active optical cable (AOC) connects a PCIe port on a controller enclosure to a data
switch.
100G QSFP28 cables are for direct connection between controllers or for connection to
smart disk enclosures.
25G SFP28 cables are for front-end networking.
Fourteen Data Rate (FDR) cables are dedicated for 56 Gbit/s IB interface modules.
Optical fibers connect the storage system to Fibre Channel switches. One end of the
optical fiber connects to a Fibre Channel host bus adapter (HBA), and the other end
connects to the Fibre Channel switch or the storage system. An optical fiber uses LC
connectors at both ends. MPO-4*DLC optical fibers are dedicated for 8 Gbit/s Fibre
Channel interface modules with 8 ports and 16 Gbit/s Fibre Channel interface modules
with 8 ports, and are used to connect the storage system to Fibre Channel switches.
2.1.4 HDD
2.1.4.1 HDD Structure
 A platter is coated with magnetic materials on both surfaces with polarized magnetic
grains to represent a binary information unit, or bit.
 A read/write head reads and writes data for platters. It changes the polarities of
magnetic grains on the platter surface to save data.
 The actuator arm moves the read/write head to the specified position.
 The spindle has a motor and bearing underneath. It rotates the specified position on
the platter to the read/write head.
 The control circuit controls the speed of the platter and movement of the actuator
arm, and delivers commands to the head.
2.1.4.2 HDD Design

Each disk platter has two read/write heads to read and write data on the two surfaces of
the platter.
Airflow prevents the head from touching the platter, so the head can move between
tracks at a high speed. A long distance between the head and the platter results in weak
signals, and a short distance may cause the head to rub against the platter surface. The
platter surface must therefore be smooth and flat. Any foreign matter or dust will
shorten the distance and cause the head to rub against the magnetic surface. This will
result in permanent data corruption.
Working principles:
 The read/write head starts in the landing zone near the platter spindle.
 The spindle connects to all of the platters and a motor. The spindle motor rotates at
a constant speed to drive the platters.
 When the spindle rotates, there is a small gap between the head and the platter.
This is called the flying height of the head.
 The head is attached to the end of the actuator arm, which drives the head to the
specified position above the platter where data needs to be written or read.
 The head reads and writes data in binary format on the platter surface. The read
data is stored in the flash chip of the disk and then transmitted to the program.
2.1.4.3 Data Organization on a Disk

 Platter surface: Each platter of a disk has two valid surfaces to store data. All valid
surfaces are numbered in sequence, starting from 0 for the top. A surface number in
a disk system is also referred to as a head number because each valid surface has a
read/write head.
 Track: Tracks are the concentric circles for data recording around the spindle on a
platter. Tracks are numbered from the outermost circle to the innermost one,
starting from 0. Each platter surface has 300 to 1024 tracks. New types of large-
capacity disks have even more tracks on each surface. The tracks per inch (TPI) on a
platter are generally used to measure the track density. Tracks are only magnetized
areas on the platter surfaces and are invisible to human eyes.
 Cylinder: A cylinder is formed by tracks with the same number on all platter surfaces
of a disk. The heads of each cylinder are numbered from top to bottom, starting
from 0. Data is read and written based on cylinders. Head 0 in a cylinder reads and
writes data first, and then the other heads in the same cylinder read and write data
in sequence. After all heads in a cylinder have completed reads and writes, the heads
move to the next cylinder. Selection of cylinders is a mechanical switching process
called seek. The position of heads in a disk is generally indicated by the cylinder
number instead of the track number.
 Sector: Each track is divided into smaller units called sectors to arrange data orderly.
A sector is the smallest storage unit that can be independently addressed in a disk.
Tracks may vary in the number of sectors. A sector can generally store 512 bytes of
user data, but some disks can be formatted into even larger sectors of 4 KB.
2.1.4.4 Disk Capacity

Disks may have one or multiple platters. However, a disk allows only one head to read
and write data at a time. As a result, increasing the number of platters and heads can
only improve the disk capacity. The throughput or I/O performance of the disk will not
change.
Disk capacity = Number of cylinders x Number of heads x Number of sectors x Sector
size. The unit is MB or GB. The disk capacity is determined by the capacity of a single
platter and the number of platters.
The superior processing speed of a CPU over a disk forces the CPU to wait until the disk
completes a read/write operation before issuing a new command. Adding a cache to the
disk to improve the read/write speed solves this problem.
2.1.4.5 Disk Performance Factors

 Rotation speed: This is the number of platter revolutions per minute (rpm). When
data is being read or written, the platter rotates while the head stays still. Fast
platter rotations shorten data transmission time. When processing sequential I/Os,
the actuator arm avoids frequent seeking, so the rotation speed is the primary factor
in determining throughput and IOPS.
 Seek speed: The actuator arm must change tracks frequently for random I/Os. Track
changes take much longer time than data transmission. An actuator arm with a
faster seek speed can therefore improve the IOPS of random I/Os.
 Single platter capacity: A larger capacity for a single platter increases data storage
within a unit of space for a higher data density. A higher data density with the same
rotation and seek speed gives disks better performance.
 Port speed: In theory, the current port speed is enough to support the maximum
external transmission bandwidth of disks. The seek speed is the bottleneck for
random I/Os with port speed having little impact on performance.
2.1.4.6 Average Access Time

 The average seek time is the average time required for a head to move from its
initial position to a specified platter track. This is an important metric for the internal
transfer rate of a disk and should be as short as possible.
 The average latency time is how long a head must wait for a sector to move to the
specified position after the head has reached the desired track. The average latency
is generally half of the time required for the platter to rotate a full circle. Faster
rotations therefore decrease latency.
2.1.4.7 Data Transfer Rate

The data transfer rate of a disk refers to how fast the disk can read and write data. It
includes the internal and external data transfer rates and is measured in MB/s.
 Internal transfer rate is also called sustained transfer rate. It is the highest rate at
which a head reads and writes data. This excludes the seek time and the delay for
the sector to move to the head. It is a measurement based on an ideal situation
where the head does not need to change the track or read a specified sector, but
reads and writes all sectors sequentially and cyclically on one track.
 External transfer rate is also called burst data transfer rate or interface transfer rate.
It refers to the data transfer rate between the system bus and the disk buffer and
depends on the disk port type and buffer size.
2.1.4.8 Disk IOPS and Transmission Bandwidth

IOPS is calculated using the seek time, rotation latency, and data transmission time.
 Seek time: The shorter the seek time, the faster the I/O. The current average seek
time is 3 to 15 ms.
 Rotation latency: It refers to the time required for the platter to rotate the sector of
the target data to the position below the head. The rotation latency depends on the
rotation speed. Generally, the latency is half of the time required for the platter to
rotate a full circle. For example, the average rotation latency of a 7200 rpm disk is
about 4.17 ms (60 x 1000/7200/2), and the average rotation latency of a 15000 rpm
disk is about 2 ms.
 Data transmission time: It is the time required for transmitting the requested data
and can be calculated by dividing the data size by the data transfer rate. For
example, the data transfer rate of IDE/ATA disks can reach 133 MB/s and that of
SATA II disks can reach 300 MB/s.
 Random I/Os require the head to change tracks frequently. The data transmission
time is much shorter than the time for track changes. In this case, data transmission
time can be ignored.
Theoretically, the maximum IOPS of a disk can be calculated using the following formula:
IOPS = 1000 ms/(Seek time + Rotation latency). The data transmission time is ignored.
For example, if the average seek time is 3 ms, the theoretical maximum IOPS for 7200
rpm, 10k rpm, and 15k rpm disks is 140, 167, and 200, respectively.
2.1.4.9 Transmission Mode

Parallel transmission:
 Parallel transmission features high efficiency, short distances, and low frequency.
 In long-distance transmission, using multiple lines is more expensive than using a
single line.
 Long-distance transmission requires thicker conducting wires to reduce signal
attenuation, but it is difficult to bundle them into a single cable.
 In long-distance transmission, the time for data on each line to reach the peer end
varies due to wire resistance or other factors. The next transmission can be initiated
only after data on all lines has reached the peer end.
 High transmission frequency causes serious circuit oscillation and generates
interference between the lines. The frequency of parallel transmission must therefore
be carefully set.
Serial transmission:
 Serial transmission is less efficient than parallel transmission, but is generally faster
with potential increases in transmission speed from increasing the transmission
frequency.
 Serial transmission is used for long-distance transmission. Currently, PCI interfaces
use serial transmission. The PCIe interface is a typical example of serial transmission.
The transmission rate of a single line is up to 2.5 Gbit/s.
2.1.4.10 Disk Ports

Disks are classified into IDE, SCSI, SATA, SAS, and Fibre Channel disks by port. These disks
also differ in their mechanical bases.
IDE and SATA disks use the ATA mechanical base and are suitable for single -task
processing.
SCSI, SAS, and Fibre Channel disks use the SCSI mechanical base and are suitable for
multi-task processing.
Comparison:
 SCSI disks provide faster processing than ATA disks under high data throughput.
 ATA disks overheat during multi-task processing due to the frequent movement of
the read/write head.
 SCSI disks provide higher reliability than ATA disks.
IDE disk port:
 Multiple ATA versions have been released, including ATA-1 (IDE), ATA-2 (Enhanced
IDE/Fast ATA), ATA-3 (Fast ATA-2), ATA-4 (ATA33), ATA-5 (ATA66), ATA-6
(ATA100), and ATA-7 (ATA133).
 ATA ports have several advantages and disadvantages:
 Their strengths are their low price and good compatibility.
 Their disadvantages are their low speed, limited applications, and strict
restrictions on cable length.
 The transmission rate of the PATA port is also inadequate for current user needs.
SATA port:
 During data transmission, the data and signal lines are separated and use
independent transmission clock frequency. The transmission rate of SATA is 30 times
that of PATA.
 Advantages:
 A SATA port generally has 7+15 pins, uses a single channel, and transmits data
faster than ATA.
 SATA uses the cyclic redundancy check (CRC) for instructions and data packets
to ensure data transmission reliability.
 SATA surpasses ATA in interference protection.
SCSI port:
 SCSI disks were developed to replace IDE disks to provide higher rotation speed and
transmission rate. SCSI was originally a bus-type interface and worked independently
of the system bus.
 Advantages:
 It is applicable to a wide range of devices. One SCSI controller card can connect
to 15 devices simultaneously.
 It provides high performance with multi-task processing, low CPU usage, fast
rotation speed, and a high transmission rate.
 SCSI disks support diverse applications as external or built-in components with
hot-swappable replacement.
 Disadvantages:
 High cost and complex installation and configuration.
SAS port:
 SAS is similar to SATA in its use of a serial architecture for a high transmission rate
and streamlined internal space with shorter internal connections.
 SAS improves the efficiency, availability, and scalability of the storage system. It is
backward compatible with SATA for the physical and protocol layers.
 Advantages:
 SAS is superior to SCSI in its transmission rate, anti-interference, and longer
connection distances.
 Disadvantages:
 SAS disks are more expensive.
Fibre Channel port:
 Fiber Channel was originally designed for network transmission rather than disk
ports. It has gradually been applied to disk systems in pursuit of higher speed.
 Advantages:
 Easy to upgrade. Supports optical fiber cables with a length over 10 km.
 Large bandwidth
 Strong universality
 Disadvantages:
 High cost
 Complex to build
2.1.5 SSD
2.1.5.1 SSD Overview
Traditional disks use magnetic materials to store data, but SSDs use NAND flash with
cells as storage units. NAND flash is a non-volatile random access storage medium that
can retain stored data after the power is turned off. It quickly and compactly stores
digital information.
SSDs eliminate high-speed rotational components for higher performance, lower power
consumption, and zero noise.
SSDs do not have mechanical parts, but this does not mean that they have an infinite life
cycle. Because NAND flash is a non-volatile medium, original data must be erased before
new data can be written. However, there is a limit to how many times each cell can be
erased. Once the limit is reached, data reads and writes become invalid on that cell.
2.1.5.2 SSD Architecture

The host interface is the protocol and physical interface for the host to access an SSD.
Common interfaces are SATA, SAS, and PCIe.
The SSD controller is the core SSD component for read and write access between a host
and the back-end media and for protocol conversion, table entry management, data
caching, and data checking.
DRAM is the cache for the flash translation layer (FTL) entries and data.
NAND flash is a non-volatile random access storage medium that stores data.
There are concurrent multiple channels with time-division multiplexing for flash granules
in a channel. There is also support for TCQ and NCQ for simultaneous responses to
multiple I/O requests.
2.1.5.3 NAND Flash

Internal storage units of NAND flash include LUNs, planes, blocks, pages, and cells.
NAND flash stores data using floating gate transistors. The threshold voltage changes
based on the number of electric charges stored in a floating gate. Data is then
represented using the read voltage of the transistor threshold.
 A LUN is the smallest physical unit that can be independently encapsulated and
typically contains multiple planes.
 A plane has an independent page register. It typically contains 1,000 or 2,000 odd or
even blocks.
 A block is the smallest erasure unit and generally consists of multiple pages.
 A page is the smallest programming and read unit and is usually 16 KB.
 A cell is the smallest erasable, programmable, and readable unit found in pages. A
cell corresponds to a floating gate transistor that stores one or multiple bits.
A page is the basic unit of programming and reading, and a block is the basic unit of
erasing.
Each P/E cycle causes some damage to the insulation layer of the floating gate transistor.
If block erasure or programming fails, the block is labeled a bad block. When the number
of bad blocks reaches a threshold (4%), the NAND flash reaches the end of its service
life.
2.1.5.4 SLC, MLC, TLC, and QLC

NAND flash chips have the following classifications based on the number of bits stored in
a cell:
 A single level cell (SLC) can store one bit of data: 0 or 1.
 A multi level cell (MLC) can store two bits of data: 00, 01, 10, and 11.
 A triple level cell (TLC) can store three bits of data: 000, 001, 010, 011, 100, 101, 110,
and 111.
 A quad level cell (QLC) can store four bits of data: 0000, 0001, 0010, 0011, 0100,
0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, and 1111.
These four types of cells have similar costs but store different amounts of data.
Originally, the capacity of an SSD was only 64 GB or smaller. Now, a TLC SSD can store
up to 2 TB of data. However, each cell type has a different life cycle, resulting in different
SSD reliability. The life cycle is also an important factor in selecting SSDs.
2.1.5.5 Flash Chip Data Relationship

This slide shows the logic diagram of a flash chip (Toshiba 3D-TLC).
 A page is logically formed by 146,688 cells. Each page can store 16 KB of content
and 1952 bytes of ECC data. A page is the minimum I/O unit of the flash chip.
 Every 768 pages form a block. Every 1478 blocks form a plane.
 A flash chip consists of two planes, with one storing blocks with odd sequence
numbers and the other storing even sequence numbers. The two planes can be
operated concurrently.
ECC must be performed on the data stored in the NAND flash, so the size of the page in
the NAND flash is not an integer of 16 KB, but with an extra group of bytes. For example,
if the actual size of a 16 KB page is 16,384 + 1,952 bytes, then the 16,384 bytes are for
data storage, and the 1,952 bytes are for storing data check codes for ECC.
2.1.5.6 Address Mapping Management

The logical block address (LBA) may refer to an address of a data block or the data block
that the address indicates.
PBA: physical block address
The host accesses the SSD through the LBA. Each LBA generally represents a sector of
512 bytes. The host OS accesses the SSD in units of 4 KB. The basic unit for the host to
access the SSD is called host page.
The flash page of an SSD is the basic unit for the SSD controller to access the flash chip,
which is also called the physical page. Each time the host writes a host page, the SSD
controller writes it to a physical page and records their mapping relationship.
When the host reads a host page, the SSD finds the requested data according to the
mapping relationship.
2.1.5.7 SSD Read and Write Process

SSD write process:
 The SSD controller connects to eight flash dies through eight channels. For better
explanation, the figure shows only one block in each die. Each 4 KB square in the
blocks represents a page.
 The host writes 4 kilobytes to the block of channel 0 to occupy one page.
 The host continues to write 16 kilobytes. This example shows 4 kilobytes being
written to each block of channels 1 through 4.
 The host continues to write data to the blocks until all blocks are full.
 When the blocks on all channels are full, the SSD controller selects a new block to
write data in the same way.
 Green indicates valid data and red indicates invalid data. Unnecessary data in the
blocks becomes aged or invalid, and its mapping relationship is replaced.
 For example, host page A was originally stored in flash page X, and the mapping
relationship was A to X. Later, the host rewrites the host page. Flash memory does
not overwrite data, so the SSD writes the new data to a new page Y, establishes the
new mapping relationship of A to Y, and cancels the original mapping relationship.
The data in page X becomes aged and invalid, which is also known as garbage data.
 The host continues to write data to the SSD until it is full. In this case, the host
cannot write more data unless the garbage data is cleared.
SSD read process:
 An 8-fold increase in read speed depends on whether the read data is evenly
distributed in the blocks of each channel. If the 32 KB data is stored in the blocks of
channels 1 through 4, the read speed can only support a 4-fold improvement at
most. That is why smaller files are transmitted at a slower rate.
2.1.5.8 SSD Performance Advantages

Short response time: Traditional HDDs have lower efficiency in data transmission because
they waste time with seeking and mechanical latency. SSDs use NAND flash to eliminate
seeking and mechanical latency for far faster responses to read and write requests.
High read/write efficiency: HDDs perform random read/write operations by moving the
head back and forth, resulting in low read/write efficiency. In contrast, SSDs calculate
data storage locations with an internal controller to reduce the mechanical operations
and streamline read/write processing.
In addition, deploying a large number of SSDs grants enormous advantages in power
efficiency.
2.1.6 Interface Module

2.1.6.1 GE Interface Modules
A GE electrical interface module has four electrical ports with 1 Gbit/s and is used for
HyperMetro quorum networking.
A 40GE interface module provides two optical ports with 40 Gbit/s for connecting storage
devices to application servers.
A 100GE interface module provides two optical ports with 100 Gbit/s for connecting
storage devices to application servers.
2.1.6.2 SAS Expansion Module and RDMA Interface Module

A 25 Gbit/s RDMA interface module provides four optical ports with 25 Gbit/s for direct
connections between two controller enclosures.
A 100 Gbit/s RDMA interface module provides two optical ports with 100 Gbit/s for
connecting controller enclosures to scale-out switches or smart disk enclosures. SO stands
for scale-out and BE stands for back-end in the labels.
A 12 Gbit/s SAS expansion module provides four mini SAS HD expansion ports with 4 x
12 Gbit/s to connect controller enclosures to 2 U SAS disk enclosures.
2.1.6.3 SmartIO Interface Modules

SmartIO interface modules support 8, 10, 16, 25, and 32 Gbit/s optical modules, which
respectively provide 8 Gbit/s Fibre Channel, 10GE, 16 Gbit/s Fibre Channel, 25GE, and 32
Gbit/s Fibre Channel ports. SmartIO interface modules connect storage devices to
application servers.
The optical module rate must match the rate on the interface module label. Otherwise,
the storage system will report an alarm and the port will become unavailable.
2.1.6.4 PCIe and 56 Gbit/s IB Interface Modules

A PCIe interface module provides two PCIe ports for connecting controller enclosures to
PCIe switches and exchanging control and data flows between the controller enclosures.
Indicators:
 Interface module power indicator
 Link/Speed indicator of a PCIe port
 PCIe port
 Handle
The 56 Gbit/s IB interface module provides two IB ports with a transmission rate of 4 x 14
Gbit/s.
Indicators:
 Power indicator/Hot swap button
 Link indicator of a 56 Gbit/s IB port
 Active indicator of a 56 Gbit/s IB port
 56 Gbit/s IB port
 Handle
2.1.6.5 Fibre Channel and FCoE Interface Modules

A 16 Gbit/s Fibre Channel interface module has two physical ports, which are converted
into eight 16 Gbit/s Fibre Channel ports by dedicated cables. Each port provides a
transmission rate of 16 Gbit/s. They serve as the service ports between the storage
system and application server to receive data exchange commands from the application
server.
A 10 Gbit/s FCoE interface module provides two FCoE ports with 10 Gbit/s, which connect
the storage system to the application server for data transmission.
The 10 Gbit/s FCoE interface module supports only direct connections.
2.2 RAID Technologies

2.2.1 Traditional RAID
2.2.1.1 Basic Concept of RAID
Redundant Array of Independent Disks (RAID) combines multiple physical disks into one
logical disk in different ways, for the purposes of read/write performance and data
security improvement.
Functionality of RAID:
 Combines multiple physical disks into one logical disk array to provide larger storage
capacity.
 Divides data into blocks and concurrently writes/reads data to/from multiple disks to
improve disk access efficiency.
 Provides mirroring or parity for fault tolerance.
Hardware RAID and software RAID can be implemented in storage devices.
 Hardware RAID uses a dedicated RAID adapter, disk controller, or storage processor.
The RAID controller has a built-in processor, I/O processor, and memory to improve
resource utilization and data transmission speed. The RAID controller manages
routes and buffers, and controls data flows between the host and the RAID array.
Hardware RAID is usually used in servers.
 Software RAID has no built-in processor or I/O processor but relies on a host
processor. Therefore, a low-speed CPU cannot meet the requirements for RAID
implementation. Software RAID is typically used in enterprise-class storage devices.
Disk striping: Space in each disk is divided into multiple strips of a specific size. Data is
also divided into blocks based on strip size when data is being written.
 Strip: A strip consists of one or more consecutive sectors in a disk, and multiple strips
form a stripe.
 Stripe: A stripe consists of strips of the same location or ID on multiple disks in the
same array.
RAID generally provides two methods for data protection.
 One is storing data copies on another redundant disk to improve data reliability and
read performance.
 The other is parity. Parity data is additional information calculated using user data.
For a RAID array that uses parity, an additional parity disk is required. The XOR
(symbol: ⊕) algorithm is used for parity.
2.2.1.2 RAID 0
RAID 0, also referred to as striping, provides the best storage performance among all
RAID levels. RAID 0 uses the striping technology to distribute data to all disks in a RAID
array.
Figure 2-1 Working principles of RAID 0

A RAID 0 array contains at least two member disks. A RAID 0 array divides data into
blocks of different sizes ranging from 512 bytes to megabytes (usually multiples of 512
bytes) and concurrently writes the data blocks to different disks. The preceding figure
shows a RAID 0 array consisting of two disks (drives). The first two data blocks are
written to stripe 0: the first data block is written to strip 0 in disk 1, and the second data
block is written to strip 0 in disk 2. Then, the next data block is written to the next strip
(strip 1) in disk 1, and so forth. In this mode, I/O loads are balanced among all disks in
the RAID array. As the data transfer speed on the bus is much higher than the data read
and write speed on disks, data reads and writes on disks can be considered as being
processed concurrently.
A RAID 0 array provides a large-capacity disk with high I/O processing performance.
Before the introduction of RAID 0, there was a technology similar to RAID 0, called Just a
Bundle Of Disks (JBOD). JBOD refers to a large virtual disk consisting of multiple disks.
Unlike RAID 0, JBOD does not concurrently write data blocks to different disks. JBOD uses
another disk only when the storage capacity in the first disk is used up. Therefore, JBOD
provides a total available capacity which is the sum of capacities in all disks but provides
the performance of individual disks.
In contrast, RAID 0 searches the target data block and reads data in all disks upon
receiving a data read request. The preceding figure shows a data read process. A RAID 0
array provides a read/write performance that is directly proportional to disk quantity.
2.2.1.3 RAID 1
RAID 1, also referred to as mirroring, maximizes data security. A RAID 1 array uses two
identical disks including one mirror disk. When data is written to a disk, a copy of the
same data is stored in the mirror disk. When the source (physical) disk fails, the mirror
disk takes over services from the source disk to maintain service continuity. The mirror
disk is used as a backup to provide high data reliability.
The amount of data stored in a RAID 1 array is only equal to the capacity of a single disk,
and data copies are retained in another disk. That is, each gigabyte data needs 2
gigabyte disk space. Therefore, a RAID 1 array consisting of two disks has a space
utilization of 50%.

Unlike RAID 0 which utilizes striping to concurrently write different data to different
disks, RAID 1 writes same data to each disk so that data in all member disks is consistent.
As shown in the preceding figure, data blocks D 0, D 1, and D 2 are to be written to disks.
D 0 and the copy of D 0 are written to the two disks (disk 1 and disk 2) at the same time.
Other data blocks are also written to the RAID 1 array in the same way by mirroring.
Generally, a RAID 1 array provides write performance of a single disk.
A RAID 1 array reads data from the data disk and the mirror disk at the same time to
improve read performance. If one disk fails, data can be read from the other disk.
A RAID 1 array provides read performance which is the sum of the read performance of
the two disks. When a RAID array degrades, its performance decreases by half.
2.2.1.4 RAID 3
RAID 3 is similar to RAID 0 but uses dedicated parity stripes. In a RAID 3 array, a
dedicated disk (parity disk) is used to store the parity data of strips in other disks in the
same stripe. If incorrect data is detected or a disk fails, data in the faulty disk can be
recovered using the parity data. RAID 3 applies to data-intensive or single-user
environments where data blocks need to be continuously accessed for a long time. RAID
3 writes data to all member data disks. However, when new data is written to any disk,
RAID 3 recalculates and rewrites parity data. Therefore, when a large amount of data
from an application is written, the parity disk in a RAID 3 array needs to process heavy
workloads. Parity operations have certain impact on the read and write performance of a
RAID 3 array. In addition, the parity disk is subject to the highest failure rate in a RAID 3
array due to heavy workloads. A write penalty occurs when just a small amount of data is
written to multiple disks, which does not improve disk performance as compared with
data writes to a single disk.

RAID 3 uses a single disk for fault tolerance and performs parallel data transmission.
RAID 3 uses striping to divide data into blocks and writes XOR parity data to the last disk
(parity disk).
The write performance of RAID 3 depends on the amount of changed data, the number
of disks, and the time required to calculate and store parity data. If a RAID 3 array
consists of N member disks of the same rotational speed and write penalty is not
considered, its sequential I/O write performance is theoretically slightly inferior to N – 1
times that of a single disk when full-stripe write is performed. (Additional time is
required to calculate redundancy check.)
In a RAID 3 array, data is read by stripe. Data blocks in a stripe can be concurrently read
as drives in all disks are controlled.
RAID 3 performs parallel data reads and writes. The read performance of a RAID 3 array
depends on the amount of data to be read and the number of member disks.
2.2.1.5 RAID 5
RAID 5 is improved based on RAID 3 and consists of striping and parity. In a RAID 5 array,
data is written to disks by striping. In a RAID 5 array, the parity data of different strips is
distributed among member disks instead of a parity disk.
Similar to RAID 3, a write penalty occurs when just a small amount of data is written.

The write performance of a RAID 5 array depends on the amount of data to be written
and the number of member disks. If a RAID 5 array consists of N member disks of the
same rotational speed and write penalty is not considered, its sequential I/O write
performance is theoretically slightly inferior to N – 1 times that of a single disk when full-
stripe write is performed. (Additional time is required to calculate redundancy check.)
In a RAID 3 or RAID 5 array, if a disk fails, the array changes from the online (normal)
state to the degraded state until the faulty disk is reconstructed. If a second disk also
fails, the data in the array will be lost.
2.2.1.6 RAID 6
Data protection mechanisms of all RAID arrays previously discussed considered only
failures of individual disks (excluding RAID 0). The time required for reconstruction
increases along with the growth of disk capacities. It may take several days instead of
hours to reconstruct a RAID 5 array consisting of large-capacity disks. During the
reconstruction, the array is in the degraded state, and the failure of any additional disk
will cause the array to be faulty and data to be lost. This is why some organizations or
units need a dual-redundancy system. In other words, a RAID array should tolerate
failures of up to two disks while maintaining normal access to data. Such dual-
redundancy data protection can be implemented in the following ways:
 The first one is multi-mirroring. Multi-mirroring is a method of storing multiple
copies of a data block in redundant disks when the data block is stored in the
primary disk. This means heavy overheads.
 The second one is a RAID 6 array. A RAID 6 array protects data by tolerating failures
of up to two disks even at the same time.
The formal name of RAID 6 is distributed double-parity (DP) RAID. It is essentially an
improved RAID 5, and also consists of striping and distributed parity. RAID 6 supports
double parity, which means that:
 When user data is written, double parity calculation needs to be performed.

Therefore, RAID 6 provides the slowest data writes among all RAID levels.
 Additional parity data takes storage spaces in two disks. This is why RAID 6 is
considered as an N + 2 RAID.
Currently, RAID 6 is implemented in different ways. Different methods are used for
obtaining parity data.
RAID 6 P+Q
Figure 2-5 Working principles of RAID 6 P+Q

 When a RAID 6 array uses P+Q parity, P and Q represent two independent parity
data. P and Q parity data is obtained using different algorithms. User data and parity
data are distributed in all disks in the same stripe.
 As shown in the figure, P 1 is obtained by performing an XOR operation on D 0, D 1,
and D 2 in stripe 0, P 2 is obtained by performing an XOR operation on D 3, D 4, and
D 5 in stripe 1, and P 3 is obtained by performing an XOR operation on D 6, D 7, and
D 8 in stripe 2.
 Q 1 is obtained by performing a GF transform and then an XOR operation on D 0, D
1, and D 2 in stripe 0, Q 2 is obtained by performing a GF transform and then an
XOR operation on D 3, D 4, and D 5 in stripe 1, and Q 3 is obtained by performing a
GF transform and then an XOR operation on D 6, D 7, and D 8 in stripe 2.
 If a strip on a disk fails, data on the failed disk can be recovered using the P parity
value. The XOR operation is performed between the P parity value and other data
disks. If two disks in the same stripe fail at the same time, different solutions apply
to different scenarios. If the Q parity data is not in any of the two faulty disks, the
data can be recovered to data disks, and then the parity data is recalculated. If the Q
parity data is in one of the two faulty disks, data in the two faulty disks must be
recovered by using both the formulas.
RAID 6 DP
Figure 2-6 Working principles of RAID 6 DP

 RAID 6 DP also has two independent parity data blocks. The first parity data is the
same as the first parity data of RAID 6 P+Q. The second parity data is the diagonal
parity data obtained through diagonal XOR operation. Horizontal parity data is
obtained by performing an XOR operation on user data in the same stripe. As shown
in the preceding figure, P 0 is obtained by performing an XOR operation on D 0, D 1,
D 2, and D 3 in stripe 0, and P 1 is obtained by performing an XOR operation on D4,
D5, D6, and D 7 in stripe 1. Therefore, P 0 = D 0 ⊕ D 1 ⊕ D 2 ⊕ D 3, P 1 = D 4 ⊕
D 5 ⊕ D 6 ⊕ D 7, and so on.
 The second parity data block is obtained by performing an XOR operation on
diagonal data blocks in the array. The process of selecting data blocks is relatively
complex. DP 0 is obtained by performing an XOR operation on D 0 in disk 1 in stripe
0, D 5 in disk 2 in stripe 1, D 10 in disk 3 in stripe 2, and D 15 in disk 4 in stripe 3. DP
1 is obtained by performing an XOR operation on D 1 in disk 2 in stripe 0, D 6 in disk
3 in stripe 1, D 11 in disk 4 in stripe 2, and P 3 in the first parity disk in stripe 3. DP 2
is obtained by performing an XOR operation on D 2 in disk 3 in stripe 0, D 7 in disk 4
in stripe 1, P 2 in the first parity disk in stripe 2, and D 12 in disk 1 in stripe 3.
Therefore, DP 0 = D 0 ⊕ D 5 ⊕ D 10 ⊕ D 15, DP 1 = D 1 ⊕ D 6 ⊕ D 11 ⊕ P 3,
and so on.
 A RAID 6 array tolerates failures of up to two disks.
 A RAID 6 array provides relatively poor performance no matter whether DP or P+Q is
implemented. Therefore, RAID 6 applies to the following two scenarios:
 Data is critical and should be consistently in online and available state.
 Large-capacity (generally > 2 T) disks are used. The reconstruction of a large -
capacity disk takes a long time. Data will be inaccessible for a long time if two
disks fail at the same time. A RAID 6 array tolerates failure of another disk
during the reconstruction of one disk. Some enterprises anticipate to use a dual-
redundancy protection RAID array for their large-capacity disks.
2.2.1.7 RAID 10
For most enterprises, RAID 0 is not really a practical choice, while RAID 1 is limited by
disk capacity utilization. RAID 10 provides the optimal solution by combining RAID 1 and
RAID 0. In particular, RAID 10 provides superior performance by eliminating write penalty
in random writes.
A RAID 10 array consists of an even number of disks. User data is written to half of the
disks and mirror copies of user data are retained in the other half of disks. Mirroring is
performed based on stripes.

As shown in the figure, physical disks 1 and 2 form a RAID 1 array, and physical disks 3
and 4 form another RAID 1 array. The two RAID 1 sub-arrays form a RAID 0 array.
When data is written to the RAID 10 array, data blocks are concurrently written to sub-
arrays by mirroring. As shown in the figure, D 0 is written to physical disk 1, and its c opy
is written to physical disk 2.
If disks (such as disk 2 and disk 4) in both the two RAID 1 sub-arrays fail, accesses to
data in the RAID 10 array will remain normal. This is because integral copies of the data
in faulty disks 2 and 4 are retained on other two disks (such as disk 3 and disk 1).
However, if disks (such as disk 1 and 2) in the same RAID 1 sub-array fail at the same
time, data will be inaccessible.
Theoretically, RAID 10 tolerates failures of half of the physical disks. However, in the
worst case, failures of two disks in the same sub-array may also cause data loss.
Generally, RAID 10 protects data against the failure of a single disk.
2.2.1.8 RAID 50
RAID 50 combines RAID 0 and RAID 5. Two RAID 5 sub-arrays form a RAID 0 array. The
two RAID 5 sub-arrays are independent of each other. A RAID 50 array requires at least
six disks because a RAID 5 sub-array requires at least three disks.

As shown in the figure, disks 1, 2, and 3 form a RAID 5 sub-array, and disks 4, 5, and 6
form another RAID 5 sub-array. The two RAID 5 sub-arrays form a RAID 0 array.
A RAID 50 array tolerates failures of multiple disks at the same time. However, failures of
two disks in the same RAID 5 sub-array will cause data loss.
2.2.2 RAID 2.0+

2.2.2.1 RAID Evolution
As a well-developed and reliable disk data protection standard, RAID has always been
used as a basic technology for storage systems. However, traditional RAID is becoming
increasingly defective, in particular, in reconstruction of large-capacity disks, with ever-
increasing data storage requirements and capacity per disk.
Traditional RAID is defective due to high risk of data loss and impact on services.
 High risk of data loss: Ever-increasing disk capacities lead to longer reconstruction
time and higher risk of data loss. Dual redundancy protection is invalid during
reconstruction and data will be lost if any additional disk or data block fails.
Therefore, a longer reconstruction duration results in higher risk of data loss.
 Material impact on services: During reconstruction, member disks are engaged in
reconstruction and provide poor service performance, which will affect the operation
of upper-layer services.
To solve the preceding problems of traditional RAID and ride on the development of
virtualization technologies, the following alternative solutions emerged:
 LUN virtualization: A traditional RAID array is further divided into small units. These
units are regrouped into storage spaces accessible to hosts.
 Block virtualization: Disks in a storage pool are divided into small data blocks. A
RAID array is created using these data blocks so that data can be evenly distributed
to all disks in the storage pool. Then, resources are managed based on data blocks.
2.2.2.2 Basic Principles of RAID 2.0+

RAID 2.0+ divides a physical disk into multiple chunks (CKs). CKs in different disks form a
chunk group (CKG). CKGs have a RAID relationship with each other. Multiple CKGs form
a large storage resource pool. Resources are allocated from the resource pool to hosts.
Implementation mechanism of RAID 2.0+:
 Multiple SSDs form a storage pool.
 Each SSD is then divided into chunks (CKs) of a fixed size (typically 4 MB) for logical
space management.
 CKs from different SSDs form chunk groups (CKGs) based on the RAID policy
specified on DeviceManager.
 CKGs are further divided into grains (typically 8 KB). Grains are mapped to LUNs for
refined management of storage resources.
RAID 2.0+ outperforms traditional RAID in the following aspects:
 Service load balancing to avoid hot spots: Data is evenly distributed to all disks in the
resource pool, protecting disks from early end of service life due to excessive writes.
 Fast reconstruction to reduce risk window: When a disk fails, the valid data in the
faulty disk is reconstructed to all other functioning disks in the resource pool (fast
many-to-many reconstruction), efficiently resuming redundancy protection.
 Reconstruction load balancing among all disks in the resource pool to minimize the
impact on upper-layer applications.
2.2.2.3 RAID 2.0+ Composition

1. Disk Domain
A disk domain is a combination of disks (which can be all disks in the array). After
the disks are combined and reserved for hot spare capacity, each disk domain
provides storage resources for the storage pool.
For traditional RAID, a RAID array must be created first for allocating disk spaces to
service hosts. However, there are some restrictions and requirements for creating a
RAID array: A RAID array must consist of disks of the same type, size, and rotational
speed, and should consist of a maximum number of 12 disks.
Huawei RAID 2.0+ is implemented in another way. A disk domain should be created
first. A disk domain is a disk array. A disk can belong to only one disk domain. One or
more disk domains can be created in an OceanStor storage system. It seems that a
disk domain is similar to a RAID array. Both consist of disks but have significant
differences. A RAID array consists of disks of the same type, size, and rotational
speed, and such disks are associated with a RAID level. In contrast, a disk domain
consists of up to more than 100 disks of up to three types. Each type of disk is
associated with a storage tier. For example, SSDs are associated with the high
performance tier, SAS disks are associated with the performance tier, and NL-SAS
disks are associated with the capacity tier. A storage tier would not exist if there are
no disks of the corresponding type in a disk domain. A disk domain separates an
array of disks from another array of disks for fully isolating faults and maintaining
independent performance and storage resources. RAID levels are not specified when
a disk domain is created. That is, data redundancy protection methods are not
specified. Actually, RAID 2.0+ provides more flexible and specific data redundancy
protection methods. The storage space formed by disks in a disk domain is divided
into storage pools of a smaller granularity and hot spare space shared among
storage tiers. The system automatically sets the hot spare space based on the hot
spare policy (high, low, or none) set by an administrator for the disk domain and the
number of disks at each storage tier in the disk domain. In a traditional RAID array,
an administrator should specify a disk as the hot space disk.
2. Storage Pool and Storage Tier
A storage pool is a storage resource container. The storage resources used by
application servers are all from storage pools.
A storage tier is a collection of storage media providing the same performance level
in a storage pool. Different storage tiers manage storage media of different
performance levels and provide storage space for applications that have different
performance requirements.
A storage pool created based on a specified disk domain dynamically allocates CKs
from the disk domain to form CKGs according to the RAID policy of each storage tier
for providing storage resources with RAID protection to applications.
A storage pool can be divided into multiple tiers based on disk types.
When creating a storage pool, a user is allowed to specify a storage tier and related
RAID policy and capacity for the storage pool.
OceanStor storage systems support RAID 1, RAID 10, RAID 3, RAID 5, RAID 50, and
RAID 6 and related RAID policies.
The capacity tier consists of large-capacity SATA and NL-SAS disks. DP RAID 6 is
recommended.
3. Disk Group
An OceanStor storage system automatically divides disks of each type in each disk
domain into one or more disk groups (DGs) according to disk quantity.
One DG consists of disks of only one type.
CKs in a CKG are allocated from different disks in a DG.
DGs are internal objects automatically configured by OceanStor storage systems and
typically used for fault isolation. DGs are not presented externally.
4. Logical Drive
A logical drive (LD) is a disk that is managed by a storage system and corresponds to
a physical disk.
5. CK
A chunk (CK) is a disk space of a specified size allocated from a storage pool. It is the
basic unit of a RAID array.
6. CKG
A chunk group (CKG) is a logical storage unit that consists of CKs from different
disks in the same DG based on the RAID algorithm. It is the minimum unit for
allocating resources from a disk domain to a storage pool.
All CKs in a CKG are allocated from the disks in the same DG. A CKG has RAID
attributes, which are actually configured for corresponding storage tiers. CKs and
CKGs are internal objects automatically configured by storage systems. They are not
presented externally.
7. Extent
Each CKG is divided into logical storage spaces of a specific and adjustable size called
extents. Extent is the minimum unit (granularity) for migration and statistics of hot
data. It is also the minimum unit for space application and release in a storage pool.
An extent belongs to a volume or LUN. A user can set the extent size when creating
a storage pool. After that, the extent size cannot be changed. Different storage pools
may consist of extents of different sizes, but one storage pool must consist of extents
of the same size.
8. Grain
When a thin LUN is created, extents are divided into 64 KB blocks which are called
grains. A thin LUN allocates storage space by grains. Logical block addresses (LBAs)
in a grain are consecutive.
Grains are mapped to thin LUNs. A thick LUN does not involve grains.
9. Volume and LUN
A volume is an internal management object in a storage system.
A LUN is a storage unit that can be directly mapped to a host for data reads and
writes. A LUN is the external embodiment of a volume.
A volume organizes all extents and grains of a LUN and applies for and releases
extents to increase and decrease the actual space used by the volume.
2.2.3 Other RAID Technologies

2.2.3.1 Huawei Dynamic RAID Algorithm
When a flash component fails, Huawei dynamic RAID algorithm can proactively recover
the data in the faulty flash component and keep providing RAID protection for the data.
This RAID algorithm dynamically adjusts the number of data blocks in a RAID array to
meet system reliability and capacity requirements. If a chunk is faulty and no chunk is
available from disks outside the disk domain, the system dynamically reconstructs the
original N + M chunks to (N - 1) + M chunks. When a new SSD is inserted, the system
migrates data from the (N - 1) + M chunks to the newly constructed N + M chunks for
efficient disk utilization.
Dynamic RAID adopts the erasure coding (EC) algorithm, which can dynamically adjust
the number of CKs in a CKG if only SSDs are used to meet the system reliability and
capacity requirements.
2.2.3.2 RAID-TP
RAID protection is essential to a storage system for consistently high reliability and
performance. However, the reliability of RAID protection is challenged by uncontrollable
RAID array construction time due to drastic increase in capacity.
RAID-TP achieves optimal performance, reliability, and capacity utilization.
Customers have to purchase disks of larger capacity to replace existing disks for system
upgrades. In such a case, one system may consist of disks of different capacities. How to
maintain the optimal capacity utilization in a system that uses a mix of disks with
different capacities?
RAID-TP uses Huawei's optimized FlexEC algorithm that allows the system to tolerate
failures of up to three disks, improving reliability while allowing a longer reconstruction
time window.
RAID-TP with FlexEC algorithm reduces the amount of data read from a single disk by
70%, as compared with traditional RAID, minimizing the impact on system performance.
In a typical 4:2 RAID 6 array, the capacity utilization is about 67%. The capacity
utilization of a Huawei OceanStor all-flash storage system with 25 disks is improved by
20% on this basis.
2.3 Common Storage Protocols

2.3.1 SCSI
2.3.1.1 SCSI Protocol
Small Computer System Interface (SCSI) is a vast protocol system. The SCSI protocol
defines a model and a necessary instruction set for different devices to exchange
information using the framework.
SCSI reference documents cover devices, models, and links.
 SCSI architecture documents discuss the basic architecture models SAM and SPC and
describe the SCSI architecture in detail, covering topics like the task queue model
and basic common instruction model.
 SCSI device implementation documents cover the implementation of specific devices,
such as the block device (disk) SBC and stream device (tape) SSC instruction systems.
 SCSI transmission link implementation documents discuss FCP, SAS, iSCSI, and FCoE,
and describe in detail the implementation of the SCSI protocol on media.
2.3.1.2 SCSI Logical Topology

The SCSI logical topology includes initiators, targets, and LUNs.
 Initiator: SCSI is essentially a client/server (C/S) architecture in which a client acts as
an initiator to send request instructions to a SCSI target. Generally, a host acts as an
initiator.
 Target: processes SCSI instructions. It receives and parses instructions from a host.
For example, a disk array functions as a target.
 LUN: a namespace resource described by a SCSI target. A target may include
multiple LUNs, and attributes of the LUNs may be different. For example, LUN#0
may be a disk, and LUN#1 may be another device.
The initiator and target of SCSI constitute a typical C/S model. Each instruction is
implemented through the request/response mode. The initiator sends SCSI requests. The
target responds to the SCSI requests, provides services through LUNs, and provides a task
management function.
2.3.1.3 SCSI Initiator Model

SCSI initiator logical layers in different operating systems:
 On Windows, a SCSI initiator includes three logical layers: storage/tape driver, SCSI
port, and mini port. The SCSI port implements the basic framework processing
procedures for SCSI, such as device discovery and namespace scanning.
 On Linux, a SCSI initiator includes three logical layers: SCSI device driver, scsi_mod
middle layer, and SCSI adapter driver (HBA). The scsi_mod middle layer processes
SCSI device-irrelevant and adapter-irrelevant processes, such as exceptions and
namespace maintenance. The HBA driver provides link implementation details, such
as SCSI instruction packaging and unpacking. The device driver implements specific
SCSI device drivers, such as the famous SCSI disk driver, SCSI tape driver, and SCSI
CD-ROM device driver.
 The structure of Solaris comprises the SCSI device driver, SSA middle layer, and SCSI
adapter driver, which is similar to the structure of Linux/Windows.
 The AIX architecture is structured in three layers: SCSI device driver, SCSI middle
layer, and SCSI adaptation driver.
2.3.1.4 SCSI Target Model

Based on the SCSI architecture, a target is divided into three layers: port layer, middle
layer, and device layer.
 A PORT model in a target packages or unpackages SCSI instructions on links. For
example, a PORT can package instructions into FPC, iSCSI, or SAS, or unpackage
instructions from those formats.
 A device model in a target serves as a SCSI instruction analyser. It tells the initiator
what device the current LUN is by processing INQUIRT, and processes I/Os through
READ/WRITE.
 The middle layer of a target maintains models such as LUN space, task set, and task
(command). There are two ways to maintain LUN space. One is to maintain a global
LUN for all PORTs, and the other is to maintain a LUN space for each PORT.
2.3.1.5 SCSI Protocol and Storage System

The SCSI protocol is the basic protocol used for communication between hosts and
storage devices.
The controller sends a signal to the bus processor requesting to use the bus. After the
request is accepted, the controller's high-speed cache sends data. During this process, the
bus is occupied by the controller and other devices connected to the same bus cannot use
it. However, the bus processor can interrupt the data transfer at any time and allow other
devices to use the bus for operations of a higher priority.
A SCSI controller is like a small CPU with its own command set and cache. The special
SCSI bus architecture can dynamically allocate resources to tasks run by multiple devices
in a computer. In this way, multiple tasks can be processed at the same time.
2.3.1.6 SCSI Protocol Addressing

A traditional SCSI controller is connected to a single bus, therefore only one bus ID is
allocated. An enterprise-level server may be configured with multiple SCSI controllers, so
there may be multiple SCSI buses. In a storage network, each FC HBA or iSCSI network
adapter is connected to a bus. A bus ID must therefore be allocated to each bus to
distinguish between them.
To address devices connected to a SCSI bus, SCSI device IDs and LUNs are used. Each
device on the SCSI bus must have a unique device ID. The HBA on the server also has its
own device ID: 7. Each bus, including the bus adapter, supports a maximum of 8 or 16
device IDs. The device ID is used to address devices and identify the priority of the devices
on the bus.
Each storage device may include sub-devices, such as virtual disks and tape drives. So
LUN IDs are used to address sub-devices in a storage device.
A ternary description (bus ID, target device ID, and LUN ID) is used to identify a SCSI
target.
2.3.2 iSCSI, FC, and FCoE

2.3.2.1 iSCSI Protocol
The iSCSI protocol was first launched by IBM, Cisco and HP. Since 2004, the iSCSI protocol
has been used as the formal IETF standard. The existing iSCSI protocol is based on SCSI
Architecture Model-2 (AM2).
iSCSI is short for Internet Small Computer System Interface. It is an IP -based storage
networking standard for linking data storage facilities. It provides block-level access to
storage devices by carrying SCSI commands over a TCP/IP network.
The iSCSI protocol encapsulates SCSI commands and block data into TCP packets for
transmission over IP networks. As the transport layer protocol of SCSI, iSCSI uses mature
IP network technologies to implement and extend SAN. The SCSI protocol layer generates
CDBs and sends the CDBs to the iSCSI protocol layer. The iSCSI protocol layer then
encapsulates the CDBs into PDUs and transmits the PDUs over an IP network.
2.3.2.2 iSCSI Initiator and Target

The iSCSI communication system inherits some of SCSI's features. The iSCSI
communication involves an initiator that sends I/O requests and a target that responds to
the I/O requests and executes I/O operations. After a connection is set up between the
initiator and target, the target controls the entire process as the primary device.
 There are three types of iSCSI initiators: software-based initiator driver, hardware-
based TCP offload engine (TOE) NIC, and iSCSI HBA. Their performance increases in
that order.
 An iSCSI target is usually an iSCSI disk array or iSCSI tape library.
The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators
and targets. All iSCSI nodes are identified by their iSCSI names. This method distinguishes
iSCSI names from host names.
2.3.2.3 iSCSI Architecture

In an iSCSI system, a user sends a data read or write command on a SCSI storage device.
The operating system converts this request into one or multiple SCSI instructions and
sends the instructions to the target SCSI controller card. The iSCSI node encapsulates the
instructions and data into an iSCSI packet and sends the packet to the TCP/IP layer,
where the packet is encapsulated into an IP packet to be transmitted over a network. You
can also encrypt the SCSI instructions for transmission over an insecure network.
Data packets can be transmitted over a LAN or the Internet. The receiving storage
controller restructures the data packets and sends the SCSI control commands and data
in the iSCSI packets to corresponding disks. The disks execute the operation requested by
the host or application. For a data request, data will be read from the disks and sent to
the host. The process is completely transparent to users. Though SCSI instruction
execution and data preparation can be implemented by the network controller software
using TCP/IP, the host will spare a lot of CPU resources to process the SCSI instructions
and data. If these transactions are processed by dedicated devices, the impact on system
performance will be reduced to a minimum. An iSCSI adapter combines the functions of
an NIC and an HBA. The iSCSI adapter obtains data by blocks, classifies and processes
data using the TCP/IP processing engine, and sends IP data packets over an IP network. In
this way, users can create IP SANs without compromising server performance.
2.3.2.4 FC Protocol
FC can be referred to as the FC protocol, FC network, or FC interconnection. As FC
delivers high performance, it is becoming more commonly used for front-end host access
on point-to-point and switch-based networks. Like TCP/IP, the FC protocol suite also
includes concepts from the TCP/IP protocol suite and the Ethernet, such as FC switching,
FC switch, FC routing, FC router, and SPF routing algorithm.
FC protocol structure:
 FC-0: defines physical connections and selects different physical media and data
rates for protocol operations. This maximizes system flexibility and allows for existing
cables and different technologies to be used to meet the requirements of different
systems. Copper cables and optical cables are commonly used.
 FC-1: records the 8-bit/10-bit transmission code to balance the transmission bit
stream. The code can also serve as a mechanism to transfer data and detect errors.
Its excellent transfer capability of 8-bit/10-bit encoding helps reduce component
design costs and ensures optimum transfer density for better clock recovery. Note: 8-
bit/10-bit encoding is also applicable to IBM ESCON.
 FC-2: includes the following items for sending data over the network:
 How data should be split into small frames
 How much data should be sent at a time (flow control)
 Where frames should be sent (including defining service levels based on
applications)
 FC-3: defines advanced functions such as striping (data is transferred through
multiple channels), multicast (one message is sent to multiple targets), and group
query (multiple ports are mapped to one node). When FC-2 defines functions for a
single port, FC-3 can define functions across ports.
 FC-4: maps upper-layer protocols. FC performance is mapped to an IP address, a SCSI

protocol, or an ATM protocol. SCSI is a subset of the FC protocol.
Like the Ethernet, FC provides the following network topologies:
 Point-to-point:
 The simplest topology that allows direct communication between two nodes
(usually a storage device and a server).
 FC-AL:
 Similar to the Ethernet shared bus topology but is in arbitrated loop mode rather
than bus connection mode. Each device is connected to another device end to
end to form a loop.
 Data frames are transmitted hop by hop in the arbitrated loop and the data
frames can be transmitted only in one direction at any time. As shown in the
figure, node A needs to communicate with node H. After node A wins the
arbitration, it sends data frames to node H. However, the data frames are
transmitted clockwise in the sequence of B-C-D-E-F-G-H, which is inefficient.
Figure 2-9
 Fabric:
 Similar to an Ethernet switching topology, a fabric topology is a mesh switching
matrix.
 The forwarding efficiency is much greater than in FC-AL.
 FC devices are connected to fabric switches through optical fibres or copper
cables to implement point-to-point communication between nodes.
FC frees the workstation from the management of every port. Each port
manages its own point-to-point connection to the fabric, and other fabric
functions are implemented by FC switches. On an FC network, there are seven
types of ports.
 Device (node) port:
 N_Port: Node port. A fabric device can be directly attached.
 NL_Port: Node loop port. A device can be attached to a loop.
 Switch port:
 E_Port: Expansion port (connecting switches).
 F_Port: A port of a fabric device that used to connect to the N_Port.

 FL_Port: Fabric loop port.
 G_Port: A generic port that can be converted into an E_Port or F_Port.
 U_Port: A universal port used to describe automatic port detection.
2.3.2.5 FCoE Protocol

FCoE: defines the mapping from FC to IEEE 802.3 Ethernet. It uses the Ethernet's physical
and data link layers and the FC's network, service, and protocol layers.
FCoE has the following characteristics:
 Organization: Submitted to the American National Standards Institute (ANSI) T11
committee for approval in 2008. Cooperation with the IEEE is required.
 Objective: To use the scalability of the Ethernet and retain the high reliability and
efficiency of FC
 Other challenges: When FC and the Ethernet are used together, there can be
problems related to packet loss, path redundancy, failover, frame segmentation and
reassembly, and non-blocking transmission.
 FC delivers poor compatibility and is not applicable for long-distance transmission.
FCoE has the same problems.
FCoE retains the protocol stack above FC-2 and replaces FC-0 and FC-1 with the
Ethernet's link layer. The original FC2 is further divided into the following:
 FC-2V: FC2 virtual sub-layer
 FC-2M: FC2 multiplexer sub-layer
 FC-2P: FC2 virtual physical layer
The FC_BB_E mapping protocol requires that FCoE uses lossless Ethernet for transmission
at the bottom layer and carries FC frames in full-duplex and lossless mode. The Ethernet
protocol is used on the physical line.
Comparison between FC and FCoE:
 FC-0 defines the bearer media type, and FC-1 defines the frame encoding and
decoding mode. The two layers need to be defined during transmission over the FC
SAN network. FCoE runs on the Ethernet. Therefore, the Ethernet link layer replaces
the preceding two layers.
 Different environments: The FC protocol runs on the traditional FC SAN storage
network, while FCoE runs on the Ethernet.
 Different channels: The FC protocol runs on the FC network, and all packets are
transmitted through FCs. There are various protocol packets, such as IP and ARP
packets, on the Ethernet. To transmit FCoE packets, a virtual FC needs to be created
on the Ethernet.
 Compared with the FC protocol, the FIP initialization protocol is used for FCoE to
obtain the VLAN, establish a virtual channel with an FCF, and maintain virtual links.
FCoE requires the support of other protocols. The Ethernet tolerates packet loss, but the
FC protocol does not. As the FC protocol for transmission on the Ethernet, FCoE inherits
this feature that packet loss is not allowed. To ensure that FCoE runs properly on an
Ethernet network, the Ethernet needs to be enhanced to prevent packet loss. The
enhanced Ethernet is called Converged Enhanced Ethernet (CEE).
2.3.3 SAS and SATA

2.3.3.1 SAS Protocol
SAS is the serial standard of the SCSI bus protocol. A serial port has a simple structure,
supports hot swap, and boasts a high transmission speed and execution efficiency.
Generally, large parallel cables cause electronic interference. The SAS cable structure can
solve this problem. The SAS cable structure saves space, thereby improving heat
dissipation and ventilation for servers that use SAS disks.
SAS has the following advantages:
 Lower cost:
 A SAS backplane supports SAS and SATA disks, which reduces the cost of using
different types of disks.
 There is no need to design different products based on the SCSI and SATA
standards. In addition, the cabling complexity and the number of PCB layers are
reduced, further reducing costs.
 System integrators do not need to purchase different backplanes and cables for
different disks.
 More devices can be connected.
 The SAS technology introduces the SAS expander, so that a SAS system supports
more devices. Each expander can be connected to multiple ports, and each port
can be connected to a SAS device, a host, or another SAS expander.
 High reliability:
 The reliability is the same as that of SCSI and FC disks and is better than that of
SATA disks.
 The verified SCSI command set is retained.
 High performance:
 The unidirectional port rate is high.
 Compatibility with SATA:
 SATA disks can be directly installed in a SAS environment.
 SATA and SAS disks can be used in the same system, which meets the
requirements of the popular tiered storage strategy.
The SAS architecture includes six layers from the bottom to the top: physical layer, phy
layer, link layer, port layer, transport layer, and application layer. Each layer provides
certain functions.
 Physical layer: defines hardware, such as cables, connectors, and transceivers.
 Phy layer: includes the lowest-level protocols, like coding schemes and power
supply/reset sequences.
 Link layer: describes how to control phy layer connection management, primitives,
CRC, scrambling and descrambling, and rate matching.
 Port layer: describes the interfaces of the link layer and transport layer, including
how to request, interrupt, and set up connections.
 Transport layer: defines how the transmitted commands, status, and data are
encapsulated into SAS frames and how SAS frames are decomposed.
 Application layer: describes how to use SAS in different types of applications.
SAS has the following characteristics:
 SAS uses the full-duplex (bidirectional) communication mode. The traditional parallel
SCSI can communicate only in one direction. When a device receives a data packet
from the parallel SCSI and needs to respond, a new SCSI communication link needs
to be set up after the previous link is disconnected. However, each SAS cable
contains two input cables and two output cables. This way, SAS can read and write
data at the same time, improving the data throughput efficiency.
 Compared with SCSI, SAS has the following advantages:
 As it uses the serial communication mode, SAS provides higher throughput and
may deliver higher performance in the future.
 Four narrow ports can be bound as a wide link port to provide higher
throughput.
Scalability of SAS
 SAS uses expanders to expand interfaces. One SAS domain supports a maximum of
16,384 disk devices.
 A SAS expander is an interconnection device in a SAS domain. Similar to an Ethernet
switch, a SAS expander enables an increased number of devices to be connected in a
SAS domain, and reduces the cost in HBAs. Each expander can connect to a
maximum of 128 terminals or expanders. The main components in a SAS domain are
SAS expanders, terminal devices, and connection devices (or SAS connection cables).
 A SAS expander is equipped with a routing table that tracks the addresses of all
SAS drives.
 A terminal device can be an initiator (usually a SAS HBA) or a target (a SAS or
SATA disk, or an HBA in target mode).
 Loops cannot be formed in a SAS domain. This ensures terminal devices can be
detected.
 In reality, the number of terminal devices connected to an extender is far fewer than
128 due to bandwidth reasons.
Cable Connection Principles of SAS
 Most storage device vendors use SAS cables to connect disk enclosures to controller
enclosures or connect disk enclosures. A SAS cable bundles four independent
channels (narrow ports) into a wide port to provide higher bandwidth. The four
independent channels provide 12 Gbit/s each, so a wide port can provide 48 Gbit/s of
bandwidth. To ensure that the data volume on a SAS cable does not exceed the
maximum bandwidth of the SAS cable, the total number of disks connected to a SAS
loop must be limited.
 For a Huawei storage device, the maximum number of disks supported is 168. That
is, a loop comprising up to a maximum seven disk enclosures each with 24 disk slots.
However, all disks in the loop must be traditional SAS disks. As SSDs are becoming
more common, one must consider that SSDs deliver much higher transmission
speeds than SAS disks. Therefore, for SSDs, a maximum of 96 disks are supported in
a loop: four disk enclosures, each with 24 disk slots, form a loop.
 A SAS cable is called a mini SAS cable when the speed of a single channel is 6 Gbit/s,
and a SAS cable is called a high-density mini SAS cable when the speed is increased
to 12 Gbit/s.
2.3.3.2 SATA Protocol

SATA is short for Serial ATA, which is a type of computer bus used for data transmission
between the main board and storage devices (disks and CD-ROM drives). SATA uses a
brand new bus structure rather than simply improving that of PATA.
At the physical layer, the SAS port is completely compatible with the SATA port. SATA
disks can be used in a SAS environment. In terms of port standards, SATA is a sub-
standard of SAS. Therefore, a SAS controller can directly control SATA disks. However,
SAS cannot be directly used in a SATA environment, because a SATA controller cannot
control SAS disks.
At the protocol layer, SAS includes three types of protocols that are used for data
transmission of different devices.
 The serial SCSI protocol (SSP) is used to transmit SCSI commands.
 The SCSI management protocol (SMP) is used to maintain and manage connected
devices.
 The SATA channel protocol (STP) is used for data transmission between SAS and
SATA.
When the three protocols operate cooperatively, SAS can be used with SATA and some
SCSI devices.
2.3.4 PCIe and NVMe

2.3.4.1 PCIe Protocol
In 1991, Intel first proposed the concept of PCI. PCI has the following characteristics:
 Simple bus structure, low costs, easy designs.
 The parallel bus supports a limited number of devices and the bus scalability is poor.
 When multiple devices are connected, the effective bandwidth of the bus is greatly
reduced and the transmission rate slows down.
With the development of modern processor technologies, it was inevitable engineers
would look to replace parallel buses with high-speed differential buses in the
interconnectivity field. Compared with single-ended parallel signals, high-speed
differential signals are used for higher clock frequencies. In this case, the PCIe bus came
to being.
PCIe is short for PCI Express, which is a high-performance and high-bandwidth serial
communication interconnection standard. It was first proposed by Intel and then
developed by the Peripheral Component Interconnect Special Interest Group (PCI-SIG) to
replace bus-based communication architectures.
Compared with the traditional PCI bus, PCIe has the following advantages:
 Dual channels, high bandwidth, and a fast transmission rate: A transmission mode
(RX and TX are separated) similar to the full-duplex mode is implemented. In
addition, a higher transmission rate can be provided. The first-generation PCIe X1
provides 2.5 Gbit/s, the second generation provides 5 Gbit/s, the PCIe 3.0 provides 8
Gbit/s, the PCIe 4.0 provides 16 Gbit/s, and the PCIe 5.0 provides up to 32 Gbit/s.
 Compatibility: PCIe is compatible with PCI at the software layer but has upgraded
software.
 Ease-of-use: Hot swap is supported. A PCIe bus interface slot contains the hot swap
detection signal, supporting hot swap and heat exchange.
 Error processing and reporting: A PCIe bus uses a layered structure, in which the
software layer can process and report errors.
 Virtual channels of each physical connection: Each physical channel supports multiple
virtual channels (in theory, eight virtual channels are supported for independent
communication control), thereby supporting QoS of each virtual channel and
achieving high-quality traffic control.
 Reduced I/Os, board-level space, and crosstalk: A typical PCI bus data line requires at
least 50 I/O resources, while PCIe XL requires only four I/O resources. Reduced I/Os
saves board-level space, and the direct distance between I/Os can be longer, thereby
reducing crosstalk.
Why PCIe? PCIe is future-oriented, and higher throughputs can be achieved in the future.
PCIe is providing increasing throughput using the latest technologies, and the transition
from PCI to PCIe can be simplified by guaranteeing compatibility with PCI software using
layered protocols and drives. The PCIe protocol features point-to-point connection, high
reliability, tree networking, full duplex, and frame-structure-based transmission.
PCIe protocol layers include the physical layer, data link layer, transaction layer, and
application layer.
 The physical layer in a PCIe bus architecture determines the physical features of the
bus. In future, the performance of a PCIe bus can be further improved by increasing
the speed or changing the encoding or decoding mode. Such changes only affect the
physical layer, facilitating upgrades.
 The data link layer ensures the correctness and reliability of data packets transmitted
over a PCIe bus. It checks whether the data packet encapsulation is complete and
correct, adds the sequence number and CRC code to the data, and uses the ack/nack
handshake protocol for error detection and correction.
 The processing layer receives read and write requests from the software layer or
creates a request encapsulation packet and transmits it to the data link layer. This
type of packet is called a transaction layer packet (TLP). The TLP receives data link
layer packets (DLLP) from the link layer, associates the DLLP with a related software
request, and transmits it to the software layer for processing.
 The application layer is designed by users based on actual needs. Other layers must
comply with the protocol requirements.
2.3.4.2 NVMe Protocol

NVMe is short for Non-Volatile Memory Express. The NVMe standard is oriented to PCIe
SSDs. Direct connection from the native PCIe channel to the CPU can avoid the latency
caused by communication between the external controller (PCH) of the SATA and SAS
interface and the CPU.
In terms of the entire storage process, NVMe not only serves as a logical protocol port,
but also as an instruction standard and a specified protocol. The low latency and
parallelism of PCIe channels and the parallelism of contemporary processors, platforms,
and applications can be used to greatly improve the read and write performance of SSDs
with controllable costs. They can also reduce the latency caused by the Advanced Host
Controller Interface (AHCI) and ensure enhanced performance of SSDs in the SATA era.
NVMe protocol stack:
 In terms of the transmission path, I/Os of a SAS all-flash array are transmitted from
the front-end server to the CPU through the FC/IP front-end interface protocol of a
storage device. They are then transmitted to a SAS chip, a SAS expander, and finally
a SAS SSD through PCIe links and switches.
 The Huawei NVMe-based all-flash storage system supports end-to-end NVMe. Data
I/Os are transmitted from a front-end server to the CPU through a storage device's
FC-NVMe/NVMe Over RDMA front-end interface protocol. Back-end data is
transmitted directly to NVMe-based SSDs through 100 Gbit/s RDMA. The CPU of the
NVMe-based all-flash storage system appears to communicate directly with NVMe
SSDs via a shorter transmission path, providing higher transmission efficiency and a
lower transmission latency.
 In terms of software protocol parsing, SAS- and NVMe-based all-flash storage
systems differ greatly in protocol interaction for data writes. If the SAS back-end SCSI
protocol is used, four protocol interactions are required for a complete data write
operation. Huawei NVMe-based all-flash storage systems require only two protocol
interactions, making them twice as efficient as SAS-based all-flash storage systems
in terms of processing write requests.
Advantages of NVMe:
 Low latency: Data is not read from registers when commands are executed, resulting
in a low I/O latency.
 High bandwidth: PCIe X4 can provide up to 4 Gbit/s throughput for a single drive.
 High IOPS: NVMe increases the maximum queue depth from 32 to 64,000. The IOPS
of SSDs is also greatly improved.
 Low power consumption: The automatic switchover between power consumption
modes and dynamic power management greatly reduce power consumption.
 Wide driver applicability: The driver applicability problem between different PCIe
SSDs is solved.
Huawei OceanStor Dorado all-flash storage systems use NVMe-oF to implement SSD
resource sharing, and provide 32 Gbit/s FC-NVMe and NVMe over 100 Gbit/s RDMA
networking designs. In this way, the same network protocol is used for front-end network
connection, back-end disk enclosure connection, and scale-out controller interconnection.
RDMA uses related hardware and network technologies to enable NICs of servers to
directly read memory, achieving high bandwidth, low latency, and low resource
consumption. However, the RDMA-dedicated IB network architecture is incompatible with
a live network, resulting in high costs. RoCE effectively solves this problem. RoCE is a
network protocol that uses the Ethernet to carry RDMA. There are two versions of RoCE.
RoCEv1 is a link layer protocol and cannot be used in different broadcast domains.
RoCEv2 is a network layer protocol and can implement routing functions.
2.3.5 RDMA and IB

2.3.5.1 RDMA Protocol
RDMA is short for Remote Direct Memory Access, a method of transferring data in a
buffer between application software on two servers over a network.
Comparison between traditional mode and RDMA mode:
 Compared with the internal bus I/O of traditional DMA, RDMA uses direct buffer
transmission between the application software of two endpoints over a network.
 Compared with traditional network transmission, RDMA does not require operating
systems or protocol stacks.
 RDMA can achieve ultra-low latency and ultra-high throughput transmission
between endpoints without using an abundance of CPU and OS resources. Few
resources are consumed for data processing and migration.
Currently, there are three types of RDMA networks: IB, RoCE, and iWARP. IB is designed
for RDMA to ensure reliable transmission at the hardware level. RoCE and iWARP are
Ethernet-based EDMA technologies and support corresponding verbs interfaces.
 IB is a new-generation network protocol that supports RDMA from the beginning.
NICs and switches that support this technology are required.
 RoCE is a network protocol that allows RDMA over the Ethernet. The lower network
header is an Ethernet header, and the higher network header (including data) is an
IB header. RoCE allows RDMA to be used on a standard Ethernet infrastructure
(switch). The NIC should support RoCE. RoCEv1 is an RDMA protocol implemented
based on the Ethernet link layer. Switches must support flow control technologies
like PFC to ensure reliable transmission at the physical layer. RoCEv2 is implemented
at the UDP layer in the Ethernet TCP/IP protocol.
 iWARP: allows RDMA through TCP. The functions supported by IB and RoCE are not
supported by iWARP. However, iWARP allows RDMA to be used over a standard
Ethernet infrastructure (switch). The NIC should support iWARP (if CPU offload is
used). Otherwise, all iWARP stacks can be implemented in the SW, and most RDMA
performance advantages are lost.
2.3.5.2 IB Protocol
IB technology is specifically designed for server connections and is widely used for
communication between servers (for example, replication and distributed working),
between a server and a storage device (for example, SAN and DAS), and between a
server and a network (for example, LAN, WAN, and the Internet).
IB defines a set of devices used for system communication, including channel adapters,
switches, and routers used to connect to other devices, such as host channel adapters
(HCAs) and target channel adapters (TCAs). The IB protocol has the following features:
 Standard-based protocol: IB was designed by the InfiniBand Trade Association, which
was founded in 1999 and comprised 225 companies. Main members of the
association include Agilent, Dell, HP, IBM, InfiniSwitch, Intel, Mellanox, Network
Appliance, and Sun Microsystems. More than 100 other members help develop and
promote the standard.
 Speed: IB provides high speeds.
 Memory: Servers that support IB use HCAs to convert the IB protocol to the PCI-X or
PCI-Xpress bus inside the server. The HCA supports RDMA and is also called kernel
bypass. RDMA fits clusters well. It uses a virtual addressing solution to let a server
identify and use memory resources from other servers without involving any
operating system kernels.
 RDMA helps implement transport offload. The transport offload function transfers
data packet routing from the OS to the chip level, reducing the service load of the
processor. An 80 GHz processor is required to process data at a transmission speed
of 10 Gbit/s in the OS.
The IB system includes CAs, a switch, a router, a repeater, and connected links. CAs
include and HCAs and TCAs.
 An HCA is used to connect a host processor to the IB structure.
 A TCA is used to connect I/O adapters to the IB structure.
IB in storage: The IB front-end network is used to exchange data with customers. Data is
transmitted based on the IPoIB protocol. The IB back-end network is used for data
interaction between nodes in a storage device. The RPC module uses RDMA to
synchronize data between nodes.
IB layers include the application layer, transport layer, network layer, link layer, and
physical layer. The functions of each layer are described as follows:
 Transport layer: responsible for in-order distribution and segmentation of packets,
channel multiplexing, and data transmission. It also sends, receives, and reassembles
data packet segments.
 Network layer: provides a mechanism for routing packets from one substructure to
another. Each routing packet of the source and destination nodes has a global
routing header (GRH) and a 128-bit IPv6 address. A standard global 64-bit identifier
is also embedded at the network layer and this identifier is unique in all subnets.
Through the exchange of such identifier values, data can be transmitted across
multiple subnets.
 Link layer: provides such functions as packet design, point-to-point connection, and
packet switching in the local subsystems. At the packet communication level, two
special packet types are specified: data transmission and network management
packets. The network management packet provides functions like operation control,
subnet indication, and fault tolerance for device enumeration. The data transmission
packet is used for data transmission. The maximum size of each packet is 4 KB. In
each specific device subnet, the direction and exchange of each packet are
implemented by a local subnet manager with a 16-bit identifier address.
 Physical layer: defines connections at three rates: 1X, 4X, and 12X. The signal
transmission rates are 2.5 Gbit/s, 10 Gbit/s, and 30 Gbit/s, respectively. IBA therefore
allows multiple connections to obtain a speed of up to 30 Gbit/s. Because the full-
duplex serial communication mode is used, the single-rate bidirectional connection
requires only four cables. When the 12-rate mode is used, only 48 cables are
required.
2.3.6 CIFS, NFS, and NDMP

2.3.6.1 CIFS Protocol
In 1996, Microsoft renamed SMB to CIFS and added many new functions. Now, CIFS
includes SMB1, SMB2, and SMB3.0.
CIFS has high requirements on network transmission reliability, so usually uses TCP/IP.
CIFS is mainly used for the Internet and by Windows hosts to access files or other
resources over the Internet. CIFS allows Windows clients to identify and access shared
resources. With CIFS, clients can quickly read, write, and create files in storage systems as
on local PCs. CIFS helps maintain a high access speed and a fast system response even
when many users simultaneously access the same shared file.
2.3.6.2 NFS Protocol

NFS is short for Network File System. The network file sharing protocol is defined by the
IETF and widely used in the Linux/Unix environment.
NFS is a client/server application that uses remote procedure call (RPC) for
communication between computers. Users can store and update files on the remote NAS
just like on local PCs. A system requires an NFS client to connect to an NFS server. NFS is
used for independent transmission so uses TCP or UDP. Users or system administrators
can use NFS to mount all file systems or a part of a file system (a part of any directory or
subdirectory hierarchy). Access to the mounted file system can be controlled using
permissions, for example, read-only or read-write permissions.
Differences between NFSv3 and NFSv4:
 NFSv4 is a stateful protocol. It implements the file lock function and can obtain the
root node of a file system without the help of the NLM and MOUNT protocols.
NFSv3 is a stateless protocol. It requires the NLM protocol to implement the file lock
function.
 NFSv4 has enhanced security and supports and RPCSEC-GSS identity authentication.
 NFSv4 provides only two requests: NULL and COMPOUND. All operations are
integrated into COMPOUND. A client can encapsulate multiple operations into one
COMPOUND request based on actual requests to improve flexibility.
 The command space of the NFSv4 file system is changed. A root file system (fsid=0)
must be set on the server, and other file systems are mounted to the root file system
for export.
 Compared with NFSv3, the cross-platform feature of NFSv4 is enhanced.
2.3.6.3 NDMP Protocol

The backup process of the traditional NAS storage is as follows:
 A NAS device is a closed storage system. The Client Agent of the backup software
can only be installed on the production system instead of the NAS device. In the
traditional network backup process, data is read from a NAS device through the CIFS
or NFS sharing protocol, and then transferred to a backup server over a network.
 Such a mode occupies network, production system and backup server resources,
resulting in poor performance and an inability to meet the requirements for backing
up a large amount of data.
The NDMP protocol is designed for the data backup system of NAS devices. It enables
NAS devices to send data directly to the connected disk devices or the backup servers on
the network for backup, without any backup client agent being required.
There are two networking modes for NDMP:
 On a 2-way network, backup media is connected directly to a NAS storage system
instead of to a backup server. In a backup process, the backup server sends a backup
command to the NAS storage system through the Ethernet. The system then directly
backs up data to the tape library it is connected to.
 In the NDMP 2-way backup mode, data flows are transmitted directly to backup
media, greatly improving the transmission performance and reducing server
resource usage. However, a tape library is connected to a NAS storage device, so
the tape library can back up data only for the NAS storage device to which it is
connected.
 Tape libraries are expensive. To enable different NAS storage devices to share
tape devices, NDMP also supports the 3-way backup mode.
 In the 3-way backup mode, a NAS storage system can transfer backup data to a NAS
storage device connected to a tape library through a dedicated backup network.
Then, the storage device backs up the data to the tape library.
2.4 Storage System Architecture

2.4.1 Storage System Architecture Evolution
The storage system evolved from a single controller to mutual backup of dual controllers
that processed their own tasks before processing data concurrently. Later, parallel
symmetric processing was implemented for multiple controllers. Distributed storage has
become widely used thanks to the development of cloud computing and big data.
Currently, single-controller storage is rare. Most entry-level and mid-range storage
systems use dual-controller architecture, while most mission-critical storage systems use
multi-controller architecture.
Single-controller Storage:
 External disk array with RAID controllers: Using a disk chassis, a disk array virtualizes
internal disks into logical disks through RAID controllers, and then connects to an
SCS interface on a host through an external SCSI interface.
 If a storage system has only one controller module, it is a single point of failure
(SPOF).
Dual-controller Storage:
 Currently, dual-controller architecture is mainly used in mainstream entry-level and
mid-range storage systems.
 There are two working modes: Active-Standby and Active-Active.
 Active-Standby
This is also called high availability (HA). That is, only one is working at a time,
while the other waits, synchronizes data, and monitors services. If the active
controller fails, the standby controller takes over its services. In addition, the
active controller is powered off or restarted before the takeover to prevent brain
split. The bus use of the active controller is released and then back-end and
front-end buses are taken over.
 Active-Active
Two controllers are working at the same time. Each connects to all back-end
buses, but each bus is managed by only one controller. Each controller manages
half of all back-end buses. If one controller is faulty, the other takes over all
buses. This is more efficient than Active-Standby.
Mid-range Storage Architecture Evolution:
 Mid-range storage systems always use an independent dual-controller architecture.
Controllers are usually of modular hardware.
 The evolution of mid-range storage mainly focuses on the rate of host interfaces and
disk interfaces, and the number of ports.
 The common form factor is the convergence of SAN and NAS storage services.
Multi-controller Storage:
 Most mission-critical storage systems use multi-controller architecture.
 The main architecture models are as follows:
 Bus architecture
 Hi-Star architecture
 Direct-connection architecture
 Virtual matrix architecture
Mission-critical storage architecture evolution:
 In 1990, EMC launched Symmetrix, a full bus architecture. A parallel bus connected
front-end interface modules, cache modules, and back-end disk interface modules
for data and signal exchange in time-division multiplexing mode.
 In 2000, HDS adopted the switching architecture for Lightning 9900 products. Front-
end interface modules, cache modules, and back-end disk interface modules were
connected on two redundant switched networks, increasing communication channels
to dozens of times more than that of the bus architecture. The internal bus was no
longer a performance bottleneck.
 In 2003, EMC launched the DMX series based on full mesh architecture. All modules
were connected in point-to-point mode, obtaining theoretically larger internal
bandwidth but adding system complexity and limiting scalability challenges.
 In 2009, to reduce hardware development costs, EMC launched the distributed

switching architecture by connecting a separated switch module to the tightly
coupled dual-controller of mid-range storage systems. This achieved a balance
between costs and scalability.
 In 2012, Huawei launched the Huawei OceanStor 18000 series, a mission-critical
storage product also based on distributed switching architecture.
Storage Software Technology Evolution:
A storage system combines unreliable and low-performance disks to provide high-
reliability and high-performance storage through effective management. Storage systems
provide sharing, easy-to-manage, and convenient data protection functions. Storage
system software has evolved from basic RAID and cache to data protection features such
as snapshot and replication, to dynamic resource management with improved data
management efficiency, and deduplication and tiered storage with improved storage
efficiency.
Distributed Storage Architecture:
 A distributed storage system organizes local HDDs and SSDs of general-purpose
servers into a large-scale storage resource pool, and then distributes data to multiple
data storage servers.
 Currently, distributed storage of Huawei learns from Google, building a distributed
file system among multiple servers and then implementing storage services on the
file system.
 Most storage nodes are general-purpose servers. Huawei OceanStor 100D is
compatible with multiple general-purpose x86 servers and Arm servers.
 Protocol: storage protocol layer. The block, object, HDFS, and file services
support local mounting access over iSCSI or VSC, S3/Swift access, HDFS access,
and NFS access respectively.
 VBS: block access layer of FusionStorage Block. User I/Os are delivered to VBS
over iSCSI or SCSI.
 EDS-B: provides block services with enterprise features, and receives and
processes I/Os from VBS.
 EDS-F: provides the HDFS service.
 Metadata Controller (MDC): The metadata control device controls distributed
cluster node status, data distribution rules, and data rebuilding rules.
 Object Storage Device (OSD): a storage device for storing user data in
distributed clusters of the object storage device
 Cluster Manager (CM): manages cluster information.
2.4.2 Storage System Expansion Methods

Service data continues to increase with the continued development of enterprise
information systems and the ever-expanding scale of services. The initial configuration of
storage systems is often not enough to meet these demands. Storage system capacity
expansion has become a major concern of system administrators. There are two capacity
expansion methods: scale-up and scale-out.
Scale-up:
 This traditional vertical expansion architecture continuously adds storage disks into
the existing storage systems to meet demands.
 Advantage: simple operation at the initial stage
 Disadvantage: As the storage system scale increases, resource increase reaches a
bottleneck.
Scale-out:
 This horizontal expansion architecture adds controllers to meet demands.
 Advantage: As the scale increases, the unit price decreases and the efficiency is
improved.
 Disadvantage: The complexity of software and management increases.
Huawei SAS disk enclosure is used as an example.
 Port consistency: In a loop, the EXP port of an upper-level disk enclosure is connected
to the PRI port of a lower-level disk enclosure.
 Dual-plane networking: Expansion module A connects to controller A, while
expansion module B connects to controller B.
 Symmetric networking: On controllers A and B, symmetric ports and slots are
connected to the same disk enclosure.
 Forward and backward connection networking: Expansion module A uses forward
connection, while expansion module B uses backward connection.
 Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed
the upper limit.
Huawei smart disk enclosure is used as an example.
 Port consistency: In a loop, the EXP (P1) port of an upper-level disk enclosure is
connected to the PRI (P0) port of a lower-level disk enclosure.
 Dual-plane networking: Expansion board A connects to controller A, while expansion
board B connects to controller B.
 Symmetric networking: On controllers A and B, symmetric ports and slots are
connected to the same disk enclosure.
 Forward connection networking: Both expansion modules A and B use forward
connection.
 Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed
the upper limit.
IP scale-out is used for Huawei OceanStor V3 and V5 entry-level and mid-range series,
Huawei OceanStor V5 Kunpeng series, and Huawei OceanStor Dorado V6 series. IP scale-
out integrates TCP/IP, Remote Direct Memory Access (RDMA), and Internet Wide Area
RDMA Protocol (iWARP) to implement service switching between controllers, which
complies with the all-IP trend of the data center network.
PCIe scale-out is used for Huawei OceanStor 18000 V3 and V5 series, and Huawei
OceanStor Dorado V3 series. PCIe scale-out integrates PCIe channels and the RDMA
technology to implement service switching between controllers.
PCIe scale-out: features high bandwidth and low latency.
IP scale-out: employs standard data center technologies (such as ETH, TCP/IP, and
iWARP) and infrastructure, and boosts the development of Huawei's proprietary chips for
entry-level and mid-range products.
Next, let's move on to I/O read and write processes of the host. The scenarios are as
follows:
 Local Write Process
 A host delivers write I/Os to engine 0.
 Engine 0 writes the data into the local cache, implements mirror protection, and
returns a message indicating that data is written successfully.
 Engine 0 flushes dirty data onto a disk. If the target disk is on the local
computer, engine 0 directly delivers the write I/Os.
 If the target disk is on a remote device, engine 0 transfers the I/Os to the engine
(engine 1 for example) where the disk resides.
 Engine 1 writes dirty data onto disks.
 Non-local Write Process
 After detecting that the LUN is owned by engine 0, engine 2 transfers the write
I/Os to engine 0.
 Engine 0 writes the data into the local cache, implements mirror protection, and
returns a message to engine 2, indicating that data is written successfully.
 Engine 2 returns the write success message to the host.
 Engine 0 flushes dirty data onto a disk. If the target disk is on the local
computer, engine 0 directly delivers the write I/Os.
 Engine 1 writes dirty data onto disks.
 Local Read Process
 If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to the
host.
 If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from
the disk. If the target disk is on the local computer, engine 0 reads data from the
disk.
 After the read I/Os are hit locally, engine 0 returns the data to the host.
 Engine 1 reads data from the disk.
 Engine 1 accomplishes the data read.
 Engine 1 returns the data to engine 0 and then engine 0 returns the data to the
host.
 Non-local Read Process
 The LUN is not owned by the engine that delivers read I/Os, and the host
delivers the read I/Os to engine 2.
 After detecting that the LUN is owned by engine 0, engine 2 transfers the read
I/Os to engine 0.
 If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to
engine 2.
 Engine 2 returns the data to the host.
 If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from
the disk. If the target disk is on the local computer, engine 0 reads data from the
disk.
 After the read I/Os are hit locally, engine 0 returns the data to engine 2 and
then engine 2 returns the data to the host.
 If the target disk is on a remote device, engine 0 transfers the I/Os to engine 1
where the disk resides.
 Engine 1 reads data from the disk.
 Engine 1 completes the data read.
 Engine 1 returns the data to engine 0, engine 0 returns the data to engine 2, and
then engine 2 returns the data to the host.
2.4.3 Huawei Storage Product Architecture

Huawei entry-level and mid-range storage products use dual-controller architecture by
default. Huawei mission-critical storage products use architecture with multiple
controllers. OceanStor Dorado V6 SmartMatrix architecture integrates the advantages of
scale-up and scale-out architectures. A single system can be expanded to a maximum of
32 controllers, greatly improving E2E high reliability. The architecture ensures zero service
interruption when seven out of eight controllers are faulty, providing 99.9999% high
availability. It is a perfect choice to carry key service applications in finance,
manufacturing, and carrier industries.
SmartMatrix makes breakthroughs in the mission-critical storage architecture that
separates computing and storage resources. Controller enclosures are completely
separated from and directly connected to disk enclosures. The biggest advantage is that
controllers and storage devices can be independently expanded and upgraded, which
greatly improves storage system flexibility, protects customers' investments in the long
term, reduces storage risks, and guarantees service continuity.
 Front-end full interconnection
 Dorado 8000 and 18000 V6 support FIMs, which can be simultaneously accessed
by four controllers in a controller enclosure.
 Upon reception of host I/Os, the FIM directly distributes the I/Os to appropriate
controllers.
 Full interconnection among controllers
 Controllers in a controller enclosure are connected by 100 Gbit/s (40 Gbit/s for
Dorado 3000 V6) RDMA links on the backplane.
 For scale-out to multiple controller enclosures, any two controllers can be

directly connected to avoid data forwarding.
 Back-end full interconnection
 Dorado 8000 and 18000 V6 support BIMs, which allow a smart disk enclosure to
be connected to two controller enclosures and accessed by eight controllers
simultaneously. This technique, together with continuous mirroring, allows the
system to tolerate failure of 7 out of 8 controllers.
 Dorado 3000, 5000, and 6000 V6 do not support BIMs. Disk enclosures
connected to Dorado 3000, 5000, and 6000 V6 can be accessed by only one
controller enclosure. Continuous mirroring is not supported.
The storage system supports three types of disk enclosures: SAS, smart SAS, and smart
NVMe. Currently, they cannot be used together on one storage system. Smart SAS and
smart NVMe disk enclosures use the same networking mode. In this mode, a controller
enclosure uses the shared 2-port 100 Gbit/s RDMA interface module to connect to a disk
enclosure. Each interface module connects to the four controllers in the controller
enclosure through PCIe 3.0 x16. In this way, each disk enclosure can be simultaneously
accessed by all four controllers, achieving full interconnection between the disk enclosure
and the four controllers. A smart disk enclosure has two groups of uplink ports and can
connect to two controller enclosures at the same time. This allows the two controller
enclosures (eight controllers) to simultaneously access a disk enclosure, implementing
full interconnection between the disk enclosure and eight controllers. When full
interconnection between disk enclosures and eight controllers is implemented, the system
can use continuous mirroring to tolerate failure of 7 out of 8 controllers without service
interruption.
Huawei storage provides E2E global resource sharing:
 Symmetric architecture
 All products support host access in active-active mode. Requests can be evenly
distributed to each front-end link.
 They eliminate LUN ownership of controllers, making LUNs easier to use and
balancing loads. They accomplish this by dividing a LUN into multiple slices that
are then evenly distributed to all controllers using the DHT algorithm
 Mission-critical products reduce latency with intelligent FIMs that divide LUNs
into slices for hosts I/Os and send the requests to their target controller.
 Shared port
 A single port is shared by four controllers in a controller enclosure.
 Loads are balanced without host multipathing.
 Global cache
 The system directly writes received I/Os (in one or two slices) to the cache of the
corresponding controller and sends an acknowledgement to the host.
 The intelligent read cache of all controllers participates in prefetch and cache hit
of all LUN data and metadata.
FIMs of Huawei OceanStor Dorado 8000 and 18000 V6 series storage adopt Huawei-
developed Hi1822 chip to connect to all controllers in a controller enclosure via four
internal links and each front-end port provides a communication link for the host. If any
controller restarts during an upgrade, services are seamlessly switched to the other
controller without impacting hosts and interrupting links. The host is unaware of
controller faults. Switchover is completed within 1 second.
The FIM has the following features:
 Failure of a controller will not disconnect the front-end link, and the host is unaware
of the controller failure.
 The PCIe link between the FIM and the controller is disconnected, and the FIM
detects the controller failure.
 Service switchover is performed between the controllers, and the FIM redistributes
host requests to other controllers.
 The switchover time is about 1 second, which is much shorter than switchover
performed by multipathing software (10-30s).
In global cache mode, host data is directly written into linear space logs, and the logs
directly copy the host data to the memory of multiple controllers using RDMA based on a
preset copy policy. The global cache consists of two parts:
 Global memory: memory of all controllers (four controllers in the figure). This is
managed in a unified memory address, and provides linear address space for the
upper layer based on a redundancy configuration policy.
 WAL: new write cache of the log type
The global pool uses RAID 2.0+, full-strip write of new data, and shared RAID groups
between multiple strips.
Another feature is back-end sharing, which includes sharing of back-end interface
modules within an enclosure and cross-controller enclosure sharing of back-end disk
enclosures.
Active-Active Architecture with Full Load Balancing:
 Even distribution of unhomed LUNs
 Data on LUNs is divided into 64 MB slices. The slices are distributed to different
virtual nodes based on the hash result (LUN ID + LBA).
 Front-end load balancing
 UltraPath selects appropriate physical links to send each slice to the
corresponding virtual node.
 The front-end interconnect I/O modules forward the slices to the corresponding
virtual nodes.
 Front-end: If there is no UltraPath or FIM, the controllers forward I/Os to the
corresponding virtual nodes.
 Global write cache load balancing
 The data volume is balanced.
 Data hotspots are balanced.
 Global storage pool load balancing
 Usage of disks is balanced.
 The wear degree and lifecycle of disks are balanced.

 Data is evenly distributed.
 Hotspot data is balanced.
 Three cache copies
 The system supports two or three copies of the write cache.
 Three-copy requires an extra license.
 Only mission-critical storage systems support three copies.
 Three copies tolerate simultaneous failure of two controllers.
 Failure of two controllers does not cause data loss or service interruption.
 Three copies tolerate failure of one controller enclosure.
 With three copies, data is mirrored in a controller enclosure and across controller
enclosures.
 Failure of a controller enclosure does not cause data loss or service interruption.
Key reliability technologies of Huawei storage products:
 Continuous mirroring
 Dorado V6's mission-critical storage systems support continuous mirroring. In the
event of a controller failure, the system automatically selects new controllers for
mirroring.
 Continuous mirroring includes all devices in back-end full interconnection.
 Back-end full interconnection
 Controllers are directly connected to disk enclosures.
 Dorado V6's mission-critical storage systems support back-end full
interconnection.
 BIMs + two groups of uplink ports on the disk enclosures achieve full
interconnection of the disk enclosures to eight controllers.
 Continuous mirroring and back-end full interconnection allow the system to tolerate
failure of seven out of eight controllers.
Host service switchover when a single controller is faulty: When FIMs are used,
failure of a controller will not disconnect front-end ports from hosts, and the hosts
are unaware of the controller failure, ensuring high availability. When a controller
fails, the FIM port chip detects that the PCIe link between the FIM and the controller
is disconnected. Then service switchover is performed between the controllers, and
the FIM redistributes host I/Os to other controllers. This process is completed within
seconds and does not affect host services. In comparison, when non-shared interface
modules are used, a link switchover must be performed by the host's multipathing
software in the event of a controller failure, which takes a longer time (10 to 30
seconds) and reduces reliability.
2.5 Storage Network Architecture

2.5.1 DAS
Direct-attached storage (DAS) connects one or more storage devices to servers. These
storage devices provide block-level data access for the servers. Based on the locations of
storage devices and servers, DAS is classified into internal DAS and external DAS. SCSI
cables are used to connect hosts and storage devices.
JBOD, short for Just a Bunch Of Disks, logically connects several physical disks in series to
increase capacity but does not provide data protection. JBOD can resolve the insufficient
capacity expansion issue caused by limited disk slots of internal storage. However, it
offers no redundancy, resulting in poor reliability.
For a smart disk array, the controller provides RAID and large-capacity cache, enables the
disk array to have multiple functions, and is equipped with dedicated management
software.
2.5.2 NAS
Enterprises need to store a large amount of data and share the data through a network.
Therefore, network-attached storage (NAS) is a good choice. NAS connects storage
devices to the live network and provides data and file services.
For a server or host, NAS is an external device and can be flexibly deployed through the
network. In addition, NAS provides file-level sharing rather than block-level sharing,
which makes it easier for clients to access NAS over the network. UNIX and Microsoft
Windows users can seamlessly share data through NAS or File Transfer Protocol (FTP).
When NAS sharing is used, UNIX uses NFS and Windows uses CIFS.
NAS has the following characteristics:
 NAS provides storage resources through file-level data access and sharing, enabling
users to quickly share files with minimum storage management costs.
 NAS is a preferred file sharing storage solution that does not require multiple file
servers.
 NAS also helps eliminate bottlenecks in user access to general-purpose servers.
 NAS uses network and file sharing protocols for archiving and storage. These
protocols include TCP/IP for data transmission as well as CIFS and NFS for providing
remote file services.
A general-purpose server can be used to carry any application and run a general-purpose
operating system. Unlike general-purpose servers, NAS is dedicated to file services and
provides file sharing services for other operating systems using open standard protocols.
NAS devices are optimized based on general-purpose servers in aspects such as file
service functions, storage, and retrieval. To improve the high availability of NAS devices,
some NAS vendors also support the NAS clustering function.
The components of a NAS device are as follows:
 NAS engine (CPU and memory)
 One or more NICs that provide network connections, for example, GE NIC and 10GE
NIC.
 An optimized operating system for NAS function management

 NFS and CIFS protocols
 Disk resources that use industry-standard storage protocols, such as ATA, SCSI, and
Fibre Channel
NAS protocols include NFS, CIFS, FTP, HTTP, and NDMP.
 NFS is a traditional file sharing protocol in the UNIX environment. It is a stateless
protocol. If a fault occurs, NFS connections can be automatically recovered.
 CIFS is a traditional file sharing protocol in the Microsoft environment. It is a stateful
protocol based on the Server Message Block (SMB) protocol. If a fault occurs, CIFS
connections cannot be automatically recovered. CIFS is integrated into the operating
system and does not require additional software. Moreover, CIFS sends only a small
amount of redundant information, so it has higher transmission efficiency than NFS.
 FTP is one of the protocols in the TCP/IP protocol suite. It consists of two parts: FTP
server and FTP client. The FTP server is used to store files. Users can use the FTP
client to access resources on the FTP server through FTP.
 Hypertext Transfer Protocol (HTTP) is an application-layer protocol used to transfer
hypermedia documents (such as HTML). It is designed for communication between a
Web browser and a Web server, but can also be used for other purposes.
 Network Data Management Protocol (NDMP) provides an open standard for NAS
network backup. NDMP enables data to be directly written to tapes without being
backed up by backup servers, improving the speed and efficiency of NAS data
protection.
Working principles of NFS: Like other file sharing protocols, NFS also uses the C/S
architecture. However, NFS provides only the basic file processing function and does not
provide any TCP/IP data transmission function. The TCP/IP data transmission function can
be implemented only by using the Remote Procedure Call (RPC) protocol. NFS file
systems are completely transparent to clients. Accessing files or directories in an NFS file
system is the same as accessing local files or directories.
One program can use RPC to request a service from a program located in another
computer over a network without having to understand the underlying network
protocols. RPC assumes the existence of a transmission protocol such as Transmission
Control Protocol (TCP) or User Datagram Protocol (UDP) to carry the message data
between communicating programs. In the OSI network communication model, RPC
traverses the transport layer and application layer. RPC simplifies development of
applications.
RPC works based on the client/server model. The requester is a client, and the service
provider is a server. The client sends a call request with parameters to the RPC server and
waits for a response. On the server side, the process remains in a sleep state until the call
request arrives. Upon receipt of the call request, the server obtains the process
parameters, outputs the calculation results, and sends the response to the client. Then,
the server waits for the next call request. The client receives the response and obtains call
results.
One of the typical applications of NFS is using the NFS server as internal shared storage
in cloud computing. The NFS client is optimized based on cloud computing to provide
better performance and reliability. Cloud virtualization software (such as VMware)
optimizes the NFS client, so that the VM storage space can be created on the shared
space of the NFS server.
Working principles of CIFS: CIFS runs on top of TCP/IP and allows Windows computers to
access files on UNIX computers over a network.
The CIFS protocol applies to file sharing. Two typical application scenarios are as follows:
 File sharing service
 CIFS is commonly used in file sharing service scenarios such as enterprise file
sharing.
 Hyper-V VM application scenario
 SMB can be used to share mirrors of Hyper-V virtual machines promoted by
Microsoft. In this scenario, the failover feature of SMB 3.0 is required to ensure
service continuity upon a node failure and to ensure the reliability of VMs.
2.5.3 SAN
2.5.3.1 IP SAN Technologies
NIC + Initiator software: Host devices such as servers and workstations use standard NICs
to connect to Ethernet switches. iSCSI storage devices are also connected to the Ethernet
switches or to the NICs of the hosts. The initiator software installed on hosts virtualizes
NICs into iSCSI cards. The iSCSI cards are used to receive and transmit iSCSI data packets,
implementing iSCSI and TCP/IP transmission between the hosts and iSCSI devices. This
mode uses standard Ethernet NICs and switches, eliminating the need for adding other
adapters. Therefore, this mode is the most cost-effective. However, the mode occupies
host resources when converting iSCSI packets into TCP/IP packets, increasing host
operation overheads and degrading system performance. The NIC + initiator software
mode is applicable to scenarios that require the relatively low I/O and bandwidth
performance for data access.
TOE NIC + initiator software: The TOE NIC processes the functions of the TCP/IP protocol
layer, and the host processes the functions of the iSCSI protocol layer. Therefore, the TOE
NIC significantly improves the data transmission rate. Compared with the pure software
mode, this mode reduces host operation overheads and requires minimal network
construction expenditure. This is a trade-off solution.
iSCSI HBA:
 An iSCSI HBA is installed on the host to implement efficient data exchange between
the host and the switch and between the host and the storage device. Functions of
the iSCSI protocol layer and TCP/IP protocol stack are handled by the host HBA,
occupying the least CPU resources. This mode delivers the best data transmission
performance but requires high expenditure.
 The iSCSI communication system inherits part of SCSI's features. The iSCSI
communication involves an initiator that sends I/O requests and a target that
responds to the I/O requests and executes I/O operations. After a connection is set
up between the initiator and target, the target controls the entire process as the
primary device. The target includes the iSCSI disk array and iSCSI tape library.
 The iSCSI protocol defines a set of naming and addressing methods for iSCSI
initiators and targets. All iSCSI nodes are identified by their iSCSI names. In this way,
iSCSI names are distinguished from host names.
 iSCSI uses iSCSI Qualified Name (IQN) to identify initiators and targets. Addresses
change with the relocation of initiator or target devices, but their names remain
unchanged. When setting up a connection, an initiator sends a request. After the
target receives the request, it checks whether the iSCSI name contained in the
request is consistent with that bound with the target. If the iSCSI names are
consistent, the connection is set up. Each iSCSI node has a unique iSCSI name. One
iSCSI name can be used in the connections from one initiator to multiple targets.
Multiple iSCSI names can be used in the connections from one target to multiple
initiators.
Logical ports are created based on bond ports, VLAN ports, or Ethernet ports. Logical
ports are virtual ports that carry host services. A unique IP address is allocated to each
logical port for carrying its services.
 Bond port: To improve reliability of paths for accessing file systems and increase
bandwidth, you can bond multiple Ethernet ports on the same interface module to
form a bond port.
 VLAN: VLANs logically divide the physical Ethernet ports or bond ports of a storage
system into multiple broadcast domains. On a VLAN, when service data is being sent
or received, a VLAN ID is configured for the data so that the networks and services
of VLANs are isolated, further ensuring service data security and reliability.
 Ethernet port: Physical Ethernet ports on an interface module of a storage system.
Bond ports, VLANs, and logical ports are created based on Ethernet ports.
IP address failover: A logical IP address fails over from a faulty port to an available
port. In this way, services are switched from the faulty port to the available port
without interruption. The faulty port takes over services back after it recovers. This
task can be completed automatically or manually. IP address failover applies to IP
SAN and NAS.
During the IP address failover, services are switched from the faulty port to an available
port, ensuring service continuity and improving the reliability of paths for accessing file
systems. Users are not aware of this process.
The essence of IP address failover is a service switchover between ports. The ports can be
Ethernet ports, bond ports, or VLAN ports.
 Ethernet port–based IP address failover: To improve the reliability of paths for
accessing file systems, you can create logical ports based on Ethernet ports.
Figure 2-10
 Host services are running on logical port A of Ethernet port A. The corresponding
IP address is "a". Ethernet port A fails and thereby cannot provide services. After
IP address failover is enabled, the storage system will automatically locate
available Ethernet port B, delete the configuration of logical port A that
corresponds to Ethernet port A, and create and configure logical port A on
Ethernet port B. In this way, host services are quickly switched to logical port A
on Ethernet port B. The service switchover is executed quickly. Users are not
aware of this process.
 Bond port-based IP address failover: To improve the reliability of paths for accessing
file systems, you can bond multiple Ethernet ports to form a bond port. When an
Ethernet port that is used to create the bond port fails, services are still running on
the bond port. The IP address fails over only when all Ethernet ports that are used to
create the bond port fail.
Figure 2-11
 Multiple Ethernet ports are bonded to form bond port A. Logical port A created
based on bond port A can provide high-speed data transmission. When both
Ethernet ports A and B fail due to various causes, the storage system will
automatically locate bond port B, delete logical port A, and create the same
logical port A on bond port B. In this way, services are switched from bond port
A to bond port B. After Ethernet ports A and B recover, services will be switched
back to bond port A if failback is enabled. The service switchover is executed

quickly, and users are not aware of this process.
 VLAN-based IP address failover: You can create VLANs to isolate different services.
 To implement VLAN-based IP address failover, you must create VLANs, allocate
a unique ID to each VLAN, and use the VLANs to isolate different services. When
an Ethernet port on a VLAN fails, the storage system will automatically locate an
available Ethernet port with the same VLAN ID and switch services to the
available Ethernet port. After the faulty port recovers, it takes over the services.
 VLAN names, such as VLAN A and VLAN B, are automatically generated when
VLANs are created. The actual VLAN names depend on the storage system
version.
 Ethernet ports and their corresponding switch ports are divided into multiple
VLANs, and different IDs are allocated to the VLANs. The VLANs are used to
isolated different services. VLAN A is created on Ethernet port A, and the VLAN
ID is 1. Logical port A that is created based on VLAN A can be used to isolate
services. When Ethernet port A fails due to various causes, the storage system
will automatically locate VLAN B and the port whose VLAN ID is 1, delete logical
port A, and create the same logical port A based on VLAN B. In this way, the
port where services are running is switched to VLAN B. After Ethernet port A
recovers, the port where services are running will be switched back to VLAN A if
failback is enabled.
 An Ethernet port can belong to multiple VLANs. When the Ethernet port fails, all
VLANs will fail. Services must be switched to ports of other available VLANs. The
service switchover is executed quickly, and users are not aware of this process.
2.5.3.2 FC SAN Technologies

FC HBA: The FC HBA converts SCSI packets into Fibre Channel packets, which does not
occupy host resources.
Here are some key concepts in Fibre Channel networking:
 Fibre Channel Routing (FCR) provides connectivity to devices in different fabrics
without merging the fabrics. Different from E_Port cascading of common switches,
after switches are connected through an FCR switch, the two fabric networks are not
converged and are still two independent fabrics. The link switch between two fabrics
functions as a router.
 FC Router: a switch running the FC-FC routing service.
 EX_Port: a type of port that functions like an E_Port, but does not propagate fabric
services or routing topology information from one fabric to another.
 Backbone fabric: fabric of a switch running the Fibre Channel router service.
 Edge fabric: fabric that connects a Fibre Channel router.
 Inter fabric link (IFL): the link between an E_Port and an EX-Port, or a VE_Port and a
VEX-Port.
Another important concept is zoning. A zone is a set of ports or devices that
communicate with each other. A zone member can only access other members of the
same zone. A device can reside in multiple zones. You can configure basic zones to
control the access permission of each device or port. Moreover, you can set traffic
isolation zones. When there are multiple ISLs (E_Ports), an ISL only transmits the traffic
destined for ports that reside in the same traffic isolation zone.
2.5.3.3 Comparison Between IP SAN and FC SAN

First, let's look back on the concept of SAN.
 Protocol: Fibre Channel/iSCSI. The SAN architectures that use the two protocols are
FC SAN and IP SAN.
 Raw device access: suitable for traditional database access.
 Dependence on the application host to provide file access. Share access requires the
support of cluster software, which causes high overheads in processing access
conflicts, resulting in poor performance. In addition, it is difficult to support sharing
in heterogeneous environments.
 High performance, high bandwidth, and low latency, but high cost and poor
scalability
Then, let's compare FC SAN and IP SAN.
 To solve the poor scalability issue of DAS, storage devices can be networked using FC
SAN to support connection to more than 100 servers.
 IP SAN is designed to address the management and cost challenges of FC SAN. IP
SAN requires only a few hardware configurations and the hardware is widely used.
Therefore, the cost of IP SAN is much lower than that of FC SAN. Most hosts have
been configured with appropriate NICs and switches, which are also suitable
(although not perfect) for iSCSI transmission. High-performance IP SAN requires
dedicated iSCSI HBAs and high-end switches.
2.5.4 Distributed Architecture

A distributed storage system organizes local HDDs and SSDs of general-purpose servers
into large-scale storage resource pools, and then distributes data to multiple data storage
servers.
10GE, 25GE, and IB networks are generally used as the backend networks of distributed
storage. The frontend network is usually a GE, 10GE, or 25GE network.
The network planes and their functions are described as follows:
 Management plane: interconnects with the customer's management network for
system management and maintenance.
 BMC plane: connects to Mgmt ports of management or storage nodes to enable
remote device management.
 Storage plane: an internal plane, used for service data communication among all
nodes in the storage system.
 Service plane: interconnects with customer applications and accesses storage devices
through standard protocols such as iSCSI and HDFS.
 Replication plane: enables data synchronization and replication among replication
nodes.
 Arbitration plane: communicates with the HyperMetro quorum server. This plane is
planned only when the HyperMetro function is planned for the block service.
The key software components and their functions are described as follows:
 FSM: a management process of Huawei distributed storage that provides operation
and maintenance (O&M) functions, such as alarm management, monitoring, log
management, and configuration. It is recommended that this module be deployed on
two nodes in active/standby mode.
 Virtual Block Service (VBS): a process that provides the distributed storage access
point service through SCSI or iSCSI interfaces and enables application servers to
access distributed storage resources
 Object Storage Device (OSD): a component of Huawei distributed storage for storing
user data in distributed clusters.
 REP: data replication network
 Enterprise Data Service (EDS): a component that processes I/O services sent from
VBS.
2.6 Introduction to Huawei Intelligent Storage Products

2.6.1 All-Flash Storage
2.6.1.1 All-flash Product Series
Huawei OceanStor Dorado all-flash storage systems are next-generation, high-end
storage products designed for medium- and large-sized data centers or mission-critical
business services. The systems focus on core services provided by medium- and large-
sized enterprises (such as enterprise-level data centers, virtual data centers, and cloud
data centers) and meet the requirements for high performance, reliability and efficiency
demanded by medium- and large-sized data centers. OceanStor Dorado's efficient
hardware design, end-to-end flash acceleration, and intelligent storage system
management provide excellent data storage services for enterprises that meet the
requirements of various enterprise applications, such as large-scale database OLTP/OLAP,
cloud computing, server virtualization, and virtual desktops.
2.6.1.2 Product Highlights

The SmartMatrix 3.0 full-mesh balanced architecture uses a high-speed and matrix-based
full-mesh passive backplane. Multiple controller nodes can be connected. Interface
modules are connected to the backplane in full sharing mode, allowing hosts to access
any controller through any port. The SmartMatrix architecture allows close coordination
between controller nodes and simplifies the software model, achieving active-active fine-
grained balancing, high efficiency, low latency, and collaborative operations.
FlashLink® provides high input/output operations per second (IOPS) concurrency and
stable low latency. FlashLink® employs a series of optimizations for flash media. It
associates controller CPUs with SSD CPUs to coordinate SSD algorithms between these
CPUs, achieving high system performance and reliability.
SSDs use the NAND flash as a permanent storage media. Compared with traditional
HDDs, SSDs offer high speeds, low power consumption and low latency. They are also
smaller, lighter, and shockproof.
High performance
 All-SSD configuration boasts high IOPS and low latency.
 Support for FlashLink®, such as intelligent multi-core, efficient RAID, hot and cold
data separation, and low latency.
High reliability
 Component failure protection, dual-redundancy design, and active-active working
mode; SmartMatrix 3.0 full-mesh architecture for high efficiency, low latency, and
collaborative operations.
 Dual-redundancy design, power-off protection, and coffer disk.
 Advanced data protection technologies: HyperSnap, HyperReplication, HyperClone,
and HyperMetro.
 RAID 2.0+ underlying virtualization.
High availability
 Supports online replacement of components, such as controllers, power supplies,
interface modules, and disks.
 Supports disk roaming, which enables the storage system to automatically identify
relocated disks and resume their services.
 Centrally manages storage resources in third-party storage systems.
2.6.1.3 Product Form

Architecture:
 Pangea V6 arm hardware platform.
 CPU: Kunpeng 920 series.
 2 U, disk and controller integration.
 Supports 25 x 2.5-inch controller enclosures and 36 x palm-sized NVMe controller
enclosures.
 Two controllers that work in active-active mode.
2.6.1.4 Components
The controller enclosure uses a modular design and consists of a system enclosure,
controllers (including fan modules), power modules, BBU modules, and disk modules.
2.6.1.5 Software Architecture

The storage system software manages storage devices and stored data and assists
application servers in data operations.
The software provided by Huawei OceanStor Dorado 3000 V6, Dorado 5000 V6, and
Dorado 6000 V6 storage systems includes storage system software, software on the
maintenance terminal, and software running on the application server. The three types of
software cooperate with each other to intelligently, efficiently, and cost-effectively
implement various storage, backup, and DR services.
2.6.1.6 Intelligent Chips

The key technologies of FlashLink® include:
Intelligent multi-core technology
The storage system uses Huawei-developed CPUs and houses more CPUs and CPU cores
per controller than any other in the industry. The intelligent multi-core technology allows
storage performance to increase linearly with the number of CPUs and cores.
Efficient RAID
The storage system uses the redirect-on-write (ROW) full-stripe write design, which
writes all new data to new blocks instead of overwriting existing blocks. This greatly
reduces the overhead on controller CPUs and read/write loads on SSDs in a write process,
improving system performance at multiple RAID levels.
Hot and cold data separation
The storage system can identify and separate hot and cold data to improve garbage
collection performance, shorten the program/erase (P/E) cycles on SSDs, and extend SSD
service life.
Low latency guarantee
The storage system uses the latest generation of Huawei-developed SSDs and a faster
protocol to optimize I/O processing and maintain a low I/O latency.
Smart disk enclosure
The storage system supports the next-generation Huawei-developed smart disk
enclosure. The smart disk enclosure is equipped with CPU and memory resources, and
can offload tasks, such as data reconstruction upon a disk failure, from controllers to
reduce the workload on the controllers and eliminate the impact of such tasks on service
performance.
Efficient time point technology
The storage system implements data protection by using distributed time points. Read
and write I/Os from user hosts carry the time point information to quickly locate
metadata, thereby improving access performance.
Global wear leveling and anti-wear leveling
Global wear leveling: If data is unevenly distributed to SSDs, certain SSDs may be used
more frequently and wear faster than others. As a result, they may fail much earlier than
expected, increasing the maintenance costs. The storage system uses global wear leveling
that levels the wear degree among all SSDs, improving SSD reliability.
Global anti-wear leveling: When the wear degree of multiple SSDs is reaching the
threshold, the storage system preferentially writes data to specific SSDs. In this way,
these SSDs wear faster than the others. This prevents multiple SSDs from failing at a
time.
2.6.1.7 Typical Application Scenario – Acceleration of Critical Services

FAQ:
With the rapid development of the mobile Internet, effective value mining from a large
amount of customer data and rapidly expanding transaction data relies on efficient data
collection, analysis, consolidation, and extraction to facilitate the implementation of data-
centric strategies. Existing IT systems are under increasing pressure to improve. For
example, it takes several hours to process data and integrate data warehouses in the bill
and inventory systems of banks and large enterprises. As a result, services like operation
analysis and service queries cannot be obtained in a timely manner.
Solution:
Huawei's high-performance all-flash solution resolves these problems. High-end all-flash
storage systems are used to carry multiple core applications (services like the transaction
system database). The processing time is reduced by more than half, the response latency
is shortened, and the service efficiency is improved several times over.
2.6.2 Hybrid Flash Storage

2.6.2.1 Hybrid Flash Storage Series
Huawei OceanStor hybrid flash storage combines a superior hardware structure and an
integrated architecture that unifies block and file services with advanced data application
and protection technologies, meeting medium- and large-size enterprises' storage
requirements for high performance, scalability, reliability, and availability.
Converged storage:
 Convergence of SAN and NAS storage technologies.
 Support for storage network protocols such as iSCSI, Fibre Channel, NFS, CIFS, and
FTP.
High performance:
 High-performance processor, high-speed and large-capacity cache, and various high-
speed interface modules provide excellent storage performance.
 Support for SSD acceleration, greatly improving storage performance.
Flexible scalability:
 Support for various disk types.
 Support for various interface modules.
 Support for such technologies as scale-out.
High reliability:
 SmartMatrix full-mesh architecture, redundancy design for all components, active-
active working mode, and RAID 2.0+.
 Multiple data protection technologies, such as power failure protection, data pre-
copy, coffer disk, and bad sector repair.
High availability:
 Multiple advanced data protection technologies, such as snapshot, LUN copy, remote
replication, clone, volume mirroring, and active-active, and support for the NDMP
protocol.
Intelligence and high efficiency:
 Various control and management functions, such as SmartTier, SmartQoS, and
SmartThin, providing refined control and management.
 DeviceManager supports GUI-based operation and management.
 eService provides self-service intelligent O&M.
2.6.2.2 Product Form

A storage system consists of controller enclosures and disk enclosures, providing
customers with an intelligent storage platform that features high reliability, high
performance, and large capacity.
Different types of controller enclosures and disk enclosures are configured for different
models.
2.6.2.3 Convergence of SAN and NAS

Convergence of SAN and NAS storage technologies: One storage system supports both
SAN and NAS services at the same time and allows SAN and NAS services to share
storage device resources. Hosts can access any LUN or file system through the front-end
port of any controller. During the entire data life cycle, hot data gradually becomes cold
data. If cold data occupies the cache or SSDs for a long time, valuable resources will be
wasted, and the long-term performance of the storage system will be affected. The
storage system uses the intelligent storage tiering technology to flexibly allocate data
storage media in the background.
The intelligent tiering technology needs to be deployed on a device with different media
types. Data is monitored in real time. Data that is not accessed for a long time is marked
as cold data and is gradually transferred from high-performance media to low-speed
media, ensuring that service response from devices does not slow down. After being
activated, cold data can be quickly moved to high-performance media, ensuring stable
system performance.
Migration policies can be manually or automatically triggered.
2.6.2.4 Support for Multiple Service Scenarios

Huawei OceanStor hybrid flash storage system integrates SAN and NAS and supports
multiple storage protocols. It is used in a wide range of general-purpose scenarios,
including but not limited to government, finance, telecoms, manufacturing, backup, and
DR.
2.6.2.5 Application Scenario – Active-Active Data Centers

Load balancing among controllers
RPO = 0 and RTO ≈ 0 for mission-critical services
Convergence of SAN and NAS: SAN and NAS active-active services can be deployed on
the same device. If a single controller is faulty, local switchover is supported.
Solutions that ensure uninterrupted service running for customers
The active-active solution can be used in industries such as healthcare, finance, and social
security.
2.6.3 Distributed Storage

2.6.3.1 Distributed Storage
The advent of cloud computing and AI has led to exponential data growth. Newly
emerging applications, such as high-speed 5G communication, high definition (HD)
4K/8K video, autonomous driving, and big data analytics, are raising data storage
demands.
Huawei OceanStor 100D is a distributed storage product with scale-out and supports the
business needs of both today and tomorrow. It provides elastic on-demand services
powered by cloud infrastructure and carries both critical and emerging workloads.
2.6.3.2 Product Highlights

Block storage provides standard SCSI and iSCSI interfaces. It is an ideal storage platform
for private clouds, containers, virtualization platforms, and database applications.
HDFS storage provides a decoupled storage-compute big data solution with native HDFS.
Its intelligent tiering reduces TCO and offers a consistent user experience.
Object storage supports mainstream cloud computing ecosystems with standard object
storage APIs for content storage, cloud backup and archiving, and public cloud storage
service operations.
File storage supports the NFS protocol to provide common file services for users.
2.6.3.3 Hardware Node Examples

Huawei OceanStor 9000 scale-out file storage is a distributed file storage platform
designed for big data. Huawei OceanStor 9000 supports the following functions:
Leverages OceanStor DFS, a Huawei-developed distributed file system, to store large
amounts of unstructured data and provide a unified global namespace.
Can be interconnected with FusionInsight Hadoop and Cloudera Hadoop using open-
source Hadoop components, helping users easily build enterprise-grade big data analysis
platforms.
Provides the NFS protocol enhancement feature. You can configure multiple network
ports and install the NFS protocol optimization plug-in DFSClient on an NFS client to
implement concurrent connection and cache optimization of multiple network ports,
greatly improving the performance of a single client.
2.6.3.4 Software System Architecture

Storage interface layer: provides standard interfaces for applications to access the storage
system and supports SCSI, iSCSI, object, and Hadoop protocols.
Storage service layer: provides block, object, and HDFS services and enriched enterprise-
grade features.
Storage engine layer: leverages the Plog interface and an Append Only redirect-on-write
(ROW) write mechanism to provide multi-copy, EC, data rebuilding and balancing, disk
management, and data read/write capabilities.
Storage management: operates, manages, and maintains the system, and provides
functions such as system installation, deployment, service configuration, device
management, alarm reporting, monitoring, upgrade, and expansion.
2.6.3.5 Application Scenario – Cloud Resource Pool

As an example, Huawei OceanStor 100D applies to the following scenarios:
Private cloud and virtualization
Huawei OceanStor 100D provides an ultra-high quantity of data storage resource pools
featuring on-demand resource provisioning and elastic capacity expansion in
virtualization and private cloud environments. It improves storage deployment,
expansion, and operation and maintenance (O&M) efficiency using general-purpose
servers. Typical scenarios include Internet-finance channel access clouds, development
and testing clouds, cloud-based services, B2B cloud resource pools in carriers' BOM
domains, and e-Government clouds.
Mission-critical database
Huawei OceanStor 100D delivers enterprise-grade capabilities, such as distributed active-
active storage and consistent low latency, to ensure efficient and stable running of data
warehouses and mission-critical databases, including online analytical processing (OLAP)
and online transaction processing (OLTP).
Big data analytics
OceanStor 100D provides an industry-leading decoupled storage-compute solution for
big data, which integrates traditional data silos and builds a unified big data resource
pool for enterprises. It also leverages enterprise-grade capabilities, such as elastic large-
ratio erasure coding (EC) and on-demand deployment and expansion of decoupled
compute and storage resources, to improve big data service efficiency and reduce TCO.
Typical scenarios include big data analytics for finance, carriers (log retention), and
governments.
Content storage and backup archiving
OceanStor 100D provides high-performance and highly reliable object storage resource
pools to meet large throughput, frequent access to hotspot data, as well as long-term
storage and online access requirements of real-time online services such as Internet data,
online audio/video, and enterprise web disks. Typical scenarios include storage, backup,
and archiving of financial electronic check images, audio and video recordings, medical
images, government and enterprise electronic documents, and Internet of Vehicles (IoV).
2.6.4 Edge Data Storage (FusionCube)

2.6.4.1 Edge Data Storage (FusionCube)
Huawei FusionCube is an IT infrastructure platform based on a hyper-converged
architecture. FusionCube is prefabricated with compute, network, and storage devices in
an out-of-the-box package, eliminating the need for users to purchase extra storage or
network devices. A set of FusionCube appliance converges servers and storage, and pre-
integrates a distributed storage engine, virtualization platform, and cloud management
software. It supports on-demand resource scheduling and linear expansion. It is mainly
used in data center scenarios with multiple types of hybrid workloads, such as databases,
desktop clouds, containers, and virtualization.
Huawei FusionCube 1000 is an edge IT infrastructure solution with an integrated design.
It is delivered as an integrated cabinet. FusionCube 1000 is mainly used in edge data
centers and edge application scenarios in vertical industries, such as gas stations,
campuses, coal mines, and power grids. FusionCube Center can be deployed to
implement remote centralized management for FusionCube 1000 deployed in offices. In
conjunction with FusionCube Center Vision, FusionCube 1000 offers integrated cabinets,
service rollout, O&M management, and centralized troubleshooting services. It greatly

shortens the deployment cycle and reduces O&M costs.
Hyper-converged infrastructure (HCI) is a set of devices consolidating not only compute,
network, storage, and server virtualization resources, but also elements such as backup
software, snapshot technology, data deduplication, and inline compression. Multiple sets
of devices can be aggregated by a network to achieve modular, seamless scale-out and
form a unified resource pool.
2.6.4.2 Hardware Model Examples

Product portfolio:
Blade server: E.g. the Huawei E9000, which integrates computing, storage, and network
resources. The E9000 provides 12 U space for installing Huawei E9000 series server
blades, storage nodes, and capacity expansion nodes.
High-density server: E.g. the Huawei X6800 and X6000 with their high node densities. The
Huawei X6800 is a 4 U server with four nodes, and the Huawei X6000 is a 2 U server with
four nodes.
Rack server: E.g. the Huawei RH server or TaiShan server. The TaiShan 2280 is a 2 U 2-
socket rack server. It features high-performance computing, large-capacity storage, low
power consumption, easy management, and easy deployment and is designed for
Internet, distributed storage, cloud computing, big data, and enterprise services.
2.6.4.3 Storage System

The distributed storage system is a distributed block storage software package installed
on the server to virtualize all local disks on that server into a storage resource pool. This
allows for block storage services to be provided.
2.6.4.4 Software Architecture

The hardware for the FusionCube solution includes servers, switches, SSDs, and
uninterruptible power supply (UPS) resources.
The software for the FusionCube solution includes FusionCube Builder, FusionCube
Center, DR and backup software, and distributed storage software.
2.6.4.5 Application Scenario – Edge Data Center Service Scenario

Plug-and-play: Full-stack edge data center, delivery as an integrated cabinet, zero onsite
configuration, and plug-and-play.
Attendance-free: Unified and centralized management of edge data centers, requiring no
dedicated personnel to be in attendance and significantly reducing O&M costs.
Edge-cloud synergy: Enterprise application ecosystem built on the cloud to quickly deliver
applications from the central data center to edge data centers, accelerating customer
service innovation.
3 Advanced Storage Technologies
3.1 Storage Resource Tuning Technologies and

Applications
3.1.1 SmartThin
3.1.1.1 Overview
Huawei-developed OceanStor SmartThin for OceanStor storage series provides the
automatic thin provisioning function. It solves the problems in deployment of the
traditional storage systems.
SmartThin allocates storage spaces on demand rather than pre-allocating all storage
spaces at the initial stage. It is more cost-effective because customers can start business
with a few disks and add disks based on site requirements. In this way, both the initial
purchase cost and TCO are minimized.
3.1.1.2 Working Principles

SmartThin virtualizes storage resources.
It manages storage resources on demand. SmartThin does not allocate all capacity of a
storage system in advance. It presents a virtual storage space larger than the physical
storage space. Instead, it allocates the space based on user demands. If users need more
space, they can expand the capacity of back-end storage pools as needed. The storage
system does not need to be shut down, and the whole capacity expansion process is
transparent to users.
SmartThin creates thin LUNs based on the RAID 2.0+ technology, that is, thin LUNs
coexist with thick LUNs in the same storage resource pool. A thin LUN is a logical unit
created in a storage pool. The thin LUN can then be mapped to and accessed by a host.
The capacity of a thin LUN is not determined by the size of its physical capacity. Its
capacity is virtualized. Physical space from the storage resource pool is not allocated to
the thin LUN unless the thin LUN starts to process an I/O request.
A thick LUN is a logical disk that can be accessed by a host. The capacity of a thick LUN
has been specified when it is created. SmartThin enables the storage system to allocate
storage resources to the thick LUN using the automatic resource configuration
technology.
3.1.1.3 Read/write Process

SmartThin uses the capacity-on-write and direct-on-time technologies to help hosts
process read and write requests of thin LUNs. Capacity-on-write is used to allocate space
upon writes, and direct-on-time is used to redirect data read and write requests.
Capacity-on-time: Upon receiving a write request from a host, a thin LUN uses direct-on-
time to check whether there is a physical storage space allocated to the logical storage
provided for the request. If no, a space allocation task is triggered, and the size for the
space allocated is measured by the grain as the minimum granularity. Then data is
written to the newly allocated physical storage space.
Direct-on-time: When capacity-on-write is used, the relationship between the actual
storage area and logical storage area of data is not calculated using a fixed formula but
determined by random mappings based on the capacity-on-write principle. Therefore,
when data is read from or written into a thin LUN, the read or write request must be
redirected to the actual storage area based on the mapping relationship between the
actual storage area and logical storage area.
A mapping table: This table is used to record the mapping between an actual storage
area and a logical storage area. A mapping table is dynamically updated during the write
process and is queried during the read process.
3.1.1.4 Application Scenarios

SmartThin allocates storage space on demand. The storage system allocates space to
application servers as needed within a specific quota threshold, eliminating the storage
resource waste
SmartThin can be used in the following scenarios:
SmartThin expands the capacity of the banking transaction systems in online mode
without interrupting ongoing services.
SmartThin dynamically allocates physical storage spaces on demand to email services
and online storage services.
SmartThin allows different services provided by a carrier to compete for physical storage
space to optimize storage configurations.
3.1.1.5 Configuration Process

To use thin LUNs, you need to import and activate the license file of SmartThin on your
storage device.
After a thin LUN is created, if an alarm is displayed indicating that the storage pool has
no available space, you are advised to expand the storage pool as soon as possible.
Otherwise, the thin LUN may enter the write through mode, causing performance
deterioration.
3.1.2 SmartTier
3.1.2.1 Overview
SmartTier is also called intelligent storage tiering. It provides the intelligent data storage
management function that automatically matches data to the storage media best suited
to that type of data by analyzing data activities.
SmartTier migrates hot data to storage media with high performance (such as SSDs) and
moves idle data to more cost-effective storage media (such as NL-SAS disks) with more
capacity. This provides hot data with quick response and high input/output operations
per second (IOPS), thereby improving the performance of the storage system.
3.1.2.2 Storage Tiers

In a storage pool, a storage tier is a collection of storage media that all deliver the same
level of performance. SmartTier divides disks into high-performance, performance, and
capacity tiers based on their performance levels. Each storage tier is comprised of the
same type of disks and uses the same RAID level.
(1) High-performance tier
Disk type: SSDs
Disk characteristics: SSDs have a high IOPS and can quickly respond to I/O request.
However, the cost of storage capacity at each unit is high.
Application characteristics: Applications with intensive random access requests are often
deployed at this tier.
Data characteristics: It carries the most active data (hot data).
(2) Performance tier
Disk type: SAS disks
Disk characteristics: This tier delivers a high bandwidth under a heavy service workload.
I/O requests are responded in a relatively quick speed. Data write is slower than data
read if no data is cached.
Application characteristics: Applications with moderate access requests are often
deployed at this tier.
Data characteristics: It carries hot data (active data).
(3) Capacity tier
Disk type: NL-SAS disks
Disk characteristics: NL-SAS disks have a low IOPS and slowly respond to I/O request.
However, the price per unit for storage request processing is high.
Application characteristics: good for applications with fewer access request volumes
Data characteristics: It carries cold data (idle data).
The types of disks in a storage pool determine how many storage tiers there are.
3.1.2.3 Three Phases of SmartTier Implementation

If a storage pool contains more than one type of disks, SmartTier can be used to fully
utilize the storage resources. During data migration, a storage pool identifies the data
activity level in the unit of a data block and migrates the whole data block to the most
appropriate storage tier.
The SmartTier implementation includes three phases: I/O monitoring, data analysis, and
data migration.
The I/O monitoring module implements I/O monitoring. The storage system determines
which data block is hotter or colder by comparing the activity level of one data block
with that of another. The activity level of a data block is calculated based on its access
frequency and I/O size.
The data analysis module implements data analysis. The system determines the I/O count
threshold of each storage tier based on the capacity of each storage tier in the storage
pool, the statistics of each data block generated by the I/O monitoring module, and the
access frequency of data blocks. The data analysis module ranks the activity levels of all
data blocks (in the same storage pool) in descending order and the hottest data blocks
are migrated first.
The data migration module implements data migration. SmartTier migrates data blocks
based on the rank and the migration policy. Data blocks of a higher-priority are migrated
to higher tiers (usually the high-performance tier or performance tier), and data blocks
of a lower-priority are migrated to lower tiers (usually the performance tier or capacity
tier).
3.1.2.4 Key Technologies

Initial capacity allocation: The storage system allocates new data to corresponding
storage tiers based on the initial capacity allocation policy.
SmartTier policy specifies the data migration direction.
A storage pool may contain multiple LUNs with different SmartTier policies. However, I/O
monitoring applies to the entire storage pool. Therefore, you are advised to enable I/O
monitoring to ensure that LUNs configured with automatic migration policies can
complete migration tasks.
Data migration plan defines the migration mode. You can adjust the data migration
mode based on service requirements.
Data migration granularity: SmartTier divides data in a storage pool based on data
migration granularities or data blocks. During data migration, a storage pool identifies
the data activity level in the unit of a data block and migrates the whole data block to
the most appropriate storage tier.
Data migration rate controls the progress of data migration among storage tiers.

SmartTier applies to a wide range of service environments. This section uses an Oracle
database service as an example.
Since cold data is stored on NL-SAS disks, there is more space available on the high-
performance SSDs for hot data. SSDs provide hot data with quick response and high
IOPS. In this way, the overall storage system performance is remarkably improved.

The configuration process of SmartTier in a storage system includes checking the license,
configuring SmartTier based on the storage system level, configuring SmartTier based on
the storage pool level, and configuring SmartTier based on the LUN level.
A license grants the permission to use a specific value-added feature. Before configuring
SmartTier, ensure that its license file contains relevant information.
Storage system-level configuration includes the configuration of a data migration speed,
which is applied to all storage pools in a storage system.
Storage pool-level configurations include configurations of data migration granularity,
the RAID level, data migration plan, I/O monitoring, and forecast analysis. The
configurations are applied to a single storage pool.
LUN-level configurations include the configuration of the initial capacity allocation policy
and the SmartTier policy, which are applied to a single LUN.
SmartTier needs extra space for data exchange when dynamically migrating data.
Therefore, a storage pool configured with SmartTier needs to reserve certain free space.
3.1.3 SmartQoS
3.1.3.1 Overview
SmartQoS dynamically allocates storage resources to meet certain performance goals of
specified applications.
As storage technologies develop, a storage system is capable of providing larger
capacities. Accordingly, a growing number of users choose to deploy multiple applications
on one storage device. Different applications contend for system bandwidth and
Input/Output Operations Per Second (IOPS) resources, comprising the performance of
critical applications.
SmartQoS helps users properly use storage system resources and ensures high
performance of critical services.
SmartQoS enables users to set performance indicators like IOPS or bandwidth for certain
applications. The storage system dynamically allocates system resources to meet QoS
requirements of certain applications based on specified performance goals. It gives
priority to certain applications with demanding QoS requirements.
3.1.3.2 I/O Priority Scheduling

The I/O priority scheduling technology of SmartQoS is based on the priorities of LUNs,
file systems, or snapshots. Their priorities are determined by users based on the
importance of services deployed on the LUNs, file systems, or snapshots.
Each LUN or file system has a priority, which is configured by a user and saved in a
storage system. When an I/O request enters the storage system, the storage system gives
a priority to the I/O request based on the priority of the LUN, file system, or snapshot
that will process the I/O request. Then the I/O carries the priority throughout this
processing procedure. When system resources are insufficient, the system preferentially
processes high-priority I/Os to improve the performance of high-priority LUNs, file
systems, or snapshots.
I/O priority scheduling is to schedule storage system resources, including CPU compute
time and cache resources.
3.1.3.3 I/O Traffic Control

SmartQoS traffic control consists of I/O class queue management, token distribution, and
dequeue control of controlled objects.
I/O traffic control uses a token-based system. When a user sets an upper limit for the
performance of a traffic control group, that limit is converted into the number of
corresponding tokens. In a storage system, where the IOPS is limited, each I/O operation
corresponds to a token. If bandwidth is limited, tokens are allocated to sectors.
The storage system adjusts the number of tokens in an I/O queue based on the priority
of a LUN, file system, or snapshot. The more tokens a LUN, file system, or snapshot I/O
queue has, the more resources the system allocates to the LUN, file system, or snapshot
I/O queue. The storage system preferentially processes I/O requests in the I/O queue of
the LUN, file system, or snapshot.
For example, if a user enables a SmartQoS policy for two LUNs in a storage system and
sets performance objectives in the SmartQoS policy, the storage system can limit the
system resources allocated to the LUNs to reserve more resources for high-priority LUNs.
The performance goals measured by the SmartQoS are bandwidth, IOPS, and latency.

SmartQoS dynamically allocates storage resources to ensure performance for critical
services and high-priority users.
1. Ensuring the performance for critical services
If OLTP and archive backup services are concurrently running on the same storage
device, both services need sufficient system resources.
(1) Online Transaction Processing (OLTP) service is a key service and has high
requirements on real-time performance.
(2) Backup service has a large amount of data with fewer requirements on latency.
SmartQoS specifies the performance objectives for different services to ensure the
performance of critical services. You can use either of the following methods to
ensure the performance of critical services: Set I/O priorities to meet the high priority
requirements of OLTP services, and create SmartQoS traffic control policies to meet
service requirements.
2. Ensuring the service performance for high-level users
For cost reduction, some users will not build their dedicated storage systems
independently. They prefer to run their storage applications on the storage platforms
offered by storage resource providers. This lowers the total cost of ownership (TCO)
and ensures the service continuity. On such shared storage platforms, services of
different types and features contend for storage resources, so the high-priority users
may fail to obtain their desired storage resources.
SmartQoS creates SmartQoS policies and sets I/O priorities for different subscribers.
In this way, when resources are insufficient, services of high-priority subscribers can
be preferentially processed and their service quality requirements can be met.

Before configuring SmartQoS, read the configuration process and check its license file.
The service monitoring function provided by the storage system can be used to obtain
the I/O characteristics of LUNs or file systems and use them as the basis of SmartQoS
policies. SmartQoS policies adjust and control applications to ensure continuity of critical
services.
3.1.4 SmartDedupe
3.1.4.1 Overview
SmartDedupe eliminates redundant data from a storage system. This deduplication
technology reduces the amount of physical storage capacity occupied by data to release
more storage capacity for increasing services.
Huawei OceanStor storage systems provide inline deduplication to deduplicate the data
that is newly written into the storage systems.
Inline deduplication deletes duplicate data before the data is written into disks.
Similar deduplication analyzes data that has been written to disks based on similar
fingerprints to find out duplicate and similar data blocks, and then performs
deduplication.
3.1.4.2 Inline Deduplication Working Principles

Deduplication data block size specifies the deduplication granularity.
Fingerprint is a fixed-length binary value that represents a data block. OceanStor Dorado
V6 uses a weak hash algorithm to calculate a fingerprint for any data block. The storage
system saves the mapping relationship between the fingerprints and storage locations of
all data blocks in a fingerprint library.
Inline deduplication working principles
1. The storage system uses a weak hash algorithm to calculate a fingerprint for any
data block that is newly written into the storage system.
2. The storage system checks whether the fingerprint for the new data block is
consistent with the fingerprint in the fingerprint library.
 If yes, a byte-by-byte comparison is performed by the storage system.
 If no, the storage system determines that the data newly written is a new data
block, writes the new data block to a disk, and records the mapping between the
fingerprint and storage location of the data block in the fingerprint library.
3. The storage system performs a byte-by-byte comparison to check whether the new
data block is consistent with the existing one.
 If yes, the storage system determines that the new data block is the same as the
existing one, deletes the data block, and directs the fingerprint and storage
location mapping of the data block to the original data block in the fingerprint
library.
 If no, the storage system writes the data block to a disk. The processing
procedure is the same as that when the deduplication function is disabled.
3.1.4.3 Post-processing Similarity Deduplication Working Principles

An opportunity table saves data blocks' fingerprint and location information for
identifying hot fingerprints.
The process of deleting similar duplicate data is as follows:
1. Write data and calculate the fingerprint and write it to the opportunity table.
A storage system divides the newly written data into blocks, uses the similar
fingerprint algorithm to calculate the similar fingerprint of the newly written data
block, writes the data block to a disk, and writes the fingerprint and location
information of the data block to the opportunity table.
2. After data is written, the storage system periodically performs similarity
deduplication.
(1) The storage system periodically checks whether there is similar fingerprint in the
opportunity table.
 If yes, it performs byte-by-byte comparison.
 If no, it continues the periodic check.

(2) The storage system performs a byte-by-byte comparison to check whether
similar blocks are actually the same.
 If they are the same, the storage system determines that the new data block
is the same as the original data block. The storage system deletes the data
block, and maps its fingerprint and storage location to the remaining data
block.
 If they are just similar, the system performs differential compression on data
blocks, records its fingerprint to the fingerprint library, updates fingerprint to
the metadata of data blocks and recycles the spaces of these data blocks.

In Virtual Desktop Infrastructure (VDI) applications, users create multiple virtual images
on a physical storage device. These images have a large amount of duplicate data. As the
amount of duplicate data increases, the storage system struggles to keep up with service
requirements. SmartDedupe can delete duplicate data among images to release storage
resources and store more service data.

When creating a LUN, you need to select an application type. The deduplication function
of the application type has been configured by default. You can run commands to view
the deduplication and compression status of each application type.
3.1.5 SmartCompression
3.1.5.1 Overview
SmartCompression reorganizes data to save space and improves the data transfer,
processing, and storage efficiency without losing any data. OceanStor Dorado V6 series
storage systems support inline compression and post-process compression.
Inline compression: The system deduplicates and compresses data before writing it to
disks. User data is processed in real time.
Post-process compression: Data is written to disks in advance and then read and
compressed when the system is idle.

The LZ77 algorithm is a lossless compression algorithm. The LZ77 algorithm replaces the
data to be encoded or decoded with the position index of the data that has just been
encoded or decoded. The repetition of current data and data that has just been encoded
or decoded implements compression.
LZ77 uses a sliding window to implement this algorithm. The scan head starts scanning
the string from the head, and there is a sliding window with a length of N before the
scan head. If there is the string at the scanning head being the same as the longest
matching string in the window, (offset in the window, longest matching length) it is used
to replace the latter repeated string.

More CPU resources will be occupied as the volume of data that is compressed by the
storage system increases.
1. Databases: Data compression is an ideal choice for databases. Many users would be
happy to sacrifice a little performance to recover over 65% of their storage capacity.
2. File services: Data compression is also applied to file services. Peak hours occupy half
of the total service time and the dataset compression ratio of the system is 50%. In
this scenario, SmartCompression slightly decreases the IOPS.
3. Engineering data and seismic geological data: The data has similar requirements to
database backups. This type of data is stored in the same format, but there is not as
much duplicate data. Such data can be compressed to save the storage space.
SmartDedupe + SmartCompression
Deduplication and compression can be used together to save more space. SmartDedupe
can be combined with SmartCompression for data testing or development systems,
storage systems with a file service enabled, and for engineering data systems.

The SmartCompression configuration process includes checking the license and enabling
SmartCompression. Checking the license is to ensure the availability of
SmartCompression.
3.1.6 SmartMigration
3.1.6.1 Overview
SmartMigration is a key service migration technology. Service data can be migrated
within a storage system and between different storage systems.
"Consistent" means that after the service migration is complete, all of the service data
has been replicated from a source LUN to a target LUN.

SmartMigration synchronizes and splits service data to migrate all data from a source
LUN to a target LUN.
3.1.6.3 SmartMigration Service Data Synchronization

Pair: In SmartMigration, a pair is a source LUN and the target LUN that the data will be
migrated to. A pair can have only one source LUN and one target LUN.
LM modules manage SmartMigration in a storage system.
Dual-write changes data and writes data into both the source and target LUNs during
service data migration.
LOG records data changes on a source LUN to determine whether the data is
concurrently written to the target LUN using dual-write technology.
A data change log (DCL) records differential data that fails to be written to the target
LUN during the data change synchronization.
Service data synchronization between a source LUN and a target LUN includes initial
synchronization and data change synchronization. The two synchronization modes are
independent and can be performed at the same time to ensure that service data changes
on the host are synchronized to the source LUN and the target LUN.
3.1.6.4 SmartMigration Pair Splitting

Splitting is performed on a single pair. The splitting process includes stopping service data
synchronization between the source and target LUNs in a pair to exchange LUN
information, and removing the data migration relationship after the exchange is
complete.
In splitting, host services are suspended. After information is exchanged, services are
delivered to the target LUN. Service migration is invisible to the users.
1. LUN information exchange
Before a target LUN can take over services from a source LUN, the two LUNs must
synchronize and then exchange information.
2. Pair splitting
Pair splitting: Data migration relationship between a source LUN and a target LUN is
removed after LUN information is exchanged.
The consistency splitting of SmartMigration means that multiple pairs exchange LUN
information at the same time and concurrently remove pair relationships after the
information exchange is complete, ensuring that data consistency at any point in
time before and after the pairs are split.

The configuration processes of SmartMigration in a storage system include checking the
license file, creating a SmartMigration task, and splitting a SmartMigration task.
A target LUN stores a point-in-time data duplicate of the source LUN only after the pair
is split. The data duplicate can be used to recover the source LUN in the event that the
source LUN is damaged. In addition, the data duplicate is accessible in scenarios such as
application testing and data analysis.
3.2 Storage Data Protection Technologies and Applications

3.2.1 HyperSnap
3.2.1.1 Overview
As information technologies develop and global data increases, people attach more and
more importance to data security and protection. Traditional data backup solutions face
the following challenges in protecting mass data:
A large amount of data needs to be backed up, and the data amount is rapidly
increasing. However, the backup window remains unchanged.
The backup process should not affect the production system performance.
Users have increasingly high requirements on RPO/RTO.
 A backup window is an interval during which data is backed up without seriously

affecting the applications.
 Recovery Time Objective (RTO) is a service switchover policy that ensures the
shortest DR switchover time. It takes the recovery time point as the object and
ensures that the DR system takes over services instantly.
 Recovery Point Objective (RPO) is a service switchover policy that ensures the least
data loss. It takes the data recovery point as the object and ensures that the data
used for DR switchover is the latest backup data.
Higher requirements promote evolution and innovation of backup technologies. More
and more backup software provides the snapshot technology to back up data.
Snapshot is a data backup technology analogous to photo taking. It instantaneously
records the state of objects. The snapshot technology helps system administrators reduce
the backup window to zero, meeting enterprises' requirements for service continuity and
data reliability.
HyperSnap is a consistency data copy of the source data at a specific point in time. It is
an available copy of a specified data set. The copy contains a static image o f a source
data at the point in time when the source data is copied.

Redirect on write (ROW) technology is the core technology for snapshot. In the overwrite
scenario, a new space is allocated to new data. After the write operation is successful, the
original space is released.
Data organization: LUNs created in a storage pool of OceanStor Dorado V6 consist of
data and metadata volumes.
 A metadata volume records the data organization information (logical block address
(LBA), version, and clone ID) and data attributes. A metadata volume is organized in
a tree structure. Logical block address (LBA) indicates the address of a logical block.
Version corresponds to the point in time of a snapshot, and Clone ID is the number
of data copies.
 Data volume stores data written to a LUN.
Source volume stores the source data for which a snapshot is to be created. It is
presented as a LUN to users.
Snapshot volume is a logical data copy generated after a snapshot is created for a source
volume. It is presented as a snapshot LUN to users.
Inactive: A snapshot is in the inactive state. In this state, the snapshot is unavailable and
can be used only after being activated.
3.2.1.3 Lossless Performance

The process of activating a snapshot is to save a data state of a source object at the time
when it is activated. Specific operations include creating a mirror for the source object of
the snapshot and associating the mirror with an activated snapshot. The principle of
creating a mirror for the source object is as follows: Common LUNs and snapshots in the
OceanStor Dorado V6 all-flash storage system use the ROW-based read and write mode.
In ROW-based read and write mode, each time the host writes new data, the system
reallocates space to store the new data and updates the LUN mapping table to point to
the new data space. As shown in the slide, L0 to L4 are logical addresses, P0 to P8 are
physical addresses, and A to I are data.
3.2.1.4 Snapshot Rollback

When required, the rollback function can immediately restore the data in a storage
system to the state at the snapshot point in time, offsetting data damage or data loss
caused by misoperations or viruses after the snapshot point in time.
Rollback copies the snapshot data to a target object (a source LUN or a snapshot). After
rollback is started, the target object instantly uses the data (the data of the target object
is that of the snapshot). To ensure that the target object is available immediately after
the rollback is started, perform the following operations:
 Perform read redirection on the target object which needs read rollback.
 Perform rollback before write on the target object which needs write rollback.
Stopping a rollback cannot restore the data of the target object to the state before the
rollback. Therefore, you are advised to add a snapshot for data protection before starting
a rollback. Stopping the rollback only stops the data copy process but cannot restore the
data on the source object to the state before the rollback.
3.2.1.5 Snapshot Cascading and Cross-Level Rollback

Huawei OceanStor Dorado V6 supports cross-level rollback. Cross-level rollback refers to
the rollback of snapshots and cascading snapshots of the same root LUN. You can select
any two snapshots or a snapshot and the root LUN to perform the rollback. Cross-level
rollback is shown in the slide. You can either select any two snapshots among snapshot0,
snapshot1, Snapshot1.snapshot0, and Snapshot1.snapshot1 to perform mutual rollback,
or roll back any one to the root LUN0.
3.2.1.6 Key Technologies: Duplicate

A snapshot duplicate backs up the data of a snapshot at the time when the snapshot was
activated. It does not back up private data written to the snapshot after the snapshot
activation. A snapshot duplicate and a source snapshot share the COW volume space of
the source LUN, but the private space is independent. The snapshot duplicate is a
writable snapshot which is independent of the source snapshot.
The read and write processes of a snapshot duplicate are the same as those of a common
snapshot.
The snapshot duplicate technology can be used to obtain multiple data copies of the
same snapshot.
3.2.1.7 Key Technologies: Rollback Before Write

HyperSnap of Huawei OceanStor converged storage systems uses the rollback before
write technology to implement second-level rollback. That is, after a snapshot rollback is
started, the source LUN can be read and written immediately. Rollback before write
means that during the rollback, when a host reads data from or writes data to a source
LUN, a snapshot copies the data blocks in the snapshot to the source LUN, and then the
host writes data to the source LUN. When no host reads or writes data, the snapshot
data is rolled back to the source volume in sequence.

A snapshot is applied widely. It can serve as a backup source, data mining source, point
for checking application status, or even a method of data replication. For example:
The snapshot feature periodically creates read-only snapshots for file systems and copies
snapshot data to local backup servers to facilitate subsequent data query and restoration.
Data is backed up based on snapshots, reducing the data backup window to zero without
interrupting file system services. Data is stored remotely, improving data reliability.
If file system data is damaged or unavailable due to viruses or misoperations, the
snapshot feature enables the file system data to be rolled back to state at the specified
snapshot point in time, rapidly restoring data.
Snapshots can be directly read and written for data mining and testing without affecting
the service data.

The snapshot configuration process includes checking the snapshot license file, creating a
LUN, and creating a snapshot.
If there is no snapshot source LUN in a storage pool, create the source LUN first.
Creating a snapshot is to generate a consistent data copy of a source LUN at a specific
point in time. You can perform the read operation on the copy without affecting its
source data. The storage pool where a source LUN resides has no available capacity,
snapshots cannot be created for the source LUN.
A snapshot LUN can be mapped to a host as a common LUN.
3.2.2 HyperClone
3.2.2.1 Overview
HyperClone allows you to obtain full copies of LUNs without interrupting host services.
These copies can be used for data backup and restoration, data reproduction, and data
analysis.

HyperClone provides a full copy of the source LUN's data at the synchronization start
time. The target LUN can be read and written immediately, without waiting for the copy
process to complete. The source and target LUNs are physically isolated. Operations on
the member LUNs do not affect each other. When data on the source LUN is damaged,
data can be reversely synchronized from the target LUN to the source LUN. A differential
bitmap records the data written to the source and target LUNs to support subsequent
incremental synchronization.
3.2.2.3 Synchronization
When a HyperClone pair starts synchronization, the system generates an instant snapshot
for the source LUN, synchronizes the snapshot data to the target LUN, and records
subsequent write operations in a differential table.
When synchronization is performed again, the system compares the data of the source
and target LUNs, and only synchronizes the differential data to the target LUN. The data
written to the target LUN between the two synchronizations will be overwritten. Before
synchronization, users can create a snapshot for a target LUN to retain its data changes.
Relevant concepts:
1. Pair: In HyperClone, a pair has one source LUN and one target LUN. A pair is a
mirror relationship between the source and target LUNs. A source LUN can form
multiple HyperClone pairs with different target LUNs. A target LUN can be added to
only one HyperClone pair.
2. Synchronization: Data is copied from a source LUN to a target LUN.
3. Reverse synchronization: If data on the source LUN needs to be restored, you can
reversely synchronize data from the target LUN to the source LUN.
4. Differential copy: The differential data can be synchronized from the source LUN to
the target LUN based on the differential bitmap.
3.2.2.4 Reverse Synchronization

If a source LUN is damaged, data on a target LUN can be restored to the source LUN. All
or differential data can be copied to the source LUN.
When reverse synchronization starts, the system generates a snapshot for a target LUN
and synchronizes all of the snapshot data to a source LUN. For incremental reverse
synchronization, the system compares the data of the source and target LUNs, and only
copies the differential data to the source LUN.
The source and target LUNs can be read and written immediately after reverse
synchronization starts.
3.2.2.5 Restrictions on Feature Configuration

HyperClone has restrictions on other functions of Huawei OceanStor Dorado V3.

HyperClone is widely used. It can serve as a backup source, as a data mining source, and
as a checkpoint for application status. After the synchronization is complete, the source
and target LUNs of HyperClone are physically isolated. Operations on the source and
target LUNs do not affect each other.
Application Scenario 1: Data Backup and Restoration
HyperClone generates one or multiple copies of source data to achieve point-in-time
backup, which can be used to restore the source LUN data in the event of data
corruption. Using a target LUN can back up data online and restore data more quickly.
Application Scenario 2: Data Analysis and Reproduction
Data analysis researches on a great amount of data in order to extract useful
information, draw conclusions, and support decision-making. The analysis services use
data on target LUNs to prevent the analysis and production services from competing for
source LUN resources, ensuring system performance.
Data reproduction (HyperClone) can create multiple copies of the same source LUN for
multiple target LUNs.

Configure HyperClone on DeviceManager.
Create a clone to copy data to a target LUN.

Create a protection group to protect data. Before this operation, you need to create
protected objects, including protection groups and LUNs. A protection group is a
collection of one LUN group or multiple LUNs.
Create a clone consistency group to facilitate unified operations on clones, improve
efficiency, and ensure that the data of all clones in the group is consistent at the time
point.
3.2.3 HyperReplication
3.2.3.1 Overview
As digitization advances in various industries, data has become critical to the efficient
operation of enterprises, and users impose increasingly demanding requirements on the
stability of storage systems. Although many enterprises have highly stable storage
systems, it is a big challenge for them to ensure data restoration from damage caused by
natural disasters. To ensure continuity, recoverability, and high availability of service data,
remote DR solutions emerge. The HyperReplication technology is one of the key
technologies used in remote DR solutions.
HyperReplication is Huawei remote replication feature. HyperReplication provides a
flexible and powerful data replication function that facilitates remote data backup and
restoration, continuous support for service data, and disaster recovery.
A primary site is a production center that includes primary storage systems, application
servers, and links.
A secondary site is a backup center that includes secondary storage systems, application
servers, and links.
HyperReplication supports the following two modes:
 Synchronous remote replication between LUNs: Data is synchronized between
primary and secondary LUNs in real time. No data is lost when a disaster occurs.
However, the performance of production services is affected by the latency of the
data transmission between primary and secondary LUNs.
 Asynchronous remote LUN replication between LUNs: Data is periodically
synchronized between primary and secondary LUNs. The performance of production
services is not affected by the latency of the data transmission between primary and
secondary LUNs. If a disaster occurs, some data may be lost.
3.2.3.2 Introduction to DR and Backup

When two data centers (DCs) use HyperReplication, they work in active/standby mode.
The production center is in the service running status, and the DR center is in the non-
service running status.
For active/standby DR, if a device in DC A is faulty or even if the entire DC A is faulty,
services are automatically switched to DC B.
For backup, DC B backs up only data in DC A and does not carry services when DC A is
faulty.
3.2.3.3 Relevant Concepts

 Pair: A pair refers to the data replication relationship between a primary LUN and a
secondary LUN. The primary LUN and secondary LUN of a pair must belong to
different storage systems.
 Data status: HyperReplication identifies the data status of the current pair based on
the data difference between primary and secondary LUNs. When a disaster occurs,
HyperReplication determines whether to allow a primary/secondary switchover for a
pair based on the data status of the pair. There are two types of pair data status:
complete and incomplete.
 Writable secondary LUN: Data delivered by a host can be written to a secondary
LUN. A secondary LUN can be set to writable in two scenarios:
 A primary LUN fails and the HyperReplication links are disconnected. In this case,
a secondary LUN can be set to writable in the secondary storage system.
 A primary LUN fails but the HyperReplication links are in normal state. The pair
must be split before you enable the secondary LUN to be writable in the primary
or secondary storage system.
 Consistency group: A collection of pairs whose services are associated. For example,
the primary storage system has three primary LUNs, which respectively store service
data, log, and change tracking information of a database. If data on any of the three
LUNs is invalid, all data on the three LUNs becomes invalid. The pairs to which three
LUNs belong comprise a consistency group.
 Synchronization: Data is replicated from a primary LUN to a secondary LUN.
HyperReplication involves initial synchronization and incremental synchronization.
 Split: Data synchronization between primary and secondary LUNs is suspended. After
splitting, there is still the pair relationship between the primary LUN and the
secondary LUN. Hosts' access permission for both the LUNs remains unchanged. At
some time, users may not want to copy data from the primary LUN to the secondary
LUN. For example, if the bandwidth is insufficient to support critical services, you
need to suspend the data synchronization of HyperReplication between links. In such
case, you can perform the split operation to suspend data synchronization.
 Primary/secondary switchover: A process during which primary and secondary LUNs
in a pair are switched over. This process changes the primary/secondary relationship
of LUNs in a HyperReplication pair.
3.2.3.4 Phases
HyperReplication involves the following phases: creating a HyperReplication relationship,
synchronizing data, switching over services, and restoring data.
1. Create a HyperReplication pair.
2. Synchronize all data manually or automatically from the primary LUN to the
secondary LUN of the HyperReplication pair. In addition, periodically synchronize
incremental data on the primary LUN to the secondary LUN.
3. Check the data status of the HyperReplication pair and the read/write properties of
the secondary LUN to determine whether a primary/secondary switchover can be
performed. Then perform a primary/secondary switchover to form a new

HyperReplication pair.
4. Synchronize data from the secondary storage system to the primary storage system.
Then perform a primary/secondary switchover to restore to the original pair
relationship.
1. Running status of a pair
You can perform synchronization, splitting, and primary/secondary switchover
operations on a HyperReplication pair based on its running status. After performing
an operation, you can view the running status of the pair to check whether the
operation is successful.
2. Working principles of asynchronous remote replication
Asynchronous remote replication periodically replicates data from the primary
storage system to the secondary storage system.
Asynchronous remote replication of Huawei OceanStor storage systems adopts the
innovative multi-time-point caching technology.
3. HyperReplication service switchover
If a primary site suffers a disaster, a secondary site can quickly take over its services
to protect service continuity. The RPO and RTO indicators must be considered during
service switchover.
Requirements for running services on the secondary storage system:
 Before a disaster occurs, data in a primary LUN is consistent with that in a
secondary LUN. If data in the secondary LUN is incomplete, services may fail to
be switched.
 Services on the production host have also been configured on the standby host.
 The secondary storage system allows a host to access a LUN in a LUN group
mapped to the host.
If a disaster occurs and the primary site is invalid, the HyperReplication links between
primary and secondary LUNs are down. In this case, the administrator needs to
manually set the read/write permissions of the secondary LUN to writable mode to
implement service switchover.
4. Data restoration
After a primary site fails, a secondary site temporarily takes over services of the
primary site. When the primary site recovers, services are switched back.
After the primary site recovers from a disaster, it is required to rebuild a
HyperReplication relationship between primary and secondary storage systems and
use data on the secondary site to restore data on the primary site.
5. Function of a consistency group
Users can perform synchronization, splitting, and primary/secondary switchover
operations on a single HyperReplication pair or manage multiple HyperReplication
pairs by using a consistency group. If associated LUNs have been added to a
HyperReplication consistency group, the consistency group can effectively prevent
data loss.

HyperReplication is used for data DR and backup. The typical application scenarios
include central DR and backup, geo-redundancy, and realizing DR with BCManager
eReplication.
Different HyperReplication modes apply to different application scenarios.
Asynchronous remote replication applies to backup and disaster recovery scenarios where
the network bandwidth is limited or a primary site is far from a secondary site (for
example, across countries or regions).

You can set up a pair relationship between primary and secondary resources to
synchronize data.
Unless otherwise specified, the operations in the slide can be performed on either the
primary or the secondary storage device. If you want to perform the operations on the
secondary storage device, perform a primary/secondary switchover first.
3.2.4 HyperMetro
3.2.4.1 Overview
HyperMetro is Huawei's active-active storage solution. Two DCs enabled with
HyperMetro back up each other and both are carrying services. If a device is faulty in a
DC or if the entire center is faulty, the other DC will automatically take over services,
solving the switchover problems of traditional DR centers. This ensures high data
reliability and service continuity, and improves the resource utilization of the storage
system.

1. Local DC deployment: In most cases, hosts are deployed in different equipment
rooms in the same industrial park. Hosts are deployed in cluster mode. Hosts and
storage devices communicate with each other through switches. Fibre Channel
switches and IP switches are supported. In addition, dual-write mirroring channels
are deployed between the storage systems to ensure the HyperMetro services are
running correctly.
2. Cross-DC deployment: Generally, hosts are deployed in two DCs in the same city or
adjacent cities. The physical distance between the two centers is within 300 km. Both
are running and can carry the same services at the same time, improving the overall
service capability and system resource utilization of the DCs. If one DC is faulty,
services are automatically switched to the other one.
In cross-DC deployments involving long-distance transmission (a minimum of 80 km for
IP networking and 25 km for Fibre Channel networking), dense wavelength division
multiplexing (DWDM) devices must be used to ensure a low transmission latency. In
addition, HyperMetro mirroring channels are deployed between the storage systems to
ensure the HyperMetro services are running correctly.
The HyperMetro solution has the following characteristics:
 The data dual-write technology ensures storage redundancy. No data is lost if there
is only one storage system running or the production center fails. Services are
switched over quickly, maximizing customer service continuity. This solution meets
the service requirements of RTO = 0 and RPO = 0.
 HyperMetro and SmartVirtualization can be used together to support heterogeneous
storage and consolidate resources on the network layer to protect the existing
investment of the customer.
 This solution can be smoothly upgraded to the 3DC solution with HyperReplication.
Based on the preceding features, the HyperMetro solution can be widely used in
industries such as healthcare, finance, and social security.
3.2.4.3 Quorum Modes

If the link between two DCs is down or one DC is faulty, data cannot be synchronized
between the two centers in real time. In this case, only one end of a HyperMetro pair or
HyperMetro consistency group can continue providing services. For data consistency,
HyperMetro uses an arbitration mechanism to determine service priorities in DCs.
HyperMetro provides two quorum modes:
The static priority mode applies to scenarios where no quorum server is configured.
If no quorum server is configured or the quorum server is inaccessible, HyperMetro works
in static priority mode. When an arbitration occurs, the preferred site wins the arbitration
and provides services.
The quorum server mode (recommended) applies to scenarios where a quorum server is
configured.
An independent physical server or VM is used as a quorum server. It is recommended that
the quorum server be deployed at a dedicated site that is different from the two DCs. In
this way, when a disaster occurs in a single DC, the quorum server still works.
In quorum server mode, in the event of the failure in a DC or disconnection between the
storage systems, each storage system sends an arbitration request to the quorum server,
and only the winner continues providing services. The preferred site takes precedence in
arbitration.
3.2.4.4 Dual-Write Working Principles

HyperMetro uses dual-write and the data change log (DCL) mechanism to ensure data
consistency between the storage systems in two DCs. Both centers provide read and write
services for hosts concurrently.
The locking and dual-write mechanisms are critical to ensure data consistency between
two storage systems.
Two storage systems with HyperMetro enabled can process I/O requests of hosts
concurrently. To prevent data conflicts when the two storage systems receive a host's
write request to modify the same data block simultaneously, only the storage system
obtaining the locking mechanism can write data. The storage system denied by the
locking mechanism must wait until the lock is released and then obtain the write
permission.
3.2.4.5 Strong Data Consistency

In the HyperMetro DR scenario, read and write operations can be concurrently performed
at both sites. If data is read from or written to the same storage address on a volume
simultaneously, the storage layer must ensure data consistency at both sites.
Data consistency at the application layer: cross-site databases, applications deployed in a
cluster, and shared storage architecture.
Data consistency at the storage layer
In normal conditions, delivered application I/Os are concurrently written into both
storage arrays, ensuring data consistency between the two storage arrays.
If a single storage device is unavailable, data differences are recorded. If one storage
device is unavailable, only the data of the normal storage device is written, and the data
changes during the service running period are recorded in DCL. After the unavailable
storage array is recovered and connected to the system, the storage writes incremental
data to the storage device using the DCL records.
Distributed lock management (DLM): Only one host is allowed to write data to a storage
address at a time when multiple hosts are accessing the address simultaneously. This
ensures data consistency.
3.2.4.6 Extensible Solution Design

Two DCs enabled with HyperMetro back up each other and both are carrying services. If
a device is faulty in a DC or if the entire center is faulty, the other DC will automatically
take over services, solving the switchover problems of traditional DR centers. This ensures
high data reliability and service continuity, and improves the resource utilization of the
storage system. The HyperMetro solution can be smoothly upgraded to the 3DC solution
with HyperReplication.
The 3DC solution deploys three DCs to guarantee service continuity in the event that any
two DCs fail, remarkably improving availability of customer services. The three DCs refer
to the production center, local metropolitan DR center, and remote DR center. They are
deployed in two places for geo-redundant protection.
3.2.4.7 Typical Application Scenarios

HyperMetro can be widely used in industries such as healthcare, finance, and social
security.

HyperMetro must be configured for both storage systems.
On storage array B in DC B, you need to check the license file, configure the IP address of
the quorum port, create a quorum server, and create a mapping relationship.
You need to create a consistency group separately if you do not create a HyperMetro
consistency group when creating the HyperMetro pair.
4 Storage Business Continuity Solutions
4.1 Backup Solution Introduction

4.1.1 Overview
4.1.1.1 Background and Definition
Threats to data security are generally unpreventable. Data integrity and the systems from
which we access data will be damaged when these threats turn into disasters. The c auses
of data loss and damage are as follows:
1. Faults of software and platforms used for processing and accessing data
2. Design vulnerabilities in operating systems that are created unintentionally or
intentionally
3. Hardware faults
4. Misoperations
5. Malicious damage made by illegal access
6. Power outages
Important data, archives, and historical records in computer systems are critical to both
enterprises and individuals. Any data loss will cause significant economic loss, waste an
enterprise's years of R&D efforts, and even interrupt its normal operation and production
activities.
To maintain normal production, marketing, and development activities, enterprises
should take advanced and effective measures to back up their data, for early prevention
purposes.
In IT and data management fields, a backup refers to a copy of data in a file system or
database that can be used to promptly recover the valid data and resume normal system
operations when a disaster or misoperation occurs.
4.1.1.2 Backup, Disaster Recovery, and Archiving

Disaster recovery (DR) and backup are two different concepts. DR is designed for
business continuity by allowing an IT system to remain running properly when a disaster
occurs. Backup is designed for addressing data loss problems when a disaster occurs. A
DR system and a backup system were two independent systems before an integrated DR
and backup system was introduced. An integrated DR and backup system aims to help
enterprises protect their data against "soft" disasters, such as misoperations, software
faults, and viruses, and "hard" disasters, such as hardware faults and natural disasters.
File archiving is another popular data protection method. An archiving system uses
inexpensive storage media (such as tapes) and archiving operations can be performed in
offline mode. Therefore, archiving helps reduce costs and facilitate media storage. A file
archiving system can also store files based on file attributes. These attributes can be
author, modification date, or other customized tags. An archiving system stores files
together with their metadata and attributes. In addition, an archiving system provides the
data compression function. In conclusion, archiving involves storing backup data that will
no longer be frequently accessed or updated in offline mode for a long term, and
attaching the "archived" tag to these data according to specific attributes for future
search.
Generally, backup refers to data backup or system backup, while DR refers to data
backup or application backup across equipment rooms. Backup is implemented using
backup software, whereas DR is implemented using replication or mirroring software. The
differences between the two are as follows:
1. DR is designed for protecting data against natural disasters, such as fires and
earthquakes. Therefore, a backup center must be set in a place which is away from
the production center at a certain distance. In contrast, data backup is performed
within a data center.
2. A DR system not only protects data but also guarantees business continuity. In
contrast, data backup only focuses on data security.
3. DR protects data integrity. In contrast, backup can only help recover data from a
point in time when a backup task is performed.
4. DR is performed in online mode while backup is performed in offline mode.
5. Data at two sites of a DR system is consistent in real time while backup data is
relatively time-sensitive.
6. When a fault occurs, a DR switchover process in a DR system lasts seconds to
minutes, while a backup system takes hours and maybe even dozens of hours to
recover data.
Backup and archiving systems are designed to protect data in different ways and the
combination of the two systems will provide more effective data protection. Backup is
designed to protect data by storing data copies. Archiving is designed to protect data by
organizing and storing data for a long term in a data management manner. In other
words, backup can be considered as short-term retention of data copies, while archiving
can be considered as long-term retention of files. In practice, we do not delete an original
copy after it is backed up. However, it will be fine if we delete an original copy after it is
archived, as we might no longer need to access it swiftly. Backup and archiving work
together to better protect data.
4.1.2 Architecture
4.1.2.1 Components
A backup system typically consists of three components: backup software, backup media,
and backup server.
The backup software is useful for creating backup policies, managing media, and adding
functions. The backup software is the core of a backup system and is used for creating
and managing copies of production data stored on storage media. Some backup software
can be upgraded with more functions, such as protection, backup, archiving, and
recovery.
Backup media include tape libraries, disk arrays, and virtual tape libraries. A virtual tape
library is essentially a disk array, but it can virtualize a disk storage into a tape library.
Compared with mechanical tapes, virtual tape libraries are compatible with tape backup
management software and conventional backup processes, greatly improving availability
and reliability.
The backup server provides services for executing backup policies. The backup software
resides and runs on the backup server. Generally, a backup software client agent needs to
be installed on the service host to be backed up.
Three elements of a backup system are Backup Window (BW), Recovery Point Objective
(RPO), and Recovery Time Objective (RTO).
BW indicates a duration of time allowed for backing up the service data in a service
system without affecting the normal operation of the service system.
RPO is for ensuring the latest backup data is used for DR switchover. A smaller RPO
means less data to be lost.
RTO refers to an acceptable duration of time and a service level within which a business
process must be restored, in order to minimize the impact of interruption on services.
4.1.2.2 Backup Solution Panorama

Huawei backup solutions include all-in-one, centralized, and cloud backup solutions.
Huawei Data Protection Appliance is a data protection and management product that
integrates the backup software, backup server, and backup storage. With the distributed
architecture, Huawei Data Protection Appliance supports the linear increase in both
performance and capacity. Only one system needs to be deployed to protect, construct,
and manage user data and applications. This helps users better protect data, save data
protection investment, and simplify the data management process. While excelling in a
wide range of scenarios, Huawei Data Protection Appliance is suited for industries such as
government, finance, carrier, healthcare, and manufacturing.
The Data Protection Appliance provides a graphical management system, which
facilitates users to manage and maintain the software and hardware of a backup system
in a centralized manner.
Centralized backup uses the backup management node to manage local and remote data
centers and schedules backup tasks in a centralized manner, remarkably simplifying the
operation of a backup system and enabling users to manage and control a backup
system in view of overall condition.
Centralized backup has the following advantages:
1. Enables centralized management of backup and recovery tasks of data from a
variety of applications, to form a unified management policy.
2. Integrates backup resources, to optimize utilization of backup resources.
3. Provides flexible scalability by allowing addition of any tapes, clients, and tape
libraries.
4. Simplifies management by allowing fewer management engineers to manage more
devices and systems.
Cloud backup involves backing up data from a local production center to a data center (a
central data center of an enterprise or a data center provided by a service provider) using
a standard network protocol over a WAN. Cloud backup is based on services, accessible
anywhere, flexible, secure, and can be shared and used on demand. Cloud backup
emerges as a brand new backup service based on broadband Internet and large storage
capacities. In conclusion, cloud backup provides data storage and backup services by
leveraging a variety of functions, such as cluster applications, grid technologies, and
distributed file systems, and integrating a variety of storage devices across the network
through application software.
4.1.3 Networking Modes

Common backup networking modes include LAN-Base, LAN-Free, Server-Free, and
Server-Less. The LAN-Base mode applies to all classes of storage systems, whereas LAN-
Free and Server-Free modes apply only to SAN storage.
LAN-Base
LAN-Base backup consumes network resources as both data and control flows are
transmitted over a LAN. Consequently, when a large amount of data needs to be backed
up within a short duration of time, network congestion is likely to occur.
Direction of backup data flows: The backup server sends a control flow to an application
server where an agent is installed over a LAN. The application server responds to the
request and sends the data to the backup server. The backup server receives and stores
the data to a storage device. The backup operation is complete.
Strengths:
 The backup system and the application system are independent of each other,
conserving hardware resources of application servers during backup.
Weaknesses:
 Additional backup servers increase hardware costs.
 Backup agents adversely affect the performance of application servers.
 Backup data is transmitted over a LAN, which adversely affects network
performance.
 Backup services must be separately maintained, complicating management and
maintenance operations.
 Users must be highly proficient at processing backup services.
LAN-Free
Control flows are transmitted over a LAN, but data flows are not. LAN-Free backup
transmits data over a SAN instead of a LAN. The server that needs to be backed up is
connected to backup media over a SAN. When triggered by a LAN-Free backup client, the
media server reads the data that needs to be backed up and backs up the same data to
the shared backup media.
Direction of backup data flows: The backup server sends a control flow over a LAN to an
application server where an agent is installed. The application server responds to the
request and reads the production data. Then, the media server reads the data from the
application server and transmits the data to the backup media. The backup operation is
complete.
Strengths:
 Backup data is transmitted without using LAN resources, significantly improving
backup performance while maintaining high network performance.
Weaknesses:
 Backup agents adversely affect the performance of application servers.
 LAN-Free backup requires a high budget.
 Devices must meet certain requirements.
Server-Free
Server-Free backup has many strengths similar to those of LAN-Free backup. The source
device, target device, and SAN device are main components of the backup data channel.
The server is still involved in the backup process, but processes much fewer workloads as
the server does not function as the main backup data channel but instead, like a traffic
police, is only responsible for giving commands other than processing loading and
transportation workloads. Control flows are transmitted over a LAN, but data flows are
not.
Direction of backup data flows: Backup data is transmitted over an independent network
without passing through a production server.
Strengths:
 Backup data flows do not consume LAN resources and do not affect network
performance.
 Services running on hosts remain nearly unaffected.
 Backup performance is excellent.
Weaknesses:
 Server-Free backup requires a high budget.
 Devices must meet strict requirements.
Server-Less
Server-Less backup uses the Network Data Management Protocol (NDMP). NDMP is a
standard network backup protocol. It supports communications between intelligent data
storage devices, tape libraries, and backup applications. After a server sends an NDMP
command to a storage device that supports the NDMP protocol, the storage device can
directly send the data to other devices without passing through a host.
4.1.4 Common Backup Technologies

4.1.4.1 Backup Types and Policies
Common backup types include full backup, cumulative incremental backup, and
differential incremental backup.
Full backup backs up all data at a point in time.
Strengths:
 Data can be quickly recovered using the last full backup.
 Recovery operations are fast.

Weaknesses:
 Large storage space is required.
 A backup operation takes a long time.
Cumulative incremental backup, also called incremental backup, is based on the last full
backup. If no previous backup has been performed, all files are backed up.
Strengths:
 It reduces the storage space required for creating a full backup each time.
 Recovery and backup operations are fast.
Weaknesses:
 The last full backup and the current incremental backup are both required to fully
recover data.
 The recovery time is shorter than that of a differential backup.
Differential incremental backup, also called differential backup, is based on the last
backup, regardless of the type of the last backup. If no previous backup has been
performed, all files are backed up.
Strengths:
 It saves the most storage space and requires a short backup window.
Weaknesses:
 The last full backup and each differential incremental backup are both required to
fully recover data, resulting in a lengthy recovery duration.
A backup policy specifies the backup content, time, and method.
Selection of backup policies:
For operating systems and application software, a full backup should be performed when
the operating systems are updated or new software is installed.
For critical application data that has a small total data volume but many data updates
per day, a full backup should be performed within the backup window daily.
For critical applications that have fewer data updates per day, a full backup should be
performed monthly or weekly. Incremental backup operations should also be performed
frequently.
4.1.4.2 Deduplication
Digital transformations of enterprises have intensified the explosive growth of service
data. The total amount of backup data that needs to be protected is also increasing
sharply. In addition, more and more duplicate data is being generated from backup and
archiving operations. Mass redundant data consumes a lot of storage and bandwidth
resources and leads to issues like long backup windows, which further affect the
availability of service systems.
Huawei Data Protection Appliance supports source-side and parallel deduplication.
Deduplication is performed before backup data is transmitted to storage media, greatly
improving backup performance.
Source-Side Deduplication
Data or files are sliced using an intelligent content-based deduplication algorithm. Then,
fingerprints are created for data blocks by hashing, for querying identical fingerprints in
fingerprint libraries. If identical fingerprints exist, it indicates that the same blocks are
stored on the media servers. Existing blocks will be used to preserve backup capacity and
bandwidth resources, and for streamlined data transfer and storage.
Technical principles:
1. Creates a fingerprint for a data block by hashing.
2. Queries whether the fingerprint exists in the fingerprint library of the Data Protection
Appliance. If yes, it indicates that this data block is duplicate and does not need to be
sent to the Data Protection Appliance. If no, the data block will be sent to the Data
Protection Appliance and written to the backup storage pool. Then, the fingerprint of
this data block is recorded in the deduplication fingerprint library.
Parallel Deduplication
Most conventional deduplication modes are based on a single node and are prone to
inefficient data access, poor processing performance, and insufficient storage capacity in
the era of big data.
Huawei Data Protection Appliance uses the parallel deduplication technology by building
a deduplication fingerprint library on multiple nodes and distributing fingerprints on
multiple nodes in parallel. This effectively resolves the performance and storage capacity
problems in single-node solutions.
Technical principles:
After fingerprints are calculated for data blocks, the system uses the grouping algorithm
to locate specific server nodes. Different fingerprints are evenly distributed on different
nodes. In this way, the system queries whether these fingerprints exist on different server
nodes, for parallel deduplication.
With fingerprint libraries, recycled data can be stored in the same space in sequence.
Such practice reduces time for querying all fingerprints in each global deduplication, and
maximizes the effect of read cache of storage, to minimize disk seek switchover
frequency due to random disk reads and improve recovery efficiency.
4.1.4.3 Backup Modes

Snapshot Backup
Snapshot backup supports backup using the snapshot function of a storage system and
agent-based backup.
Fast and recoverable:
 Enables you to browse backup information and quickly recover the selected objects.
 Consolidates incremental copies into a full copy in the background to quickly recover
data.
 Recovers data from hardware snapshots and performs fine-grained recovery using
snapshots.
Recovers copies:
Storage array protection:
 Native format
 Automatic storage detection

 Full integration (no script)
 Snapshot support
Storage array recovery:
 Enables you to recover, clone, or mount volumes.
 Enables you to copy data back.
Standard Backup
Standard backup is a scheduled data protection mechanism. A backup task is
automatically initialized at a specified time according to backup policy and plan to read
and write the data to be protected to the backup media.
Working principles:
The standard backup process consists of three steps:
Figure 4-1
1. Reads data to be protected through the backup client (agent client). Based on
different applications, the agent client can be deployed on the production server
(agent-based backup) or can be the agent client built in Huawei Data Protection
Appliance (agent-free backup).
2. Reads data from a production system to the Data Protection Appliance over the
network (TCP).
3. The Data Protection Appliance receives data and saves it to the backup storage.
For different backup modes, such as full backup, incremental backup, permanent
incremental backup, and differential backup, data is read and transmitted in different
ways. All data is transmitted or only unique data is transmitted with deduplication.
When remote DR is required, remote replication allows replication of backup data to
remote data centers.
Continuous Backup
Continuous backup is a process of continuously backing up data on production hosts to
backup media. Continuous backup is based on the block-level continuous data protection
technology. A backup agent client is installed on production hosts. Data on production
hosts is continuously backed up to the snapshot storage pool of the internal storage
system of the Data Protection Appliance and is stored in the native format. After certain
conditions are met, snapshots are created in the snapshot storage pool to manage data
at multiple points in time.
Figure 4-2
1. The snapshot storage pool allocates the base volume.
2. The agent client for continuous backup connects to the server of the Data Protection
Appliance.
3. The bypass monitoring drive in a partition of the production host continuously
captures data changes and caches the same data changes to the memory pool.
4. The agent client for continuous backup continuously transfers data to a storage
device in the snapshot storage pool of the Data Protection Appliance.
5. Source data in the partition of the production host is written to the base volume.
6. Data changes on the production host are written to the log volume first, and then
are written to the base volume storing the source data.
7. Snapshots of the base volume are managed based on the data retention policy for
continuous backup.
Advanced Backup
The advanced backup function of Huawei Data Protection Appliance effectively combines
years of experience in backup and DR and the independently developed copy data
storage system to ensure application data consistency. The advanced backup function
helps implement policy-based automation and DR, provide automation tools for
developers, support heterogeneous production storage, and implement real copy data
management.
Working principles:
Capture of production data: Data is captured in the native format. Format conversion is
not required. Data is accessible upon being mounted. SLA policy can be customized based
on applications. Retention duration, RPO, RTO, and data storage locations are intuitively
displayed.
Copy Management
Permanent incremental backup: Initial full backup and N incremental backups are
performed. A full copy is generated at each incremental backup point in time. Damages
to a copy at an incremental backup point in time will not impede recovery from any
other point in time.
No rollback: Point-in-time copies created through virtual clone can direct to both source
data and current incremental data and can be directly used for recovery.
Copy Access and Use
No data movement: Data is mounted in minutes, and data volume does not affect
recovery efficiency.
A virtual copy can be mounted to multiple hosts.
Data can be recovered from any point in time.
A host automatically takes over the original production applications after the virtual copy
is mounted.
4.1.5 Applications
Databases
Databases are critical service applications in production systems. The native backup
function of databases relies on complicated manual operations. In addition, various
databases on different platforms need protection, which requires a broad compatibility of
backup products.
The Data Protection Appliance provides a graphical wizard. Users do not need to
manually execute backup and restoration scripts, which simplifies backup and recovery
operations. Database backup process is as follows:
Install a backup client agent on the production server to be protected and connect the
client agent to the management console. The backup client agent identifies the database
data on the production server, reads the files and data from the production server
through the backup API, and transfers the same files and data to the storage media of
the Data Protection Appliance to complete the backup. The management console of the
Data Protection Appliance sends control information to the client and the Data
Protection Appliance server and accordingly, manages the execution of a backup task.
The backup process: The backup client agent invokes the backup API of a database
through an API to read data in the database, processes deduplication or encryption, and
then sends the data to the Data Protection Appliance to complete the backup.
The recovery process: The management console sends a recovery command to the
backup client agent on the production server. The backup client agent invokes the
recovery API of a database through an API to read data from the backup server, and then
sends the data to the recovery API to complete the recovery.
The Data Protection Appliance connects to a database through a dedicated API for
backup. The API varies with databases. For example, the RMAN interface of Oracle and
the VDI interface of SQL Server.
Virtualization Platforms
The popularization of virtualization has increased the confidence of enterprises in storing
their core data in a virtual environment. Therefore, enterprises are in urgent need of data
protection in a virtual environment, in particular, data backup and recovery efficiency is a
major concern.
The Data Protection Appliance provides a comprehensive and pertinent virtualization

platform protection solution which provides the following benefits:
 Mass virtual data protection to improve backup efficiency.
 Unified protection for both physical and virtual environments to simplify O&M.
 Flexible recovery methods to avoid service interruptions.
 Agent-free backup to minimize usage of host resources and maximize production
performance.
The backup process is as follows: Create a VM snapshot. Back up VM data, including VM
configuration information and data on virtual disks. During backup, the CBT technology
can be used to obtain valid data blocks or incremental data blocks on VM disks. In a full
backup operation, valid data blocks on virtual disks are obtained. In an incremental
backup operation, changed data blocks on the VM are obtained. Delete the VM snapshot
created in step 1.
The recovery process: For recovery to a new VM, create a VM based on the original and
manual configuration of the original VM. For recovery by overwriting a VM, manually
configure the VM to be overwritten. The system reads the disk block data at the
corresponding point in time from the media server and writes the data to the disk of the
VM mentioned in step 1. When FusionCompute VMs use FusionStorage, the backup and
recovery processes are different from those of FusionCompute VMs using virtualized
storage. In FusionStorage scenarios, VM disks correspond to LUN volumes on
FusionStorage. The differential bitmap volume provided by FusionStorage is used to
obtain changed data blocks on virtual disks.
File Systems
The file system backup module of the Data Protection Appliance can back up
unstructured file systems. It has the following features:
 Backup types: full backup, incremental backup, and permanent incremental backup.
 Backup and recovery granularity: single file, folder, entire disk
 Block-level deduplication: Reduces the amount of backup data to be transmitted,
shortening the backup window and conserving network resources and storage space
consumed by backup data transmission.
 Recovery location: Allows a file system to be recovered to the same location on the
original host, a specified location on the original host, or a different host.
 Incremental backup: Incremental backup is performed on a per-file basis. The system
compares a file at the backup time with a file at the last modification time to
determine whether the file needs to be backed up. If the file needs to be backed up,
the system performs incremental backup based on the comparison result.
For file system backup, four filtering modes are provided to filter backup data sources,
helping users quickly select files to be backed up.
Backup process:
1. The client deployed on the service production system reads file data to be backed up.
2. The client transmits the data over the network.
3. The Data Protection Appliance receives and stores the data on physical media. The
backup operation is complete.
4. Host status check (Windows and Linux)

Windows:
After installing the client, press Windows+R. In the Run dialog box that is displayed,
enter services.msc to open the service management window. Then, check whether the
client service is started normally. If the client service is started normally, check whether
the client is connected to the server. If connected, you can create a standard backup plan
for a file system.
Linux:
The requirements for backing up a Linux file system are similar to those for backing up a
Windows file system. Check the host service or process. If the client is connected to the
server, you can create a standard backup plan for a file system. The check commands
used in CentOS 7 are as follows:
systemctl status HWClientService.service
ps -ef|grep esf
Operating Systems
The backup process is as follows:
1. Install a client to obtain the data about an operating system.
2. For Windows, invoke the VSS interface to create a snapshot for the volume where
the operating system resides. For Linux, select the data source to be backed up.
3. Read the data of the volume where the operating system resides and back up the
same data to the storage media in the Data Protection Appliance.
4. The backup operation is complete (for Windows, delete the snapshot).
Recovery process:
1. Load the WinPE or LiveCD to boot the recovery environment. For Linux, install a
client.
2. For Windows, manually partition the disk.
3. Recover the operating system data from the storage media to the specified system
volume.
4. For Windows, use the system API to load the driver, modify the registry, and rectify
BSOD. For Linux, modify the configuration file.
5. Reboot the operating system upon completion of the recovery operation.
4.2 DR Solution Introduction

4.2.1 DR Solution Overview
4.2.1.1 Definition of DR
DR system building - a necessary means to minimize disaster impact
According to the statistics of the international authority, in 2004, the direct financial loss
resulting from natural and human-induced disasters reached 123 billion US dollars
worldwide.
In 2005, 400 catastrophes occurred worldwide and caused losses of more than 230 billion
US dollars.
In 2006, the financial loss caused directly by natural and human-induced disasters was
lower than expected at 48 billion US dollars.
The occurrence rate of natural disasters that can be measured was three times greater in
the 1990s than the 1960s, while the financial loss was nine times greater.
The huge losses caused by small-probability natural disasters cannot be ignored.
According to IDC, among the companies that experienced disasters in the ten years
before 2000, 55% collapsed when the disasters occurred, 29% collapsed within 2 years
after the disasters due to data loss, and only 16% survived.
High availability (HA) ensures that applications can still be accessed when a single
component of the local system is faulty, no matter whether the fault is a service process,
physical facility, or IT software/hardware fault.
The best HA is when a machine in the data center breaks down, but the users using the
data center service are unaware of it. However, if a server in a data center breaks down,
it takes some time for services running on the server to fail over. As a result, customers
will be aware of the failure.
The key indicator of HA is availability. Its calculation formula is [1 –
(Downtime)/(Downtime + Uptime)]. We usually use the following nines to represent
availability:
4 nines: 99.99% = 0.01% x 365 x 24 x 60 = 52.56 minutes/year
5 nines: 99.999% = 0.001% x 365 = 5.265 minutes/year
6 nines: 99.9999% = 0.0001% x 365 = 31 seconds/year
For HA, shared storage is usually used. In this case, RPO = 0. In addition, the active/active
HA mode is used to ensure that the RTO is almost 0. If the active/passive HA mode is
used, the RTO needs to be reduced to the minimum.
HA requires redundant servers to form a cluster to run applications and services. HA can
be categorized into the following types:
Active/Passive HA:
A cluster consists of only two nodes (active and standby nodes). In this configuration, the
system uses the active and standby machines to provide services. The system provides
services only on the active device.
When the active device is faulty, the services on the standby device are started to replace
the services provided by the active device.
Typically, the CRM software such as Pacemaker can be used to control the switchover
between the active and standby devices and provide a virtual IP address to provide
services.
Active/Active HA:
If a cluster consists of only two active nodes, it is called active-active. If the cluster has
multiple nodes, it is called multi-active.
In this configuration, the system runs the same load on all servers in the cluster.
Take the database as an example. The update of an instance will be synchronized to all
instances.
In this configuration, load balancing software, such as HAProxy, is used to provide virtual
IP addresses for services.
Pacemaker is a cluster manager. It uses the message and member capabilities provided
by the preferred cluster infrastructure (OpenAIS or heartbeat) to detect faults by the
secondary node and system, achieving high availability of the cluster service (also called
resources).
HAProxy is a piece of free and open-source software written in C language. It provides
high availability, load balancing, and TCP- and HTTP-based application proxy. HAProxy is
especially suitable for web sites with heavy loads that usually require keeping session.
A disaster is an unexpected event (caused by human errors or natural factors) that
results in severe faults or breakdown of the system in one data center. In this case,
services may be interrupted or become unacceptable. If the system unavailability reaches
a certain level at a specific time, the system must be switched to the standby site.
Disaster recovery (DR) refers to the capability of recovering data, applications, and
services in data centers at different locations when the production center is damaged by
a disaster.
In addition to the production site, a redundancy site is set up. When a disaster occurs and
the production site is damaged, the redundancy site can take over services from the
production site to ensure service continuity. To achieve higher availability, many users
even set up multiple redundant sites.
Main indicators for measuring a DR system
Recovery Point Objective (RPO) indicates the maximum amount of data that can be lost
when a disaster occurs.
Recovery Time Objective (RTO) indicates the time required for system recovery.
The smaller the RPO and RTO, the higher the system availability, and the larger the
investment for users.
4.2.1.2 DR System Level

Disaster recovery is an important technical application for enterprises and plays an
important role in enterprise data security. When it comes to disaster recovery, many CIOs
put remote application-level disaster recovery in the first place. They also emphasize the
construction of a remote disaster recovery system with zero data loss and automatic
application switchover at the highest level. This is actually a misconception. There is no
doubt about the importance of disaster recovery. However, disaster recovery does not
necessarily mean that application-level disaster recovery must be built. The most
important thing is to select a proper disaster recovery system based on actual
requirements.
Generally speaking, disaster backup is classified into three levels: data level, application
level, and service level. The data level and application level are within the scope of the IT
system. The service level takes the service factors outside the IT system into
consideration, including the standby office location and office personnel.
The data-level disaster recovery focuses on protecting the data from loss or damage after
a disaster occurs. Low-level data-level disaster recovery can be implemented by manually
saving backup data to a remote place. For example, periodically transporting backup
tapes to a remote place is one of the methods. The advanced data disaster recovery
solution uses the network-based data replication tool to implement asynchronous or
synchronous data transmission between the production center and the disaster recovery
center. For example, the data replication function based on disk arrays is used.
Application-level DR creates hosts and applications in the DR site based on the data-level
DR. The support system consists of the data backup system, standby data processing
system, and standby network system. Application-level DR provides the application
takeover capability. That is, when the production center is faulty, applications can be
taken over by the DR center to minimize the system downtime and improve service
continuity.
SHARE, an IT information organization initiated by IBM in 1955, released the disaster
recovery standard SHARE 78 at the 78th conference in 1992. SHARE 78 has been widely
recognized in the world.
SHARE 78 divides disaster recovery into eight levels:
Backup or recovery scope
Status of a disaster recovery plan
Distance between the application location and the backup location
Connection between the application location and backup location
Transmission between the two locations
Data allowed to be lost
Backup data update
Ability of a backup location to start a backup job
The definition of remote disaster recovery is classified into seven levels:
Backup and recovery of local data
Access mode of batch storage and read
Access mode of batch storage and read + hot backup location
Network connection
Backup location of the working status
Dual online storage
Zero data loss
In addition, ISO 27001 released by International Organization for Standardization (ISO)
requires that related data and files be stored for at least one to five years.
4.2.1.3 Panorama of Huawei Business Continuity and Disaster Recovery

Solution
Huawei Business Continuity and Disaster Recovery (BC&DR) Solution is designed to
provide business continuity assurance and data protection for enterprise customers.
Huawei provides four major DR solutions covering the local production center, intra-city
DR center, and remote DR center. In addition, Huawei provides professional DR
consulting services for customers' service systems to ensure service continuity and data
protection.
Local HA solution: ensures high availability of key services in the data center and
prevents service interruption and data loss caused by single-component faults.
Active-passive DR solution: intra-city and remote DR are supported. When a disaster
occurs, services in the DR center can be quickly recovered and provide services for
external systems.
Active-Active data center solution: In intra-city DR, load of a critical service is balanced
between two data centers, ensuring zero service interruption and data loss when a data
center malfunctions.
Geo-redundant DR solution: defends against data center-level disasters and regional
disasters and provides higher service continuity for mission-critical services. Generally, the
intra-city active/standby + remote active/standby solution or intra-city active-active +
remote active/standby solution is used.
4.2.2 DR Solution Architecture

4.2.2.1 Disaster Recovery Design Mode
The Huawei DR and backup solution provides traditional data center-level and cloud data
center-level DR and backup solutions, covering application-level and data-level DR and
backup as well as implementing application-based data protection. In addition, DR
solutions are provided for applications of different levels to balance DR requirements and
total costs. The Huawei DR and backup solution complies with customers' service and
development strategies and provides professional services including strategic consulting,
RD planning, service implementation, and continuous operation management.
Remote replication is mainly used for data backup and disaster recovery. Different
remote replication modes apply to different application scenarios.
Synchronous remote replication: applicable to scenarios where the primary site is close to
the secondary site, for example, intra-city DR and backup.
Asynchronous remote replication: applicable to scenarios where the primary site is far
away from the secondary site or the network bandwidth is limited, for example, remote
(cross-country/region, global) disaster recovery and backup.
In specific application scenarios, you need to consider the distance and replication mode
of remote replication. The service application scenarios include central backup and DR
sites and active-active continuous service sites.
In active/standby DR mode, the customer builds a DR center at another site besides the
production center to implement one-to-one data-level or application-level protection.
Compared with complex topologies such as geo-redundant and centralized disaster
recovery, active-passive disaster recovery is the most widely used disaster recovery mode
in the market.
The active-passive DR solution applies to two scenarios: scenarios where both the primary
and secondary storage systems are Huawei storage systems and scenarios where the
production storage system is a peer vendor's storage system.
Huawei's active-active data center solution allows two data centers to carry service loads
concurrently, improving data center performance and resource utilization.
Currently, data centers can work in either active-passive mode or active-active mode.
In active-passive mode, some services are mainly processed in data center A and hot
standby is implemented in data center B, and some services are mainly processed in data
center B and hot standby is implemented in data center A, achieving approximate active-
active effect.
In active-active mode, all I/O paths can access active-active LUNs to achieve load
balancing and seamless failover.
Huawei active-active data center solution adopts the active-active architecture and
combines the industry-leading HyperMetro-based functions with the web, database
cluster, load balancing, transmission devices, and network components to provide
customers with an end-to-end active-active data center solution within 100 km, ensuring
service continuity even in the event of a device or data center failure, services are not
affected and can be automatically switched.
4.2.2.2 New DR Mode Evolution in Cloud Computing

To help enterprise customers build high-quality IT systems and meet service development
requirements, Huawei provides professional services in terms of storage, cloud
computing, and servers based on IT products.
At the storage layer, Huawei provides professional storage data migration, disaster
recovery, backup, and virtualization takeover services to meet enterprise customers'
requirements for storage replacement, data protection, and unified storage management.
In terms of cloud computing, Huawei provides professional services, such as cloud
planning and design, FusionSphere solution implementation, FusionCloud desktop
solution implementation, FusionSphere service migration, big data planning, and big data
solution implementation, to meet enterprise customers' requirements on virtualization
planning and design, implementation, migration, and big data planning and
implementation.
At the data center layer, Huawei provides professional data center consolidation services
to meet customers' requirements for data center L1 and L2, data protection, data center
planning, and data migration. L1 is the infrastructure layer, including the floor layout,
power system, cooling system, cabling system, fire extinguishing system, and physical
security. L2 is the IT infrastructure layer, which uses the cloud computing system as the
core, including computing, network, storage, security, service continuity, disaster recovery,
and backup.
The Huawei enterprise DR and backup service provides multiple service products,
including storage data migration, DR, backup, VM migration, and cloud solution
implementation services, covering multiple industries such as government, energy,
finance, and education. Huawei provides professional service solutions with industry-
specific characteristics to meet enterprise customers' requirements for IT infrastructure
update, data protection, and technological transformation.
Based on customer requirements and service lifecycle, the Huawei enterprise DR and
backup service provides customers with one-stop professional services, including project
management, planning, design, integration test, integration implementation, integration
verification, and optimization.
In addition, professional and diversified tools are used to quickly collect and analyze
project information, design and implement solutions, and customize and deliver the most
appropriate professional service solutions for customers.
4.2.3 Common DR Technologies

4.2.3.1 Host-Layer DR Technology
Host-Layer DR Technology
Dedicated data replication software, such as volume replication software, is installed on
servers in the production center and disaster recovery center to implement remote
replication. There must be a network connection between the two centers as the data
channel. The remote application switchover software can be added to the server layer to
form a complete application-level DR solution.
This data replication mode requires less investment, mainly software procurement costs.
Servers and storage devices of different brands are compatible, which is suitable for users
with complex hardware composition. However, in this mode, synchronization is
implemented on the server through software, which occupies a large number of host
resources and network resources.
4.2.3.2 Network-Layer DR Technology

Network-Layer DR Technology
The data replication technology based on the SAN network layer is to add storage
gateways to SAN between front-end application servers and back-end storage systems.
The front end connects to servers and the back end connects to storage devices.
A mirror relationship between two volumes on different storage devices is established for
the storage gateways, and data written into the primary volume is written to the backup
volume at the same time.
When the primary storage device is faulty, services are switched to the secondary storage
device and the backup volume is enabled to ensure that data services are not interrupted.
Working principles:
The host in the production center writes data to the local virtualization gateway.
The virtualization gateway in the production center writes data to the local log volume.
After data is successfully written to the log volume, the virtualization gateway in the
production center returns a confirmation message to the local host.
The virtualization gateway at the production end writes data to the production volume at
the local end and sends a data write request to the virtualization gateway at the DR end.
After receiving the write request, the virtualization gateway at the DR end returns a
confirmation message to the virtualization gateway at the production end.
The virtualization gateway at the DR end writes data to the replication volume at the DR
end.
After data is successfully written to the replication volume in the DR center, the virtual
gateway in the DR center returns a completion message to the virtual gateway in the
production center.
4.2.3.3 Storage-Layer DR Technology

Storage-Layer DR Technology
Storage-layer DR uses the inter-array data replication technology to replicate data from
the local storage array to the DR storage array and generate an available data copy on
the DR storage array. When the primary storage array is faulty, services can be quickly
switched to the secondary storage array to ensure service continuity.
SAN Synchronous Replication Principle
Synchronization procedure:
The production storage receives a write request from the host. HyperReplication records
the request in logs. When recording the request, HyperReplication records only the
address information.
HyperReplication writes the request into the primary and standby LUNs. Generally, the
write policy for a LUN is in the write back state, and data is written into the cache.
HyperReplication waits for the primary and standby LUNs to return write results. If data
is successfully written into both the primary and secondary LUNs, HyperReplication
deletes the request from logs. Otherwise, HyperReplication retains the request in logs
and enters the abnormal disconnection state. In subsequent synchronization,
HyperReplication replicates the data block corresponding to the address information
recorded in logs.
HyperReplication returns the write result of the primary LUN to the host.
Splitting:
In this mode, data written by a production host is stored only to the primary LUN, and
the difference between the primary and secondary LUNs is recorded by the differential
log. If you want to ensure data consistency between the primary and secondary LUNs,
you can manually start a synchronization process. During the synchronization process,
data blocks marked as differential in the differential log are incrementally copied from
the primary LUN to the secondary LUN. The I/O processing principle is similar to that of
initial synchronization.
SAN Asynchronous Replication Principle
A time segment refers to a logical space in the cache for writing data in a period of time
(without the data amount limit).
In the low RPO scenario, the asynchronous remote replication period is short. The c ache
of an OceanStor storage system can store all data in multiple time segments. However, if
the host or DR bandwidth is abnormal and the replication period is prolonged or
interrupted, data in the cache is automatically written into disks based on the disk
flushing policy for consistency protection. Upon replication, the data is read from disks.
Based on the replication period (which is user-defined and ranges from 3 seconds to
1440 minutes), the system automatically starts a synchronization procedure for
incrementally synchronizing data from the primary site to the standby site (If the
synchronization type is set to manual, users need to manually trigger synchronization). At
the start of a replication period, the system generates time segments TPN+1 and TPX+1
separately in the caches of the primary LUN (LUN A) and the standby LUN (LUN B).
The primary site receives a write request from the production host.
The primary site writes the data involved in the write request into time segment TPN+1
in the cache of LUN A and immediately returns the write complete response to the host.
During data synchronization, the system reads data generated in the previous replication
period in time segment TPN in the cache of LUN A, transmits the data to the standby
site, and writes the data into time segment TPX+1 in the cache of LUN B. If the usage of
LUN A's cache reaches a certain threshold, the system automatically writes data into
disks. In this case, a snapshot is generated on disks for the data in time segment TPN.
During data synchronization, the system reads the data from the snapshot on disks and
replicates the data to LUN B.
After the data synchronization is complete, the system writes data in time segments TPN
and TPX+1 separately in the caches of LUN A and LUN B into disks based on the disk
flushing policy (snapshots are automatically deleted), and waits for the next replication
period.
Switchover:
A primary/secondary switchover can be performed for a synchronous remote replication
pair when the pair is in the normal state.
In the split state, a primary/secondary switchover can be performed only after the
secondary LUN is set to writable.
The asynchronous remote replication is in the split state.
In the split state, the secondary LUN must be set to writable.
NAS Asynchronous Replication Principle
At the beginning of each period, the file system asynchronous remote replication creates
a snapshot for the primary file system. Based on the incremental information generated
from the time when the replication in the previous period is complete to the time when
the current period starts, the file system asynchronous remote replication reads the
snapshot data and replicates the data to the secondary file system. After the incremental
replication is complete, the data in the secondary file system is the same as that in the
primary file system, data consistency points are formed on the secondary file system.
Remote replication between file systems is supported. Directory-to-directory or file-to-file
replication mode is not supported.
A file system can be included in only one replication task, but a replication task can
contain multiple file systems.
File systems support only one-to-one replication. A file system cannot serve as the
replication source and destination at the same time. Cascading replication and 3DC are
not supported.
The minimum unit of incremental replication is the file system block size (4 KB to 64 KB).
The minimum synchronization period of asynchronous replication is 5 minutes.
The resumable download is supported.
4.2.4 DR Application Scenarios

4.2.4.1 Virtualized DR
Challenges to Customer
The customer has a vSphere virtual data center and wants to build a new data center for
DR.
Low TCO and high return on investment (ROI)
Huawei's Solution
An IT system, including storage devices, services, networks, and virtualization platforms, is
deployed in the DR center.
Install Huawei UltraVR DR component in the production center and DR center.
ConsistentAgent is installed on host machines of VMs to implement application-level
protection for VMs.
Customer Benefits
No need for reconstruction of the live network architecture
Flexible configuration of DR policies and one-click recovery
DR rehearsal and switchback
4.2.4.2 Application-Level DR Solution

Challenges to Customer
The current IT system cannot meet the service development requirements and ensure the
continuity of online services.
IT O&M are complicated, involve a high energy consumption, and result in low resource
utilization.
Huawei's Solution
Migrate the service system to the Huawei cloud platform.
Deploy CDP storage devices and CDP software in two data centers. Implement the
application-level DR for the two data centers in the same city using the CDP technology.
Customer Benefits
The elastic resources and resource reusing are implemented, resource usage is improved,
and O&M costs are reduced.
The RTO and RPO of key services are zero. When the production center is faulty, services
and data are automatically switched to the DR center, ensuring service continuity.
5 Storage System O&M Management
5.1 Storage System O&M Management

5.1.1 Storage Management Overview
DeviceManager is a piece of integrated storage management software developed by
Huawei. It has been loaded to storage systems before factory delivery. You can log in to
DeviceManager using a web browser or a tablet.
After logging in to the CLI of a storage system, you can query, set, manage, and maintain
the storage system. On any maintenance terminal connected to the storage system, you
can use PuTTY to access the IP address of the management network port on the
controller of the storage system through the SSH protocol to log in to the CLI. The SSH
protocol supports two authentication modes: user name + password and public key.
You can log in to the storage system by either of the following methods:
Login Using a Serial Port
After the controller enclosure is connected to the maintenance terminal using serial
cables, you can log in to the CLI of the storage device using a terminal program (such as
PuTTY).
Login Using a Management Network Port
You can log in to the CLI using an IPv4 or IPv6 address.
After connecting the controller enclosure to the maintenance terminal using a network
cable, you can log in to the storage system by using any type of remote login software
that supports SSH.
For a 2 U controller subrack, the default IP addresses of the management network ports
on controller A and controller B are 192.168.128.101 and 192.168.128.102, respectively.
The default subnet mask is 255.255.0.0. For a 3 U/6 U controller enclosure, the default IP
addresses of the management network ports on management module 0 and
management module 1 are 192.168.128.101 and 192.168.128.102, respectively. The
default subnet mask is 255.255.0.0.
The IP address of the controller enclosure's management network port must be in the
same network segment as that of the maintenance terminal. Otherwise, you need to
modify the IP address of the management network port through a serial port by running
the change system management_ip command.
5.1.2 Introduction to Storage Management Tools

DeviceManager is integrated storage management software designed by Huawei for a
single storage system. DeviceManager can help you easily configure, manage, and
maintain storage devices.
Users can query, set, manage, and maintain storage systems on DeviceManager and the
CLI. Tools such as SmartKit and eService can improve O&M efficiency.
Before using DeviceManager, ensure that the maintenance terminal meets the following
requirements of DeviceManager:
Operating system and browser versions of the maintenance terminal are supported.
DeviceManager supports multiple operating systems and browsers. For details about the
compatibility information, visit Huawei Storage Interoperability Navigator.
The maintenance terminal communicates with the storage system properly.
The super administrator can log in to the storage system using this authentication mode
only.
Before logging in to DeviceManager as a Lightweight Directory Access Protocol (LDA P)
domain user, first configure the LDAP domain server, and then configure parameters on
the storage system to add it into the LDAP domain, and finally create an LDAP domain
user.
By default, DeviceManager allows 32 users to log in concurrently.
A storage system provides built-in roles and supports customized roles.
Built-in roles are preset in the system with specific permissions shown in the table. Built-
in roles include the super administrator, administrator, and read-only user.
Permissions of user-defined roles can be configured based on actual requirements.
To support permission control in multi-tenant scenarios, the storage system divides built-
in roles into two groups: system group and tenant group. Specifically, the differences
between the system group and tenant group are as follows:
Tenant group: roles in this group are used only in the tenant view (view that can be
operated after you log in to DeviceManager using a tenant account).
System group: roles belonging to this group are used only in the system view (view that
can be operated after you log in to DeviceManager using a system group account).
5.1.3 Introduction to Basic Management Operations
Figure 5-1 Configuration process
5.2 Storage System O&M Management

5.2.1 O&M Overview
ITIL
Information Technology Infrastructure Library (ITIL) is a widely recognized set of practice
guidelines for effective IT service management. Since 1980, Office of Government
Commerce of the UK has gradually proposed and improved a set of methods for
assessing the quality of IT services, which is called ITIL, to solve the problem of poor IT
service quality. In 2001, the British Standards Institution officially released the British
national standard BS15000 with ITIL as the core at the IT Service Management Forum
(itSMF). This has become a major event of historical significance in the IT service
management field.
Traditional IT only plays a supporting role, and now IT is a type of service. To achieve the
goals of reducing costs, increasing productivity, and improving service quality, ITIL has set
off a frenzy around the world. Many famous multinational companies, such as IBM, HP,
Microsoft, P&G, and HSBC are active practitioners of ITIL. As the industry is gradually
changing from technology-oriented to service-oriented, enterprises' requirements for IT
service management are also increasing, which greatly helps standardize IT processes,
keep IT processes' pace with business, and improve processing efficiency.
ITIL has the strong support from the UK, other countries in Europe, North America, New
Zealand, and Australia. Whether an enterprise imports ITIL will be regarded as key
indicators for determining whether an inspection suppliers or outsourcing service
contractor is qualified for bidding.
5.2.2 O&M Management Tool

In storage scenarios, the following O&M tools are used:
DeviceManager: single-device O&M software.
SmartKit: a professional tool for Huawei technical support engineers, including
compatibility evaluation, planning and design, one-click fault information collection,
inspection, upgrade, and FRU replacement.
eSight: a customer-oriented multi-device maintenance suite that features fault
monitoring and visualized O&M.
DME: customer-oriented software that manages storage resources in a unified manner,
orchestrates service catalogs, and provides storage services and data application services
on demand.
eService client: is deployed in the customer's equipment room. It detects storage device
exceptions in real time and notifies Huawei maintenance center of the exceptions.
eService cloud platform: is deployed in Huawei maintenance center to monitor devices on
the entire network in real time, changing passive maintenance to proactive maintenance
and even achieving agent maintenance.
5.2.3 O&M Scenarios

Maintenance Item Overview
Based on the maintenance items and periods, the system administrator can check the
device environment and device status. If an exception occurs, the system administrator
can handle and maintain the device in a timely manner to ensure the continuous and
healthy running of the storage system.
First Maintenance Items
Item Maintenance Operation
On the maintenance terminal, check whether

SmartKit and its sub-tools have been installed. The
Checking SmartKit installation sub-tools provide the following functions:
Device archive collection
Information collection

Disk health analysis
Inspection
Patch tool
On the maintenance terminal, check whether the

Checking the eService
eService tool has been installed and the alarm
installation and configuration
policy has been configured.
On DeviceManager, check whether an alarm policy

has been configured. After an alarm policy is
configured, alarms will be reported to the
customer's server or mobile phone for timely query
and handling. Alarm policy includes:
Email notification
Checking the alarm policy SMS message notification
configuration System notification
Alarm dump
Trap IP address management
USM user management
Alarm masking
Syslog notification
Daily Maintenance Items

Check and handle the alarms. Log in to DeviceManager or use the configured alarm
reporting mode to view alarms, and handle the alarms in time based on the suggestions.
Weekly Maintenance Items
Use the inspection tool of SmartKit on the

maintenance terminal to perform the inspection.
The inspection items are as follows:
 Hardware status
 Software status
Inspecting storage devices  Value-added service
 Checking alarms
Note:
If suggestions provided by SmartKit cannot resolve
the problem, use SmartKit to collect related
information and contact Huawei technical support.
Check the equipment room environment according

to check methods.
Checking the equipment room Note:
environment If the requirements are not met, adjust the
equipment room environment based on related
specifications.
Check whether the rack internal environment meets

the requirements.
Checking the rack internal Note:
environment If the requirements are not met, adjust the rack
internal environment based on related
requirements.
Information Collection
The information to be collected includes basic information, fault information, storage
device information, networking information, and application server information.
Information Type Name Description
Provides the serial number and version of a

storage device.
Device serial
Note:
number and
version You can log in to DeviceManager and query
Basic information
the serial number and version of a storage
device in the Basic Information area.
Customer
Provides the contact and contact details.
information
Time when a
Records the time when a fault occurs.
fault occurs
Records the symptom of a fault, such as

Symptom the displayed error dialog box and the
received event notification.
Fault information Operations

performed Records the operations performed before a
before a fault fault occurs.
occurs
Operations Records the operations performed from the

performed after time when a fault occurs to the time when
a fault occurs the fault is reported to the maintenance

personnel.
Hardware
Records the configuration information
module
about the hardware of a storage device.
configuration
Records the status of indicators on a

storage device, especially indicators in
orange or red.
Storage device Indicator status For details about the indicator status of
information each component on the storage device, see
the Product Description of the
corresponding product model.
Storage system Manually export the running data and

data system logs of a storage device.
Manually export alarms and logs of a

Alarm and log
storage device.
Describes how an application server and a

Connection storage device are connected, such as the
mode Fibre Channel network mode or iSCSI
network mode.
If a switch exists on the network, record the

Switch model
switch model.
Manually export the diagnosis information

about the running switch, including the
Switch diagnosis
startup configuration, current
Network information information
configuration, interface information, time,
and system version.
Describes the topology diagram or provides

Network
the networking diagram between an
topology
application server and a storage device.
Describes IP address planning rules or

provides the IP address allocation list if an
IP address
application server is connected to a storage
device over an iSCSI network.
Records the type and version of the OS

OS version
running on an application server.
Application server
information Records the port rate of an application
Port rate server that is connected to a storage device.
For details about how to check the port

rate, see the Online Help.
OS log View and export the OS logs.

HCIA-Storage Learning Guide: Huawei Storage Certification Training

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HCIA-Storage Learning Guide: Huawei Storage Certification Training

Uploaded by

Copyright:

Available Formats

Huawei Storage Certification Training

HUAWEI TECHNOLOGIES CO., LTD.

Trademarks and Permissions

Huawei Technologies Co., Ltd.

Huawei Proprietary and Confidential

Huawei Certification System

1 Storage Technology Trends ............................................................................................... 6

2.6 Introduction to Huawei Intelligent Storage Products .......................................................................................... 72

5.2.3 O&M Scenarios ......................................................................................................................................................... 124

1 Storage Technology Trends

1.1 Storage Technology Trends

1.1.1.2 Data Types

1.1.1.3 Data Processing Cycle

1.1.1.4 What Is Information

1.1.1.5 Data vs. Information

1.1.1.6 Information Lifecycle Management

1.1.2 Data Storage

1.1.2.2 Data Storage System

1.1.2.3 Physical Structure of Storage

1.1.2.4 Data Storage Types

1.1.2.5 Evolution of Data Management Technologies

1.1.2.6 Data Storage Application

1.1.3 Development of Storage Technologies

In addition, dedicated management software can be configured. Storage arrays are

1.1.3.2 Storage Media

1.1.3.3 Interface Protocols

1.1.4 Development Trend of Storage Products

 Cloudification: In terms of storage architecture trend, elastic scalability is provided by

1.1.4.2 New Requirements for Data Storage in the Intelligence Era

1.1.4.3 Data Storage Trend

Development of Storage Media

Storage Network Development

1.1.4.4 History of Huawei Storage Products

2 Basic Storage Technologies

2.1 Intelligent Storage Components

2.1.1.2 Controller Enclosure Components

2.1.2 Disk Enclosure

2.1.3 Expansion Module

2.1.3.3 Fibre Channel Switch

2.1.3.4 Device Cables

2.1.4.2 HDD Design

2.1.4.3 Data Organization on a Disk

2.1.4.4 Disk Capacity

2.1.4.5 Disk Performance Factors

2.1.4.6 Average Access Time

2.1.4.7 Data Transfer Rate

2.1.4.8 Disk IOPS and Transmission Bandwidth

2.1.4.9 Transmission Mode

2.1.4.10 Disk Ports

2.1.5.2 SSD Architecture

2.1.5.3 NAND Flash

2.1.5.4 SLC, MLC, TLC, and QLC

2.1.5.5 Flash Chip Data Relationship

2.1.5.6 Address Mapping Management

2.1.5.7 SSD Read and Write Process

2.1.5.8 SSD Performance Advantages

2.1.6 Interface Module

2.1.6.2 SAS Expansion Module and RDMA Interface Module

2.1.6.3 SmartIO Interface Modules

2.1.6.4 PCIe and 56 Gbit/s IB Interface Modules

2.1.6.5 Fibre Channel and FCoE Interface Modules

2.2 RAID Technologies

Figure 2-1 Working principles of RAID 0