HCIA-Storage V4.5 Learning Guide

Huawei Storage Certification Training
HCIA-Storage
Learning Guide
V4.5
HUAWEI TECHNOLOGIES CO., LTD.

Copyright © Huawei Technologies Co., Ltd. 2020. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any means without
prior written consent of Huawei Technologies Co., Ltd.
Trademarks and Permissions
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their
respective holders.
Notice
The purchased products, services and features are stipulated by the contract made between
Huawei and the customer. All or part of the products, services and features described in this
document may not be within the purchase scope or the usage scope. Unless otherwise specified
in the contract, all statements, information, and recommendations in this document are provided
"AS IS" without warranties, guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made
in the preparation of this document to ensure accuracy of the contents, but all statements,
information, and recommendations in this document do not constitute a warranty of any kind,
express or implied.
Huawei Technologies Co., Ltd.

Address: Huawei Industrial Base Bantian, Longgang Shenzhen 518129
People's Republic of China
Website: http://e.huawei.com
Huawei Proprietary and Confidential

Copyright © Huawei Technologies Co., Ltd
HCIA-Storage Learning Guide Page 1
Huawei Certification System

Huawei Certification follows the "platform + ecosystem" development strategy, which is a
new collaborative architecture of ICT infrastructure based on "Cloud-Pipe-Terminal". Huawei has
set up a complete certification system comprising three categories: ICT infrastructure, Platform
and Service, and ICT vertical. Huawei's technical certification system is the only one in the industry
covering all of these fields.
Huawei offers three levels of certification: Huawei Certified ICT Associate (HCIA), Huawei
Certified ICT Professional (HCIP), and Huawei Certified ICT Expert (HCIE).
Huawei Certified ICT Associate-Storage (HCIA-Storage) is designed for Huawei engineers,
students and ICT industry personnel. HCIA-Storage covers storage technology trends, basic storage
technologies, common advanced storage technologies, business continuity solutions for storage
and storage system O&M management.
The HCIA-Storage certificate introduces you to the storage industry and markets, helps you
understand sector innovation, and makes sure you stand out among your industry peers.
Contents
1 Storage Technology Trends ...................................................................................................... 5

1.1 Storage Technology Trends .......................................................................................................................................... 5
1.1.1 Data and Information ................................................................................................................................................ 5
1.1.2 Data Storage .............................................................................................................................................................. 7
1.1.3 Development of Storage Technologies ...................................................................................................................... 8
1.1.4 Development Trend of Storage Products ................................................................................................................ 11
2 Basic Storage Technologies .................................................................................................... 16
2.1 Intelligent Storage Components ................................................................................................................................. 16
2.1.1 Controller Enclosure ................................................................................................................................................ 16
2.1.2 Disk Enclosure .......................................................................................................................................................... 17
2.1.3 Expansion Module ................................................................................................................................................... 17
2.1.4 HDD .......................................................................................................................................................................... 18
2.1.5 SSD ........................................................................................................................................................................... 22
2.1.6 Interface Module ..................................................................................................................................................... 25
2.2 RAID Technologies ...................................................................................................................................................... 26
2.2.1 Traditional RAID ....................................................................................................................................................... 26
2.2.2 RAID 2.0+ ................................................................................................................................................................. 33
2.2.3 Other RAID Technologies ......................................................................................................................................... 36
2.3 Common Storage Protocols ........................................................................................................................................ 37
2.3.1 SCSI .......................................................................................................................................................................... 37
2.3.2 iSCSI, FC, and FCoE ................................................................................................................................................... 38
2.3.3 SAS and SATA ........................................................................................................................................................... 42
2.3.4 PCIe and NVMe ........................................................................................................................................................ 44
2.3.5 RDMA and IB ............................................................................................................................................................ 46
2.3.6 CIFS, NFS, and NDMP ............................................................................................................................................... 48
2.4 Storage System Architecture ...................................................................................................................................... 49
2.4.1 Storage System Architecture Evolution ................................................................................................................... 49
2.4.2 Storage System Expansion Methods........................................................................................................................ 51
2.4.3 Huawei Storage Product Architecture ..................................................................................................................... 53
2.5 Storage Network Architecture .................................................................................................................................... 56
2.5.1 DAS........................................................................................................................................................................... 56
2.5.2 NAS .......................................................................................................................................................................... 57
2.5.3 SAN .......................................................................................................................................................................... 58
2.5.4 Distributed Architecture .......................................................................................................................................... 62
2.6 Introduction to Huawei Intelligent Storage Products ................................................................................................. 63
2.6.1 All-Flash Storage ...................................................................................................................................................... 63
2.6.2 Hybrid Flash Storage ................................................................................................................................................ 65
2.6.3 Distributed Storage .................................................................................................................................................. 67

2.6.4 Edge Data Storage (FusionCube) ............................................................................................................................. 68
3 Advanced Storage Technologies............................................................................................. 70
3.1 Storage Resource Tuning Technologies and Applications .......................................................................................... 70
3.1.1 SmartThin ................................................................................................................................................................ 70
3.1.2 SmartTier ................................................................................................................................................................. 71
3.1.3 SmartQoS ................................................................................................................................................................. 73
3.1.4 SmartDedupe ........................................................................................................................................................... 75
3.1.5 SmartCompression .................................................................................................................................................. 76
3.1.6 SmartMigration ........................................................................................................................................................ 77
3.2 Storage Data Protection Technologies and Applications ............................................................................................ 78
3.2.1 HyperSnap................................................................................................................................................................ 78
3.2.2 HyperClone .............................................................................................................................................................. 81
3.2.3 HyperReplication ..................................................................................................................................................... 82
3.2.4 HyperMetro ............................................................................................................................................................. 85
4 Storage Business Continuity Solutions ................................................................................... 88
4.1 Backup Solution Introduction ..................................................................................................................................... 88
4.1.1 Overview .................................................................................................................................................................. 88
4.1.2 Architecture ............................................................................................................................................................. 89
4.1.3 Networking Modes .................................................................................................................................................. 90
4.1.4 Common Backup Technologies ................................................................................................................................ 92
4.1.5 Applications ............................................................................................................................................................. 96
4.2 DR Solution Introduction ............................................................................................................................................ 98
4.2.1 DR Solution Overview .............................................................................................................................................. 98
4.2.2 DR Solution Architecture ....................................................................................................................................... 101
4.2.3 Common DR Technologies ..................................................................................................................................... 102
4.2.4 DR Application Scenarios ....................................................................................................................................... 105
5 Storage System O&M Management ..................................................................................... 106
5.1 Storage System O&M Management ......................................................................................................................... 106
5.1.1 Storage Management Overview ............................................................................................................................ 106
5.1.2 Introduction to Storage Management Tools ......................................................................................................... 106
5.1.3 Introduction to Basic Management Operations .................................................................................................... 108
5.2 Storage System O&M Management ......................................................................................................................... 108
5.2.1 O&M Overview ...................................................................................................................................................... 108
5.2.2 O&M Management Tool ........................................................................................................................................ 109
5.2.3 O&M Scenarios ...................................................................................................................................................... 109
1 Storage Technology Trends
1.1 Storage Technology Trends

1.1.1 Data and Information
1.1.1.1 What Is Data
Data refers to recognizable symbols that record events. It is a physical symbol or a combination of
these physical symbols that record the properties, states, and relationships of events. In a narrow
sense, data refers to numbers. In a broad sense, it can be a combination of characters, letters, and
digits, graphs, images, videos and audios that have specific meanings, and can also be the abstract
representation of an attribute, quantity, location, and relationship between objects. For example, 0,
1, 2, windy weather, rain, fall in temperature, student records, and transportation of goods are all
data.
In computer science, data is a generic term for all media such as numbers, letters, symbols, and
analog parameters that can be input to and processed by computer programs. Computers store and
process a wide range of objects that generate complex data.
DAMA defines data as the expression of facts in the form of texts, numbers, graphics, images,
sounds, and videos.
1.1.1.2 Data Types
Based on data storage and management modes, data is classified into structured, semi-structured,
and unstructured data.
Structured data can be represented and stored in a relational database, and is often represented as
a two-dimensional table. For example, SQL Server, MySQL, and Oracle.
Semi-structured data does not conform to the structure of relational databases or other data tables,
but uses tags to separate semantic elements or enforces hierarchies of records and fields. For
example, XML, HTML, and JSON.
Unstructured data is not organized in a regular or complete data structure, or does not have a
predefined data model. For example, texts, pictures, reports, images, audios, and videos.
1.1.1.3 Data Processing Cycle
Data processing is the reorganization or reordering of data by humans or machines to increase their
specific value. A data processing cycle includes three basic steps: input, processing, and output.
⚫ Input: inputs data in a specific format, which depends on the processing mechanism. For
example, when a computer is used, the input data can be recorded on several types of media,
such as disks and tapes.
⚫ Processing: performs actions on the input data to obtain more data value. For example, the time
card hours are calculated to payroll, or sales orders are calculated to generate sales reports.
⚫ Output: generates and outputs the processing result. The form of the output data depends on
the data use. For example, the output data can be an employee's salary.
1.1.1.4 What Is Information

Information refers to the objects that are transmitted and processed by the voice, message, and
communication systems. It refers to all the contents that are spread in the human society. By
acquiring and identifying different information of nature and society, man can distinguish between
different things and understand and transform the world. In all communication and control systems,
information is a form of universal connection. In 1948, mathematician Claude Elwood Shannon
pointed out in paper A Mathematical Theory of Communications that the essence of information is
the resolution of random uncertainty.
Information is the data with context. The context includes:
⚫ Application meanings of data elements and related terms
⚫ Format of data expression
⚫ Time range of the data
⚫ Relevance of data to particular usage
Generally speaking, the concept of "data" is more objective and is not subjective to people's will.
Information is the processed data that has value and meanings.
For example, in the perspective of a football fan, the history of football, football matches, coaches,
players, and even the rules of FIFA are all the football data. Data of his or her favorite team, star, and
followed football events is information.
People can never know "all data" but can obtain "adequate information" that allows them to make
decisions.
1.1.1.5 Data vs. Information
Data is a raw and unorganized fact that needs to be processed to make it meaningful, whereas
information is a set of data that is processed in a meaningful way according to a given requirement.
Data does not have any specific purpose whereas information carries a meaning that has been
assigned by interpreting data.
Data alone has no significance while information is significant by itself.
Data never depends on information while information is dependent on data.
Data is measured in bits and bytes while information is measured in meaningful units like time and
quantity.
Data can be structured, tabular data, graph, data tree whereas information is language, ideas, and
thoughts based on the given data.
Data is a record that reflects the attributes of an object and is the specific form that carries
information. Data becomes information after being processed, and information needs to be digitized
into data for storage and transmission.
1.1.1.6 Information Lifecycle Management
Information lifecycle management (ILM) is an information technology strategy and concept, not just
a product or solution, for enterprise users. Data is key to informatization and is the core
competitiveness of an enterprise. Information enters a cycle from the moment it is generated. A
lifecycle is completed in the process of data creation, protection, access, migration, archiving, and
destruction. This process requires good management and cooperation. If the process is not well
managed, too many resources may be wasted or the work will be inefficient due to insufficient
resources.
EMC advises customers to implement ILM in three steps. Step 1: implement automatic network
storage and optimize storage infrastructure. Step 2: improve the service level and optimize
information management. Step 3: implement an integrated lifecycle management environment.
Data management of ILM is generally divided into the following stages:

Data creation stage: Data is generated from terminals and saved to storage devices.
⚫ Data protection stage: Different data protection technologies are used based on data and
application system levels to ensure that various types of data and information are effectively
protected in a timely manner. A storage system provides data protection functions, such as
RAID, HA, disaster recovery (DR), and permission management.
⚫ Data access stage: Information must be easy to access and can be shared among organizations
and applications of enterprises to maximize business value.
⚫ Data migration stage: When using IT devices, you need to upgrade and replace devices, and
migrate the data from the old to new devices.
⚫ Data archiving stage: The data archiving system supports the business operation for enterprises
by providing the record query for transactions and decision-making. Deduplication and
compression are often used in this phase.
⚫ Data destruction stage: After a period of time, data is no longer saved. In this phase, destroy or
reclaim data that does not need to be retained or stored, and clear the data from storage
systems and data warehouses.
1.1.2 Data Storage

1.1.2.1 What Is Data Storage
In a narrow sense, storage refers to the physical storage media with redundancy, protection, and
migration functions, such as floppy disks, CDs, DVDs, disks, and even tapes.
In a broad sense, storage refers to a portfolio of solutions that provide information access,
protection, optimization, and utilization for enterprises. It is the pillar of the data-centric information
architecture.
Data storage covered in this course refers to storage in a broad sense.
1.1.2.2 Data Storage System
Storage technologies are not separate or isolated. Actually, a complete storage system consists of a
series of components.
A storage system consists of storage hardware, storage software, and storage solutions. Hardware
involves storage devices and devices for storage connections, such as disk arrays, tape libraries, and
Fibre Channel switches. Storage software greatly improves the availability of a storage device. Data
mirroring, data replication, and automatic data backup can be implemented by using storage
software.
1.1.2.3 Physical Structure of Storage
A typical storage system comprises the disk, control, connection, and storage management software
subsystems.
In terms of its physical structure, disks reside in the bottom layer and are connected to back-end
cards and controllers of the storage system via connectors such as optical fibers and serial cables.
The storage system is connected to hosts via front-end cards and storage network switching devices
to provide data access services.
Storage management software is used to configure, monitor, and optimize subsystems and
connectors of the storage system.
1.1.2.4 Data Storage Types

Storage systems can be classified into internal and external storage systems based on the locations
of storage devices and hosts.
An internal storage system is directly connected to the host bus, and includes the high-speed cache
and memory required for CPU computing and the disks and CD-ROM drives that are directly
connected to the main boards of computers. Its capacity is generally small and hard to expand.
An external storage system is classified into direct-attached storage (DAS) and fabric-attached
storage (FAS) by connection mode.
FAS is classified into network-attached storage (NAS) and storage area network (SAN) by
transmission protocol.
1.1.2.5 Evolution of Data Management Technologies
Data management is a process of effectively collecting, storing, processing, and applying data by
using computer hardware and software technologies. The purpose of data management is to make
full use of the data. Data organization is the key to effective data management.
Data management technology is used to classify, organize, encode, input, store, retrieve, maintain,
and output data. The evolution of data storage devices and computer application systems promotes
the development of databases and data management technologies. Data management in a
computer system goes through four phases: manual management, file system management,
traditional database system management, and big data management.
1.1.2.6 Data Storage Application
Data generated by individuals and organizations is processed by computing systems and stored in
data storage systems.
In the ICT era, storage is mainly used for data access, data protection for security, and data
management.
Online storage means that storage devices and stored data are always online and accessible to users
anytime, and the data access speed meets the requirements of the computing platform. The working
mode is similar to the disk storage mode on PCs. Online storage devices use disks and disk arrays,
which are expensive but provide good performance.
Offline storage is used to back up online storage data to prevent possible data disasters. Data stored
on the offline storage is not often accessed, and is read and written in sequence. If tape libraries are
used as offline storage medium and data is read, tapes will be rolled to the beginning to locate the
data. When the written data needs to be modified, all data needs to be rewritten. Therefore, the
access speed of the offline storage is slow and the efficiency is low. A typical offline storage product
is a tape library, which is relatively cheap.
Nearline storage is a storage type for providing more storage choices to customers. Its costs and
performance are between online storage and offline storage. If the data is not frequently used or the
amount of accessed data is small, it can be stored on nearline storage devices. These devices still
provide fast addressing capabilities and a high transmission rate. For example, archive files that are
not used for a long time into nearline storage. Therefore, nearline storage is suitable to scenarios not
requiring high performance but requiring relatively high access performance.
1.1.3 Development of Storage Technologies

1.1.3.1 Storage Architecture
The storage architecture we use today derives from traditional storage, external storage, and
storage networks, and has developed into distributed and cloud storage.
Traditional storage is composed of disks. In 1956, IBM invented the world's first hard disk drive,
which used 50 x 24-inch platters with a capacity of only 5 MB. It was as big as two refrigerators and
weighed more than a ton. It was used in the industrial field at that time and was independent of the
mainframe.
External storage is also called direct attached storage. Its earliest form is Just a Bundle Of Disks
(JBOD), which simply combines disks and is represented to hosts as a bundle of independent disks. It
only increases the capacity and cannot ensure data security.
The disks deployed in servers have the following disadvantages: limited slots and insufficient
capacity; poor reliability as data is stored on independent disks; disks become the system
performance bottleneck; low storage space utilization; data is scattered since it is stored on different
servers.
JBOD solves the problem of limited slots to a certain extent, and the RAID technology improves
reliability and performance. External storage gradually develops into storage arrays with controllers.
The controllers contain the cache and support the RAID function. In addition, dedicated
management software can be configured. Storage arrays are represented as a large, high-
performance, and redundant disk to hosts.
DAS has the characteristics of scattered data and low storage space utilization.
As the amount of data in our society is explosively increased, the requirements for data storage are
flexible data sharing, high resource utilization, and extended transmission distance. The emergence
of networks infuses new vitality to storage.
SAN: establishes a network between storage devices and servers to provide block storage services.
NAS: builds networks between servers and storage devices with file systems to provide file storage
services.
Since 2011, unified storage that supports both SAN and NAS protocols is a popular choice. Storage
convergence sets a new trend: converged NAS and SAN. This convergence provides both database
and file sharing services, simplifying storage management, and improving storage utilization.
SAN is a typical storage network. It first emerges as FC SAN using the Fibre Channel network to
transmit data, and later supports IP SAN.
Distributed storage uses general-purpose server hardware to build storage resource pools and is
applicable to cloud computing scenarios. Physical resources are organized using software to form a
high-performance logical storage pool, ensuring reliability and providing multiple storage services.
Generally, distributed storage scatters data to multiple independent storage servers in a scalable
system structure. It uses those storage servers to share storage loads and location servers to locate
storage information. Distributed storage architecture has the following characteristics: universal
hardware, unified architecture, and storage-network decoupling; linear expansion of performance
and capacity, up to thousands of nodes; elastic resource scaling and high resource utilization.
Storage virtualization consolidates the storage devices into logical resources, thereby providing
comprehensive and unified storage services. Unified functions are provided regardless of different
storage forms and device types.
The cloud storage system combines multiple storage devices, applications, and services. It uses
highly virtualized multi-tenant infrastructure to provide scalable storage resources for enterprises.
Those storage resources can be dynamically configured based on organization requirements.
Cloud storage is a concept derived from cloud computing, and is a new network storage technology.
Based on functions such as cluster applications, network technologies, and distributed file systems, a
cloud storage system uses application software to enable various types of storage devices on
networks to work together, providing data storage and service access externally. When a cloud
computing system stores and manages a huge amount of data, the system requires a matched
number of storage devices. In this way, the cloud computing system turns into a cloud storage
system. Therefore, we can regard a cloud storage system as a cloud computing system with data
storage and management as its core. In a word, cloud storage is an emerging solution that
consolidates storage resources on the cloud for people to access. Users can access data on the cloud
anytime, anywhere, through any networked device.
1.1.3.2 Storage Media
History of HDDs:
⚫ From 1970 to 1991, the storage density of disk platters increased by 25% to 30% annually.
⚫ Starting from 1991, the annual increase rate of storage density surged to 60% to 80%.
⚫ Since 1997, the annual increase rate rocketed up to 100% and even 200%, thanks to IBM's Giant
Magneto Resistive (GMR) technology, which further improved the disk head sensitivity and
storage density.
⚫ IBM 1301: used air-bearing heads to eliminate friction and its capacity reached 28 MB.
⚫ IBM 3340: was a pre-installed box unit with a capacity of 30 MB. It was also called "Winchester"
disk drive, named after the Winchester 30-30 rifle because it was planned to run on two 30 MB
spindles.
⚫ In 1992, 1.8-inch HDDs were invented.
History of SSDs:
⚫ Invented by Dawon Kahng and Simon Min Sze in 1967, the floating gate transistor has become
the basis of NAND flash technology. If you are familiar with MOS tubes, you'll find that the
transistor is similar to MOSFET except a floating gate in the middle. That is why it got the name.
It is wrapped in high-impedance materials and insulated up and down to preserve charges that
enter the floating gate through the quantum tunneling effect.
⚫ In 1976, Dataram sold SSDs called Bulk Core. The SSD had the capacity of 2 MB (which was very
large at that time), and used eight large circuit boards, each board with eighteen 256 KB RAMs.
⚫ At the end of the 1990s, some vendors began to use the flash medium to manufacture SSDs. In
1997, Altec Computer Systeme launched a parallel SCSI flash SSD. In 1999, BiTMICRO released
an 18-GB flash SSD. Since then, flash SSD has gradually replaced RAM SSD and become the
mainstream product of the SSD market. The flash memory can store data even in the event of
power failure, which is similar to the HDD.
⚫ In May 2005, Samsung Electronics announced its entry into the SSD market, the first IT giant
entering this market. It is also the first SSD vendor that is widely recognized today.
⚫ In 2006, NextCom began to use SSDs on its laptops. Samsung launched the SSD with the 32 GB
capacity. According to Samsung, the market of SSDs was 1.3 billion USD in 2007 and reached 4.5
billion USD in 2010. In September, Samsung launched the PRAM SSD, another SSD technology
that used the PRAM as the carrier, and hoped to replace NOR flash memory. In November,
Microsoft's Windows Vista came into being as the first PC operating system to support SSD-
specific features.
⚫ In 2009, the capacity of SSDs caught up with that of HDDs. PureSilicon's 2.5-inch SSD provides 1
TB capacity and consists of 128 pieces of 64 Gbit/s MLC NAND memory. Finally, SSD provides the
same capacity as HDD in the same size. This is very important. HDD vendors once believed that
the HDD capacity could be easily increased by increasing the disk density with low costs.
However, the SSD capacity could be doubled only when the internal chips were doubled, which
was difficult. However, the MLC SSD proves that it is possible to double the capacity by storing
more bits in one cell. In addition, the SSD performance is much higher than that of HDD. The
SSD has the read bandwidth of 240 MB/s, write bandwidth of 215 MB/s, read latency less than
100 microseconds, 50,000 read IOPS, and 10,000 write IOPS. HDD vendors are facing a huge
threat.
The flash chips of SSD evolve from SLC with one cell storing one bit, MLC with two bits, TLC with
three bits, and now develop into QLC with one cell storing four bits.
1.1.3.3 Interface Protocols
Interface protocols refer to the communication modes and requirements that interfaces for
exchanging information must comply with.
Interfaces are used to transfer data between disk cache and host memory. Different disk interfaces
determine the connection speed between disks and controllers.
During the development of storage protocols, the data transmission rate is increasing. As storage
media evolves from HDDs to SSDs, the protocol develops from SCSI to NVMe, including the PCIe-
based NVMe protocol and NVMe over Fabrics (NVMe-oF) protocol to connect host networks.
NVMe-oF uses ultra-low-latency transmission protocols such as remote direct memory access
(RDMA) to remotely access SSDs, resolving the trade-off between performance, functionality, and
capacity during scale-out of next-generation data centers.
Released in 2016, the NVMe-oF specification supported both Fibre Channel and RDMA. In the
RDMA-based framework, InfiniBand supported converged Ethernet and Internet Wide Area RDMA
Protocol (iWARP).
In the NVMe-oF 1.1 specification released in November 2018, TCP was added as an architecture
option, that is, RDMA over Converged Ethernet (RoCE). With RoCE, no cache was required and the
CPU could directly access disks.
NVMe is an SSD controller interface standard. It is designed for PCIe interface-based SSDs and aims
to maximize flash memory performance. It can provide intensive computing capabilities for
enterprise-class workloads in data-intensive industries, such as life sciences, financial services,
multimedia, and entertainment.
NVMe SSDs are commonly used in databases. Featuring high speed and low latency, NVMe can be
used for file systems and all-flash storage arrays to achieve excellent read/write performance. The
all-flash storage system using NVMe SSDs provides efficient storage, network switching, and
metadata communication.
1.1.4 Development Trend of Storage Products

1.1.4.1 Development History of Storage Products
Overall development history:
⚫ All-flash storage: In terms of media, the price of flash chips decreases year by year, and HDDs
are gradually used as tapes for storing cold data and archive data.
⚫ Cloudification: In terms of storage architecture trend, elastic scalability is provided by the
distributed architecture, and moving workloads to the cloud helps reduce the total cost of
ownership (TCO).
⚫ Intelligence: In terms of O&M, software intelligence is provided with intelligent hardware
functions, such as smart disk enclosures.
1.1.4.2 New Requirements for Data Storage in the Intelligence Era
Historically, human society has experienced three technological revolutions, that is, steam,
electricity, and information. Each period has a huge impact on our working and personal lives. Now,
the fourth revolution, the age of intelligence, is here. New technologies such as AI, cloud computing,
Internet of Things, and big data are used for large-scale digital transformation in many sectors. New
services are centered on data and intelligence, promoting changes such as service-oriented
extension, network-based collaboration, intelligent production, and customization. Together we can
dive deep into our data, giving us insights and making it more valuable.
The entire human society is rapidly evolving into an intelligent society. During this process, the data
volume is growing explosively. The average mobile data usage per user per day is over 1 GB. During
the training of autonomous vehicles, each vehicle generates 64 TB data every day. According to
Huawei's Global Industry Vision 2025, the amount of global data will increase from 33 ZB in 2018 to
180 ZB in 2025. Data is becoming a core business asset of enterprises and even countries. The smart
government, smart finance, and smart factory built based on effective data utilization greatly
improve the efficiency of the entire society. More and more enterprises have realized that data
infrastructure is the key to intelligent success, and storage is the core foundation of data
infrastructure. In the past, we used to classify storage systems based on new technology hotspots,
technical architecture, and storage media. As the economy and society transform from digitalization
to intelligence, we tend to call the new type of storage as the storage in the intelligence era.
It has several trends:
First, intelligence, classified by Huawei as Storage for AI and AI in Storage. Storage for AI indicates
that in the future, storage will better support enterprises in AI training and applications. AI in
Storage means that storage systems use AI technologies and integrate AI into storage lifecycle
management to provide outstanding storage management, performance, efficiency, and stability.
Second, storage arrays will transform, for example, towards all-flash storage arrays. In the future,
more and more applications will require low latency, high reliability, and low TCOs, and all-flash
storage arrays will be the good choice. Although new storage media will emerge to compete, all-
flash storage will be the mainstream storage media in the future. Today, all-flash storage is still not
the mainstream in the storage market.
The third trend is distributed storage. In the 5G intelligent era, high-performance application
scenarios such as AI, HPC, and autonomous driving and the generated massive amount of data
require distributed storage devices. With dedicated hardware, they can provide efficient, cost-
effective, and EB-level large-capacity storage. Distributed storage is facing the challenges of
intensification and large-scale expansion, as well as the possible changes of chips and algorithms in
the future. Scientists attempt to use chip, algorithm, and bus technologies to break the barriers of
the von Neumann architecture, provide more computing power for the underlying data
infrastructure to provide efficient and low-cost storage media, and narrow the gap between storage
and computing. These problems need to be solved by dedicated hardware storage. The concept
similar to Memory Fabric also brings changes to the storage architecture.
The last trend is convergence. In the future, storage will be integrated with the data infrastructure to
support heterogeneous chip computing, streamline diversified protocols, and collaborate with data
processing and big data analytics to reduce data processing costs and improve efficiency. For
example, compared with the storage provided by general-purpose servers, the integration of data
and storage will lower the TCO because data processing is offloaded from servers to storage. Object,
big data, and other protocols are converged and interoperate to implement migration-free big data.
Such convergence greatly affects the design of storage systems and is the key to improving storage
efficiency.
1.1.4.3 Data Storage Trend
In the intelligence era, we must focus on innovation to hardware, protocols, and technologies. From
IBM mainframe to the x86, and then to the virtualization, all-flash storage media and all-IP network
protocols become a major trend.
In the intelligence era, Huawei Cache Coherence System (HCCS) and Compute Express Link (CXL) are
designed based on ultra-fast new interconnection protocols, helping to implement high-speed
interconnection between heterogeneous processors of CPUs and neural processing units (NPUs).
RoCE and NVMe support high-speed data transmission and containerization technologies. In
addition, new hardware and technologies provide abundant choices for data storage. The Memory
Fabric architecture implements memory resource pooling with all-flash + storage class memory
(SCM) and provides microsecond-level data processing performance. SCM media include Optane,
MRAM, ReRAM, FRAM, and Fast NAND. In terms of reliability, system reconstruction and data
migration are involved. As the chip-level design of all-flash storage advances, upper-layer
applications will be unaware of the underlying storage hardware.
Currently, the access performance of SSDs has been improved by 100-fold compared with that of
HDDs. For NVMe SSDs, the access performance is 10,000 times higher than that of HDDs. While the
latency of storage media has been greatly reduced, the ratio of network latency to the total latency
has rocketed from less than 5% to about 65%. That is to say, in more than half of the time, storage
media is idle, waiting for the network communication. How to reduce network latency is the key to
improving input/output operations per second (IOPS).
Development of Storage Media

Let's move on to Blu-ray storage. The optical storage technology started to develop in the late 1960s,
and experienced three generations of product updates and iterations: CD, DVD, and BD. Blu-ray
storage (or BD) is a relatively new member of the optical storage family. It can retain data for 50 to
100 years, but still cannot meet storage requirements nowadays. We expect to store data for a
longer time. The composite glass material based on gold nanoparticles can stably store data for
more than 600 years.
In addition, technologies such as DNA storage and quantum storage are emerging.
As the science and technology are developing, the disk capacity is increasing and the disk size is
becoming smaller. When it comes to storing information, a hard disk is still very large compared to
genes, but the amount of stored information is far less than that of genes. Therefore, scientists start
to use DNA to store data. At first, a few teams have tried to write data into the genomes of living
cells. But the approach has a couple of disadvantages. Cells replicate, introducing new mutations
over time that can change the data. Moreover, cells die, indicating that data is lost. Later, teams
attempt to store data using artificially synthesized DNA, which is freed from cells. Although the DNA
storage density now is high enough and a small amount of artificial DNA can store a large amount of
data, the data read and write are not efficient. In addition, the synthesis of DNA molecules is
expensive. However, it can be predicted that, with the development of gene sequencing
technologies, the cost will be reduced.
References:
Bohannon, J. (2012). DNA: The Ultimate Hard Drive. Science. Retrieved from:
https://www.sciencemag.org/news/2012/08/dna-ultimate-hard-drive
Akram F, Haq IU, Ali H, Laghari AT (October 2018). "Trends to store digital data in DNA: an overview".
Molecular Biology Reports. 45 (5): 1479–1490. doi:10.1007/s11033-018-4280-y
Although atomic storage is a technology of short history, it is not a new concept.
Early on December 1959, physicist Richard Feynman gave a lecture at the annual American Physical
Society meeting at Caltech "There's Plenty of Room at the Bottom: An Invitation to Enter a New Field
of Physics." In this lecture, Feynman considered the possibility of using individual atoms as basic
units for information storage.
In July 2016, researchers from Delft University of Technology, Netherlands published a paper in
Nature Nanotechnology. They used chlorine atoms on copper plates to store 1 kilobyte of rewritable
data. However, the memory temporarily can only operate in a highly clean vacuum environment or
in a liquid nitrogen environment with a temperature of minus 196°C (77K).
References:
Erwin, S. A picture worth a thousand bytes. Nature Nanotech 11, 919–920 (2016).
https://doi.org/10.1038/nnano.2016.141
Kalff, F., Rebergen, M., Fahrenfort, E. et al. A kilobyte rewritable atomic memory. Nature Nanotech
11, 926–929 (2016). https://doi.org/10.1038/nnano.2016.131
Because an atom is so small, the capacity of atomic storage will be much larger than that of the
existing storage medium in the same size. With the development of science and technology in recent
years, Feynman's idea has become a reality. To pay tribute to Feynman's great idea, some research
teams wrote his lecture into atomic memory. Although the idea of atomic storage is incredible and
its implementation is becoming possible, atomic memory has strict requirements on the operating
environment. Atoms are moving and even the atoms inside solids are vibrating in the ambient
environment, so it is difficult to keep them in an ordered state in general conditions. Atom storage
can only be used in low temperatures, liquid nitrogen, or vacuum conditions.
If both DNA storage and atomic storage are intended to reduce the size of storage and increase the
capacity of storage, quantum storage is designed to improve performance and running speed.
After years of research, both the storage efficiency and the lifecycle of the quantum memory are
improved, but it is still difficult to put the quantum memory into practice. Quantum memory has the
problems of inefficiency, large noise, short lifespan, and difficulty to operate at room temperature.
Only by solving these problems, quantum memory can be put into the market.
The elements in the quantum state are easily lost due to the influence of the external environment.
In addition, it is difficult to ensure 100% accuracy of manufacturing in the quantum state and
performing quantum operations.
References:
Wang, Y., Li, J., Zhang, S. et al. Efficient quantum memory for single-photon polarization qubits. Nat.
Photonics 13, 346–351 (2019). https://doi.org/10.1038/s41566-019-0368-8
Dou Jian-Peng, Li Hang, Pang Xiao-Ling, Zhang Chao-Ni, Yang Tian-Huai, Jin Xian-Min. Research
progress of quantum memory. Acta Physica Sinica, 2019, 68(3): 030307. doi:
10.7498/aps.68.20190039
Storage Network Development

In traditional data centers, IP SAN uses the Ethernet technology to form a multi-hop symmetric
network architecture and use the TCP/IP network protocol stack for data transmission. FC SAN
requires an independent FC network for data transmission. Although traditional TCP/IP or Fibre
Channel networks become mature after years of development, their technical architecture limits the
application of AI computing and distributed storage.
To reduce the network delay and CPU usage, the Remote Direct Memory Access (RDMA) technology
emerges, used on servers to provide the remote direct memory access function. RDMA directly
transmits data from the memory of one computer to that of another computer. Data is quickly
moved from one system to the remote system storage without intervention of both operating
systems and without time-consuming processing by processors. In this way, the system has high
bandwidth, low latency, and efficient resource usage.
1.1.4.4 History of Huawei Storage Products
Huawei has been developing storage technology since 2002 and gathers a class of global elite
engineers. Huawei is dedicated to storage innovation and R&D in the intelligent era. Huawei
products are consequently globally recognized for their superior quality by customers and standard
organizations.
Oriented to the intelligence era, Huawei OceanStor storage builds an innovative architecture based
on intelligence, hardware, and algorithms. It builds a memory/SCM-centric ultimate performance
layer based on Memory Fabric, and builds a high-performance capacity layer based on all-IP and all-
flash storage to provide intelligent tiering for data storage. Computing resource pooling and
intelligent scheduling at the 10,000-core level are implemented for CPUs, NPUs, and GPUs based on
innovative algorithms and high-speed interconnection protocols. In addition, container-based

heterogeneous microservices are tailored to business to break the boundaries of memory
performance, computing power, and protocols. Finally, an intelligent management system is
provided across the entire data lifecycle to deliver innovative storage products for a fully connected,
intelligent world.
2 Basic Storage Technologies
2.1 Intelligent Storage Components

2.1.1 Controller Enclosure
2.1.1.1 Controller Enclosure Design
A controller enclosure contains controllers and is the core component of a storage system.
The controller enclosure uses a modular design with a system subrack, controllers (with built-in fan
modules), BBUs, power modules, management modules, and interface modules.
⚫ The system subrack integrates a backplane to provide signal and power connectivity among
modules.
⚫ The controller is a core module for processing storage system services.
⚫ BBUs protect storage system data by providing backup power during failures of the external
power supply.
⚫ The AC power module supplies power to the controller enclosure, allowing the enclosure to
operate normally at maximum power.
⚫ The management module provides management, maintenance, and serial ports.
⚫ Interface modules provide service or management ports and are field replaceable units. In
computer science, data is a generic term for all media such as numbers, letters, symbols, and
analog parameters that can be input to and processed by computer programs. Computers store
and process a wide range of objects that generate complex data.
2.1.1.2 Controller Enclosure Components
⚫ A controller is the core component of a storage system. It processes storage services, receives
configuration management commands, saves configuration data, connects to disks, and saves
critical data to coffer disks.
➢ The controller CPU and cache process I/O requests from the host and manage storage
system RAID.
➢ Each controller has built-in disks to store system data. These disks also store cache data
during power failures. Disks on different controllers are redundant of each other.
⚫ Front-end (FE) ports provide service communication between application servers and the
storage system for processing host I/Os.
⚫ Back-end (BE) ports connect a controller enclosure to a disk enclosure and provide disks with
channels for reading and writing data.
⚫ A cache is a memory chip on a disk controller. It provides fast data access and is a buffer
between the internal storage and external interfaces.
⚫ An engine is a core component of a development program or system on an electronic platform.
It usually provides support for programs or a set of systems.
⚫ Coffer disks store user data, system configurations, logs, and dirty data in the cache to protect
against unexpected power outages.
➢ Built-in coffer disk: Each controller of Huawei OceanStor Dorado V6 has one or two built-in
SSDs as coffer disks. See the product documentation for more details.
➢ External coffer disk: The storage system automatically selects four disks as coffer disks.
Each coffer disk provides 2 GB space to form a RAID 1 group. The remaining space can
store service data. If a coffer disk is faulty, the system automatically replaces the faulty
coffer disk with a normal disk for redundancy.
⚫ Power module: The controller enclosure employs an AC power module for its normal
operations.
➢ A 4 U controller enclosure has four power modules (PSU 0, PSU 1, PSU 2, and PSU 3). PSU 0
and PSU 1 form a power plane to power controllers A and C and provide mutual
redundancy. PSU 2 and PSU 3 form the other power plane to power controllers B and D
and provide mutual redundancy. It is recommended that you connect PSU 0 and PSU 2 to
one PDU and PSU 1 and PSU 3 to another PDU for maximum reliability.
➢ A 2 U controller enclosure has two power modules (PSU 0 and PSU 1) to power controllers
A and B. The two power modules form a power plane and provide mutual redundancy.
Connect PSU 0 and PSU 1 to different PDUs for maximum reliability.
2.1.2 Disk Enclosure

2.1.2.1 Disk Enclosure Design
The disk enclosure uses a modular design with a system subrack, expansion modules, power
modules, and disks.
⚫ The system subrack integrates a backplane to provide signal and power connectivity among
modules.
⚫ The expansion module provides expansion ports to connect to a controller enclosure or another
disk enclosure for data transmission.
⚫ The power module supplies power to the disk enclosure, allowing the enclosure to operate
normally at maximum power.
⚫ Disks provide storage space for the storage system to save service data, system data, and cache
data. Specific disks are used as coffer disks.
2.1.3 Expansion Module

2.1.3.1 Expansion Module
Each expansion module provides one P0 and one P1 expansion port to connect to a controller
enclosure or another disk enclosure for data transmission.
2.1.3.2 CE Switch
Huawei CloudEngine series fixed switches are next-generation Ethernet switches for data centers
and provide high performance, high port density, and low latency. The switches use a flexible front-
to-rear or rear-to-front design for airflow and support IP SANs and distributed storage networks.
2.1.3.3 Fibre Channel Switch
Fibre Channel switches are high-speed network transmission relay devices that transmit data over
optical fibers. They accelerate transmission and protect against interference. Fibre Channel switches
are used on FC SANs.
2.1.3.4 Device Cables

A serial cable connects the serial port of the storage system to the maintenance terminal.
Mini SAS HD cables connect to expansion ports on controller and disk enclosures. There are mini SAS
HD electrical cables and mini SAS HD optical cables.
An active optical cable (AOC) connects a PCIe port on a controller enclosure to a data switch.
100G QSFP28 cables are for direct connection between controllers or for connection to smart disk
enclosures.
25G SFP28 cables are for front-end networking.
Fourteen Data Rate (FDR) cables are dedicated for 56 Gbit/s IB interface modules.
Optical fibers connect the storage system to Fibre Channel switches. One end of the optical fiber
connects to a Fibre Channel host bus adapter (HBA), and the other end connects to the Fibre
Channel switch or the storage system. An optical fiber uses LC connectors at both ends. MPO-4*DLC
optical fibers are dedicated for 8 Gbit/s Fibre Channel interface modules with 8 ports and 16 Gbit/s
Fibre Channel interface modules with 8 ports, and are used to connect the storage system to Fibre
Channel switches.
2.1.4 HDD
2.1.4.1 HDD Structure
⚫ A platter is coated with magnetic materials on both surfaces with polarized magnetic grains to
represent a binary information unit, or bit.
⚫ A read/write head reads and writes data for platters. It changes the polarities of magnetic grains
on the platter surface to save data.
⚫ The actuator arm moves the read/write head to the specified position.
⚫ The spindle has a motor and bearing underneath. It rotates the specified position on the platter
to the read/write head.
⚫ The control circuit controls the speed of the platter and movement of the actuator arm, and
delivers commands to the head.
2.1.4.2 HDD Design
Each disk platter has two read/write heads to read and write data on the two surfaces of the platter.
Airflow prevents the head from touching the platter, so the head can move between tracks at a high
speed. A long distance between the head and the platter results in weak signals, and a short distance
may cause the head to rub against the platter surface. The platter surface must therefore be smooth
and flat. Any foreign matter or dust will shorten the distance and cause the head to rub against the
magnetic surface. This will result in permanent data corruption.
Working principles:
⚫ The read/write head starts in the landing zone near the platter spindle.
⚫ The spindle connects to all of the platters and a motor. The spindle motor rotates at a constant
speed to drive the platters.
⚫ When the spindle rotates, there is a small gap between the head and the platter. This is called
the flying height of the head.
⚫ The head is attached to the end of the actuator arm, which drives the head to the specified
position above the platter where data needs to be written or read.
⚫ The head reads and writes data in binary format on the platter surface. The read data is stored
in the flash chip of the disk and then transmitted to the program.
2.1.4.3 Data Organization on a Disk

⚫ Platter surface: Each platter of a disk has two valid surfaces to store data. All valid surfaces are
numbered in sequence, starting from 0 for the top. A surface number in a disk system is also
referred to as a head number because each valid surface has a read/write head.
⚫ Track: Tracks are the concentric circles for data recording around the spindle on a platter. Tracks
are numbered from the outermost circle to the innermost one, starting from 0. Each platter
surface has 300 to 1024 tracks. New types of large-capacity disks have even more tracks on each
surface. The tracks per inch (TPI) on a platter are generally used to measure the track density.
Tracks are only magnetized areas on the platter surfaces and are invisible to human eyes.
⚫ Cylinder: A cylinder is formed by tracks with the same number on all platter surfaces of a disk.
The heads of each cylinder are numbered from top to bottom, starting from 0. Data is read and
written based on cylinders. Head 0 in a cylinder reads and writes data first, and then the other
heads in the same cylinder read and write data in sequence. After all heads in a cylinder have
completed reads and writes, the heads move to the next cylinder. Selection of cylinders is a
mechanical switching process called seek. The position of heads in a disk is generally indicated
by the cylinder number instead of the track number.
⚫ Sector: Each track is divided into smaller units called sectors to arrange data orderly. A sector is
the smallest storage unit that can be independently addressed in a disk. Tracks may vary in the
number of sectors. A sector can generally store 512 bytes of user data, but some disks can be
formatted into even larger sectors of 4 KB.
2.1.4.4 Disk Capacity
Disks may have one or multiple platters. However, a disk allows only one head to read and write
data at a time. As a result, increasing the number of platters and heads can only improve the disk
capacity. The throughput or I/O performance of the disk will not change.
Disk capacity = Number of cylinders x Number of heads x Number of sectors x Sector size. The unit is
MB or GB. The disk capacity is determined by the capacity of a single platter and the number of
platters.
The superior processing speed of a CPU over a disk forces the CPU to wait until the disk completes a
read/write operation before issuing a new command. Adding a cache to the disk to improve the
read/write speed solves this problem.
2.1.4.5 Disk Performance Factors
⚫ Rotation speed: This is the number of platter revolutions per minute (rpm). When data is being
read or written, the platter rotates while the head stays still. Fast platter rotations shorten data
transmission time. When processing sequential I/Os, the actuator arm avoids frequent seeking,
so the rotation speed is the primary factor in determining throughput and IOPS.
⚫ Seek speed: The actuator arm must change tracks frequently for random I/Os. Track changes
take much longer time than data transmission. An actuator arm with a faster seek speed can
therefore improve the IOPS of random I/Os.
⚫ Single platter capacity: A larger capacity for a single platter increases data storage within a unit
of space for a higher data density. A higher data density with the same rotation and seek speed
gives disks better performance.
⚫ Port speed: In theory, the current port speed is enough to support the maximum external
transmission bandwidth of disks. The seek speed is the bottleneck for random I/Os with port
speed having little impact on performance.
2.1.4.6 Average Access Time

⚫ The average seek time is the average time required for a head to move from its initial position
to a specified platter track. This is an important metric for the internal transfer rate of a disk and
should be as short as possible.
⚫ The average latency time is how long a head must wait for a sector to move to the specified
position after the head has reached the desired track. The average latency is generally half of
the time required for the platter to rotate a full circle. Faster rotations therefore decrease
latency.
2.1.4.7 Data Transfer Rate
The data transfer rate of a disk refers to how fast the disk can read and write data. It includes the
internal and external data transfer rates and is measured in MB/s.
⚫ Internal transfer rate is also called sustained transfer rate. It is the highest rate at which a head
reads and writes data. This excludes the seek time and the delay for the sector to move to the
head. It is a measurement based on an ideal situation where the head does not need to change
the track or read a specified sector, but reads and writes all sectors sequentially and cyclically on
one track.
⚫ External transfer rate is also called burst data transfer rate or interface transfer rate. It refers to
the data transfer rate between the system bus and the disk buffer and depends on the disk port
type and buffer size.
2.1.4.8 Disk IOPS and Transmission Bandwidth
IOPS is calculated using the seek time, rotation latency, and data transmission time.
⚫ Seek time: The shorter the seek time, the faster the I/O. The current average seek time is 3 to
15 ms.
⚫ Rotation latency: It refers to the time required for the platter to rotate the sector of the target
data to the position below the head. The rotation latency depends on the rotation speed.
Generally, the latency is half of the time required for the platter to rotate a full circle. For
example, the average rotation latency of a 7200 rpm disk is about 4.17 ms (60 x 1000/7200/2),
and the average rotation latency of a 15000 rpm disk is about 2 ms.
⚫ Data transmission time: It is the time required for transmitting the requested data and can be
calculated by dividing the data size by the data transfer rate. For example, the data transfer rate
of IDE/ATA disks can reach 133 MB/s and that of SATA II disks can reach 300 MB/s.
⚫ Random I/Os require the head to change tracks frequently. The data transmission time is much
shorter than the time for track changes. In this case, data transmission time can be ignored.
Theoretically, the maximum IOPS of a disk can be calculated using the following formula: IOPS =
1000 ms/(Seek time + Rotation latency). The data transmission time is ignored. For example, if the
average seek time is 3 ms, the theoretical maximum IOPS for 7200 rpm, 10k rpm, and 15k rpm disks
is 140, 167, and 200, respectively.
2.1.4.9 Transmission Mode
Parallel transmission:
⚫ Parallel transmission features high efficiency, short distances, and low frequency.
⚫ In long-distance transmission, using multiple lines is more expensive than using a single line.
⚫ Long-distance transmission requires thicker conducting wires to reduce signal attenuation, but it
is difficult to bundle them into a single cable.
⚫ In long-distance transmission, the time for data on each line to reach the peer end varies due to
wire resistance or other factors. The next transmission can be initiated only after data on all
lines has reached the peer end.
⚫ High transmission frequency causes serious circuit oscillation and generates interference
between the lines. The frequency of parallel transmission must therefore be carefully set.
Serial transmission:
⚫ Serial transmission is less efficient than parallel transmission, but is generally faster with
potential increases in transmission speed from increasing the transmission frequency.
⚫ Serial transmission is used for long-distance transmission. Currently, PCI interfaces use serial
transmission. The PCIe interface is a typical example of serial transmission. The transmission
rate of a single line is up to 2.5 Gbit/s.
2.1.4.10 Disk Ports
Disks are classified into IDE, SCSI, SATA, SAS, and Fibre Channel disks by port. These disks also differ
in their mechanical bases.
IDE and SATA disks use the ATA mechanical base and are suitable for single-task processing.
SCSI, SAS, and Fibre Channel disks use the SCSI mechanical base and are suitable for multi-task
processing.
Comparison:
⚫ SCSI disks provide faster processing than ATA disks under high data throughput.
⚫ ATA disks overheat during multi-task processing due to the frequent movement of the
read/write head.
⚫ SCSI disks provide higher reliability than ATA disks.
IDE disk port:
⚫ Multiple ATA versions have been released, including ATA-1 (IDE), ATA-2 (Enhanced IDE/Fast ATA),
ATA-3 (Fast ATA-2), ATA-4 (ATA33), ATA-5 (ATA66), ATA-6 (ATA100), and ATA-7 (ATA133).
⚫ ATA ports have several advantages and disadvantages:
➢ Their strengths are their low price and good compatibility.
➢ Their disadvantages are their low speed, limited applications, and strict restrictions on
cable length.
➢ The transmission rate of the PATA port is also inadequate for current user needs.
SATA port:
⚫ During data transmission, the data and signal lines are separated and use independent
transmission clock frequency. The transmission rate of SATA is 30 times that of PATA.
⚫ Advantages:
➢ A SATA port generally has 7+15 pins, uses a single channel, and transmits data faster than
ATA.
➢ SATA uses the cyclic redundancy check (CRC) for instructions and data packets to ensure
data transmission reliability.
➢ SATA surpasses ATA in interference protection.
SCSI port:
⚫ SCSI disks were developed to replace IDE disks to provide higher rotation speed and
transmission rate. SCSI was originally a bus-type interface and worked independently of the
system bus.
⚫ Advantages:
➢ It is applicable to a wide range of devices. One SCSI controller card can connect to 15
devices simultaneously.
➢ It provides high performance with multi-task processing, low CPU usage, fast rotation
speed, and a high transmission rate.
➢ SCSI disks support diverse applications as external or built-in components with hot-
swappable replacement.
⚫ Disadvantages:
➢ High cost and complex installation and configuration.
SAS port:
⚫ SAS is similar to SATA in its use of a serial architecture for a high transmission rate and
streamlined internal space with shorter internal connections.
⚫ SAS improves the efficiency, availability, and scalability of the storage system. It is backward
compatible with SATA for the physical and protocol layers.
⚫ Advantages:
➢ SAS is superior to SCSI in its transmission rate, anti-interference, and longer connection
distances.
⚫ Disadvantages:
➢ SAS disks are more expensive.
Fibre Channel port:
⚫ Fiber Channel was originally designed for network transmission rather than disk ports. It has
gradually been applied to disk systems in pursuit of higher speed.
⚫ Advantages:
➢ Easy to upgrade. Supports optical fiber cables with a length over 10 km.
➢ Large bandwidth
➢ Strong universality
⚫ Disadvantages:
➢ High cost
➢ Complex to build
2.1.5 SSD
2.1.5.1 SSD Overview
Traditional disks use magnetic materials to store data, but SSDs use NAND flash with cells as storage
units. NAND flash is a non-volatile random access storage medium that can retain stored data after
the power is turned off. It quickly and compactly stores digital information.
SSDs eliminate high-speed rotational components for higher performance, lower power
consumption, and zero noise.
SSDs do not have mechanical parts, but this does not mean that they have an infinite life cycle.
Because NAND flash is a non-volatile medium, original data must be erased before new data can be
written. However, there is a limit to how many times each cell can be erased. Once the limit is
reached, data reads and writes become invalid on that cell.
2.1.5.2 SSD Architecture

The host interface is the protocol and physical interface for the host to access an SSD. Common
interfaces are SATA, SAS, and PCIe.
The SSD controller is the core SSD component for read and write access between a host and the
back-end media and for protocol conversion, table entry management, data caching, and data
checking.
DRAM is the cache for the flash translation layer (FTL) entries and data.
NAND flash is a non-volatile random access storage medium that stores data.
There are concurrent multiple channels with time-division multiplexing for flash granules in a
channel. There is also support for TCQ and NCQ for simultaneous responses to multiple I/O requests.
2.1.5.3 NAND Flash
Internal storage units of NAND flash include LUNs, planes, blocks, pages, and cells.
NAND flash stores data using floating gate transistors. The threshold voltage changes based on the
number of electric charges stored in a floating gate. Data is then represented using the read voltage
of the transistor threshold.
⚫ A LUN is the smallest physical unit that can be independently encapsulated and typically
contains multiple planes.
⚫ A plane has an independent page register. It typically contains 1,000 or 2,000 odd or even
blocks.
⚫ A block is the smallest erasure unit and generally consists of multiple pages.
⚫ A page is the smallest programming and read unit and is usually 16 KB.
⚫ A cell is the smallest erasable, programmable, and readable unit found in pages. A cell
corresponds to a floating gate transistor that stores one or multiple bits.
A page is the basic unit of programming and reading, and a block is the basic unit of erasing.
Each P/E cycle causes some damage to the insulation layer of the floating gate transistor. If block
erasure or programming fails, the block is labeled a bad block. When the number of bad blocks
reaches a threshold (4%), the NAND flash reaches the end of its service life.
2.1.5.4 SLC, MLC, TLC, and QLC
NAND flash chips have the following classifications based on the number of bits stored in a cell:
⚫ A single level cell (SLC) can store one bit of data: 0 or 1.
⚫ A multi level cell (MLC) can store two bits of data: 00, 01, 10, and 11.
⚫ A triple level cell (TLC) can store three bits of data: 000, 001, 010, 011, 100, 101, 110, and 111.
⚫ A quad level cell (QLC) can store four bits of data: 0000, 0001, 0010, 0011, 0100, 0101, 0110,
0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, and 1111.
These four types of cells have similar costs but store different amounts of data. Originally, the
capacity of an SSD was only 64 GB or smaller. Now, a TLC SSD can store up to 2 TB of data. However,
each cell type has a different life cycle, resulting in different SSD reliability. The life cycle is also an
important factor in selecting SSDs.
2.1.5.5 Flash Chip Data Relationship
This slide shows the logic diagram of a flash chip (Toshiba 3D-TLC).
⚫ A page is logically formed by 146,688 cells. Each page can store 16 KB of content and 1952 bytes
of ECC data. A page is the minimum I/O unit of the flash chip.
⚫ Every 768 pages form a block. Every 1478 blocks form a plane.
⚫ A flash chip consists of two planes, with one storing blocks with odd sequence numbers and the
other storing even sequence numbers. The two planes can be operated concurrently.
ECC must be performed on the data stored in the NAND flash, so the size of the page in the NAND
flash is not an integer of 16 KB, but with an extra group of bytes. For example, if the actual size of a
16 KB page is 16,384 + 1,952 bytes, then the 16,384 bytes are for data storage, and the 1,952 bytes
are for storing data check codes for ECC.
2.1.5.6 Address Mapping Management
The logical block address (LBA) may refer to an address of a data block or the data block that the
address indicates.
PBA: physical block address
The host accesses the SSD through the LBA. Each LBA generally represents a sector of 512 bytes. The
host OS accesses the SSD in units of 4 KB. The basic unit for the host to access the SSD is called host
page.
The flash page of an SSD is the basic unit for the SSD controller to access the flash chip, which is also
called the physical page. Each time the host writes a host page, the SSD controller writes it to a
physical page and records their mapping relationship.
When the host reads a host page, the SSD finds the requested data according to the mapping
relationship.
2.1.5.7 SSD Read and Write Process
SSD write process:
⚫ The SSD controller connects to eight flash dies through eight channels. For better explanation,
the figure shows only one block in each die. Each 4 KB square in the blocks represents a page.
➢ The host writes 4 kilobytes to the block of channel 0 to occupy one page.
➢ The host continues to write 16 kilobytes. This example shows 4 kilobytes being written to
each block of channels 1 through 4.
➢ The host continues to write data to the blocks until all blocks are full.
⚫ When the blocks on all channels are full, the SSD controller selects a new block to write data in
the same way.
⚫ Green indicates valid data and red indicates invalid data. Unnecessary data in the blocks
becomes aged or invalid, and its mapping relationship is replaced.
⚫ For example, host page A was originally stored in flash page X, and the mapping relationship
was A to X. Later, the host rewrites the host page. Flash memory does not overwrite data, so the
SSD writes the new data to a new page Y, establishes the new mapping relationship of A to Y,
and cancels the original mapping relationship. The data in page X becomes aged and invalid,
which is also known as garbage data.
⚫ The host continues to write data to the SSD until it is full. In this case, the host cannot write
more data unless the garbage data is cleared.
SSD read process:
⚫ An 8-fold increase in read speed depends on whether the read data is evenly distributed in the
blocks of each channel. If the 32 KB data is stored in the blocks of channels 1 through 4, the
read speed can only support a 4-fold improvement at most. That is why smaller files are
transmitted at a slower rate.
2.1.5.8 SSD Performance Advantages

Short response time: Traditional HDDs have lower efficiency in data transmission because they
waste time with seeking and mechanical latency. SSDs use NAND flash to eliminate seeking and
mechanical latency for far faster responses to read and write requests.
High read/write efficiency: HDDs perform random read/write operations by moving the head back
and forth, resulting in low read/write efficiency. In contrast, SSDs calculate data storage locations
with an internal controller to reduce the mechanical operations and streamline read/write
processing.
In addition, deploying a large number of SSDs grants enormous advantages in power efficiency.
2.1.6 Interface Module

2.1.6.1 GE Interface Modules
A GE electrical interface module has four electrical ports with 1 Gbit/s and is used for HyperMetro
quorum networking.
A 40GE interface module provides two optical ports with 40 Gbit/s for connecting storage devices to
application servers.
A 100GE interface module provides two optical ports with 100 Gbit/s for connecting storage devices
to application servers.
2.1.6.2 SAS Expansion Module and RDMA Interface Module
A 25 Gbit/s RDMA interface module provides four optical ports with 25 Gbit/s for direct connections
between two controller enclosures.
A 100 Gbit/s RDMA interface module provides two optical ports with 100 Gbit/s for connecting
controller enclosures to scale-out switches or smart disk enclosures. SO stands for scale-out and BE
stands for back-end in the labels.
A 12 Gbit/s SAS expansion module provides four mini SAS HD expansion ports with 4 x 12 Gbit/s to
connect controller enclosures to 2 U SAS disk enclosures.
2.1.6.3 SmartIO Interface Modules
SmartIO interface modules support 8, 10, 16, 25, and 32 Gbit/s optical modules, which respectively
provide 8 Gbit/s Fibre Channel, 10GE, 16 Gbit/s Fibre Channel, 25GE, and 32 Gbit/s Fibre Channel
ports. SmartIO interface modules connect storage devices to application servers.
The optical module rate must match the rate on the interface module label. Otherwise, the storage
system will report an alarm and the port will become unavailable.
2.1.6.4 PCIe and 56 Gbit/s IB Interface Modules
A PCIe interface module provides two PCIe ports for connecting controller enclosures to PCIe
switches and exchanging control and data flows between the controller enclosures.
Indicators:
⚫ Interface module power indicator
⚫ Link/Speed indicator of a PCIe port
⚫ PCIe port
⚫ Handle
The 56 Gbit/s IB interface module provides two IB ports with a transmission rate of 4 x 14 Gbit/s.
Indicators:
⚫ Power indicator/Hot swap button

⚫ Link indicator of a 56 Gbit/s IB port
⚫ Active indicator of a 56 Gbit/s IB port
⚫ 56 Gbit/s IB port
⚫ Handle
2.1.6.5 Fibre Channel and FCoE Interface Modules
A 16 Gbit/s Fibre Channel interface module has two physical ports, which are converted into eight 16
Gbit/s Fibre Channel ports by dedicated cables. Each port provides a transmission rate of 16 Gbit/s.
They serve as the service ports between the storage system and application server to receive data
exchange commands from the application server.
A 10 Gbit/s FCoE interface module provides two FCoE ports with 10 Gbit/s, which connect the
storage system to the application server for data transmission.
The 10 Gbit/s FCoE interface module supports only direct connections.
2.2 RAID Technologies

2.2.1 Traditional RAID
2.2.1.1 Basic Concept of RAID
Redundant Array of Independent Disks (RAID) combines multiple physical disks into one logical disk
in different ways, for the purposes of read/write performance and data security improvement.
Functionality of RAID:
⚫ Combines multiple physical disks into one logical disk array to provide larger storage capacity.
⚫ Divides data into blocks and concurrently writes/reads data to/from multiple disks to improve
disk access efficiency.
⚫ Provides mirroring or parity for fault tolerance.
Hardware RAID and software RAID can be implemented in storage devices.
⚫ Hardware RAID uses a dedicated RAID adapter, disk controller, or storage processor. The RAID
controller has a built-in processor, I/O processor, and memory to improve resource utilization
and data transmission speed. The RAID controller manages routes and buffers, and controls
data flows between the host and the RAID array. Hardware RAID is usually used in servers.
⚫ Software RAID has no built-in processor or I/O processor but relies on a host processor.
Therefore, a low-speed CPU cannot meet the requirements for RAID implementation. Software
RAID is typically used in enterprise-class storage devices.
Disk striping: Space in each disk is divided into multiple strips of a specific size. Data is also divided
into blocks based on strip size when data is being written.
⚫ Strip: A strip consists of one or more consecutive sectors in a disk, and multiple strips form a
stripe.
⚫ Stripe: A stripe consists of strips of the same location or ID on multiple disks in the same array.
RAID generally provides two methods for data protection.
⚫ One is storing data copies on another redundant disk to improve data reliability and read
performance.
⚫ The other is parity. Parity data is additional information calculated using user data. For a RAID
array that uses parity, an additional parity disk is required. The XOR (symbol: ⊕) algorithm is
used for parity.
2.2.1.2 RAID 0
RAID 0, also referred to as striping, provides the best storage performance among all RAID levels.
RAID 0 uses the striping technology to distribute data to all disks in a RAID array.
Figure 2-1 Working principles of RAID 0

A RAID 0 array contains at least two member disks. A RAID 0 array divides data into blocks of
different sizes ranging from 512 bytes to megabytes (usually multiples of 512 bytes) and
concurrently writes the data blocks to different disks. The preceding figure shows a RAID 0 array
consisting of two disks (drives). The first two data blocks are written to stripe 0: the first data block is
written to strip 0 in disk 1, and the second data block is written to strip 0 in disk 2. Then, the next
data block is written to the next strip (strip 1) in disk 1, and so forth. In this mode, I/O loads are
balanced among all disks in the RAID array. As the data transfer speed on the bus is much higher
than the data read and write speed on disks, data reads and writes on disks can be considered as
being processed concurrently.
A RAID 0 array provides a large-capacity disk with high I/O processing performance. Before the
introduction of RAID 0, there was a technology similar to RAID 0, called Just a Bundle Of Disks
(JBOD). JBOD refers to a large virtual disk consisting of multiple disks. Unlike RAID 0, JBOD does not
concurrently write data blocks to different disks. JBOD uses another disk only when the storage
capacity in the first disk is used up. Therefore, JBOD provides a total available capacity which is the
sum of capacities in all disks but provides the performance of individual disks.
In contrast, RAID 0 searches the target data block and reads data in all disks upon receiving a data
read request. The preceding figure shows a data read process. A RAID 0 array provides a read/write
performance that is directly proportional to disk quantity.
2.2.1.3 RAID 1
RAID 1, also referred to as mirroring, maximizes data security. A RAID 1 array uses two identical disks
including one mirror disk. When data is written to a disk, a copy of the same data is stored in the
mirror disk. When the source (physical) disk fails, the mirror disk takes over services from the source
disk to maintain service continuity. The mirror disk is used as a backup to provide high data
reliability.
The amount of data stored in a RAID 1 array is only equal to the capacity of a single disk, and data
copies are retained in another disk. That is, each gigabyte data needs 2 gigabyte disk space.
Therefore, a RAID 1 array consisting of two disks has a space utilization of 50%.

Unlike RAID 0 which utilizes striping to concurrently write different data to different disks, RAID 1
writes same data to each disk so that data in all member disks is consistent. As shown in the
preceding figure, data blocks D 0, D 1, and D 2 are to be written to disks. D 0 and the copy of D 0 are
written to the two disks (disk 1 and disk 2) at the same time. Other data blocks are also written to
the RAID 1 array in the same way by mirroring. Generally, a RAID 1 array provides write performance
of a single disk.
A RAID 1 array reads data from the data disk and the mirror disk at the same time to improve read
performance. If one disk fails, data can be read from the other disk.
A RAID 1 array provides read performance which is the sum of the read performance of the two
disks. When a RAID array degrades, its performance decreases by half.
2.2.1.4 RAID 3
RAID 3 is similar to RAID 0 but uses dedicated parity stripes. In a RAID 3 array, a dedicated disk
(parity disk) is used to store the parity data of strips in other disks in the same stripe. If incorrect
data is detected or a disk fails, data in the faulty disk can be recovered using the parity data. RAID 3
applies to data-intensive or single-user environments where data blocks need to be continuously
accessed for a long time. RAID 3 writes data to all member data disks. However, when new data is
written to any disk, RAID 3 recalculates and rewrites parity data. Therefore, when a large amount of
data from an application is written, the parity disk in a RAID 3 array needs to process heavy
workloads. Parity operations have certain impact on the read and write performance of a RAID 3
array. In addition, the parity disk is subject to the highest failure rate in a RAID 3 array due to heavy
workloads. A write penalty occurs when just a small amount of data is written to multiple disks,
which does not improve disk performance as compared with data writes to a single disk.

RAID 3 uses a single disk for fault tolerance and performs parallel data transmission. RAID 3 uses
striping to divide data into blocks and writes XOR parity data to the last disk (parity disk).
The write performance of RAID 3 depends on the amount of changed data, the number of disks, and
the time required to calculate and store parity data. If a RAID 3 array consists of N member disks of
the same rotational speed and write penalty is not considered, its sequential I/O write performance
is theoretically slightly inferior to N – 1 times that of a single disk when full-stripe write is performed.
(Additional time is required to calculate redundancy check.)
In a RAID 3 array, data is read by stripe. Data blocks in a stripe can be concurrently read as drives in
all disks are controlled.
RAID 3 performs parallel data reads and writes. The read performance of a RAID 3 array depends on
the amount of data to be read and the number of member disks.
2.2.1.5 RAID 5
RAID 5 is improved based on RAID 3 and consists of striping and parity. In a RAID 5 array, data is
written to disks by striping. In a RAID 5 array, the parity data of different strips is distributed among
member disks instead of a parity disk.
Similar to RAID 3, a write penalty occurs when just a small amount of data is written.

The write performance of a RAID 5 array depends on the amount of data to be written and the
number of member disks. If a RAID 5 array consists of N member disks of the same rotational speed
and write penalty is not considered, its sequential I/O write performance is theoretically slightly
inferior to N – 1 times that of a single disk when full-stripe write is performed. (Additional time is
required to calculate redundancy check.)
In a RAID 3 or RAID 5 array, if a disk fails, the array changes from the online (normal) state to the
degraded state until the faulty disk is reconstructed. If a second disk also fails, the data in the array
will be lost.
2.2.1.6 RAID 6
Data protection mechanisms of all RAID arrays previously discussed considered only failures of
individual disks (excluding RAID 0). The time required for reconstruction increases along with the
growth of disk capacities. It may take several days instead of hours to reconstruct a RAID 5 array
consisting of large-capacity disks. During the reconstruction, the array is in the degraded state, and
the failure of any additional disk will cause the array to be faulty and data to be lost. This is why
some organizations or units need a dual-redundancy system. In other words, a RAID array should
tolerate failures of up to two disks while maintaining normal access to data. Such dual-redundancy
data protection can be implemented in the following ways:
⚫ The first one is multi-mirroring. Multi-mirroring is a method of storing multiple copies of a data
block in redundant disks when the data block is stored in the primary disk. This means heavy
overheads.
⚫ The second one is a RAID 6 array. A RAID 6 array protects data by tolerating failures of up to two
disks even at the same time.
The formal name of RAID 6 is distributed double-parity (DP) RAID. It is essentially an improved RAID
5, and also consists of striping and distributed parity. RAID 6 supports double parity, which means
that:
⚫ When user data is written, double parity calculation needs to be performed. Therefore, RAID 6
provides the slowest data writes among all RAID levels.
⚫ Additional parity data takes storage spaces in two disks. This is why RAID 6 is considered as an N
+ 2 RAID.
Currently, RAID 6 is implemented in different ways. Different methods are used for obtaining parity
data.
RAID 6 P+Q
Figure 2-5 Working principles of RAID 6 P+Q

⚫ When a RAID 6 array uses P+Q parity, P and Q represent two independent parity data. P and Q
parity data is obtained using different algorithms. User data and parity data are distributed in all
disks in the same stripe.
⚫ As shown in the figure, P 1 is obtained by performing an XOR operation on D 0, D 1, and D 2 in
stripe 0, P 2 is obtained by performing an XOR operation on D 3, D 4, and D 5 in stripe 1, and P 3
is obtained by performing an XOR operation on D 6, D 7, and D 8 in stripe 2.
⚫ Q 1 is obtained by performing a GF transform and then an XOR operation on D 0, D 1, and D 2 in
stripe 0, Q 2 is obtained by performing a GF transform and then an XOR operation on D 3, D 4,
and D 5 in stripe 1, and Q 3 is obtained by performing a GF transform and then an XOR
operation on D 6, D 7, and D 8 in stripe 2.
⚫ If a strip on a disk fails, data on the failed disk can be recovered using the P parity value. The
XOR operation is performed between the P parity value and other data disks. If two disks in the
same stripe fail at the same time, different solutions apply to different scenarios. If the Q parity
data is not in any of the two faulty disks, the data can be recovered to data disks, and then the
parity data is recalculated. If the Q parity data is in one of the two faulty disks, data in the two
faulty disks must be recovered by using both the formulas.
RAID 6 DP
Figure 2-6 Working principles of RAID 6 DP

⚫ RAID 6 DP also has two independent parity data blocks. The first parity data is the same as the
first parity data of RAID 6 P+Q. The second parity data is the diagonal parity data obtained
through diagonal XOR operation. Horizontal parity data is obtained by performing an XOR
operation on user data in the same stripe. As shown in the preceding figure, P 0 is obtained by
performing an XOR operation on D 0, D 1, D 2, and D 3 in stripe 0, and P 1 is obtained by
performing an XOR operation on D4, D5, D6, and D 7 in stripe 1. Therefore, P 0 = D 0 ⊕ D 1 ⊕
D 2 ⊕ D 3, P 1 = D 4 ⊕ D 5 ⊕ D 6 ⊕ D 7, and so on.
⚫ The second parity data block is obtained by performing an XOR operation on diagonal data
blocks in the array. The process of selecting data blocks is relatively complex. DP 0 is obtained by
performing an XOR operation on D 0 in disk 1 in stripe 0, D 5 in disk 2 in stripe 1, D 10 in disk 3
in stripe 2, and D 15 in disk 4 in stripe 3. DP 1 is obtained by performing an XOR operation on D
1 in disk 2 in stripe 0, D 6 in disk 3 in stripe 1, D 11 in disk 4 in stripe 2, and P 3 in the first parity
disk in stripe 3. DP 2 is obtained by performing an XOR operation on D 2 in disk 3 in stripe 0, D 7
in disk 4 in stripe 1, P 2 in the first parity disk in stripe 2, and D 12 in disk 1 in stripe 3. Therefore,
DP 0 = D 0 ⊕ D 5 ⊕ D 10 ⊕ D 15, DP 1 = D 1 ⊕ D 6 ⊕ D 11 ⊕ P 3, and so on.
⚫ A RAID 6 array tolerates failures of up to two disks.
⚫ A RAID 6 array provides relatively poor performance no matter whether DP or P+Q is
implemented. Therefore, RAID 6 applies to the following two scenarios:
➢ Data is critical and should be consistently in online and available state.
➢ Large-capacity (generally > 2 T) disks are used. The reconstruction of a large-capacity disk
takes a long time. Data will be inaccessible for a long time if two disks fail at the same time.
A RAID 6 array tolerates failure of another disk during the reconstruction of one disk. Some
enterprises anticipate to use a dual-redundancy protection RAID array for their large-
capacity disks.
2.2.1.7 RAID 10
For most enterprises, RAID 0 is not really a practical choice, while RAID 1 is limited by disk capacity
utilization. RAID 10 provides the optimal solution by combining RAID 1 and RAID 0. In particular,
RAID 10 provides superior performance by eliminating write penalty in random writes.
A RAID 10 array consists of an even number of disks. User data is written to half of the disks and
mirror copies of user data are retained in the other half of disks. Mirroring is performed based on
stripes.

As shown in the figure, physical disks 1 and 2 form a RAID 1 array, and physical disks 3 and 4 form
another RAID 1 array. The two RAID 1 sub-arrays form a RAID 0 array.
When data is written to the RAID 10 array, data blocks are concurrently written to sub-arrays by
mirroring. As shown in the figure, D 0 is written to physical disk 1, and its copy is written to physical
disk 2.
If disks (such as disk 2 and disk 4) in both the two RAID 1 sub-arrays fail, accesses to data in the RAID
10 array will remain normal. This is because integral copies of the data in faulty disks 2 and 4 are
retained on other two disks (such as disk 3 and disk 1). However, if disks (such as disk 1 and 2) in the
same RAID 1 sub-array fail at the same time, data will be inaccessible.
Theoretically, RAID 10 tolerates failures of half of the physical disks. However, in the worst case,
failures of two disks in the same sub-array may also cause data loss. Generally, RAID 10 protects data
against the failure of a single disk.
2.2.1.8 RAID 50
RAID 50 combines RAID 0 and RAID 5. Two RAID 5 sub-arrays form a RAID 0 array. The two RAID 5
sub-arrays are independent of each other. A RAID 50 array requires at least six disks because a RAID
5 sub-array requires at least three disks.

As shown in the figure, disks 1, 2, and 3 form a RAID 5 sub-array, and disks 4, 5, and 6 form another
RAID 5 sub-array. The two RAID 5 sub-arrays form a RAID 0 array.
A RAID 50 array tolerates failures of multiple disks at the same time. However, failures of two disks
in the same RAID 5 sub-array will cause data loss.
2.2.2 RAID 2.0+

2.2.2.1 RAID Evolution
As a well-developed and reliable disk data protection standard, RAID has always been used as a basic
technology for storage systems. However, traditional RAID is becoming increasingly defective, in
particular, in reconstruction of large-capacity disks, with ever-increasing data storage requirements
and capacity per disk.
Traditional RAID is defective due to high risk of data loss and impact on services.
⚫ High risk of data loss: Ever-increasing disk capacities lead to longer reconstruction time and
higher risk of data loss. Dual redundancy protection is invalid during reconstruction and data
will be lost if any additional disk or data block fails. Therefore, a longer reconstruction duration
results in higher risk of data loss.
⚫ Material impact on services: During reconstruction, member disks are engaged in

reconstruction and provide poor service performance, which will affect the operation of upper-
layer services.
To solve the preceding problems of traditional RAID and ride on the development of virtualization
technologies, the following alternative solutions emerged:
⚫ LUN virtualization: A traditional RAID array is further divided into small units. These units are
regrouped into storage spaces accessible to hosts.
⚫ Block virtualization: Disks in a storage pool are divided into small data blocks. A RAID array is
created using these data blocks so that data can be evenly distributed to all disks in the storage
pool. Then, resources are managed based on data blocks.
2.2.2.2 Basic Principles of RAID 2.0+
RAID 2.0+ divides a physical disk into multiple chunks (CKs). CKs in different disks form a chunk group
(CKG). CKGs have a RAID relationship with each other. Multiple CKGs form a large storage resource
pool. Resources are allocated from the resource pool to hosts.
Implementation mechanism of RAID 2.0+:
⚫ Multiple SSDs form a storage pool.
⚫ Each SSD is then divided into chunks (CKs) of a fixed size (typically 4 MB) for logical space
management.
⚫ CKs from different SSDs form chunk groups (CKGs) based on the RAID policy specified on
DeviceManager.
⚫ CKGs are further divided into grains (typically 8 KB). Grains are mapped to LUNs for refined
management of storage resources.
RAID 2.0+ outperforms traditional RAID in the following aspects:
⚫ Service load balancing to avoid hot spots: Data is evenly distributed to all disks in the resource
pool, protecting disks from early end of service life due to excessive writes.
⚫ Fast reconstruction to reduce risk window: When a disk fails, the valid data in the faulty disk is
reconstructed to all other functioning disks in the resource pool (fast many-to-many
reconstruction), efficiently resuming redundancy protection.
⚫ Reconstruction load balancing among all disks in the resource pool to minimize the impact on
upper-layer applications.
2.2.2.3 RAID 2.0+ Composition
1. Disk Domain
A disk domain is a combination of disks (which can be all disks in the array). After the disks are
combined and reserved for hot spare capacity, each disk domain provides storage resources for
the storage pool.
For traditional RAID, a RAID array must be created first for allocating disk spaces to service
hosts. However, there are some restrictions and requirements for creating a RAID array: A RAID
array must consist of disks of the same type, size, and rotational speed, and should consist of a
maximum number of 12 disks.
Huawei RAID 2.0+ is implemented in another way. A disk domain should be created first. A disk
domain is a disk array. A disk can belong to only one disk domain. One or more disk domains can
be created in an OceanStor storage system. It seems that a disk domain is similar to a RAID
array. Both consist of disks but have significant differences. A RAID array consists of disks of the
same type, size, and rotational speed, and such disks are associated with a RAID level. In
contrast, a disk domain consists of up to more than 100 disks of up to three types. Each type of
disk is associated with a storage tier. For example, SSDs are associated with the high
performance tier, SAS disks are associated with the performance tier, and NL-SAS disks are
associated with the capacity tier. A storage tier would not exist if there are no disks of the
corresponding type in a disk domain. A disk domain separates an array of disks from another
array of disks for fully isolating faults and maintaining independent performance and storage
resources. RAID levels are not specified when a disk domain is created. That is, data redundancy
protection methods are not specified. Actually, RAID 2.0+ provides more flexible and specific
data redundancy protection methods. The storage space formed by disks in a disk domain is
divided into storage pools of a smaller granularity and hot spare space shared among storage
tiers. The system automatically sets the hot spare space based on the hot spare policy (high,
low, or none) set by an administrator for the disk domain and the number of disks at each
storage tier in the disk domain. In a traditional RAID array, an administrator should specify a
disk as the hot space disk.
2. Storage Pool and Storage Tier
A storage pool is a storage resource container. The storage resources used by application
servers are all from storage pools.
A storage tier is a collection of storage media providing the same performance level in a storage
pool. Different storage tiers manage storage media of different performance levels and provide
storage space for applications that have different performance requirements.
A storage pool created based on a specified disk domain dynamically allocates CKs from the disk
domain to form CKGs according to the RAID policy of each storage tier for providing storage
resources with RAID protection to applications.
A storage pool can be divided into multiple tiers based on disk types.
When creating a storage pool, a user is allowed to specify a storage tier and related RAID policy
and capacity for the storage pool.
OceanStor storage systems support RAID 1, RAID 10, RAID 3, RAID 5, RAID 50, and RAID 6 and
related RAID policies.
The capacity tier consists of large-capacity SATA and NL-SAS disks. DP RAID 6 is recommended.
3. Disk Group
An OceanStor storage system automatically divides disks of each type in each disk domain into
one or more disk groups (DGs) according to disk quantity.
One DG consists of disks of only one type.
CKs in a CKG are allocated from different disks in a DG.
DGs are internal objects automatically configured by OceanStor storage systems and typically
used for fault isolation. DGs are not presented externally.
4. Logical Drive
A logical drive (LD) is a disk that is managed by a storage system and corresponds to a physical
disk.
5. CK
A chunk (CK) is a disk space of a specified size allocated from a storage pool. It is the basic unit
of a RAID array.
6. CKG
A chunk group (CKG) is a logical storage unit that consists of CKs from different disks in the same
DG based on the RAID algorithm. It is the minimum unit for allocating resources from a disk
domain to a storage pool.
All CKs in a CKG are allocated from the disks in the same DG. A CKG has RAID attributes, which
are actually configured for corresponding storage tiers. CKs and CKGs are internal objects
automatically configured by storage systems. They are not presented externally.
7. Extent
Each CKG is divided into logical storage spaces of a specific and adjustable size called extents.
Extent is the minimum unit (granularity) for migration and statistics of hot data. It is also the
minimum unit for space application and release in a storage pool.
An extent belongs to a volume or LUN. A user can set the extent size when creating a storage
pool. After that, the extent size cannot be changed. Different storage pools may consist of
extents of different sizes, but one storage pool must consist of extents of the same size.
8. Grain
When a thin LUN is created, extents are divided into 64 KB blocks which are called grains. A thin
LUN allocates storage space by grains. Logical block addresses (LBAs) in a grain are consecutive.
Grains are mapped to thin LUNs. A thick LUN does not involve grains.
9. Volume and LUN
A volume is an internal management object in a storage system.
A LUN is a storage unit that can be directly mapped to a host for data reads and writes. A LUN is
the external embodiment of a volume.
A volume organizes all extents and grains of a LUN and applies for and releases extents to
increase and decrease the actual space used by the volume.
2.2.3 Other RAID Technologies

2.2.3.1 Huawei Dynamic RAID Algorithm
When a flash component fails, Huawei dynamic RAID algorithm can proactively recover the data in
the faulty flash component and keep providing RAID protection for the data.
This RAID algorithm dynamically adjusts the number of data blocks in a RAID array to meet system
reliability and capacity requirements. If a chunk is faulty and no chunk is available from disks outside
the disk domain, the system dynamically reconstructs the original N + M chunks to (N - 1) + M
chunks. When a new SSD is inserted, the system migrates data from the (N - 1) + M chunks to the
newly constructed N + M chunks for efficient disk utilization.
Dynamic RAID adopts the erasure coding (EC) algorithm, which can dynamically adjust the number of
CKs in a CKG if only SSDs are used to meet the system reliability and capacity requirements.
2.2.3.2 RAID-TP
RAID protection is essential to a storage system for consistently high reliability and performance.
However, the reliability of RAID protection is challenged by uncontrollable RAID array construction
time due to drastic increase in capacity.
RAID-TP achieves optimal performance, reliability, and capacity utilization.
Customers have to purchase disks of larger capacity to replace existing disks for system upgrades. In
such a case, one system may consist of disks of different capacities. How to maintain the optimal
capacity utilization in a system that uses a mix of disks with different capacities?
RAID-TP uses Huawei's optimized FlexEC algorithm that allows the system to tolerate failures of up
to three disks, improving reliability while allowing a longer reconstruction time window.
RAID-TP with FlexEC algorithm reduces the amount of data read from a single disk by 70%, as
compared with traditional RAID, minimizing the impact on system performance.
In a typical 4:2 RAID 6 array, the capacity utilization is about 67%. The capacity utilization of a
Huawei OceanStor all-flash storage system with 25 disks is improved by 20% on this basis.
2.3 Common Storage Protocols

2.3.1 SCSI
2.3.1.1 SCSI Protocol
Small Computer System Interface (SCSI) is a vast protocol system. The SCSI protocol defines a model
and a necessary instruction set for different devices to exchange information using the framework.
SCSI reference documents cover devices, models, and links.
⚫ SCSI architecture documents discuss the basic architecture models SAM and SPC and describe
the SCSI architecture in detail, covering topics like the task queue model and basic common
instruction model.
⚫ SCSI device implementation documents cover the implementation of specific devices, such as
the block device (disk) SBC and stream device (tape) SSC instruction systems.
⚫ SCSI transmission link implementation documents discuss FCP, SAS, iSCSI, and FCoE, and
describe in detail the implementation of the SCSI protocol on media.
2.3.1.2 SCSI Logical Topology
The SCSI logical topology includes initiators, targets, and LUNs.
⚫ Initiator: SCSI is essentially a client/server (C/S) architecture in which a client acts as an initiator
to send request instructions to a SCSI target. Generally, a host acts as an initiator.
⚫ Target: processes SCSI instructions. It receives and parses instructions from a host. For example,
a disk array functions as a target.
⚫ LUN: a namespace resource described by a SCSI target. A target may include multiple LUNs, and
attributes of the LUNs may be different. For example, LUN#0 may be a disk, and LUN#1 may be
another device.
The initiator and target of SCSI constitute a typical C/S model. Each instruction is implemented
through the request/response mode. The initiator sends SCSI requests. The target responds to the
SCSI requests, provides services through LUNs, and provides a task management function.
2.3.1.3 SCSI Initiator Model
SCSI initiator logical layers in different operating systems:
⚫ On Windows, a SCSI initiator includes three logical layers: storage/tape driver, SCSI port, and
mini port. The SCSI port implements the basic framework processing procedures for SCSI, such
as device discovery and namespace scanning.
⚫ On Linux, a SCSI initiator includes three logical layers: SCSI device driver, scsi_mod middle layer,
and SCSI adapter driver (HBA). The scsi_mod middle layer processes SCSI device-irrelevant and
adapter-irrelevant processes, such as exceptions and namespace maintenance. The HBA driver
provides link implementation details, such as SCSI instruction packaging and unpacking. The
device driver implements specific SCSI device drivers, such as the famous SCSI disk driver, SCSI
tape driver, and SCSI CD-ROM device driver.
⚫ The structure of Solaris comprises the SCSI device driver, SSA middle layer, and SCSI adapter
driver, which is similar to the structure of Linux/Windows.
⚫ The AIX architecture is structured in three layers: SCSI device driver, SCSI middle layer, and SCSI
adaptation driver.
2.3.1.4 SCSI Target Model
Based on the SCSI architecture, a target is divided into three layers: port layer, middle layer, and
device layer.
⚫ A PORT model in a target packages or unpackages SCSI instructions on links. For example, a
PORT can package instructions into FPC, iSCSI, or SAS, or unpackage instructions from those
formats.
⚫ A device model in a target serves as a SCSI instruction analyser. It tells the initiator what device
the current LUN is by processing INQUIRT, and processes I/Os through READ/WRITE.
⚫ The middle layer of a target maintains models such as LUN space, task set, and task (command).
There are two ways to maintain LUN space. One is to maintain a global LUN for all PORTs, and
the other is to maintain a LUN space for each PORT.
2.3.1.5 SCSI Protocol and Storage System
The SCSI protocol is the basic protocol used for communication between hosts and storage devices.
The controller sends a signal to the bus processor requesting to use the bus. After the request is
accepted, the controller's high-speed cache sends data. During this process, the bus is occupied by
the controller and other devices connected to the same bus cannot use it. However, the bus
processor can interrupt the data transfer at any time and allow other devices to use the bus for
operations of a higher priority.
A SCSI controller is like a small CPU with its own command set and cache. The special SCSI bus
architecture can dynamically allocate resources to tasks run by multiple devices in a computer. In
this way, multiple tasks can be processed at the same time.
2.3.1.6 SCSI Protocol Addressing
A traditional SCSI controller is connected to a single bus, therefore only one bus ID is allocated. An
enterprise-level server may be configured with multiple SCSI controllers, so there may be multiple
SCSI buses. In a storage network, each FC HBA or iSCSI network adapter is connected to a bus. A bus
ID must therefore be allocated to each bus to distinguish between them.
To address devices connected to a SCSI bus, SCSI device IDs and LUNs are used. Each device on the
SCSI bus must have a unique device ID. The HBA on the server also has its own device ID: 7. Each bus,
including the bus adapter, supports a maximum of 8 or 16 device IDs. The device ID is used to
address devices and identify the priority of the devices on the bus.
Each storage device may include sub-devices, such as virtual disks and tape drives. So LUN IDs are
used to address sub-devices in a storage device.
A ternary description (bus ID, target device ID, and LUN ID) is used to identify a SCSI target.
2.3.2 iSCSI, FC, and FCoE

2.3.2.1 iSCSI Protocol
The iSCSI protocol was first launched by IBM, Cisco and HP. Since 2004, the iSCSI protocol has been
used as the formal IETF standard. The existing iSCSI protocol is based on SCSI Architecture Model-2
(AM2).
iSCSI is short for Internet Small Computer System Interface. It is an IP-based storage networking
standard for linking data storage facilities. It provides block-level access to storage devices by
carrying SCSI commands over a TCP/IP network.
The iSCSI protocol encapsulates SCSI commands and block data into TCP packets for transmission
over IP networks. As the transport layer protocol of SCSI, iSCSI uses mature IP network technologies
to implement and extend SAN. The SCSI protocol layer generates CDBs and sends the CDBs to the
iSCSI protocol layer. The iSCSI protocol layer then encapsulates the CDBs into PDUs and transmits
the PDUs over an IP network.
2.3.2.2 iSCSI Initiator and Target
The iSCSI communication system inherits some of SCSI's features. The iSCSI communication involves
an initiator that sends I/O requests and a target that responds to the I/O requests and executes I/O
operations. After a connection is set up between the initiator and target, the target controls the
entire process as the primary device.
⚫ There are three types of iSCSI initiators: software-based initiator driver, hardware-based TCP
offload engine (TOE) NIC, and iSCSI HBA. Their performance increases in that order.
⚫ An iSCSI target is usually an iSCSI disk array or iSCSI tape library.
The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators and targets. All
iSCSI nodes are identified by their iSCSI names. This method distinguishes iSCSI names from host
names.
2.3.2.3 iSCSI Architecture
In an iSCSI system, a user sends a data read or write command on a SCSI storage device. The
operating system converts this request into one or multiple SCSI instructions and sends the
instructions to the target SCSI controller card. The iSCSI node encapsulates the instructions and data
into an iSCSI packet and sends the packet to the TCP/IP layer, where the packet is encapsulated into
an IP packet to be transmitted over a network. You can also encrypt the SCSI instructions for
transmission over an insecure network.
Data packets can be transmitted over a LAN or the Internet. The receiving storage controller
restructures the data packets and sends the SCSI control commands and data in the iSCSI packets to
corresponding disks. The disks execute the operation requested by the host or application. For a
data request, data will be read from the disks and sent to the host. The process is completely
transparent to users. Though SCSI instruction execution and data preparation can be implemented
by the network controller software using TCP/IP, the host will spare a lot of CPU resources to
process the SCSI instructions and data. If these transactions are processed by dedicated devices, the
impact on system performance will be reduced to a minimum. An iSCSI adapter combines the
functions of an NIC and an HBA. The iSCSI adapter obtains data by blocks, classifies and processes
data using the TCP/IP processing engine, and sends IP data packets over an IP network. In this way,
users can create IP SANs without compromising server performance.
2.3.2.4 FC Protocol
FC can be referred to as the FC protocol, FC network, or FC interconnection. As FC delivers high
performance, it is becoming more commonly used for front-end host access on point-to-point and
switch-based networks. Like TCP/IP, the FC protocol suite also includes concepts from the TCP/IP
protocol suite and the Ethernet, such as FC switching, FC switch, FC routing, FC router, and SPF
routing algorithm.
FC protocol structure:
⚫ FC-0: defines physical connections and selects different physical media and data rates for
protocol operations. This maximizes system flexibility and allows for existing cables and different
technologies to be used to meet the requirements of different systems. Copper cables and
optical cables are commonly used.
⚫ FC-1: records the 8-bit/10-bit transmission code to balance the transmission bit stream. The
code can also serve as a mechanism to transfer data and detect errors. Its excellent transfer
capability of 8-bit/10-bit encoding helps reduce component design costs and ensures optimum
transfer density for better clock recovery. Note: 8-bit/10-bit encoding is also applicable to IBM
ESCON.
⚫ FC-2: includes the following items for sending data over the network:
➢ How data should be split into small frames
➢ How much data should be sent at a time (flow control)
➢ Where frames should be sent (including defining service levels based on applications)
⚫ FC-3: defines advanced functions such as striping (data is transferred through multiple
channels), multicast (one message is sent to multiple targets), and group query (multiple ports
are mapped to one node). When FC-2 defines functions for a single port, FC-3 can define
functions across ports.
⚫ FC-4: maps upper-layer protocols. FC performance is mapped to an IP address, a SCSI protocol,
or an ATM protocol. SCSI is a subset of the FC protocol.
Like the Ethernet, FC provides the following network topologies:
⚫ Point-to-point:
➢ The simplest topology that allows direct communication between two nodes (usually a
storage device and a server).
⚫ FC-AL:
➢ Similar to the Ethernet shared bus topology but is in arbitrated loop mode rather than bus
connection mode. Each device is connected to another device end to end to form a loop.
➢ Data frames are transmitted hop by hop in the arbitrated loop and the data frames can be
transmitted only in one direction at any time. As shown in the figure, node A needs to
communicate with node H. After node A wins the arbitration, it sends data frames to node
H. However, the data frames are transmitted clockwise in the sequence of B-C-D-E-F-G-H,
which is inefficient.
Figure 2-9
⚫ Fabric:
➢ Similar to an Ethernet switching topology, a fabric topology is a mesh switching matrix.
➢ The forwarding efficiency is much greater than in FC-AL.
➢ FC devices are connected to fabric switches through optical fibres or copper cables to
implement point-to-point communication between nodes.
FC frees the workstation from the management of every port. Each port manages its own
point-to-point connection to the fabric, and other fabric functions are implemented by FC
switches. On an FC network, there are seven types of ports.
⚫ Device (node) port:

➢ N_Port: Node port. A fabric device can be directly attached.
➢ NL_Port: Node loop port. A device can be attached to a loop.
⚫ Switch port:
➢ E_Port: Expansion port (connecting switches).
➢ F_Port: A port of a fabric device that used to connect to the N_Port.
➢ FL_Port: Fabric loop port.
➢ G_Port: A generic port that can be converted into an E_Port or F_Port.
➢ U_Port: A universal port used to describe automatic port detection.
2.3.2.5 FCoE Protocol
FCoE: defines the mapping from FC to IEEE 802.3 Ethernet. It uses the Ethernet's physical and data
link layers and the FC's network, service, and protocol layers.
FCoE has the following characteristics:
⚫ Organization: Submitted to the American National Standards Institute (ANSI) T11 committee for
approval in 2008. Cooperation with the IEEE is required.
⚫ Objective: To use the scalability of the Ethernet and retain the high reliability and efficiency of
FC
⚫ Other challenges: When FC and the Ethernet are used together, there can be problems related
to packet loss, path redundancy, failover, frame segmentation and reassembly, and non-blocking
transmission.
⚫ FC delivers poor compatibility and is not applicable for long-distance transmission. FCoE has the
same problems.
FCoE retains the protocol stack above FC-2 and replaces FC-0 and FC-1 with the Ethernet's link layer.
The original FC2 is further divided into the following:
⚫ FC-2V: FC2 virtual sub-layer
⚫ FC-2M: FC2 multiplexer sub-layer
⚫ FC-2P: FC2 virtual physical layer
The FC_BB_E mapping protocol requires that FCoE uses lossless Ethernet for transmission at the
bottom layer and carries FC frames in full-duplex and lossless mode. The Ethernet protocol is used
on the physical line.
Comparison between FC and FCoE:
⚫ FC-0 defines the bearer media type, and FC-1 defines the frame encoding and decoding mode.
The two layers need to be defined during transmission over the FC SAN network. FCoE runs on
the Ethernet. Therefore, the Ethernet link layer replaces the preceding two layers.
⚫ Different environments: The FC protocol runs on the traditional FC SAN storage network, while
FCoE runs on the Ethernet.
⚫ Different channels: The FC protocol runs on the FC network, and all packets are transmitted
through FCs. There are various protocol packets, such as IP and ARP packets, on the Ethernet. To
transmit FCoE packets, a virtual FC needs to be created on the Ethernet.
⚫ Compared with the FC protocol, the FIP initialization protocol is used for FCoE to obtain the
VLAN, establish a virtual channel with an FCF, and maintain virtual links.
FCoE requires the support of other protocols. The Ethernet tolerates packet loss, but the FC protocol
does not. As the FC protocol for transmission on the Ethernet, FCoE inherits this feature that packet
loss is not allowed. To ensure that FCoE runs properly on an Ethernet network, the Ethernet needs
to be enhanced to prevent packet loss. The enhanced Ethernet is called Converged Enhanced
Ethernet (CEE).
2.3.3 SAS and SATA

2.3.3.1 SAS Protocol
SAS is the serial standard of the SCSI bus protocol. A serial port has a simple structure, supports hot
swap, and boasts a high transmission speed and execution efficiency. Generally, large parallel cables
cause electronic interference. The SAS cable structure can solve this problem. The SAS cable
structure saves space, thereby improving heat dissipation and ventilation for servers that use SAS
disks.
SAS has the following advantages:
⚫ Lower cost:
➢ A SAS backplane supports SAS and SATA disks, which reduces the cost of using different
types of disks.
➢ There is no need to design different products based on the SCSI and SATA standards. In
addition, the cabling complexity and the number of PCB layers are reduced, further
reducing costs.
➢ System integrators do not need to purchase different backplanes and cables for different
disks.
➢ More devices can be connected.
➢ The SAS technology introduces the SAS expander, so that a SAS system supports more
devices. Each expander can be connected to multiple ports, and each port can be
connected to a SAS device, a host, or another SAS expander.
⚫ High reliability:
➢ The reliability is the same as that of SCSI and FC disks and is better than that of SATA disks.
➢ The verified SCSI command set is retained.
⚫ High performance:
➢ The unidirectional port rate is high.
⚫ Compatibility with SATA:
➢ SATA disks can be directly installed in a SAS environment.
➢ SATA and SAS disks can be used in the same system, which meets the requirements of the
popular tiered storage strategy.
The SAS architecture includes six layers from the bottom to the top: physical layer, phy layer, link
layer, port layer, transport layer, and application layer. Each layer provides certain functions.
⚫ Physical layer: defines hardware, such as cables, connectors, and transceivers.
⚫ Phy layer: includes the lowest-level protocols, like coding schemes and power supply/reset
sequences.
⚫ Link layer: describes how to control phy layer connection management, primitives, CRC,
scrambling and descrambling, and rate matching.
⚫ Port layer: describes the interfaces of the link layer and transport layer, including how to
request, interrupt, and set up connections.
⚫ Transport layer: defines how the transmitted commands, status, and data are encapsulated into
SAS frames and how SAS frames are decomposed.
⚫ Application layer: describes how to use SAS in different types of applications.

SAS has the following characteristics:
⚫ SAS uses the full-duplex (bidirectional) communication mode. The traditional parallel SCSI can
communicate only in one direction. When a device receives a data packet from the parallel SCSI
and needs to respond, a new SCSI communication link needs to be set up after the previous link
is disconnected. However, each SAS cable contains two input cables and two output cables. This
way, SAS can read and write data at the same time, improving the data throughput efficiency.
⚫ Compared with SCSI, SAS has the following advantages:
➢ As it uses the serial communication mode, SAS provides higher throughput and may deliver
higher performance in the future.
➢ Four narrow ports can be bound as a wide link port to provide higher throughput.
Scalability of SAS
⚫ SAS uses expanders to expand interfaces. One SAS domain supports a maximum of 16,384 disk
devices.
⚫ A SAS expander is an interconnection device in a SAS domain. Similar to an Ethernet switch, a
SAS expander enables an increased number of devices to be connected in a SAS domain, and
reduces the cost in HBAs. Each expander can connect to a maximum of 128 terminals or
expanders. The main components in a SAS domain are SAS expanders, terminal devices, and
connection devices (or SAS connection cables).
➢ A SAS expander is equipped with a routing table that tracks the addresses of all SAS drives.
➢ A terminal device can be an initiator (usually a SAS HBA) or a target (a SAS or SATA disk, or
an HBA in target mode).
⚫ Loops cannot be formed in a SAS domain. This ensures terminal devices can be detected.
⚫ In reality, the number of terminal devices connected to an extender is far fewer than 128 due to
bandwidth reasons.
Cable Connection Principles of SAS
⚫ Most storage device vendors use SAS cables to connect disk enclosures to controller enclosures
or connect disk enclosures. A SAS cable bundles four independent channels (narrow ports) into
a wide port to provide higher bandwidth. The four independent channels provide 12 Gbit/s
each, so a wide port can provide 48 Gbit/s of bandwidth. To ensure that the data volume on a
SAS cable does not exceed the maximum bandwidth of the SAS cable, the total number of disks
connected to a SAS loop must be limited.
⚫ For a Huawei storage device, the maximum number of disks supported is 168. That is, a loop
comprising up to a maximum seven disk enclosures each with 24 disk slots. However, all disks in
the loop must be traditional SAS disks. As SSDs are becoming more common, one must consider
that SSDs deliver much higher transmission speeds than SAS disks. Therefore, for SSDs, a
maximum of 96 disks are supported in a loop: four disk enclosures, each with 24 disk slots, form
a loop.
⚫ A SAS cable is called a mini SAS cable when the speed of a single channel is 6 Gbit/s, and a SAS
cable is called a high-density mini SAS cable when the speed is increased to 12 Gbit/s.
2.3.3.2 SATA Protocol
SATA is short for Serial ATA, which is a type of computer bus used for data transmission between the
main board and storage devices (disks and CD-ROM drives). SATA uses a brand new bus structure
rather than simply improving that of PATA.
At the physical layer, the SAS port is completely compatible with the SATA port. SATA disks can be
used in a SAS environment. In terms of port standards, SATA is a sub-standard of SAS. Therefore, a
SAS controller can directly control SATA disks. However, SAS cannot be directly used in a SATA
environment, because a SATA controller cannot control SAS disks.
At the protocol layer, SAS includes three types of protocols that are used for data transmission of
different devices.
⚫ The serial SCSI protocol (SSP) is used to transmit SCSI commands.
⚫ The SCSI management protocol (SMP) is used to maintain and manage connected devices.
⚫ The SATA channel protocol (STP) is used for data transmission between SAS and SATA.
When the three protocols operate cooperatively, SAS can be used with SATA and some SCSI devices.
2.3.4 PCIe and NVMe

2.3.4.1 PCIe Protocol
In 1991, Intel first proposed the concept of PCI. PCI has the following characteristics:
⚫ Simple bus structure, low costs, easy designs.
⚫ The parallel bus supports a limited number of devices and the bus scalability is poor.
⚫ When multiple devices are connected, the effective bandwidth of the bus is greatly reduced and
the transmission rate slows down.
With the development of modern processor technologies, it was inevitable engineers would look to
replace parallel buses with high-speed differential buses in the interconnectivity field. Compared
with single-ended parallel signals, high-speed differential signals are used for higher clock
frequencies. In this case, the PCIe bus came to being.
PCIe is short for PCI Express, which is a high-performance and high-bandwidth serial communication
interconnection standard. It was first proposed by Intel and then developed by the Peripheral
Component Interconnect Special Interest Group (PCI-SIG) to replace bus-based communication
architectures.
Compared with the traditional PCI bus, PCIe has the following advantages:
⚫ Dual channels, high bandwidth, and a fast transmission rate: A transmission mode (RX and TX
are separated) similar to the full-duplex mode is implemented. In addition, a higher
transmission rate can be provided. The first-generation PCIe X1 provides 2.5 Gbit/s, the second
generation provides 5 Gbit/s, the PCIe 3.0 provides 8 Gbit/s, the PCIe 4.0 provides 16 Gbit/s,
and the PCIe 5.0 provides up to 32 Gbit/s.
⚫ Compatibility: PCIe is compatible with PCI at the software layer but has upgraded software.
⚫ Ease-of-use: Hot swap is supported. A PCIe bus interface slot contains the hot swap detection
signal, supporting hot swap and heat exchange.
⚫ Error processing and reporting: A PCIe bus uses a layered structure, in which the software layer
can process and report errors.
⚫ Virtual channels of each physical connection: Each physical channel supports multiple virtual
channels (in theory, eight virtual channels are supported for independent communication
control), thereby supporting QoS of each virtual channel and achieving high-quality traffic
control.
⚫ Reduced I/Os, board-level space, and crosstalk: A typical PCI bus data line requires at least 50
I/O resources, while PCIe XL requires only four I/O resources. Reduced I/Os saves board-level
space, and the direct distance between I/Os can be longer, thereby reducing crosstalk.
Why PCIe? PCIe is future-oriented, and higher throughputs can be achieved in the future. PCIe is
providing increasing throughput using the latest technologies, and the transition from PCI to PCIe
can be simplified by guaranteeing compatibility with PCI software using layered protocols and drives.
The PCIe protocol features point-to-point connection, high reliability, tree networking, full duplex,
and frame-structure-based transmission.
PCIe protocol layers include the physical layer, data link layer, transaction layer, and application
layer.
⚫ The physical layer in a PCIe bus architecture determines the physical features of the bus. In
future, the performance of a PCIe bus can be further improved by increasing the speed or
changing the encoding or decoding mode. Such changes only affect the physical layer,
facilitating upgrades.
⚫ The data link layer ensures the correctness and reliability of data packets transmitted over a
PCIe bus. It checks whether the data packet encapsulation is complete and correct, adds the
sequence number and CRC code to the data, and uses the ack/nack handshake protocol for
error detection and correction.
⚫ The processing layer receives read and write requests from the software layer or creates a
request encapsulation packet and transmits it to the data link layer. This type of packet is called
a transaction layer packet (TLP). The TLP receives data link layer packets (DLLP) from the link
layer, associates the DLLP with a related software request, and transmits it to the software layer
for processing.
⚫ The application layer is designed by users based on actual needs. Other layers must comply with
the protocol requirements.
2.3.4.2 NVMe Protocol
NVMe is short for Non-Volatile Memory Express. The NVMe standard is oriented to PCIe SSDs. Direct
connection from the native PCIe channel to the CPU can avoid the latency caused by communication
between the external controller (PCH) of the SATA and SAS interface and the CPU.
In terms of the entire storage process, NVMe not only serves as a logical protocol port, but also as an
instruction standard and a specified protocol. The low latency and parallelism of PCIe channels and
the parallelism of contemporary processors, platforms, and applications can be used to greatly
improve the read and write performance of SSDs with controllable costs. They can also reduce the
latency caused by the Advanced Host Controller Interface (AHCI) and ensure enhanced performance
of SSDs in the SATA era.
NVMe protocol stack:
⚫ In terms of the transmission path, I/Os of a SAS all-flash array are transmitted from the front-
end server to the CPU through the FC/IP front-end interface protocol of a storage device. They
are then transmitted to a SAS chip, a SAS expander, and finally a SAS SSD through PCIe links and
switches.
⚫ The Huawei NVMe-based all-flash storage system supports end-to-end NVMe. Data I/Os are
transmitted from a front-end server to the CPU through a storage device's FC-NVMe/NVMe
Over RDMA front-end interface protocol. Back-end data is transmitted directly to NVMe-based
SSDs through 100 Gbit/s RDMA. The CPU of the NVMe-based all-flash storage system appears to
communicate directly with NVMe SSDs via a shorter transmission path, providing higher
transmission efficiency and a lower transmission latency.
⚫ In terms of software protocol parsing, SAS- and NVMe-based all-flash storage systems differ
greatly in protocol interaction for data writes. If the SAS back-end SCSI protocol is used, four
protocol interactions are required for a complete data write operation. Huawei NVMe-based all-
flash storage systems require only two protocol interactions, making them twice as efficient as
SAS-based all-flash storage systems in terms of processing write requests.
Advantages of NVMe:
⚫ Low latency: Data is not read from registers when commands are executed, resulting in a low
I/O latency.
⚫ High bandwidth: PCIe X4 can provide up to 4 Gbit/s throughput for a single drive.
⚫ High IOPS: NVMe increases the maximum queue depth from 32 to 64,000. The IOPS of SSDs is
also greatly improved.
⚫ Low power consumption: The automatic switchover between power consumption modes and
dynamic power management greatly reduce power consumption.
⚫ Wide driver applicability: The driver applicability problem between different PCIe SSDs is solved.
Huawei OceanStor Dorado all-flash storage systems use NVMe-oF to implement SSD resource
sharing, and provide 32 Gbit/s FC-NVMe and NVMe over 100 Gbit/s RDMA networking designs. In
this way, the same network protocol is used for front-end network connection, back-end disk
enclosure connection, and scale-out controller interconnection.
RDMA uses related hardware and network technologies to enable NICs of servers to directly read
memory, achieving high bandwidth, low latency, and low resource consumption. However, the
RDMA-dedicated IB network architecture is incompatible with a live network, resulting in high costs.
RoCE effectively solves this problem. RoCE is a network protocol that uses the Ethernet to carry
RDMA. There are two versions of RoCE. RoCEv1 is a link layer protocol and cannot be used in
different broadcast domains. RoCEv2 is a network layer protocol and can implement routing
functions.
2.3.5 RDMA and IB

2.3.5.1 RDMA Protocol
RDMA is short for Remote Direct Memory Access, a method of transferring data in a buffer between
application software on two servers over a network.
Comparison between traditional mode and RDMA mode:
⚫ Compared with the internal bus I/O of traditional DMA, RDMA uses direct buffer transmission
between the application software of two endpoints over a network.
⚫ Compared with traditional network transmission, RDMA does not require operating systems or
protocol stacks.
⚫ RDMA can achieve ultra-low latency and ultra-high throughput transmission between endpoints
without using an abundance of CPU and OS resources. Few resources are consumed for data
processing and migration.
Currently, there are three types of RDMA networks: IB, RoCE, and iWARP. IB is designed for RDMA to
ensure reliable transmission at the hardware level. RoCE and iWARP are Ethernet-based EDMA
technologies and support corresponding verbs interfaces.
⚫ IB is a new-generation network protocol that supports RDMA from the beginning. NICs and
switches that support this technology are required.
⚫ RoCE is a network protocol that allows RDMA over the Ethernet. The lower network header is
an Ethernet header, and the higher network header (including data) is an IB header. RoCE allows
RDMA to be used on a standard Ethernet infrastructure (switch). The NIC should support RoCE.
RoCEv1 is an RDMA protocol implemented based on the Ethernet link layer. Switches must
support flow control technologies like PFC to ensure reliable transmission at the physical layer.
RoCEv2 is implemented at the UDP layer in the Ethernet TCP/IP protocol.
⚫ iWARP: allows RDMA through TCP. The functions supported by IB and RoCE are not supported
by iWARP. However, iWARP allows RDMA to be used over a standard Ethernet infrastructure
(switch). The NIC should support iWARP (if CPU offload is used). Otherwise, all iWARP stacks can
be implemented in the SW, and most RDMA performance advantages are lost.
2.3.5.2 IB Protocol
IB technology is specifically designed for server connections and is widely used for communication
between servers (for example, replication and distributed working), between a server and a storage
device (for example, SAN and DAS), and between a server and a network (for example, LAN, WAN,
and the Internet).
IB defines a set of devices used for system communication, including channel adapters, switches,
and routers used to connect to other devices, such as host channel adapters (HCAs) and target
channel adapters (TCAs). The IB protocol has the following features:
⚫ Standard-based protocol: IB was designed by the InfiniBand Trade Association, which was
founded in 1999 and comprised 225 companies. Main members of the association include
Agilent, Dell, HP, IBM, InfiniSwitch, Intel, Mellanox, Network Appliance, and Sun Microsystems.
More than 100 other members help develop and promote the standard.
⚫ Speed: IB provides high speeds.
⚫ Memory: Servers that support IB use HCAs to convert the IB protocol to the PCI-X or PCI-Xpress
bus inside the server. The HCA supports RDMA and is also called kernel bypass. RDMA fits
clusters well. It uses a virtual addressing solution to let a server identify and use memory
resources from other servers without involving any operating system kernels.
⚫ RDMA helps implement transport offload. The transport offload function transfers data packet
routing from the OS to the chip level, reducing the service load of the processor. An 80 GHz
processor is required to process data at a transmission speed of 10 Gbit/s in the OS.
The IB system includes CAs, a switch, a router, a repeater, and connected links. CAs include and HCAs
and TCAs.
⚫ An HCA is used to connect a host processor to the IB structure.
⚫ A TCA is used to connect I/O adapters to the IB structure.
IB in storage: The IB front-end network is used to exchange data with customers. Data is transmitted
based on the IPoIB protocol. The IB back-end network is used for data interaction between nodes in
a storage device. The RPC module uses RDMA to synchronize data between nodes.
IB layers include the application layer, transport layer, network layer, link layer, and physical layer.
The functions of each layer are described as follows:
⚫ Transport layer: responsible for in-order distribution and segmentation of packets, channel
multiplexing, and data transmission. It also sends, receives, and reassembles data packet
segments.
⚫ Network layer: provides a mechanism for routing packets from one substructure to another.
Each routing packet of the source and destination nodes has a global routing header (GRH) and
a 128-bit IPv6 address. A standard global 64-bit identifier is also embedded at the network layer
and this identifier is unique in all subnets. Through the exchange of such identifier values, data
can be transmitted across multiple subnets.
⚫ Link layer: provides such functions as packet design, point-to-point connection, and packet
switching in the local subsystems. At the packet communication level, two special packet types
are specified: data transmission and network management packets. The network management
packet provides functions like operation control, subnet indication, and fault tolerance for
device enumeration. The data transmission packet is used for data transmission. The maximum
size of each packet is 4 KB. In each specific device subnet, the direction and exchange of each
packet are implemented by a local subnet manager with a 16-bit identifier address.
⚫ Physical layer: defines connections at three rates: 1X, 4X, and 12X. The signal transmission rates
are 2.5 Gbit/s, 10 Gbit/s, and 30 Gbit/s, respectively. IBA therefore allows multiple connections
to obtain a speed of up to 30 Gbit/s. Because the full-duplex serial communication mode is
used, the single-rate bidirectional connection requires only four cables. When the 12-rate mode
is used, only 48 cables are required.
2.3.6 CIFS, NFS, and NDMP

2.3.6.1 CIFS Protocol
In 1996, Microsoft renamed SMB to CIFS and added many new functions. Now, CIFS includes SMB1,
SMB2, and SMB3.0.
CIFS has high requirements on network transmission reliability, so usually uses TCP/IP. CIFS is mainly
used for the Internet and by Windows hosts to access files or other resources over the Internet. CIFS
allows Windows clients to identify and access shared resources. With CIFS, clients can quickly read,
write, and create files in storage systems as on local PCs. CIFS helps maintain a high access speed
and a fast system response even when many users simultaneously access the same shared file.
2.3.6.2 NFS Protocol
NFS is short for Network File System. The network file sharing protocol is defined by the IETF and
widely used in the Linux/Unix environment.
NFS is a client/server application that uses remote procedure call (RPC) for communication between
computers. Users can store and update files on the remote NAS just like on local PCs. A system
requires an NFS client to connect to an NFS server. NFS is used for independent transmission so uses
TCP or UDP. Users or system administrators can use NFS to mount all file systems or a part of a file
system (a part of any directory or subdirectory hierarchy). Access to the mounted file system can be
controlled using permissions, for example, read-only or read-write permissions.
Differences between NFSv3 and NFSv4:
⚫ NFSv4 is a stateful protocol. It implements the file lock function and can obtain the root node of
a file system without the help of the NLM and MOUNT protocols. NFSv3 is a stateless protocol.
It requires the NLM protocol to implement the file lock function.
⚫ NFSv4 has enhanced security and supports and RPCSEC-GSS identity authentication.
⚫ NFSv4 provides only two requests: NULL and COMPOUND. All operations are integrated into
COMPOUND. A client can encapsulate multiple operations into one COMPOUND request based
on actual requests to improve flexibility.
⚫ The command space of the NFSv4 file system is changed. A root file system (fsid=0) must be set
on the server, and other file systems are mounted to the root file system for export.
⚫ Compared with NFSv3, the cross-platform feature of NFSv4 is enhanced.
2.3.6.3 NDMP Protocol
The backup process of the traditional NAS storage is as follows:
⚫ A NAS device is a closed storage system. The Client Agent of the backup software can only be
installed on the production system instead of the NAS device. In the traditional network backup
process, data is read from a NAS device through the CIFS or NFS sharing protocol, and then
transferred to a backup server over a network.
⚫ Such a mode occupies network, production system and backup server resources, resulting in
poor performance and an inability to meet the requirements for backing up a large amount of
data.
The NDMP protocol is designed for the data backup system of NAS devices. It enables NAS devices to
send data directly to the connected disk devices or the backup servers on the network for backup,
without any backup client agent being required.
There are two networking modes for NDMP:
⚫ On a 2-way network, backup media is connected directly to a NAS storage system instead of to a
backup server. In a backup process, the backup server sends a backup command to the NAS
storage system through the Ethernet. The system then directly backs up data to the tape library
it is connected to.
➢ In the NDMP 2-way backup mode, data flows are transmitted directly to backup media,
greatly improving the transmission performance and reducing server resource usage.
However, a tape library is connected to a NAS storage device, so the tape library can back
up data only for the NAS storage device to which it is connected.
➢ Tape libraries are expensive. To enable different NAS storage devices to share tape
devices, NDMP also supports the 3-way backup mode.
⚫ In the 3-way backup mode, a NAS storage system can transfer backup data to a NAS storage
device connected to a tape library through a dedicated backup network. Then, the storage
device backs up the data to the tape library.
2.4 Storage System Architecture

2.4.1 Storage System Architecture Evolution
The storage system evolved from a single controller to mutual backup of dual controllers that
processed their own tasks before processing data concurrently. Later, parallel symmetric processing
was implemented for multiple controllers. Distributed storage has become widely used thanks to the
development of cloud computing and big data.
Currently, single-controller storage is rare. Most entry-level and mid-range storage systems use dual-
controller architecture, while most mission-critical storage systems use multi-controller architecture.
Single-controller Storage:
⚫ External disk array with RAID controllers: Using a disk chassis, a disk array virtualizes internal
disks into logical disks through RAID controllers, and then connects to an SCS interface on a host
through an external SCSI interface.
⚫ If a storage system has only one controller module, it is a single point of failure (SPOF).
Dual-controller Storage:
⚫ Currently, dual-controller architecture is mainly used in mainstream entry-level and mid-range
storage systems.
⚫ There are two working modes: Active-Standby and Active-Active.
➢ Active-Standby
This is also called high availability (HA). That is, only one is working at a time, while the
other waits, synchronizes data, and monitors services. If the active controller fails, the
standby controller takes over its services. In addition, the active controller is powered off
or restarted before the takeover to prevent brain split. The bus use of the active controller
is released and then back-end and front-end buses are taken over.
➢ Active-Active
Two controllers are working at the same time. Each connects to all back-end buses, but
each bus is managed by only one controller. Each controller manages half of all back-end
buses. If one controller is faulty, the other takes over all buses. This is more efficient than
Active-Standby.
Mid-range Storage Architecture Evolution:
⚫ Mid-range storage systems always use an independent dual-controller architecture. Controllers
are usually of modular hardware.
⚫ The evolution of mid-range storage mainly focuses on the rate of host interfaces and disk
interfaces, and the number of ports.
⚫ The common form factor is the convergence of SAN and NAS storage services.
Multi-controller Storage:
⚫ Most mission-critical storage systems use multi-controller architecture.
⚫ The main architecture models are as follows:
➢ Bus architecture
➢ Hi-Star architecture
➢ Direct-connection architecture
➢ Virtual matrix architecture
Mission-critical storage architecture evolution:
⚫ In 1990, EMC launched Symmetrix, a full bus architecture. A parallel bus connected front-end
interface modules, cache modules, and back-end disk interface modules for data and signal
exchange in time-division multiplexing mode.
⚫ In 2000, HDS adopted the switching architecture for Lightning 9900 products. Front-end
interface modules, cache modules, and back-end disk interface modules were connected on two
redundant switched networks, increasing communication channels to dozens of times more
than that of the bus architecture. The internal bus was no longer a performance bottleneck.
⚫ In 2003, EMC launched the DMX series based on full mesh architecture. All modules were
connected in point-to-point mode, obtaining theoretically larger internal bandwidth but adding
system complexity and limiting scalability challenges.
⚫ In 2009, to reduce hardware development costs, EMC launched the distributed switching
architecture by connecting a separated switch module to the tightly coupled dual-controller of
mid-range storage systems. This achieved a balance between costs and scalability.
⚫ In 2012, Huawei launched the Huawei OceanStor 18000 series, a mission-critical storage
product also based on distributed switching architecture.
Storage Software Technology Evolution:
A storage system combines unreliable and low-performance disks to provide high-reliability and
high-performance storage through effective management. Storage systems provide sharing, easy-to-
manage, and convenient data protection functions. Storage system software has evolved from basic
RAID and cache to data protection features such as snapshot and replication, to dynamic resource
management with improved data management efficiency, and deduplication and tiered storage with
improved storage efficiency.
Distributed Storage Architecture:
⚫ A distributed storage system organizes local HDDs and SSDs of general-purpose servers into a
large-scale storage resource pool, and then distributes data to multiple data storage servers.
⚫ Currently, distributed storage of Huawei learns from Google, building a distributed file system
among multiple servers and then implementing storage services on the file system.
⚫ Most storage nodes are general-purpose servers. Huawei OceanStor 100D is compatible with
multiple general-purpose x86 servers and Arm servers.
➢ Protocol: storage protocol layer. The block, object, HDFS, and file services support local
mounting access over iSCSI or VSC, S3/Swift access, HDFS access, and NFS access
respectively.
➢ VBS: block access layer of FusionStorage Block. User I/Os are delivered to VBS over iSCSI or
SCSI.
➢ EDS-B: provides block services with enterprise features, and receives and processes I/Os
from VBS.
➢ EDS-F: provides the HDFS service.
➢ Metadata Controller (MDC): The metadata control device controls distributed cluster node
status, data distribution rules, and data rebuilding rules.
➢ Object Storage Device (OSD): a storage device for storing user data in distributed clusters
of the object storage device
➢ Cluster Manager (CM): manages cluster information.
2.4.2 Storage System Expansion Methods

Service data continues to increase with the continued development of enterprise information
systems and the ever-expanding scale of services. The initial configuration of storage systems is
often not enough to meet these demands. Storage system capacity expansion has become a major
concern of system administrators. There are two capacity expansion methods: scale-up and scale-
out.
Scale-up:
⚫ This traditional vertical expansion architecture continuously adds storage disks into the existing
storage systems to meet demands.
⚫ Advantage: simple operation at the initial stage
⚫ Disadvantage: As the storage system scale increases, resource increase reaches a bottleneck.
Scale-out:
⚫ This horizontal expansion architecture adds controllers to meet demands.
⚫ Advantage: As the scale increases, the unit price decreases and the efficiency is improved.
⚫ Disadvantage: The complexity of software and management increases.
Huawei SAS disk enclosure is used as an example.
⚫ Port consistency: In a loop, the EXP port of an upper-level disk enclosure is connected to the PRI
port of a lower-level disk enclosure.
⚫ Dual-plane networking: Expansion module A connects to controller A, while expansion module
B connects to controller B.
⚫ Symmetric networking: On controllers A and B, symmetric ports and slots are connected to the
same disk enclosure.
⚫ Forward and backward connection networking: Expansion module A uses forward connection,
while expansion module B uses backward connection.
⚫ Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed the upper
limit.
Huawei smart disk enclosure is used as an example.
⚫ Port consistency: In a loop, the EXP (P1) port of an upper-level disk enclosure is connected to
the PRI (P0) port of a lower-level disk enclosure.
⚫ Dual-plane networking: Expansion board A connects to controller A, while expansion board B
connects to controller B.
⚫ Symmetric networking: On controllers A and B, symmetric ports and slots are connected to the
same disk enclosure.
⚫ Forward connection networking: Both expansion modules A and B use forward connection.
⚫ Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed the upper
limit.
IP scale-out is used for Huawei OceanStor V3 and V5 entry-level and mid-range series, Huawei
OceanStor V5 Kunpeng series, and Huawei OceanStor Dorado V6 series. IP scale-out integrates
TCP/IP, Remote Direct Memory Access (RDMA), and Internet Wide Area RDMA Protocol (iWARP) to
implement service switching between controllers, which complies with the all-IP trend of the data
center network.
PCIe scale-out is used for Huawei OceanStor 18000 V3 and V5 series, and Huawei OceanStor Dorado
V3 series. PCIe scale-out integrates PCIe channels and the RDMA technology to implement service
switching between controllers.
PCIe scale-out: features high bandwidth and low latency.
IP scale-out: employs standard data center technologies (such as ETH, TCP/IP, and iWARP) and
infrastructure, and boosts the development of Huawei's proprietary chips for entry-level and mid-
range products.
Next, let's move on to I/O read and write processes of the host. The scenarios are as follows:
⚫ Local Write Process
➢ A host delivers write I/Os to engine 0.
➢ Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message indicating that data is written successfully.
➢ Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
➢ If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (engine 1
for example) where the disk resides.
➢ Engine 1 writes dirty data onto disks.
⚫ Non-local Write Process
➢ After detecting that the LUN is owned by engine 0, engine 2 transfers the write I/Os to
engine 0.
➢ Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message to engine 2, indicating that data is written successfully.
➢ Engine 2 returns the write success message to the host.
➢ Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
➢ Engine 1 writes dirty data onto disks.
⚫ Local Read Process

➢ If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to the host.
➢ If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from the disk. If
the target disk is on the local computer, engine 0 reads data from the disk.
➢ After the read I/Os are hit locally, engine 0 returns the data to the host.
➢ Engine 1 reads data from the disk.
➢ Engine 1 accomplishes the data read.
➢ Engine 1 returns the data to engine 0 and then engine 0 returns the data to the host.
⚫ Non-local Read Process
➢ The LUN is not owned by the engine that delivers read I/Os, and the host delivers the read
I/Os to engine 2.
➢ After detecting that the LUN is owned by engine 0, engine 2 transfers the read I/Os to
engine 0.
➢ If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to engine 2.
➢ Engine 2 returns the data to the host.
➢ If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from the disk. If
the target disk is on the local computer, engine 0 reads data from the disk.
➢ After the read I/Os are hit locally, engine 0 returns the data to engine 2 and then engine 2
returns the data to the host.
➢ If the target disk is on a remote device, engine 0 transfers the I/Os to engine 1 where the
disk resides.
➢ Engine 1 reads data from the disk.
➢ Engine 1 completes the data read.
➢ Engine 1 returns the data to engine 0, engine 0 returns the data to engine 2, and then
engine 2 returns the data to the host.
2.4.3 Huawei Storage Product Architecture

Huawei entry-level and mid-range storage products use dual-controller architecture by default.
Huawei mission-critical storage products use architecture with multiple controllers. OceanStor
Dorado V6 SmartMatrix architecture integrates the advantages of scale-up and scale-out
architectures. A single system can be expanded to a maximum of 32 controllers, greatly improving
E2E high reliability. The architecture ensures zero service interruption when seven out of eight
controllers are faulty, providing 99.9999% high availability. It is a perfect choice to carry key service
applications in finance, manufacturing, and carrier industries.
SmartMatrix makes breakthroughs in the mission-critical storage architecture that separates
computing and storage resources. Controller enclosures are completely separated from and directly
connected to disk enclosures. The biggest advantage is that controllers and storage devices can be
independently expanded and upgraded, which greatly improves storage system flexibility, protects
customers' investments in the long term, reduces storage risks, and guarantees service continuity.
⚫ Front-end full interconnection
➢ Dorado 8000 and 18000 V6 support FIMs, which can be simultaneously accessed by four
controllers in a controller enclosure.
➢ Upon reception of host I/Os, the FIM directly distributes the I/Os to appropriate
controllers.
⚫ Full interconnection among controllers
➢ Controllers in a controller enclosure are connected by 100 Gbit/s (40 Gbit/s for Dorado
3000 V6) RDMA links on the backplane.
➢ For scale-out to multiple controller enclosures, any two controllers can be directly
connected to avoid data forwarding.
⚫ Back-end full interconnection
➢ Dorado 8000 and 18000 V6 support BIMs, which allow a smart disk enclosure to be
connected to two controller enclosures and accessed by eight controllers simultaneously.
This technique, together with continuous mirroring, allows the system to tolerate failure of
7 out of 8 controllers.
➢ Dorado 3000, 5000, and 6000 V6 do not support BIMs. Disk enclosures connected to
Dorado 3000, 5000, and 6000 V6 can be accessed by only one controller enclosure.
Continuous mirroring is not supported.
The storage system supports three types of disk enclosures: SAS, smart SAS, and smart NVMe.
Currently, they cannot be used together on one storage system. Smart SAS and smart NVMe disk
enclosures use the same networking mode. In this mode, a controller enclosure uses the shared 2-
port 100 Gbit/s RDMA interface module to connect to a disk enclosure. Each interface module
connects to the four controllers in the controller enclosure through PCIe 3.0 x16. In this way, each
disk enclosure can be simultaneously accessed by all four controllers, achieving full interconnection
between the disk enclosure and the four controllers. A smart disk enclosure has two groups of uplink
ports and can connect to two controller enclosures at the same time. This allows the two controller
enclosures (eight controllers) to simultaneously access a disk enclosure, implementing full
interconnection between the disk enclosure and eight controllers. When full interconnection
between disk enclosures and eight controllers is implemented, the system can use continuous
mirroring to tolerate failure of 7 out of 8 controllers without service interruption.
Huawei storage provides E2E global resource sharing:
⚫ Symmetric architecture
➢ All products support host access in active-active mode. Requests can be evenly distributed
to each front-end link.
➢ They eliminate LUN ownership of controllers, making LUNs easier to use and balancing
loads. They accomplish this by dividing a LUN into multiple slices that are then evenly
distributed to all controllers using the DHT algorithm
➢ Mission-critical products reduce latency with intelligent FIMs that divide LUNs into slices
for hosts I/Os and send the requests to their target controller.
⚫ Shared port
➢ A single port is shared by four controllers in a controller enclosure.
➢ Loads are balanced without host multipathing.
⚫ Global cache
➢ The system directly writes received I/Os (in one or two slices) to the cache of the
corresponding controller and sends an acknowledgement to the host.
➢ The intelligent read cache of all controllers participates in prefetch and cache hit of all LUN
data and metadata.
FIMs of Huawei OceanStor Dorado 8000 and 18000 V6 series storage adopt Huawei-developed
Hi1822 chip to connect to all controllers in a controller enclosure via four internal links and each
front-end port provides a communication link for the host. If any controller restarts during an
upgrade, services are seamlessly switched to the other controller without impacting hosts and
interrupting links. The host is unaware of controller faults. Switchover is completed within 1 second.
The FIM has the following features:
⚫ Failure of a controller will not disconnect the front-end link, and the host is unaware of the
controller failure.
⚫ The PCIe link between the FIM and the controller is disconnected, and the FIM detects the
controller failure.
⚫ Service switchover is performed between the controllers, and the FIM redistributes host
requests to other controllers.
⚫ The switchover time is about 1 second, which is much shorter than switchover performed by
multipathing software (10-30s).
In global cache mode, host data is directly written into linear space logs, and the logs directly copy
the host data to the memory of multiple controllers using RDMA based on a preset copy policy. The
global cache consists of two parts:
⚫ Global memory: memory of all controllers (four controllers in the figure). This is managed in a
unified memory address, and provides linear address space for the upper layer based on a
redundancy configuration policy.
⚫ WAL: new write cache of the log type
The global pool uses RAID 2.0+, full-strip write of new data, and shared RAID groups between
multiple strips.
Another feature is back-end sharing, which includes sharing of back-end interface modules within an
enclosure and cross-controller enclosure sharing of back-end disk enclosures.
Active-Active Architecture with Full Load Balancing:
⚫ Even distribution of unhomed LUNs
➢ Data on LUNs is divided into 64 MB slices. The slices are distributed to different virtual
nodes based on the hash result (LUN ID + LBA).
⚫ Front-end load balancing
➢ UltraPath selects appropriate physical links to send each slice to the corresponding virtual
node.
➢ The front-end interconnect I/O modules forward the slices to the corresponding virtual
nodes.
➢ Front-end: If there is no UltraPath or FIM, the controllers forward I/Os to the
corresponding virtual nodes.
⚫ Global write cache load balancing
➢ The data volume is balanced.
➢ Data hotspots are balanced.
⚫ Global storage pool load balancing
➢ Usage of disks is balanced.
➢ The wear degree and lifecycle of disks are balanced.
➢ Data is evenly distributed.
➢ Hotspot data is balanced.
⚫ Three cache copies
➢ The system supports two or three copies of the write cache.

➢ Three-copy requires an extra license.
➢ Only mission-critical storage systems support three copies.
⚫ Three copies tolerate simultaneous failure of two controllers.
➢ Failure of two controllers does not cause data loss or service interruption.
⚫ Three copies tolerate failure of one controller enclosure.
➢ With three copies, data is mirrored in a controller enclosure and across controller
enclosures.
➢ Failure of a controller enclosure does not cause data loss or service interruption.
Key reliability technologies of Huawei storage products:
⚫ Continuous mirroring
➢ Dorado V6's mission-critical storage systems support continuous mirroring. In the event of
a controller failure, the system automatically selects new controllers for mirroring.
➢ Continuous mirroring includes all devices in back-end full interconnection.
⚫ Back-end full interconnection
➢ Controllers are directly connected to disk enclosures.
➢ Dorado V6's mission-critical storage systems support back-end full interconnection.
➢ BIMs + two groups of uplink ports on the disk enclosures achieve full interconnection of
the disk enclosures to eight controllers.
⚫ Continuous mirroring and back-end full interconnection allow the system to tolerate failure of
seven out of eight controllers.
Host service switchover when a single controller is faulty: When FIMs are used, failure of a
controller will not disconnect front-end ports from hosts, and the hosts are unaware of the
controller failure, ensuring high availability. When a controller fails, the FIM port chip detects
that the PCIe link between the FIM and the controller is disconnected. Then service switchover
is performed between the controllers, and the FIM redistributes host I/Os to other controllers.
This process is completed within seconds and does not affect host services. In comparison,
when non-shared interface modules are used, a link switchover must be performed by the
host's multipathing software in the event of a controller failure, which takes a longer time (10 to
30 seconds) and reduces reliability.
2.5 Storage Network Architecture

2.5.1 DAS
Direct-attached storage (DAS) connects one or more storage devices to servers. These storage
devices provide block-level data access for the servers. Based on the locations of storage devices and
servers, DAS is classified into internal DAS and external DAS. SCSI cables are used to connect hosts
and storage devices.
JBOD, short for Just a Bunch Of Disks, logically connects several physical disks in series to increase
capacity but does not provide data protection. JBOD can resolve the insufficient capacity expansion
issue caused by limited disk slots of internal storage. However, it offers no redundancy, resulting in
poor reliability.
For a smart disk array, the controller provides RAID and large-capacity cache, enables the disk array
to have multiple functions, and is equipped with dedicated management software.
2.5.2 NAS
Enterprises need to store a large amount of data and share the data through a network. Therefore,
network-attached storage (NAS) is a good choice. NAS connects storage devices to the live network
and provides data and file services.
For a server or host, NAS is an external device and can be flexibly deployed through the network. In
addition, NAS provides file-level sharing rather than block-level sharing, which makes it easier for
clients to access NAS over the network. UNIX and Microsoft Windows users can seamlessly share
data through NAS or File Transfer Protocol (FTP). When NAS sharing is used, UNIX uses NFS and
Windows uses CIFS.
NAS has the following characteristics:
⚫ NAS provides storage resources through file-level data access and sharing, enabling users to
quickly share files with minimum storage management costs.
⚫ NAS is a preferred file sharing storage solution that does not require multiple file servers.
⚫ NAS also helps eliminate bottlenecks in user access to general-purpose servers.
⚫ NAS uses network and file sharing protocols for archiving and storage. These protocols include
TCP/IP for data transmission as well as CIFS and NFS for providing remote file services.
A general-purpose server can be used to carry any application and run a general-purpose operating
system. Unlike general-purpose servers, NAS is dedicated to file services and provides file sharing
services for other operating systems using open standard protocols. NAS devices are optimized
based on general-purpose servers in aspects such as file service functions, storage, and retrieval. To
improve the high availability of NAS devices, some NAS vendors also support the NAS clustering
function.
The components of a NAS device are as follows:
⚫ NAS engine (CPU and memory)
⚫ One or more NICs that provide network connections, for example, GE NIC and 10GE NIC.
⚫ An optimized operating system for NAS function management
⚫ NFS and CIFS protocols
⚫ Disk resources that use industry-standard storage protocols, such as ATA, SCSI, and Fibre
Channel
NAS protocols include NFS, CIFS, FTP, HTTP, and NDMP.
⚫ NFS is a traditional file sharing protocol in the UNIX environment. It is a stateless protocol. If a
fault occurs, NFS connections can be automatically recovered.
⚫ CIFS is a traditional file sharing protocol in the Microsoft environment. It is a stateful protocol
based on the Server Message Block (SMB) protocol. If a fault occurs, CIFS connections cannot be
automatically recovered. CIFS is integrated into the operating system and does not require
additional software. Moreover, CIFS sends only a small amount of redundant information, so it
has higher transmission efficiency than NFS.
⚫ FTP is one of the protocols in the TCP/IP protocol suite. It consists of two parts: FTP server and
FTP client. The FTP server is used to store files. Users can use the FTP client to access resources
on the FTP server through FTP.
⚫ Hypertext Transfer Protocol (HTTP) is an application-layer protocol used to transfer hypermedia
documents (such as HTML). It is designed for communication between a Web browser and a
Web server, but can also be used for other purposes.
⚫ Network Data Management Protocol (NDMP) provides an open standard for NAS network
backup. NDMP enables data to be directly written to tapes without being backed up by backup
servers, improving the speed and efficiency of NAS data protection.
Working principles of NFS: Like other file sharing protocols, NFS also uses the C/S architecture.
However, NFS provides only the basic file processing function and does not provide any TCP/IP data
transmission function. The TCP/IP data transmission function can be implemented only by using the
Remote Procedure Call (RPC) protocol. NFS file systems are completely transparent to clients.
Accessing files or directories in an NFS file system is the same as accessing local files or directories.
One program can use RPC to request a service from a program located in another computer over a
network without having to understand the underlying network protocols. RPC assumes the existence
of a transmission protocol such as Transmission Control Protocol (TCP) or User Datagram Protocol
(UDP) to carry the message data between communicating programs. In the OSI network
communication model, RPC traverses the transport layer and application layer. RPC simplifies
development of applications.
RPC works based on the client/server model. The requester is a client, and the service provider is a
server. The client sends a call request with parameters to the RPC server and waits for a response.
On the server side, the process remains in a sleep state until the call request arrives. Upon receipt of
the call request, the server obtains the process parameters, outputs the calculation results, and
sends the response to the client. Then, the server waits for the next call request. The client receives
the response and obtains call results.
One of the typical applications of NFS is using the NFS server as internal shared storage in cloud
computing. The NFS client is optimized based on cloud computing to provide better performance
and reliability. Cloud virtualization software (such as VMware) optimizes the NFS client, so that the
VM storage space can be created on the shared space of the NFS server.
Working principles of CIFS: CIFS runs on top of TCP/IP and allows Windows computers to access files
on UNIX computers over a network.
The CIFS protocol applies to file sharing. Two typical application scenarios are as follows:
⚫ File sharing service
➢ CIFS is commonly used in file sharing service scenarios such as enterprise file sharing.
⚫ Hyper-V VM application scenario
➢ SMB can be used to share mirrors of Hyper-V virtual machines promoted by Microsoft. In
this scenario, the failover feature of SMB 3.0 is required to ensure service continuity upon
a node failure and to ensure the reliability of VMs.
2.5.3 SAN
2.5.3.1 IP SAN Technologies
NIC + Initiator software: Host devices such as servers and workstations use standard NICs to connect
to Ethernet switches. iSCSI storage devices are also connected to the Ethernet switches or to the
NICs of the hosts. The initiator software installed on hosts virtualizes NICs into iSCSI cards. The iSCSI
cards are used to receive and transmit iSCSI data packets, implementing iSCSI and TCP/IP
transmission between the hosts and iSCSI devices. This mode uses standard Ethernet NICs and
switches, eliminating the need for adding other adapters. Therefore, this mode is the most cost-
effective. However, the mode occupies host resources when converting iSCSI packets into TCP/IP
packets, increasing host operation overheads and degrading system performance. The NIC + initiator
software mode is applicable to scenarios that require the relatively low I/O and bandwidth
performance for data access.
TOE NIC + initiator software: The TOE NIC processes the functions of the TCP/IP protocol layer, and
the host processes the functions of the iSCSI protocol layer. Therefore, the TOE NIC significantly
improves the data transmission rate. Compared with the pure software mode, this mode reduces
host operation overheads and requires minimal network construction expenditure. This is a trade-off
solution.
iSCSI HBA:
⚫ An iSCSI HBA is installed on the host to implement efficient data exchange between the host
and the switch and between the host and the storage device. Functions of the iSCSI protocol
layer and TCP/IP protocol stack are handled by the host HBA, occupying the least CPU resources.
This mode delivers the best data transmission performance but requires high expenditure.
⚫ The iSCSI communication system inherits part of SCSI's features. The iSCSI communication
involves an initiator that sends I/O requests and a target that responds to the I/O requests and
executes I/O operations. After a connection is set up between the initiator and target, the target
controls the entire process as the primary device. The target includes the iSCSI disk array and
iSCSI tape library.
⚫ The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators and
targets. All iSCSI nodes are identified by their iSCSI names. In this way, iSCSI names are
distinguished from host names.
⚫ iSCSI uses iSCSI Qualified Name (IQN) to identify initiators and targets. Addresses change with
the relocation of initiator or target devices, but their names remain unchanged. When setting
up a connection, an initiator sends a request. After the target receives the request, it checks
whether the iSCSI name contained in the request is consistent with that bound with the target.
If the iSCSI names are consistent, the connection is set up. Each iSCSI node has a unique iSCSI
name. One iSCSI name can be used in the connections from one initiator to multiple targets.
Multiple iSCSI names can be used in the connections from one target to multiple initiators.
Logical ports are created based on bond ports, VLAN ports, or Ethernet ports. Logical ports are
virtual ports that carry host services. A unique IP address is allocated to each logical port for carrying
its services.
⚫ Bond port: To improve reliability of paths for accessing file systems and increase bandwidth, you
can bond multiple Ethernet ports on the same interface module to form a bond port.
⚫ VLAN: VLANs logically divide the physical Ethernet ports or bond ports of a storage system into
multiple broadcast domains. On a VLAN, when service data is being sent or received, a VLAN ID
is configured for the data so that the networks and services of VLANs are isolated, further
ensuring service data security and reliability.
⚫ Ethernet port: Physical Ethernet ports on an interface module of a storage system. Bond ports,
VLANs, and logical ports are created based on Ethernet ports.
IP address failover: A logical IP address fails over from a faulty port to an available port. In this
way, services are switched from the faulty port to the available port without interruption. The
faulty port takes over services back after it recovers. This task can be completed automatically
or manually. IP address failover applies to IP SAN and NAS.
During the IP address failover, services are switched from the faulty port to an available port,
ensuring service continuity and improving the reliability of paths for accessing file systems. Users are
not aware of this process.
The essence of IP address failover is a service switchover between ports. The ports can be Ethernet
ports, bond ports, or VLAN ports.
⚫ Ethernet port–based IP address failover: To improve the reliability of paths for accessing file
systems, you can create logical ports based on Ethernet ports.
Figure 2-10
➢ Host services are running on logical port A of Ethernet port A. The corresponding IP
address is "a". Ethernet port A fails and thereby cannot provide services. After IP address
failover is enabled, the storage system will automatically locate available Ethernet port B,
delete the configuration of logical port A that corresponds to Ethernet port A, and create
and configure logical port A on Ethernet port B. In this way, host services are quickly
switched to logical port A on Ethernet port B. The service switchover is executed quickly.
Users are not aware of this process.
⚫ Bond port-based IP address failover: To improve the reliability of paths for accessing file
systems, you can bond multiple Ethernet ports to form a bond port. When an Ethernet port that
is used to create the bond port fails, services are still running on the bond port. The IP address
fails over only when all Ethernet ports that are used to create the bond port fail.
Figure 2-11
➢ Multiple Ethernet ports are bonded to form bond port A. Logical port A created based on
bond port A can provide high-speed data transmission. When both Ethernet ports A and B
fail due to various causes, the storage system will automatically locate bond port B, delete
logical port A, and create the same logical port A on bond port B. In this way, services are
switched from bond port A to bond port B. After Ethernet ports A and B recover, services
will be switched back to bond port A if failback is enabled. The service switchover is
executed quickly, and users are not aware of this process.
⚫ VLAN-based IP address failover: You can create VLANs to isolate different services.
➢ To implement VLAN-based IP address failover, you must create VLANs, allocate a unique ID
to each VLAN, and use the VLANs to isolate different services. When an Ethernet port on a
VLAN fails, the storage system will automatically locate an available Ethernet port with the
same VLAN ID and switch services to the available Ethernet port. After the faulty port
recovers, it takes over the services.
➢ VLAN names, such as VLAN A and VLAN B, are automatically generated when VLANs are
created. The actual VLAN names depend on the storage system version.
➢ Ethernet ports and their corresponding switch ports are divided into multiple VLANs, and
different IDs are allocated to the VLANs. The VLANs are used to isolated different services.
VLAN A is created on Ethernet port A, and the VLAN ID is 1. Logical port A that is created
based on VLAN A can be used to isolate services. When Ethernet port A fails due to various
causes, the storage system will automatically locate VLAN B and the port whose VLAN ID is
1, delete logical port A, and create the same logical port A based on VLAN B. In this way,
the port where services are running is switched to VLAN B. After Ethernet port A recovers,
the port where services are running will be switched back to VLAN A if failback is enabled.
➢ An Ethernet port can belong to multiple VLANs. When the Ethernet port fails, all VLANs will
fail. Services must be switched to ports of other available VLANs. The service switchover is
executed quickly, and users are not aware of this process.
2.5.3.2 FC SAN Technologies
FC HBA: The FC HBA converts SCSI packets into Fibre Channel packets, which does not occupy host
resources.
Here are some key concepts in Fibre Channel networking:
⚫ Fibre Channel Routing (FCR) provides connectivity to devices in different fabrics without
merging the fabrics. Different from E_Port cascading of common switches, after switches are
connected through an FCR switch, the two fabric networks are not converged and are still two
independent fabrics. The link switch between two fabrics functions as a router.
⚫ FC Router: a switch running the FC-FC routing service.
⚫ EX_Port: a type of port that functions like an E_Port, but does not propagate fabric services or
routing topology information from one fabric to another.
⚫ Backbone fabric: fabric of a switch running the Fibre Channel router service.
⚫ Edge fabric: fabric that connects a Fibre Channel router.
⚫ Inter fabric link (IFL): the link between an E_Port and an EX-Port, or a VE_Port and a VEX-Port.
Another important concept is zoning. A zone is a set of ports or devices that communicate with each
other. A zone member can only access other members of the same zone. A device can reside in
multiple zones. You can configure basic zones to control the access permission of each device or
port. Moreover, you can set traffic isolation zones. When there are multiple ISLs (E_Ports), an ISL
only transmits the traffic destined for ports that reside in the same traffic isolation zone.
2.5.3.3 Comparison Between IP SAN and FC SAN
First, let's look back on the concept of SAN.
⚫ Protocol: Fibre Channel/iSCSI. The SAN architectures that use the two protocols are FC SAN and
IP SAN.
⚫ Raw device access: suitable for traditional database access.
⚫ Dependence on the application host to provide file access. Share access requires the support of
cluster software, which causes high overheads in processing access conflicts, resulting in poor
performance. In addition, it is difficult to support sharing in heterogeneous environments.
⚫ High performance, high bandwidth, and low latency, but high cost and poor scalability
Then, let's compare FC SAN and IP SAN.
⚫ To solve the poor scalability issue of DAS, storage devices can be networked using FC SAN to
support connection to more than 100 servers.
⚫ IP SAN is designed to address the management and cost challenges of FC SAN. IP SAN requires
only a few hardware configurations and the hardware is widely used. Therefore, the cost of IP
SAN is much lower than that of FC SAN. Most hosts have been configured with appropriate NICs
and switches, which are also suitable (although not perfect) for iSCSI transmission. High-
performance IP SAN requires dedicated iSCSI HBAs and high-end switches.
2.5.4 Distributed Architecture

A distributed storage system organizes local HDDs and SSDs of general-purpose servers into large-
scale storage resource pools, and then distributes data to multiple data storage servers.
10GE, 25GE, and IB networks are generally used as the backend networks of distributed storage. The
frontend network is usually a GE, 10GE, or 25GE network.
The network planes and their functions are described as follows:
⚫ Management plane: interconnects with the customer's management network for system
management and maintenance.
⚫ BMC plane: connects to Mgmt ports of management or storage nodes to enable remote device
management.
⚫ Storage plane: an internal plane, used for service data communication among all nodes in the
storage system.
⚫ Service plane: interconnects with customer applications and accesses storage devices through
standard protocols such as iSCSI and HDFS.
⚫ Replication plane: enables data synchronization and replication among replication nodes.
⚫ Arbitration plane: communicates with the HyperMetro quorum server. This plane is planned
only when the HyperMetro function is planned for the block service.
The key software components and their functions are described as follows:
⚫ FSM: a management process of Huawei distributed storage that provides operation and
maintenance (O&M) functions, such as alarm management, monitoring, log management, and
configuration. It is recommended that this module be deployed on two nodes in active/standby
mode.
⚫ Virtual Block Service (VBS): a process that provides the distributed storage access point service
through SCSI or iSCSI interfaces and enables application servers to access distributed storage
resources
⚫ Object Storage Device (OSD): a component of Huawei distributed storage for storing user data in
distributed clusters.
⚫ REP: data replication network
⚫ Enterprise Data Service (EDS): a component that processes I/O services sent from VBS.
2.6 Introduction to Huawei Intelligent Storage Products

2.6.1 All-Flash Storage
2.6.1.1 All-flash Product Series
Huawei OceanStor Dorado all-flash storage systems are next-generation, high-end storage products
designed for medium- and large-sized data centers or mission-critical business services. The systems
focus on core services provided by medium- and large-sized enterprises (such as enterprise-level
data centers, virtual data centers, and cloud data centers) and meet the requirements for high
performance, reliability and efficiency demanded by medium- and large-sized data centers.
OceanStor Dorado's efficient hardware design, end-to-end flash acceleration, and intelligent storage
system management provide excellent data storage services for enterprises that meet the
requirements of various enterprise applications, such as large-scale database OLTP/OLAP, cloud
computing, server virtualization, and virtual desktops.
2.6.1.2 Product Highlights
The SmartMatrix 3.0 full-mesh balanced architecture uses a high-speed and matrix-based full-mesh
passive backplane. Multiple controller nodes can be connected. Interface modules are connected to
the backplane in full sharing mode, allowing hosts to access any controller through any port. The
SmartMatrix architecture allows close coordination between controller nodes and simplifies the
software model, achieving active-active fine-grained balancing, high efficiency, low latency, and
collaborative operations.
FlashLink® provides high input/output operations per second (IOPS) concurrency and stable low
latency. FlashLink® employs a series of optimizations for flash media. It associates controller CPUs
with SSD CPUs to coordinate SSD algorithms between these CPUs, achieving high system
performance and reliability.
SSDs use the NAND flash as a permanent storage media. Compared with traditional HDDs, SSDs offer
high speeds, low power consumption and low latency. They are also smaller, lighter, and shockproof.
High performance
⚫ All-SSD configuration boasts high IOPS and low latency.
⚫ Support for FlashLink®, such as intelligent multi-core, efficient RAID, hot and cold data
separation, and low latency.
High reliability
⚫ Component failure protection, dual-redundancy design, and active-active working mode;
SmartMatrix 3.0 full-mesh architecture for high efficiency, low latency, and collaborative
operations.
⚫ Dual-redundancy design, power-off protection, and coffer disk.
⚫ Advanced data protection technologies: HyperSnap, HyperReplication, HyperClone, and
HyperMetro.
⚫ RAID 2.0+ underlying virtualization.
High availability
⚫ Supports online replacement of components, such as controllers, power supplies, interface
modules, and disks.
⚫ Supports disk roaming, which enables the storage system to automatically identify relocated
disks and resume their services.
⚫ Centrally manages storage resources in third-party storage systems.
2.6.1.3 Product Form

Architecture:
⚫ Pangea V6 arm hardware platform.
⚫ CPU: Kunpeng 920 series.
⚫ 2 U, disk and controller integration.
⚫ Supports 25 x 2.5-inch controller enclosures and 36 x palm-sized NVMe controller enclosures.
⚫ Two controllers that work in active-active mode.
2.6.1.4 Components
The controller enclosure uses a modular design and consists of a system enclosure, controllers
(including fan modules), power modules, BBU modules, and disk modules.
2.6.1.5 Software Architecture
The storage system software manages storage devices and stored data and assists application
servers in data operations.
The software provided by Huawei OceanStor Dorado 3000 V6, Dorado 5000 V6, and Dorado 6000 V6
storage systems includes storage system software, software on the maintenance terminal, and
software running on the application server. The three types of software cooperate with each other
to intelligently, efficiently, and cost-effectively implement various storage, backup, and DR services.
2.6.1.6 Intelligent Chips
The key technologies of FlashLink® include:
Intelligent multi-core technology
The storage system uses Huawei-developed CPUs and houses more CPUs and CPU cores per
controller than any other in the industry. The intelligent multi-core technology allows storage
performance to increase linearly with the number of CPUs and cores.
Efficient RAID
The storage system uses the redirect-on-write (ROW) full-stripe write design, which writes all new
data to new blocks instead of overwriting existing blocks. This greatly reduces the overhead on
controller CPUs and read/write loads on SSDs in a write process, improving system performance at
multiple RAID levels.
Hot and cold data separation
The storage system can identify and separate hot and cold data to improve garbage collection
performance, shorten the program/erase (P/E) cycles on SSDs, and extend SSD service life.
Low latency guarantee
The storage system uses the latest generation of Huawei-developed SSDs and a faster protocol to
optimize I/O processing and maintain a low I/O latency.
Smart disk enclosure
The storage system supports the next-generation Huawei-developed smart disk enclosure. The
smart disk enclosure is equipped with CPU and memory resources, and can offload tasks, such as
data reconstruction upon a disk failure, from controllers to reduce the workload on the controllers
and eliminate the impact of such tasks on service performance.
Efficient time point technology
The storage system implements data protection by using distributed time points. Read and write
I/Os from user hosts carry the time point information to quickly locate metadata, thereby improving
access performance.
Global wear leveling and anti-wear leveling

Global wear leveling: If data is unevenly distributed to SSDs, certain SSDs may be used more
frequently and wear faster than others. As a result, they may fail much earlier than expected,
increasing the maintenance costs. The storage system uses global wear leveling that levels the wear
degree among all SSDs, improving SSD reliability.
Global anti-wear leveling: When the wear degree of multiple SSDs is reaching the threshold, the
storage system preferentially writes data to specific SSDs. In this way, these SSDs wear faster than
the others. This prevents multiple SSDs from failing at a time.
2.6.1.7 Typical Application Scenario – Acceleration of Critical Services
FAQ:
With the rapid development of the mobile Internet, effective value mining from a large amount of
customer data and rapidly expanding transaction data relies on efficient data collection, analysis,
consolidation, and extraction to facilitate the implementation of data-centric strategies. Existing IT
systems are under increasing pressure to improve. For example, it takes several hours to process
data and integrate data warehouses in the bill and inventory systems of banks and large enterprises.
As a result, services like operation analysis and service queries cannot be obtained in a timely
manner.
Solution:
Huawei's high-performance all-flash solution resolves these problems. High-end all-flash storage
systems are used to carry multiple core applications (services like the transaction system database).
The processing time is reduced by more than half, the response latency is shortened, and the service
efficiency is improved several times over.
2.6.2 Hybrid Flash Storage

2.6.2.1 Hybrid Flash Storage Series
Huawei OceanStor hybrid flash storage combines a superior hardware structure and an integrated
architecture that unifies block and file services with advanced data application and protection
technologies, meeting medium- and large-size enterprises' storage requirements for high
performance, scalability, reliability, and availability.
Converged storage:
⚫ Convergence of SAN and NAS storage technologies.
⚫ Support for storage network protocols such as iSCSI, Fibre Channel, NFS, CIFS, and FTP.
High performance:
⚫ High-performance processor, high-speed and large-capacity cache, and various high-speed
interface modules provide excellent storage performance.
⚫ Support for SSD acceleration, greatly improving storage performance.
Flexible scalability:
⚫ Support for various disk types.
⚫ Support for various interface modules.
⚫ Support for such technologies as scale-out.
High reliability:
⚫ SmartMatrix full-mesh architecture, redundancy design for all components, active-active
working mode, and RAID 2.0+.
⚫ Multiple data protection technologies, such as power failure protection, data pre-copy, coffer
disk, and bad sector repair.
High availability:
⚫ Multiple advanced data protection technologies, such as snapshot, LUN copy, remote
replication, clone, volume mirroring, and active-active, and support for the NDMP protocol.
Intelligence and high efficiency:
⚫ Various control and management functions, such as SmartTier, SmartQoS, and SmartThin,
providing refined control and management.
⚫ DeviceManager supports GUI-based operation and management.
⚫ eService provides self-service intelligent O&M.
2.6.2.2 Product Form
A storage system consists of controller enclosures and disk enclosures, providing customers with an
intelligent storage platform that features high reliability, high performance, and large capacity.
Different types of controller enclosures and disk enclosures are configured for different models.
2.6.2.3 Convergence of SAN and NAS
Convergence of SAN and NAS storage technologies: One storage system supports both SAN and NAS
services at the same time and allows SAN and NAS services to share storage device resources. Hosts
can access any LUN or file system through the front-end port of any controller. During the entire
data life cycle, hot data gradually becomes cold data. If cold data occupies the cache or SSDs for a
long time, valuable resources will be wasted, and the long-term performance of the storage system
will be affected. The storage system uses the intelligent storage tiering technology to flexibly
allocate data storage media in the background.
The intelligent tiering technology needs to be deployed on a device with different media types. Data
is monitored in real time. Data that is not accessed for a long time is marked as cold data and is
gradually transferred from high-performance media to low-speed media, ensuring that service
response from devices does not slow down. After being activated, cold data can be quickly moved to
high-performance media, ensuring stable system performance.
Migration policies can be manually or automatically triggered.
2.6.2.4 Support for Multiple Service Scenarios
Huawei OceanStor hybrid flash storage system integrates SAN and NAS and supports multiple
storage protocols. It is used in a wide range of general-purpose scenarios, including but not limited
to government, finance, telecoms, manufacturing, backup, and DR.
2.6.2.5 Application Scenario – Active-Active Data Centers
Load balancing among controllers
RPO = 0 and RTO ≈ 0 for mission-critical services
Convergence of SAN and NAS: SAN and NAS active-active services can be deployed on the same
device. If a single controller is faulty, local switchover is supported.
Solutions that ensure uninterrupted service running for customers
The active-active solution can be used in industries such as healthcare, finance, and social security.
2.6.3 Distributed Storage

2.6.3.1 Distributed Storage
The advent of cloud computing and AI has led to exponential data growth. Newly emerging
applications, such as high-speed 5G communication, high definition (HD) 4K/8K video, autonomous
driving, and big data analytics, are raising data storage demands.
Huawei OceanStor 100D is a distributed storage product with scale-out and supports the business
needs of both today and tomorrow. It provides elastic on-demand services powered by cloud
infrastructure and carries both critical and emerging workloads.
2.6.3.2 Product Highlights
Block storage provides standard SCSI and iSCSI interfaces. It is an ideal storage platform for private
clouds, containers, virtualization platforms, and database applications.
HDFS storage provides a decoupled storage-compute big data solution with native HDFS. Its
intelligent tiering reduces TCO and offers a consistent user experience.
Object storage supports mainstream cloud computing ecosystems with standard object storage APIs
for content storage, cloud backup and archiving, and public cloud storage service operations.
File storage supports the NFS protocol to provide common file services for users.
2.6.3.3 Hardware Node Examples
Huawei OceanStor 9000 scale-out file storage is a distributed file storage platform designed for big
data. Huawei OceanStor 9000 supports the following functions:
Leverages OceanStor DFS, a Huawei-developed distributed file system, to store large amounts of
unstructured data and provide a unified global namespace.
Can be interconnected with FusionInsight Hadoop and Cloudera Hadoop using open-source Hadoop
components, helping users easily build enterprise-grade big data analysis platforms.
Provides the NFS protocol enhancement feature. You can configure multiple network ports and
install the NFS protocol optimization plug-in DFSClient on an NFS client to implement concurrent
connection and cache optimization of multiple network ports, greatly improving the performance of
a single client.
2.6.3.4 Software System Architecture
Storage interface layer: provides standard interfaces for applications to access the storage system
and supports SCSI, iSCSI, object, and Hadoop protocols.
Storage service layer: provides block, object, and HDFS services and enriched enterprise-grade
features.
Storage engine layer: leverages the Plog interface and an Append Only redirect-on-write (ROW)
write mechanism to provide multi-copy, EC, data rebuilding and balancing, disk management, and
data read/write capabilities.
Storage management: operates, manages, and maintains the system, and provides functions such as
system installation, deployment, service configuration, device management, alarm reporting,
monitoring, upgrade, and expansion.
2.6.3.5 Application Scenario – Cloud Resource Pool
As an example, Huawei OceanStor 100D applies to the following scenarios:
Private cloud and virtualization
Huawei OceanStor 100D provides an ultra-high quantity of data storage resource pools featuring on-
demand resource provisioning and elastic capacity expansion in virtualization and private cloud
environments. It improves storage deployment, expansion, and operation and maintenance (O&M)
efficiency using general-purpose servers. Typical scenarios include Internet-finance channel access
clouds, development and testing clouds, cloud-based services, B2B cloud resource pools in carriers'
BOM domains, and e-Government clouds.
Mission-critical database
Huawei OceanStor 100D delivers enterprise-grade capabilities, such as distributed active-active
storage and consistent low latency, to ensure efficient and stable running of data warehouses and
mission-critical databases, including online analytical processing (OLAP) and online transaction
processing (OLTP).
Big data analytics
OceanStor 100D provides an industry-leading decoupled storage-compute solution for big data,
which integrates traditional data silos and builds a unified big data resource pool for enterprises. It
also leverages enterprise-grade capabilities, such as elastic large-ratio erasure coding (EC) and on-
demand deployment and expansion of decoupled compute and storage resources, to improve big
data service efficiency and reduce TCO. Typical scenarios include big data analytics for finance,
carriers (log retention), and governments.
Content storage and backup archiving
OceanStor 100D provides high-performance and highly reliable object storage resource pools to
meet large throughput, frequent access to hotspot data, as well as long-term storage and online
access requirements of real-time online services such as Internet data, online audio/video, and
enterprise web disks. Typical scenarios include storage, backup, and archiving of financial electronic
check images, audio and video recordings, medical images, government and enterprise electronic
documents, and Internet of Vehicles (IoV).
2.6.4 Edge Data Storage (FusionCube)

2.6.4.1 Edge Data Storage (FusionCube)
Huawei FusionCube is an IT infrastructure platform based on a hyper-converged architecture.
FusionCube is prefabricated with compute, network, and storage devices in an out-of-the-box
package, eliminating the need for users to purchase extra storage or network devices. A set of
FusionCube appliance converges servers and storage, and pre-integrates a distributed storage
engine, virtualization platform, and cloud management software. It supports on-demand resource
scheduling and linear expansion. It is mainly used in data center scenarios with multiple types of
hybrid workloads, such as databases, desktop clouds, containers, and virtualization.
Huawei FusionCube 1000 is an edge IT infrastructure solution with an integrated design. It is
delivered as an integrated cabinet. FusionCube 1000 is mainly used in edge data centers and edge
application scenarios in vertical industries, such as gas stations, campuses, coal mines, and power
grids. FusionCube Center can be deployed to implement remote centralized management for
FusionCube 1000 deployed in offices. In conjunction with FusionCube Center Vision, FusionCube
1000 offers integrated cabinets, service rollout, O&M management, and centralized troubleshooting
services. It greatly shortens the deployment cycle and reduces O&M costs.
Hyper-converged infrastructure (HCI) is a set of devices consolidating not only compute, network,
storage, and server virtualization resources, but also elements such as backup software, snapshot
technology, data deduplication, and inline compression. Multiple sets of devices can be aggregated
by a network to achieve modular, seamless scale-out and form a unified resource pool.
2.6.4.2 Hardware Model Examples
Product portfolio:
Blade server: E.g. the Huawei E9000, which integrates computing, storage, and network resources.
The E9000 provides 12 U space for installing Huawei E9000 series server blades, storage nodes, and
capacity expansion nodes.
High-density server: E.g. the Huawei X6800 and X6000 with their high node densities. The Huawei
X6800 is a 4 U server with four nodes, and the Huawei X6000 is a 2 U server with four nodes.
Rack server: E.g. the Huawei RH server or TaiShan server. The TaiShan 2280 is a 2 U 2-socket rack
server. It features high-performance computing, large-capacity storage, low power consumption,
easy management, and easy deployment and is designed for Internet, distributed storage, cloud
computing, big data, and enterprise services.
2.6.4.3 Storage System
The distributed storage system is a distributed block storage software package installed on the
server to virtualize all local disks on that server into a storage resource pool. This allows for block
storage services to be provided.
2.6.4.4 Software Architecture
The hardware for the FusionCube solution includes servers, switches, SSDs, and uninterruptible
power supply (UPS) resources.
The software for the FusionCube solution includes FusionCube Builder, FusionCube Center, DR and
backup software, and distributed storage software.
2.6.4.5 Application Scenario – Edge Data Center Service Scenario
Plug-and-play: Full-stack edge data center, delivery as an integrated cabinet, zero onsite
configuration, and plug-and-play.
Attendance-free: Unified and centralized management of edge data centers, requiring no dedicated
personnel to be in attendance and significantly reducing O&M costs.
Edge-cloud synergy: Enterprise application ecosystem built on the cloud to quickly deliver
applications from the central data center to edge data centers, accelerating customer service
innovation.
3 Advanced Storage Technologies
3.1 Storage Resource Tuning Technologies and Applications

3.1.1 SmartThin
3.1.1.1 Overview
Huawei-developed OceanStor SmartThin for OceanStor storage series provides the automatic thin
provisioning function. It solves the problems in deployment of the traditional storage systems.
SmartThin allocates storage spaces on demand rather than pre-allocating all storage spaces at the
initial stage. It is more cost-effective because customers can start business with a few disks and add
disks based on site requirements. In this way, both the initial purchase cost and TCO are minimized.
3.1.1.2 Working Principles
SmartThin virtualizes storage resources.
It manages storage resources on demand. SmartThin does not allocate all capacity of a storage
system in advance. It presents a virtual storage space larger than the physical storage space. Instead,
it allocates the space based on user demands. If users need more space, they can expand the
capacity of back-end storage pools as needed. The storage system does not need to be shut down,
and the whole capacity expansion process is transparent to users.
SmartThin creates thin LUNs based on the RAID 2.0+ technology, that is, thin LUNs coexist with thick
LUNs in the same storage resource pool. A thin LUN is a logical unit created in a storage pool. The
thin LUN can then be mapped to and accessed by a host. The capacity of a thin LUN is not
determined by the size of its physical capacity. Its capacity is virtualized. Physical space from the
storage resource pool is not allocated to the thin LUN unless the thin LUN starts to process an I/O
request.
A thick LUN is a logical disk that can be accessed by a host. The capacity of a thick LUN has been
specified when it is created. SmartThin enables the storage system to allocate storage resources to
the thick LUN using the automatic resource configuration technology.
3.1.1.3 Read/write Process
SmartThin uses the capacity-on-write and direct-on-time technologies to help hosts process read
and write requests of thin LUNs. Capacity-on-write is used to allocate space upon writes, and direct-
on-time is used to redirect data read and write requests.
Capacity-on-time: Upon receiving a write request from a host, a thin LUN uses direct-on-time to
check whether there is a physical storage space allocated to the logical storage provided for the
request. If no, a space allocation task is triggered, and the size for the space allocated is measured by
the grain as the minimum granularity. Then data is written to the newly allocated physical storage
space.
Direct-on-time: When capacity-on-write is used, the relationship between the actual storage area
and logical storage area of data is not calculated using a fixed formula but determined by random
mappings based on the capacity-on-write principle. Therefore, when data is read from or written
into a thin LUN, the read or write request must be redirected to the actual storage area based on the
mapping relationship between the actual storage area and logical storage area.
A mapping table: This table is used to record the mapping between an actual storage area and a
logical storage area. A mapping table is dynamically updated during the write process and is queried
during the read process.
3.1.1.4 Application Scenarios
SmartThin allocates storage space on demand. The storage system allocates space to application
servers as needed within a specific quota threshold, eliminating the storage resource waste
SmartThin can be used in the following scenarios:
SmartThin expands the capacity of the banking transaction systems in online mode without
interrupting ongoing services.
SmartThin dynamically allocates physical storage spaces on demand to email services and online
storage services.
SmartThin allows different services provided by a carrier to compete for physical storage space to
optimize storage configurations.
3.1.1.5 Configuration Process
To use thin LUNs, you need to import and activate the license file of SmartThin on your storage
device.
After a thin LUN is created, if an alarm is displayed indicating that the storage pool has no available
space, you are advised to expand the storage pool as soon as possible. Otherwise, the thin LUN may
enter the write through mode, causing performance deterioration.
3.1.2 SmartTier
3.1.2.1 Overview
SmartTier is also called intelligent storage tiering. It provides the intelligent data storage
management function that automatically matches data to the storage media best suited to that type
of data by analyzing data activities.
SmartTier migrates hot data to storage media with high performance (such as SSDs) and moves idle
data to more cost-effective storage media (such as NL-SAS disks) with more capacity. This provides
hot data with quick response and high input/output operations per second (IOPS), thereby
improving the performance of the storage system.
3.1.2.2 Storage Tiers
In a storage pool, a storage tier is a collection of storage media that all deliver the same level of
performance. SmartTier divides disks into high-performance, performance, and capacity tiers based
on their performance levels. Each storage tier is comprised of the same type of disks and uses the
same RAID level.
(1) High-performance tier
Disk type: SSDs
Disk characteristics: SSDs have a high IOPS and can quickly respond to I/O request. However, the
cost of storage capacity at each unit is high.
Application characteristics: Applications with intensive random access requests are often deployed
at this tier.
Data characteristics: It carries the most active data (hot data).
(2) Performance tier
Disk type: SAS disks

Disk characteristics: This tier delivers a high bandwidth under a heavy service workload. I/O requests
are responded in a relatively quick speed. Data write is slower than data read if no data is cached.
Application characteristics: Applications with moderate access requests are often deployed at this
tier.
Data characteristics: It carries hot data (active data).
(3) Capacity tier
Disk type: NL-SAS disks
Disk characteristics: NL-SAS disks have a low IOPS and slowly respond to I/O request. However, the
price per unit for storage request processing is high.
Application characteristics: good for applications with fewer access request volumes
Data characteristics: It carries cold data (idle data).
The types of disks in a storage pool determine how many storage tiers there are.
3.1.2.3 Three Phases of SmartTier Implementation
If a storage pool contains more than one type of disks, SmartTier can be used to fully utilize the
storage resources. During data migration, a storage pool identifies the data activity level in the unit
of a data block and migrates the whole data block to the most appropriate storage tier.
The SmartTier implementation includes three phases: I/O monitoring, data analysis, and data
migration.
The I/O monitoring module implements I/O monitoring. The storage system determines which data
block is hotter or colder by comparing the activity level of one data block with that of another. The
activity level of a data block is calculated based on its access frequency and I/O size.
The data analysis module implements data analysis. The system determines the I/O count threshold
of each storage tier based on the capacity of each storage tier in the storage pool, the statistics of
each data block generated by the I/O monitoring module, and the access frequency of data blocks.
The data analysis module ranks the activity levels of all data blocks (in the same storage pool) in
descending order and the hottest data blocks are migrated first.
The data migration module implements data migration. SmartTier migrates data blocks based on the
rank and the migration policy. Data blocks of a higher-priority are migrated to higher tiers (usually
the high-performance tier or performance tier), and data blocks of a lower-priority are migrated to
lower tiers (usually the performance tier or capacity tier).
3.1.2.4 Key Technologies
Initial capacity allocation: The storage system allocates new data to corresponding storage tiers
based on the initial capacity allocation policy.
SmartTier policy specifies the data migration direction.
A storage pool may contain multiple LUNs with different SmartTier policies. However, I/O monitoring
applies to the entire storage pool. Therefore, you are advised to enable I/O monitoring to ensure
that LUNs configured with automatic migration policies can complete migration tasks.
Data migration plan defines the migration mode. You can adjust the data migration mode based on
service requirements.
Data migration granularity: SmartTier divides data in a storage pool based on data migration
granularities or data blocks. During data migration, a storage pool identifies the data activity level in
the unit of a data block and migrates the whole data block to the most appropriate storage tier.
Data migration rate controls the progress of data migration among storage tiers.

SmartTier applies to a wide range of service environments. This section uses an Oracle database
service as an example.
Since cold data is stored on NL-SAS disks, there is more space available on the high-performance
SSDs for hot data. SSDs provide hot data with quick response and high IOPS. In this way, the overall
storage system performance is remarkably improved.
The configuration process of SmartTier in a storage system includes checking the license, configuring
SmartTier based on the storage system level, configuring SmartTier based on the storage pool level,
and configuring SmartTier based on the LUN level.
A license grants the permission to use a specific value-added feature. Before configuring SmartTier,
ensure that its license file contains relevant information.
Storage system-level configuration includes the configuration of a data migration speed, which is
applied to all storage pools in a storage system.
Storage pool-level configurations include configurations of data migration granularity, the RAID level,
data migration plan, I/O monitoring, and forecast analysis. The configurations are applied to a single
storage pool.
LUN-level configurations include the configuration of the initial capacity allocation policy and the
SmartTier policy, which are applied to a single LUN.
SmartTier needs extra space for data exchange when dynamically migrating data. Therefore, a
storage pool configured with SmartTier needs to reserve certain free space.
3.1.3 SmartQoS
3.1.3.1 Overview
SmartQoS dynamically allocates storage resources to meet certain performance goals of specified
applications.
As storage technologies develop, a storage system is capable of providing larger capacities.
Accordingly, a growing number of users choose to deploy multiple applications on one storage
device. Different applications contend for system bandwidth and Input/Output Operations Per
Second (IOPS) resources, comprising the performance of critical applications.
SmartQoS helps users properly use storage system resources and ensures high performance of
critical services.
SmartQoS enables users to set performance indicators like IOPS or bandwidth for certain
applications. The storage system dynamically allocates system resources to meet QoS requirements
of certain applications based on specified performance goals. It gives priority to certain applications
with demanding QoS requirements.
3.1.3.2 I/O Priority Scheduling
The I/O priority scheduling technology of SmartQoS is based on the priorities of LUNs, file systems,
or snapshots. Their priorities are determined by users based on the importance of services deployed
on the LUNs, file systems, or snapshots.
Each LUN or file system has a priority, which is configured by a user and saved in a storage system.
When an I/O request enters the storage system, the storage system gives a priority to the I/O
request based on the priority of the LUN, file system, or snapshot that will process the I/O request.
Then the I/O carries the priority throughout this processing procedure. When system resources are
insufficient, the system preferentially processes high-priority I/Os to improve the performance of
high-priority LUNs, file systems, or snapshots.
I/O priority scheduling is to schedule storage system resources, including CPU compute time and
cache resources.
3.1.3.3 I/O Traffic Control
SmartQoS traffic control consists of I/O class queue management, token distribution, and dequeue
control of controlled objects.
I/O traffic control uses a token-based system. When a user sets an upper limit for the performance
of a traffic control group, that limit is converted into the number of corresponding tokens. In a
storage system, where the IOPS is limited, each I/O operation corresponds to a token. If bandwidth
is limited, tokens are allocated to sectors.
The storage system adjusts the number of tokens in an I/O queue based on the priority of a LUN, file
system, or snapshot. The more tokens a LUN, file system, or snapshot I/O queue has, the more
resources the system allocates to the LUN, file system, or snapshot I/O queue. The storage system
preferentially processes I/O requests in the I/O queue of the LUN, file system, or snapshot.
For example, if a user enables a SmartQoS policy for two LUNs in a storage system and sets
performance objectives in the SmartQoS policy, the storage system can limit the system resources
allocated to the LUNs to reserve more resources for high-priority LUNs. The performance goals
measured by the SmartQoS are bandwidth, IOPS, and latency.
SmartQoS dynamically allocates storage resources to ensure performance for critical services and
high-priority users.
1. Ensuring the performance for critical services
If OLTP and archive backup services are concurrently running on the same storage device, both
services need sufficient system resources.
(1) Online Transaction Processing (OLTP) service is a key service and has high requirements on
real-time performance.
(2) Backup service has a large amount of data with fewer requirements on latency.
SmartQoS specifies the performance objectives for different services to ensure the performance
of critical services. You can use either of the following methods to ensure the performance of
critical services: Set I/O priorities to meet the high priority requirements of OLTP services, and
create SmartQoS traffic control policies to meet service requirements.
2. Ensuring the service performance for high-level users
For cost reduction, some users will not build their dedicated storage systems independently.
They prefer to run their storage applications on the storage platforms offered by storage
resource providers. This lowers the total cost of ownership (TCO) and ensures the service
continuity. On such shared storage platforms, services of different types and features contend
for storage resources, so the high-priority users may fail to obtain their desired storage
resources.
SmartQoS creates SmartQoS policies and sets I/O priorities for different subscribers. In this way,
when resources are insufficient, services of high-priority subscribers can be preferentially
processed and their service quality requirements can be met.
Before configuring SmartQoS, read the configuration process and check its license file. The service
monitoring function provided by the storage system can be used to obtain the I/O characteristics of
LUNs or file systems and use them as the basis of SmartQoS policies. SmartQoS policies adjust and
control applications to ensure continuity of critical services.
3.1.4 SmartDedupe
3.1.4.1 Overview
SmartDedupe eliminates redundant data from a storage system. This deduplication technology
reduces the amount of physical storage capacity occupied by data to release more storage capacity
for increasing services.
Huawei OceanStor storage systems provide inline deduplication to deduplicate the data that is newly
written into the storage systems.
Inline deduplication deletes duplicate data before the data is written into disks.
Similar deduplication analyzes data that has been written to disks based on similar fingerprints to
find out duplicate and similar data blocks, and then performs deduplication.
3.1.4.2 Inline Deduplication Working Principles
Deduplication data block size specifies the deduplication granularity.
Fingerprint is a fixed-length binary value that represents a data block. OceanStor Dorado V6 uses a
weak hash algorithm to calculate a fingerprint for any data block. The storage system saves the
mapping relationship between the fingerprints and storage locations of all data blocks in a
fingerprint library.
Inline deduplication working principles
1. The storage system uses a weak hash algorithm to calculate a fingerprint for any data block that
is newly written into the storage system.
2. The storage system checks whether the fingerprint for the new data block is consistent with the
fingerprint in the fingerprint library.
➢ If yes, a byte-by-byte comparison is performed by the storage system.
➢ If no, the storage system determines that the data newly written is a new data block,
writes the new data block to a disk, and records the mapping between the fingerprint and
storage location of the data block in the fingerprint library.
3. The storage system performs a byte-by-byte comparison to check whether the new data block
is consistent with the existing one.
➢ If yes, the storage system determines that the new data block is the same as the existing
one, deletes the data block, and directs the fingerprint and storage location mapping of the
data block to the original data block in the fingerprint library.
➢ If no, the storage system writes the data block to a disk. The processing procedure is the
same as that when the deduplication function is disabled.
3.1.4.3 Post-processing Similarity Deduplication Working Principles
An opportunity table saves data blocks' fingerprint and location information for identifying hot
fingerprints.
The process of deleting similar duplicate data is as follows:
1. Write data and calculate the fingerprint and write it to the opportunity table.
A storage system divides the newly written data into blocks, uses the similar fingerprint
algorithm to calculate the similar fingerprint of the newly written data block, writes the data
block to a disk, and writes the fingerprint and location information of the data block to the
opportunity table.
2. After data is written, the storage system periodically performs similarity deduplication.
(1) The storage system periodically checks whether there is similar fingerprint in the
opportunity table.
➢ If yes, it performs byte-by-byte comparison.
➢ If no, it continues the periodic check.
(2) The storage system performs a byte-by-byte comparison to check whether similar blocks
are actually the same.
➢ If they are the same, the storage system determines that the new data block is the
same as the original data block. The storage system deletes the data block, and maps
its fingerprint and storage location to the remaining data block.
➢ If they are just similar, the system performs differential compression on data blocks,
records its fingerprint to the fingerprint library, updates fingerprint to the metadata of
data blocks and recycles the spaces of these data blocks.
In Virtual Desktop Infrastructure (VDI) applications, users create multiple virtual images on a physical
storage device. These images have a large amount of duplicate data. As the amount of duplicate
data increases, the storage system struggles to keep up with service requirements. SmartDedupe
can delete duplicate data among images to release storage resources and store more service data.
When creating a LUN, you need to select an application type. The deduplication function of the
application type has been configured by default. You can run commands to view the deduplication
and compression status of each application type.
3.1.5 SmartCompression
3.1.5.1 Overview
SmartCompression reorganizes data to save space and improves the data transfer, processing, and
storage efficiency without losing any data. OceanStor Dorado V6 series storage systems support
inline compression and post-process compression.
Inline compression: The system deduplicates and compresses data before writing it to disks. User
data is processed in real time.
Post-process compression: Data is written to disks in advance and then read and compressed when
the system is idle.
The LZ77 algorithm is a lossless compression algorithm. The LZ77 algorithm replaces the data to be
encoded or decoded with the position index of the data that has just been encoded or decoded. The
repetition of current data and data that has just been encoded or decoded implements compression.
LZ77 uses a sliding window to implement this algorithm. The scan head starts scanning the string
from the head, and there is a sliding window with a length of N before the scan head. If there is the
string at the scanning head being the same as the longest matching string in the window, (offset in
the window, longest matching length) it is used to replace the latter repeated string.
More CPU resources will be occupied as the volume of data that is compressed by the storage
system increases.
1. Databases: Data compression is an ideal choice for databases. Many users would be happy to
sacrifice a little performance to recover over 65% of their storage capacity.
2. File services: Data compression is also applied to file services. Peak hours occupy half of the
total service time and the dataset compression ratio of the system is 50%. In this scenario,
SmartCompression slightly decreases the IOPS.
3. Engineering data and seismic geological data: The data has similar requirements to database
backups. This type of data is stored in the same format, but there is not as much duplicate data.
Such data can be compressed to save the storage space.
SmartDedupe + SmartCompression
Deduplication and compression can be used together to save more space. SmartDedupe can be
combined with SmartCompression for data testing or development systems, storage systems with a
file service enabled, and for engineering data systems.
The SmartCompression configuration process includes checking the license and enabling
SmartCompression. Checking the license is to ensure the availability of SmartCompression.
3.1.6 SmartMigration
3.1.6.1 Overview
SmartMigration is a key service migration technology. Service data can be migrated within a storage
system and between different storage systems.
"Consistent" means that after the service migration is complete, all of the service data has been
replicated from a source LUN to a target LUN.
SmartMigration synchronizes and splits service data to migrate all data from a source LUN to a target
LUN.
3.1.6.3 SmartMigration Service Data Synchronization
Pair: In SmartMigration, a pair is a source LUN and the target LUN that the data will be migrated to.
A pair can have only one source LUN and one target LUN.
LM modules manage SmartMigration in a storage system.
Dual-write changes data and writes data into both the source and target LUNs during service data
migration.
LOG records data changes on a source LUN to determine whether the data is concurrently written to
the target LUN using dual-write technology.
A data change log (DCL) records differential data that fails to be written to the target LUN during the
data change synchronization.
Service data synchronization between a source LUN and a target LUN includes initial synchronization
and data change synchronization. The two synchronization modes are independent and can be
performed at the same time to ensure that service data changes on the host are synchronized to the
source LUN and the target LUN.
3.1.6.4 SmartMigration Pair Splitting
Splitting is performed on a single pair. The splitting process includes stopping service data
synchronization between the source and target LUNs in a pair to exchange LUN information, and
removing the data migration relationship after the exchange is complete.
In splitting, host services are suspended. After information is exchanged, services are delivered to
the target LUN. Service migration is invisible to the users.
1. LUN information exchange
Before a target LUN can take over services from a source LUN, the two LUNs must synchronize
and then exchange information.
2. Pair splitting
Pair splitting: Data migration relationship between a source LUN and a target LUN is removed
after LUN information is exchanged.
The consistency splitting of SmartMigration means that multiple pairs exchange LUN
information at the same time and concurrently remove pair relationships after the information
exchange is complete, ensuring that data consistency at any point in time before and after the
pairs are split.
The configuration processes of SmartMigration in a storage system include checking the license file,
creating a SmartMigration task, and splitting a SmartMigration task.
A target LUN stores a point-in-time data duplicate of the source LUN only after the pair is split. The
data duplicate can be used to recover the source LUN in the event that the source LUN is damaged.
In addition, the data duplicate is accessible in scenarios such as application testing and data analysis.
3.2 Storage Data Protection Technologies and Applications

3.2.1 HyperSnap
3.2.1.1 Overview
As information technologies develop and global data increases, people attach more and more
importance to data security and protection. Traditional data backup solutions face the following
challenges in protecting mass data:
A large amount of data needs to be backed up, and the data amount is rapidly increasing. However,
the backup window remains unchanged.
The backup process should not affect the production system performance.
Users have increasingly high requirements on RPO/RTO.
⚫ A backup window is an interval during which data is backed up without seriously affecting the
applications.
⚫ Recovery Time Objective (RTO) is a service switchover policy that ensures the shortest DR
switchover time. It takes the recovery time point as the object and ensures that the DR system
takes over services instantly.
⚫ Recovery Point Objective (RPO) is a service switchover policy that ensures the least data loss. It
takes the data recovery point as the object and ensures that the data used for DR switchover is
the latest backup data.
Higher requirements promote evolution and innovation of backup technologies. More and more
backup software provides the snapshot technology to back up data.
Snapshot is a data backup technology analogous to photo taking. It instantaneously records the state
of objects. The snapshot technology helps system administrators reduce the backup window to zero,
meeting enterprises' requirements for service continuity and data reliability.
HyperSnap is a consistency data copy of the source data at a specific point in time. It is an available
copy of a specified data set. The copy contains a static image of a source data at the point in time
when the source data is copied.
Redirect on write (ROW) technology is the core technology for snapshot. In the overwrite scenario, a
new space is allocated to new data. After the write operation is successful, the original space is
released.
Data organization: LUNs created in a storage pool of OceanStor Dorado V6 consist of data and
metadata volumes.
⚫ A metadata volume records the data organization information (logical block address (LBA),
version, and clone ID) and data attributes. A metadata volume is organized in a tree structure.
Logical block address (LBA) indicates the address of a logical block. Version corresponds to the
point in time of a snapshot, and Clone ID is the number of data copies.
⚫ Data volume stores data written to a LUN.
Source volume stores the source data for which a snapshot is to be created. It is presented as a LUN
to users.
Snapshot volume is a logical data copy generated after a snapshot is created for a source volume. It
is presented as a snapshot LUN to users.
Inactive: A snapshot is in the inactive state. In this state, the snapshot is unavailable and can be used
only after being activated.
3.2.1.3 Lossless Performance
The process of activating a snapshot is to save a data state of a source object at the time when it is
activated. Specific operations include creating a mirror for the source object of the snapshot and
associating the mirror with an activated snapshot. The principle of creating a mirror for the source
object is as follows: Common LUNs and snapshots in the OceanStor Dorado V6 all-flash storage
system use the ROW-based read and write mode. In ROW-based read and write mode, each time the
host writes new data, the system reallocates space to store the new data and updates the LUN
mapping table to point to the new data space. As shown in the slide, L0 to L4 are logical addresses,
P0 to P8 are physical addresses, and A to I are data.
3.2.1.4 Snapshot Rollback
When required, the rollback function can immediately restore the data in a storage system to the
state at the snapshot point in time, offsetting data damage or data loss caused by misoperations or
viruses after the snapshot point in time.
Rollback copies the snapshot data to a target object (a source LUN or a snapshot). After rollback is
started, the target object instantly uses the data (the data of the target object is that of the
snapshot). To ensure that the target object is available immediately after the rollback is started,
perform the following operations:
⚫ Perform read redirection on the target object which needs read rollback.
⚫ Perform rollback before write on the target object which needs write rollback.
Stopping a rollback cannot restore the data of the target object to the state before the rollback.
Therefore, you are advised to add a snapshot for data protection before starting a rollback. Stopping
the rollback only stops the data copy process but cannot restore the data on the source object to the
state before the rollback.
3.2.1.5 Snapshot Cascading and Cross-Level Rollback

Huawei OceanStor Dorado V6 supports cross-level rollback. Cross-level rollback refers to the rollback
of snapshots and cascading snapshots of the same root LUN. You can select any two snapshots or a
snapshot and the root LUN to perform the rollback. Cross-level rollback is shown in the slide. You
can either select any two snapshots among snapshot0, snapshot1, Snapshot1.snapshot0, and
Snapshot1.snapshot1 to perform mutual rollback, or roll back any one to the root LUN0.
3.2.1.6 Key Technologies: Duplicate
A snapshot duplicate backs up the data of a snapshot at the time when the snapshot was activated.
It does not back up private data written to the snapshot after the snapshot activation. A snapshot
duplicate and a source snapshot share the COW volume space of the source LUN, but the private
space is independent. The snapshot duplicate is a writable snapshot which is independent of the
source snapshot.
The read and write processes of a snapshot duplicate are the same as those of a common snapshot.
The snapshot duplicate technology can be used to obtain multiple data copies of the same snapshot.
3.2.1.7 Key Technologies: Rollback Before Write
HyperSnap of Huawei OceanStor converged storage systems uses the rollback before write
technology to implement second-level rollback. That is, after a snapshot rollback is started, the
source LUN can be read and written immediately. Rollback before write means that during the
rollback, when a host reads data from or writes data to a source LUN, a snapshot copies the data
blocks in the snapshot to the source LUN, and then the host writes data to the source LUN. When no
host reads or writes data, the snapshot data is rolled back to the source volume in sequence.
A snapshot is applied widely. It can serve as a backup source, data mining source, point for checking
application status, or even a method of data replication. For example:
The snapshot feature periodically creates read-only snapshots for file systems and copies snapshot
data to local backup servers to facilitate subsequent data query and restoration. Data is backed up
based on snapshots, reducing the data backup window to zero without interrupting file system
services. Data is stored remotely, improving data reliability.
If file system data is damaged or unavailable due to viruses or misoperations, the snapshot feature
enables the file system data to be rolled back to state at the specified snapshot point in time, rapidly
restoring data.
Snapshots can be directly read and written for data mining and testing without affecting the service
data.
The snapshot configuration process includes checking the snapshot license file, creating a LUN, and
creating a snapshot.
If there is no snapshot source LUN in a storage pool, create the source LUN first.
Creating a snapshot is to generate a consistent data copy of a source LUN at a specific point in time.
You can perform the read operation on the copy without affecting its source data. The storage pool
where a source LUN resides has no available capacity, snapshots cannot be created for the source
LUN.
A snapshot LUN can be mapped to a host as a common LUN.
3.2.2 HyperClone
3.2.2.1 Overview
HyperClone allows you to obtain full copies of LUNs without interrupting host services. These copies
can be used for data backup and restoration, data reproduction, and data analysis.
HyperClone provides a full copy of the source LUN's data at the synchronization start time. The
target LUN can be read and written immediately, without waiting for the copy process to complete.
The source and target LUNs are physically isolated. Operations on the member LUNs do not affect
each other. When data on the source LUN is damaged, data can be reversely synchronized from the
target LUN to the source LUN. A differential bitmap records the data written to the source and
target LUNs to support subsequent incremental synchronization.
3.2.2.3 Synchronization
When a HyperClone pair starts synchronization, the system generates an instant snapshot for the
source LUN, synchronizes the snapshot data to the target LUN, and records subsequent write
operations in a differential table.
When synchronization is performed again, the system compares the data of the source and target
LUNs, and only synchronizes the differential data to the target LUN. The data written to the target
LUN between the two synchronizations will be overwritten. Before synchronization, users can create
a snapshot for a target LUN to retain its data changes.
Relevant concepts:
1. Pair: In HyperClone, a pair has one source LUN and one target LUN. A pair is a mirror
relationship between the source and target LUNs. A source LUN can form multiple HyperClone
pairs with different target LUNs. A target LUN can be added to only one HyperClone pair.
2. Synchronization: Data is copied from a source LUN to a target LUN.
3. Reverse synchronization: If data on the source LUN needs to be restored, you can reversely
synchronize data from the target LUN to the source LUN.
4. Differential copy: The differential data can be synchronized from the source LUN to the target
LUN based on the differential bitmap.
3.2.2.4 Reverse Synchronization
If a source LUN is damaged, data on a target LUN can be restored to the source LUN. All or
differential data can be copied to the source LUN.
When reverse synchronization starts, the system generates a snapshot for a target LUN and
synchronizes all of the snapshot data to a source LUN. For incremental reverse synchronization, the
system compares the data of the source and target LUNs, and only copies the differential data to the
source LUN.
The source and target LUNs can be read and written immediately after reverse synchronization
starts.
3.2.2.5 Restrictions on Feature Configuration
HyperClone has restrictions on other functions of Huawei OceanStor Dorado V3.
HyperClone is widely used. It can serve as a backup source, as a data mining source, and as a
checkpoint for application status. After the synchronization is complete, the source and target LUNs
of HyperClone are physically isolated. Operations on the source and target LUNs do not affect each
other.
Application Scenario 1: Data Backup and Restoration
HyperClone generates one or multiple copies of source data to achieve point-in-time backup, which
can be used to restore the source LUN data in the event of data corruption. Using a target LUN can
back up data online and restore data more quickly.
Application Scenario 2: Data Analysis and Reproduction
Data analysis researches on a great amount of data in order to extract useful information, draw
conclusions, and support decision-making. The analysis services use data on target LUNs to prevent
the analysis and production services from competing for source LUN resources, ensuring system
performance.
Data reproduction (HyperClone) can create multiple copies of the same source LUN for multiple
target LUNs.
Configure HyperClone on DeviceManager.
Create a clone to copy data to a target LUN.
Create a protection group to protect data. Before this operation, you need to create protected
objects, including protection groups and LUNs. A protection group is a collection of one LUN group
or multiple LUNs.
Create a clone consistency group to facilitate unified operations on clones, improve efficiency, and
ensure that the data of all clones in the group is consistent at the time point.
3.2.3 HyperReplication
3.2.3.1 Overview
As digitization advances in various industries, data has become critical to the efficient operation of
enterprises, and users impose increasingly demanding requirements on the stability of storage
systems. Although many enterprises have highly stable storage systems, it is a big challenge for them
to ensure data restoration from damage caused by natural disasters. To ensure continuity,
recoverability, and high availability of service data, remote DR solutions emerge. The
HyperReplication technology is one of the key technologies used in remote DR solutions.
HyperReplication is Huawei remote replication feature. HyperReplication provides a flexible and
powerful data replication function that facilitates remote data backup and restoration, continuous
support for service data, and disaster recovery.
A primary site is a production center that includes primary storage systems, application servers, and
links.
A secondary site is a backup center that includes secondary storage systems, application servers, and
links.
HyperReplication supports the following two modes:
⚫ Synchronous remote replication between LUNs: Data is synchronized between primary and
secondary LUNs in real time. No data is lost when a disaster occurs. However, the performance
of production services is affected by the latency of the data transmission between primary and
secondary LUNs.
⚫ Asynchronous remote LUN replication between LUNs: Data is periodically synchronized between
primary and secondary LUNs. The performance of production services is not affected by the
latency of the data transmission between primary and secondary LUNs. If a disaster occurs,
some data may be lost.
3.2.3.2 Introduction to DR and Backup
When two data centers (DCs) use HyperReplication, they work in active/standby mode. The
production center is in the service running status, and the DR center is in the non-service running
status.
For active/standby DR, if a device in DC A is faulty or even if the entire DC A is faulty, services are
automatically switched to DC B.
For backup, DC B backs up only data in DC A and does not carry services when DC A is faulty.
3.2.3.3 Relevant Concepts
⚫ Pair: A pair refers to the data replication relationship between a primary LUN and a secondary
LUN. The primary LUN and secondary LUN of a pair must belong to different storage systems.
⚫ Data status: HyperReplication identifies the data status of the current pair based on the data
difference between primary and secondary LUNs. When a disaster occurs, HyperReplication
determines whether to allow a primary/secondary switchover for a pair based on the data
status of the pair. There are two types of pair data status: complete and incomplete.
⚫ Writable secondary LUN: Data delivered by a host can be written to a secondary LUN. A
secondary LUN can be set to writable in two scenarios:
➢ A primary LUN fails and the HyperReplication links are disconnected. In this case, a
secondary LUN can be set to writable in the secondary storage system.
➢ A primary LUN fails but the HyperReplication links are in normal state. The pair must be
split before you enable the secondary LUN to be writable in the primary or secondary
storage system.
⚫ Consistency group: A collection of pairs whose services are associated. For example, the primary
storage system has three primary LUNs, which respectively store service data, log, and change
tracking information of a database. If data on any of the three LUNs is invalid, all data on the
three LUNs becomes invalid. The pairs to which three LUNs belong comprise a consistency
group.
⚫ Synchronization: Data is replicated from a primary LUN to a secondary LUN. HyperReplication
involves initial synchronization and incremental synchronization.
⚫ Split: Data synchronization between primary and secondary LUNs is suspended. After splitting,
there is still the pair relationship between the primary LUN and the secondary LUN. Hosts'
access permission for both the LUNs remains unchanged. At some time, users may not want to
copy data from the primary LUN to the secondary LUN. For example, if the bandwidth is
insufficient to support critical services, you need to suspend the data synchronization of
HyperReplication between links. In such case, you can perform the split operation to suspend
data synchronization.
⚫ Primary/secondary switchover: A process during which primary and secondary LUNs in a pair
are switched over. This process changes the primary/secondary relationship of LUNs in a
HyperReplication pair.
3.2.3.4 Phases
HyperReplication involves the following phases: creating a HyperReplication relationship,
synchronizing data, switching over services, and restoring data.
1. Create a HyperReplication pair.
2. Synchronize all data manually or automatically from the primary LUN to the secondary LUN of
the HyperReplication pair. In addition, periodically synchronize incremental data on the primary
LUN to the secondary LUN.
3. Check the data status of the HyperReplication pair and the read/write properties of the
secondary LUN to determine whether a primary/secondary switchover can be performed. Then
perform a primary/secondary switchover to form a new HyperReplication pair.
4. Synchronize data from the secondary storage system to the primary storage system. Then
perform a primary/secondary switchover to restore to the original pair relationship.
1. Running status of a pair
You can perform synchronization, splitting, and primary/secondary switchover operations on a
HyperReplication pair based on its running status. After performing an operation, you can view
the running status of the pair to check whether the operation is successful.
2. Working principles of asynchronous remote replication
Asynchronous remote replication periodically replicates data from the primary storage system
to the secondary storage system.
Asynchronous remote replication of Huawei OceanStor storage systems adopts the innovative
multi-time-point caching technology.
3. HyperReplication service switchover
If a primary site suffers a disaster, a secondary site can quickly take over its services to protect
service continuity. The RPO and RTO indicators must be considered during service switchover.
Requirements for running services on the secondary storage system:
➢ Before a disaster occurs, data in a primary LUN is consistent with that in a secondary LUN.
If data in the secondary LUN is incomplete, services may fail to be switched.
➢ Services on the production host have also been configured on the standby host.
➢ The secondary storage system allows a host to access a LUN in a LUN group mapped to the
host.
If a disaster occurs and the primary site is invalid, the HyperReplication links between primary
and secondary LUNs are down. In this case, the administrator needs to manually set the
read/write permissions of the secondary LUN to writable mode to implement service
switchover.
4. Data restoration
After a primary site fails, a secondary site temporarily takes over services of the primary site.
When the primary site recovers, services are switched back.
After the primary site recovers from a disaster, it is required to rebuild a HyperReplication
relationship between primary and secondary storage systems and use data on the secondary
site to restore data on the primary site.
5. Function of a consistency group
Users can perform synchronization, splitting, and primary/secondary switchover operations on a
single HyperReplication pair or manage multiple HyperReplication pairs by using a consistency
group. If associated LUNs have been added to a HyperReplication consistency group, the
consistency group can effectively prevent data loss.
HyperReplication is used for data DR and backup. The typical application scenarios include central
DR and backup, geo-redundancy, and realizing DR with BCManager eReplication.
Different HyperReplication modes apply to different application scenarios.
Asynchronous remote replication applies to backup and disaster recovery scenarios where the
network bandwidth is limited or a primary site is far from a secondary site (for example, across
countries or regions).
You can set up a pair relationship between primary and secondary resources to synchronize data.
Unless otherwise specified, the operations in the slide can be performed on either the primary or
the secondary storage device. If you want to perform the operations on the secondary storage
device, perform a primary/secondary switchover first.
3.2.4 HyperMetro
3.2.4.1 Overview
HyperMetro is Huawei's active-active storage solution. Two DCs enabled with HyperMetro back up
each other and both are carrying services. If a device is faulty in a DC or if the entire center is faulty,
the other DC will automatically take over services, solving the switchover problems of traditional DR
centers. This ensures high data reliability and service continuity, and improves the resource
utilization of the storage system.
1. Local DC deployment: In most cases, hosts are deployed in different equipment rooms in the
same industrial park. Hosts are deployed in cluster mode. Hosts and storage devices
communicate with each other through switches. Fibre Channel switches and IP switches are
supported. In addition, dual-write mirroring channels are deployed between the storage
systems to ensure the HyperMetro services are running correctly.
2. Cross-DC deployment: Generally, hosts are deployed in two DCs in the same city or adjacent
cities. The physical distance between the two centers is within 300 km. Both are running and
can carry the same services at the same time, improving the overall service capability and
system resource utilization of the DCs. If one DC is faulty, services are automatically switched to
the other one.
In cross-DC deployments involving long-distance transmission (a minimum of 80 km for IP
networking and 25 km for Fibre Channel networking), dense wavelength division multiplexing
(DWDM) devices must be used to ensure a low transmission latency. In addition, HyperMetro
mirroring channels are deployed between the storage systems to ensure the HyperMetro services
are running correctly.
The HyperMetro solution has the following characteristics:
⚫ The data dual-write technology ensures storage redundancy. No data is lost if there is only one
storage system running or the production center fails. Services are switched over quickly,
maximizing customer service continuity. This solution meets the service requirements of RTO =
0 and RPO = 0.
⚫ HyperMetro and SmartVirtualization can be used together to support heterogeneous storage
and consolidate resources on the network layer to protect the existing investment of the
customer.
⚫ This solution can be smoothly upgraded to the 3DC solution with HyperReplication.
Based on the preceding features, the HyperMetro solution can be widely used in industries such as
healthcare, finance, and social security.
3.2.4.3 Quorum Modes

If the link between two DCs is down or one DC is faulty, data cannot be synchronized between the
two centers in real time. In this case, only one end of a HyperMetro pair or HyperMetro consistency
group can continue providing services. For data consistency, HyperMetro uses an arbitration
mechanism to determine service priorities in DCs.
HyperMetro provides two quorum modes:
The static priority mode applies to scenarios where no quorum server is configured.
If no quorum server is configured or the quorum server is inaccessible, HyperMetro works in static
priority mode. When an arbitration occurs, the preferred site wins the arbitration and provides
services.
The quorum server mode (recommended) applies to scenarios where a quorum server is configured.
An independent physical server or VM is used as a quorum server. It is recommended that the
quorum server be deployed at a dedicated site that is different from the two DCs. In this way, when
a disaster occurs in a single DC, the quorum server still works.
In quorum server mode, in the event of the failure in a DC or disconnection between the storage
systems, each storage system sends an arbitration request to the quorum server, and only the
winner continues providing services. The preferred site takes precedence in arbitration.
3.2.4.4 Dual-Write Working Principles
HyperMetro uses dual-write and the data change log (DCL) mechanism to ensure data consistency
between the storage systems in two DCs. Both centers provide read and write services for hosts
concurrently.
The locking and dual-write mechanisms are critical to ensure data consistency between two storage
systems.
Two storage systems with HyperMetro enabled can process I/O requests of hosts concurrently. To
prevent data conflicts when the two storage systems receive a host's write request to modify the
same data block simultaneously, only the storage system obtaining the locking mechanism can write
data. The storage system denied by the locking mechanism must wait until the lock is released and
then obtain the write permission.
3.2.4.5 Strong Data Consistency
In the HyperMetro DR scenario, read and write operations can be concurrently performed at both
sites. If data is read from or written to the same storage address on a volume simultaneously, the
storage layer must ensure data consistency at both sites.
Data consistency at the application layer: cross-site databases, applications deployed in a cluster,
and shared storage architecture.
Data consistency at the storage layer
In normal conditions, delivered application I/Os are concurrently written into both storage arrays,
ensuring data consistency between the two storage arrays.
If a single storage device is unavailable, data differences are recorded. If one storage device is
unavailable, only the data of the normal storage device is written, and the data changes during the
service running period are recorded in DCL. After the unavailable storage array is recovered and
connected to the system, the storage writes incremental data to the storage device using the DCL
records.
Distributed lock management (DLM): Only one host is allowed to write data to a storage address at a
time when multiple hosts are accessing the address simultaneously. This ensures data consistency.
3.2.4.6 Extensible Solution Design

Two DCs enabled with HyperMetro back up each other and both are carrying services. If a device is
faulty in a DC or if the entire center is faulty, the other DC will automatically take over services,
solving the switchover problems of traditional DR centers. This ensures high data reliability and
service continuity, and improves the resource utilization of the storage system. The HyperMetro
solution can be smoothly upgraded to the 3DC solution with HyperReplication.
The 3DC solution deploys three DCs to guarantee service continuity in the event that any two DCs
fail, remarkably improving availability of customer services. The three DCs refer to the production
center, local metropolitan DR center, and remote DR center. They are deployed in two places for
geo-redundant protection.
3.2.4.7 Typical Application Scenarios
HyperMetro can be widely used in industries such as healthcare, finance, and social security.
HyperMetro must be configured for both storage systems.
On storage array B in DC B, you need to check the license file, configure the IP address of the
quorum port, create a quorum server, and create a mapping relationship.
You need to create a consistency group separately if you do not create a HyperMetro consistency
group when creating the HyperMetro pair.
4 Storage Business Continuity Solutions
4.1 Backup Solution Introduction

4.1.1 Overview
4.1.1.1 Background and Definition
Threats to data security are generally unpreventable. Data integrity and the systems from which we
access data will be damaged when these threats turn into disasters. The causes of data loss and
damage are as follows:
1. Faults of software and platforms used for processing and accessing data
2. Design vulnerabilities in operating systems that are created unintentionally or intentionally
3. Hardware faults
4. Misoperations
5. Malicious damage made by illegal access
6. Power outages
Important data, archives, and historical records in computer systems are critical to both enterprises
and individuals. Any data loss will cause significant economic loss, waste an enterprise's years of
R&D efforts, and even interrupt its normal operation and production activities.
To maintain normal production, marketing, and development activities, enterprises should take
advanced and effective measures to back up their data, for early prevention purposes.
In IT and data management fields, a backup refers to a copy of data in a file system or database that
can be used to promptly recover the valid data and resume normal system operations when a
disaster or misoperation occurs.
4.1.1.2 Backup, Disaster Recovery, and Archiving
Disaster recovery (DR) and backup are two different concepts. DR is designed for business continuity
by allowing an IT system to remain running properly when a disaster occurs. Backup is designed for
addressing data loss problems when a disaster occurs. A DR system and a backup system were two
independent systems before an integrated DR and backup system was introduced. An integrated DR
and backup system aims to help enterprises protect their data against "soft" disasters, such as
misoperations, software faults, and viruses, and "hard" disasters, such as hardware faults and
natural disasters. File archiving is another popular data protection method. An archiving system uses
inexpensive storage media (such as tapes) and archiving operations can be performed in offline
mode. Therefore, archiving helps reduce costs and facilitate media storage. A file archiving system
can also store files based on file attributes. These attributes can be author, modification date, or
other customized tags. An archiving system stores files together with their metadata and attributes.
In addition, an archiving system provides the data compression function. In conclusion, archiving
involves storing backup data that will no longer be frequently accessed or updated in offline mode
for a long term, and attaching the "archived" tag to these data according to specific attributes for
future search.
Generally, backup refers to data backup or system backup, while DR refers to data backup or
application backup across equipment rooms. Backup is implemented using backup software,
whereas DR is implemented using replication or mirroring software. The differences between the
two are as follows:
1. DR is designed for protecting data against natural disasters, such as fires and earthquakes.
Therefore, a backup center must be set in a place which is away from the production center at a
certain distance. In contrast, data backup is performed within a data center.
2. A DR system not only protects data but also guarantees business continuity. In contrast, data
backup only focuses on data security.
3. DR protects data integrity. In contrast, backup can only help recover data from a point in time
when a backup task is performed.
4. DR is performed in online mode while backup is performed in offline mode.
5. Data at two sites of a DR system is consistent in real time while backup data is relatively time-
sensitive.
6. When a fault occurs, a DR switchover process in a DR system lasts seconds to minutes, while a
backup system takes hours and maybe even dozens of hours to recover data.
Backup and archiving systems are designed to protect data in different ways and the combination of
the two systems will provide more effective data protection. Backup is designed to protect data by
storing data copies. Archiving is designed to protect data by organizing and storing data for a long
term in a data management manner. In other words, backup can be considered as short-term
retention of data copies, while archiving can be considered as long-term retention of files. In
practice, we do not delete an original copy after it is backed up. However, it will be fine if we delete
an original copy after it is archived, as we might no longer need to access it swiftly. Backup and
archiving work together to better protect data.
4.1.2 Architecture
4.1.2.1 Components
A backup system typically consists of three components: backup software, backup media, and
backup server.
The backup software is useful for creating backup policies, managing media, and adding functions.
The backup software is the core of a backup system and is used for creating and managing copies of
production data stored on storage media. Some backup software can be upgraded with more
functions, such as protection, backup, archiving, and recovery.
Backup media include tape libraries, disk arrays, and virtual tape libraries. A virtual tape library is
essentially a disk array, but it can virtualize a disk storage into a tape library. Compared with
mechanical tapes, virtual tape libraries are compatible with tape backup management software and
conventional backup processes, greatly improving availability and reliability.
The backup server provides services for executing backup policies. The backup software resides and
runs on the backup server. Generally, a backup software client agent needs to be installed on the
service host to be backed up.
Three elements of a backup system are Backup Window (BW), Recovery Point Objective (RPO), and
Recovery Time Objective (RTO).
BW indicates a duration of time allowed for backing up the service data in a service system without
affecting the normal operation of the service system.
RPO is for ensuring the latest backup data is used for DR switchover. A smaller RPO means less data
to be lost.
RTO refers to an acceptable duration of time and a service level within which a business process
must be restored, in order to minimize the impact of interruption on services.
4.1.2.2 Backup Solution Panorama
Huawei backup solutions include all-in-one, centralized, and cloud backup solutions.
Huawei Data Protection Appliance is a data protection and management product that integrates the
backup software, backup server, and backup storage. With the distributed architecture, Huawei Data
Protection Appliance supports the linear increase in both performance and capacity. Only one
system needs to be deployed to protect, construct, and manage user data and applications. This
helps users better protect data, save data protection investment, and simplify the data management
process. While excelling in a wide range of scenarios, Huawei Data Protection Appliance is suited for
industries such as government, finance, carrier, healthcare, and manufacturing.
The Data Protection Appliance provides a graphical management system, which facilitates users to
manage and maintain the software and hardware of a backup system in a centralized manner.
Centralized backup uses the backup management node to manage local and remote data centers
and schedules backup tasks in a centralized manner, remarkably simplifying the operation of a
backup system and enabling users to manage and control a backup system in view of overall
condition.
Centralized backup has the following advantages:
1. Enables centralized management of backup and recovery tasks of data from a variety of
applications, to form a unified management policy.
2. Integrates backup resources, to optimize utilization of backup resources.
3. Provides flexible scalability by allowing addition of any tapes, clients, and tape libraries.
4. Simplifies management by allowing fewer management engineers to manage more devices and
systems.
Cloud backup involves backing up data from a local production center to a data center (a central
data center of an enterprise or a data center provided by a service provider) using a standard
network protocol over a WAN. Cloud backup is based on services, accessible anywhere, flexible,
secure, and can be shared and used on demand. Cloud backup emerges as a brand new backup
service based on broadband Internet and large storage capacities. In conclusion, cloud backup
provides data storage and backup services by leveraging a variety of functions, such as cluster
applications, grid technologies, and distributed file systems, and integrating a variety of storage
devices across the network through application software.
4.1.3 Networking Modes

Common backup networking modes include LAN-Base, LAN-Free, Server-Free, and Server-Less. The
LAN-Base mode applies to all classes of storage systems, whereas LAN-Free and Server-Free modes
apply only to SAN storage.
LAN-Base
LAN-Base backup consumes network resources as both data and control flows are transmitted over a
LAN. Consequently, when a large amount of data needs to be backed up within a short duration of
time, network congestion is likely to occur.
Direction of backup data flows: The backup server sends a control flow to an application server
where an agent is installed over a LAN. The application server responds to the request and sends the
data to the backup server. The backup server receives and stores the data to a storage device. The
backup operation is complete.
Strengths:
⚫ The backup system and the application system are independent of each other, conserving
hardware resources of application servers during backup.
Weaknesses:
⚫ Additional backup servers increase hardware costs.
⚫ Backup agents adversely affect the performance of application servers.
⚫ Backup data is transmitted over a LAN, which adversely affects network performance.
⚫ Backup services must be separately maintained, complicating management and maintenance
operations.
⚫ Users must be highly proficient at processing backup services.
LAN-Free
Control flows are transmitted over a LAN, but data flows are not. LAN-Free backup transmits data
over a SAN instead of a LAN. The server that needs to be backed up is connected to backup media
over a SAN. When triggered by a LAN-Free backup client, the media server reads the data that needs
to be backed up and backs up the same data to the shared backup media.
Direction of backup data flows: The backup server sends a control flow over a LAN to an application
server where an agent is installed. The application server responds to the request and reads the
production data. Then, the media server reads the data from the application server and transmits
the data to the backup media. The backup operation is complete.
Strengths:
⚫ Backup data is transmitted without using LAN resources, significantly improving backup
performance while maintaining high network performance.
Weaknesses:
⚫ Backup agents adversely affect the performance of application servers.
⚫ LAN-Free backup requires a high budget.
⚫ Devices must meet certain requirements.
Server-Free
Server-Free backup has many strengths similar to those of LAN-Free backup. The source device,
target device, and SAN device are main components of the backup data channel. The server is still
involved in the backup process, but processes much fewer workloads as the server does not function
as the main backup data channel but instead, like a traffic police, is only responsible for giving
commands other than processing loading and transportation workloads. Control flows are
transmitted over a LAN, but data flows are not.
Direction of backup data flows: Backup data is transmitted over an independent network without
passing through a production server.
Strengths:
⚫ Backup data flows do not consume LAN resources and do not affect network performance.
⚫ Services running on hosts remain nearly unaffected.
⚫ Backup performance is excellent.
Weaknesses:
⚫ Server-Free backup requires a high budget.
⚫ Devices must meet strict requirements.
Server-Less
Server-Less backup uses the Network Data Management Protocol (NDMP). NDMP is a standard
network backup protocol. It supports communications between intelligent data storage devices,
tape libraries, and backup applications. After a server sends an NDMP command to a storage device
that supports the NDMP protocol, the storage device can directly send the data to other devices
without passing through a host.
4.1.4 Common Backup Technologies

4.1.4.1 Backup Types and Policies
Common backup types include full backup, cumulative incremental backup, and differential
incremental backup.
Full backup backs up all data at a point in time.
Strengths:
⚫ Data can be quickly recovered using the last full backup.
⚫ Recovery operations are fast.
Weaknesses:
⚫ Large storage space is required.
⚫ A backup operation takes a long time.
Cumulative incremental backup, also called incremental backup, is based on the last full backup. If
no previous backup has been performed, all files are backed up.
Strengths:
⚫ It reduces the storage space required for creating a full backup each time.
⚫ Recovery and backup operations are fast.
Weaknesses:
⚫ The last full backup and the current incremental backup are both required to fully recover data.
⚫ The recovery time is shorter than that of a differential backup.
Differential incremental backup, also called differential backup, is based on the last backup,
regardless of the type of the last backup. If no previous backup has been performed, all files are
backed up.
Strengths:
⚫ It saves the most storage space and requires a short backup window.
Weaknesses:
⚫ The last full backup and each differential incremental backup are both required to fully recover
data, resulting in a lengthy recovery duration.
A backup policy specifies the backup content, time, and method.
Selection of backup policies:
For operating systems and application software, a full backup should be performed when the
operating systems are updated or new software is installed.
For critical application data that has a small total data volume but many data updates per day, a full
backup should be performed within the backup window daily.
For critical applications that have fewer data updates per day, a full backup should be performed
monthly or weekly. Incremental backup operations should also be performed frequently.
4.1.4.2 Deduplication
Digital transformations of enterprises have intensified the explosive growth of service data. The total
amount of backup data that needs to be protected is also increasing sharply. In addition, more and
more duplicate data is being generated from backup and archiving operations. Mass redundant data
consumes a lot of storage and bandwidth resources and leads to issues like long backup windows,
which further affect the availability of service systems.
Huawei Data Protection Appliance supports source-side and parallel deduplication. Deduplication is
performed before backup data is transmitted to storage media, greatly improving backup
performance.
Source-Side Deduplication
Data or files are sliced using an intelligent content-based deduplication algorithm. Then, fingerprints
are created for data blocks by hashing, for querying identical fingerprints in fingerprint libraries. If
identical fingerprints exist, it indicates that the same blocks are stored on the media servers. Existing
blocks will be used to preserve backup capacity and bandwidth resources, and for streamlined data
transfer and storage.
Technical principles:
1. Creates a fingerprint for a data block by hashing.
2. Queries whether the fingerprint exists in the fingerprint library of the Data Protection
Appliance. If yes, it indicates that this data block is duplicate and does not need to be sent to
the Data Protection Appliance. If no, the data block will be sent to the Data Protection
Appliance and written to the backup storage pool. Then, the fingerprint of this data block is
recorded in the deduplication fingerprint library.
Parallel Deduplication
Most conventional deduplication modes are based on a single node and are prone to inefficient data
access, poor processing performance, and insufficient storage capacity in the era of big data.
Huawei Data Protection Appliance uses the parallel deduplication technology by building a
deduplication fingerprint library on multiple nodes and distributing fingerprints on multiple nodes in
parallel. This effectively resolves the performance and storage capacity problems in single-node
solutions.
Technical principles:
After fingerprints are calculated for data blocks, the system uses the grouping algorithm to locate
specific server nodes. Different fingerprints are evenly distributed on different nodes. In this way,
the system queries whether these fingerprints exist on different server nodes, for parallel
deduplication.
With fingerprint libraries, recycled data can be stored in the same space in sequence. Such practice
reduces time for querying all fingerprints in each global deduplication, and maximizes the effect of
read cache of storage, to minimize disk seek switchover frequency due to random disk reads and
improve recovery efficiency.
4.1.4.3 Backup Modes
Snapshot Backup
Snapshot backup supports backup using the snapshot function of a storage system and agent-based
backup.
Fast and recoverable:
⚫ Enables you to browse backup information and quickly recover the selected objects.
⚫ Consolidates incremental copies into a full copy in the background to quickly recover data.
⚫ Recovers data from hardware snapshots and performs fine-grained recovery using snapshots.
Recovers copies:
Storage array protection:
⚫ Native format
⚫ Automatic storage detection
⚫ Full integration (no script)
⚫ Snapshot support
Storage array recovery:
⚫ Enables you to recover, clone, or mount volumes.
⚫ Enables you to copy data back.
Standard Backup
Standard backup is a scheduled data protection mechanism. A backup task is automatically initialized
at a specified time according to backup policy and plan to read and write the data to be protected to
the backup media.
Working principles:
The standard backup process consists of three steps:
Figure 4-1
1. Reads data to be protected through the backup client (agent client). Based on different
applications, the agent client can be deployed on the production server (agent-based backup)
or can be the agent client built in Huawei Data Protection Appliance (agent-free backup).
2. Reads data from a production system to the Data Protection Appliance over the network (TCP).
3. The Data Protection Appliance receives data and saves it to the backup storage.
For different backup modes, such as full backup, incremental backup, permanent incremental
backup, and differential backup, data is read and transmitted in different ways. All data is
transmitted or only unique data is transmitted with deduplication.
When remote DR is required, remote replication allows replication of backup data to remote data
centers.
Continuous Backup
Continuous backup is a process of continuously backing up data on production hosts to backup
media. Continuous backup is based on the block-level continuous data protection technology. A
backup agent client is installed on production hosts. Data on production hosts is continuously
backed up to the snapshot storage pool of the internal storage system of the Data Protection
Appliance and is stored in the native format. After certain conditions are met, snapshots are created
in the snapshot storage pool to manage data at multiple points in time.
Figure 4-2
1. The snapshot storage pool allocates the base volume.
2. The agent client for continuous backup connects to the server of the Data Protection Appliance.
3. The bypass monitoring drive in a partition of the production host continuously captures data
changes and caches the same data changes to the memory pool.
4. The agent client for continuous backup continuously transfers data to a storage device in the
snapshot storage pool of the Data Protection Appliance.
5. Source data in the partition of the production host is written to the base volume.
6. Data changes on the production host are written to the log volume first, and then are written to
the base volume storing the source data.
7. Snapshots of the base volume are managed based on the data retention policy for continuous
backup.
Advanced Backup
The advanced backup function of Huawei Data Protection Appliance effectively combines years of
experience in backup and DR and the independently developed copy data storage system to ensure
application data consistency. The advanced backup function helps implement policy-based
automation and DR, provide automation tools for developers, support heterogeneous production
storage, and implement real copy data management.
Working principles:
Capture of production data: Data is captured in the native format. Format conversion is not
required. Data is accessible upon being mounted. SLA policy can be customized based on
applications. Retention duration, RPO, RTO, and data storage locations are intuitively displayed.
Copy Management
Permanent incremental backup: Initial full backup and N incremental backups are performed. A full
copy is generated at each incremental backup point in time. Damages to a copy at an incremental
backup point in time will not impede recovery from any other point in time.
No rollback: Point-in-time copies created through virtual clone can direct to both source data and
current incremental data and can be directly used for recovery.
Copy Access and Use
No data movement: Data is mounted in minutes, and data volume does not affect recovery
efficiency.
A virtual copy can be mounted to multiple hosts.

Data can be recovered from any point in time.
A host automatically takes over the original production applications after the virtual copy is
mounted.
4.1.5 Applications
Databases
Databases are critical service applications in production systems. The native backup function of
databases relies on complicated manual operations. In addition, various databases on different
platforms need protection, which requires a broad compatibility of backup products.
The Data Protection Appliance provides a graphical wizard. Users do not need to manually execute
backup and restoration scripts, which simplifies backup and recovery operations. Database backup
process is as follows:
Install a backup client agent on the production server to be protected and connect the client agent
to the management console. The backup client agent identifies the database data on the production
server, reads the files and data from the production server through the backup API, and transfers the
same files and data to the storage media of the Data Protection Appliance to complete the backup.
The management console of the Data Protection Appliance sends control information to the client
and the Data Protection Appliance server and accordingly, manages the execution of a backup task.
The backup process: The backup client agent invokes the backup API of a database through an API to
read data in the database, processes deduplication or encryption, and then sends the data to the
Data Protection Appliance to complete the backup.
The recovery process: The management console sends a recovery command to the backup client
agent on the production server. The backup client agent invokes the recovery API of a database
through an API to read data from the backup server, and then sends the data to the recovery API to
complete the recovery.
The Data Protection Appliance connects to a database through a dedicated API for backup. The API
varies with databases. For example, the RMAN interface of Oracle and the VDI interface of SQL
Server.
Virtualization Platforms
The popularization of virtualization has increased the confidence of enterprises in storing their core
data in a virtual environment. Therefore, enterprises are in urgent need of data protection in a
virtual environment, in particular, data backup and recovery efficiency is a major concern.
The Data Protection Appliance provides a comprehensive and pertinent virtualization platform
protection solution which provides the following benefits:
⚫ Mass virtual data protection to improve backup efficiency.
⚫ Unified protection for both physical and virtual environments to simplify O&M.
⚫ Flexible recovery methods to avoid service interruptions.
⚫ Agent-free backup to minimize usage of host resources and maximize production performance.
The backup process is as follows: Create a VM snapshot. Back up VM data, including VM
configuration information and data on virtual disks. During backup, the CBT technology can be used
to obtain valid data blocks or incremental data blocks on VM disks. In a full backup operation, valid
data blocks on virtual disks are obtained. In an incremental backup operation, changed data blocks
on the VM are obtained. Delete the VM snapshot created in step 1.
The recovery process: For recovery to a new VM, create a VM based on the original and manual
configuration of the original VM. For recovery by overwriting a VM, manually configure the VM to be
overwritten. The system reads the disk block data at the corresponding point in time from the media
server and writes the data to the disk of the VM mentioned in step 1. When FusionCompute VMs
use FusionStorage, the backup and recovery processes are different from those of FusionCompute
VMs using virtualized storage. In FusionStorage scenarios, VM disks correspond to LUN volumes on
FusionStorage. The differential bitmap volume provided by FusionStorage is used to obtain changed
data blocks on virtual disks.
File Systems
The file system backup module of the Data Protection Appliance can back up unstructured file
systems. It has the following features:
⚫ Backup types: full backup, incremental backup, and permanent incremental backup.
⚫ Backup and recovery granularity: single file, folder, entire disk
⚫ Block-level deduplication: Reduces the amount of backup data to be transmitted, shortening the
backup window and conserving network resources and storage space consumed by backup data
transmission.
⚫ Recovery location: Allows a file system to be recovered to the same location on the original
host, a specified location on the original host, or a different host.
⚫ Incremental backup: Incremental backup is performed on a per-file basis. The system compares
a file at the backup time with a file at the last modification time to determine whether the file
needs to be backed up. If the file needs to be backed up, the system performs incremental
backup based on the comparison result.
For file system backup, four filtering modes are provided to filter backup data sources, helping users
quickly select files to be backed up.
Backup process:
1. The client deployed on the service production system reads file data to be backed up.
2. The client transmits the data over the network.
3. The Data Protection Appliance receives and stores the data on physical media. The backup
operation is complete.
4. Host status check (Windows and Linux)
Windows:
After installing the client, press Windows+R. In the Run dialog box that is displayed, enter
services.msc to open the service management window. Then, check whether the client service is
started normally. If the client service is started normally, check whether the client is connected to
the server. If connected, you can create a standard backup plan for a file system.
Linux:
The requirements for backing up a Linux file system are similar to those for backing up a Windows
file system. Check the host service or process. If the client is connected to the server, you can create
a standard backup plan for a file system. The check commands used in CentOS 7 are as follows:
systemctl status HWClientService.service
ps -ef|grep esf
Operating Systems
The backup process is as follows:
1. Install a client to obtain the data about an operating system.
2. For Windows, invoke the VSS interface to create a snapshot for the volume where the operating
system resides. For Linux, select the data source to be backed up.
3. Read the data of the volume where the operating system resides and back up the same data to
the storage media in the Data Protection Appliance.
4. The backup operation is complete (for Windows, delete the snapshot).
Recovery process:
1. Load the WinPE or LiveCD to boot the recovery environment. For Linux, install a client.
2. For Windows, manually partition the disk.
3. Recover the operating system data from the storage media to the specified system volume.
4. For Windows, use the system API to load the driver, modify the registry, and rectify BSOD. For
Linux, modify the configuration file.
5. Reboot the operating system upon completion of the recovery operation.
4.2 DR Solution Introduction

4.2.1 DR Solution Overview
4.2.1.1 Definition of DR
DR system building - a necessary means to minimize disaster impact
According to the statistics of the international authority, in 2004, the direct financial loss resulting
from natural and human-induced disasters reached 123 billion US dollars worldwide.
In 2005, 400 catastrophes occurred worldwide and caused losses of more than 230 billion US dollars.
In 2006, the financial loss caused directly by natural and human-induced disasters was lower than
expected at 48 billion US dollars.
The occurrence rate of natural disasters that can be measured was three times greater in the 1990s
than the 1960s, while the financial loss was nine times greater.
The huge losses caused by small-probability natural disasters cannot be ignored.
According to IDC, among the companies that experienced disasters in the ten years before 2000,
55% collapsed when the disasters occurred, 29% collapsed within 2 years after the disasters due to
data loss, and only 16% survived.
High availability (HA) ensures that applications can still be accessed when a single component of the
local system is faulty, no matter whether the fault is a service process, physical facility, or IT
software/hardware fault.
The best HA is when a machine in the data center breaks down, but the users using the data center
service are unaware of it. However, if a server in a data center breaks down, it takes some time for
services running on the server to fail over. As a result, customers will be aware of the failure.
The key indicator of HA is availability. Its calculation formula is [1 – (Downtime)/(Downtime +
Uptime)]. We usually use the following nines to represent availability:
4 nines: 99.99% = 0.01% x 365 x 24 x 60 = 52.56 minutes/year
5 nines: 99.999% = 0.001% x 365 = 5.265 minutes/year
6 nines: 99.9999% = 0.0001% x 365 = 31 seconds/year
For HA, shared storage is usually used. In this case, RPO = 0. In addition, the active/active HA mode is
used to ensure that the RTO is almost 0. If the active/passive HA mode is used, the RTO needs to be
reduced to the minimum.
HA requires redundant servers to form a cluster to run applications and services. HA can be
categorized into the following types:
Active/Passive HA:
A cluster consists of only two nodes (active and standby nodes). In this configuration, the system
uses the active and standby machines to provide services. The system provides services only on the
active device.
When the active device is faulty, the services on the standby device are started to replace the
services provided by the active device.
Typically, the CRM software such as Pacemaker can be used to control the switchover between the
active and standby devices and provide a virtual IP address to provide services.
Active/Active HA:
If a cluster consists of only two active nodes, it is called active-active. If the cluster has multiple
nodes, it is called multi-active.
In this configuration, the system runs the same load on all servers in the cluster.
Take the database as an example. The update of an instance will be synchronized to all instances.
In this configuration, load balancing software, such as HAProxy, is used to provide virtual IP
addresses for services.
Pacemaker is a cluster manager. It uses the message and member capabilities provided by the
preferred cluster infrastructure (OpenAIS or heartbeat) to detect faults by the secondary node and
system, achieving high availability of the cluster service (also called resources).
HAProxy is a piece of free and open-source software written in C language. It provides high
availability, load balancing, and TCP- and HTTP-based application proxy. HAProxy is especially
suitable for web sites with heavy loads that usually require keeping session.
A disaster is an unexpected event (caused by human errors or natural factors) that results in severe
faults or breakdown of the system in one data center. In this case, services may be interrupted or
become unacceptable. If the system unavailability reaches a certain level at a specific time, the
system must be switched to the standby site.
Disaster recovery (DR) refers to the capability of recovering data, applications, and services in data
centers at different locations when the production center is damaged by a disaster.
In addition to the production site, a redundancy site is set up. When a disaster occurs and the
production site is damaged, the redundancy site can take over services from the production site to
ensure service continuity. To achieve higher availability, many users even set up multiple redundant
sites.
Main indicators for measuring a DR system
Recovery Point Objective (RPO) indicates the maximum amount of data that can be lost when a
disaster occurs.
Recovery Time Objective (RTO) indicates the time required for system recovery.
The smaller the RPO and RTO, the higher the system availability, and the larger the investment for
users.
4.2.1.2 DR System Level
Disaster recovery is an important technical application for enterprises and plays an important role in
enterprise data security. When it comes to disaster recovery, many CIOs put remote application-
level disaster recovery in the first place. They also emphasize the construction of a remote disaster
recovery system with zero data loss and automatic application switchover at the highest level. This is
actually a misconception. There is no doubt about the importance of disaster recovery. However,
disaster recovery does not necessarily mean that application-level disaster recovery must be built.
The most important thing is to select a proper disaster recovery system based on actual
requirements.
Generally speaking, disaster backup is classified into three levels: data level, application level, and
service level. The data level and application level are within the scope of the IT system. The service
level takes the service factors outside the IT system into consideration, including the standby office
location and office personnel.
The data-level disaster recovery focuses on protecting the data from loss or damage after a disaster
occurs. Low-level data-level disaster recovery can be implemented by manually saving backup data
to a remote place. For example, periodically transporting backup tapes to a remote place is one of
the methods. The advanced data disaster recovery solution uses the network-based data replication
tool to implement asynchronous or synchronous data transmission between the production center
and the disaster recovery center. For example, the data replication function based on disk arrays is
used.
Application-level DR creates hosts and applications in the DR site based on the data-level DR. The
support system consists of the data backup system, standby data processing system, and standby
network system. Application-level DR provides the application takeover capability. That is, when the
production center is faulty, applications can be taken over by the DR center to minimize the system
downtime and improve service continuity.
SHARE, an IT information organization initiated by IBM in 1955, released the disaster recovery
standard SHARE 78 at the 78th conference in 1992. SHARE 78 has been widely recognized in the
world.
SHARE 78 divides disaster recovery into eight levels:
Backup or recovery scope
Status of a disaster recovery plan
Distance between the application location and the backup location
Connection between the application location and backup location
Transmission between the two locations
Data allowed to be lost
Backup data update
Ability of a backup location to start a backup job
The definition of remote disaster recovery is classified into seven levels:
Backup and recovery of local data
Access mode of batch storage and read
Access mode of batch storage and read + hot backup location
Network connection
Backup location of the working status
Dual online storage
Zero data loss
In addition, ISO 27001 released by International Organization for Standardization (ISO) requires that
related data and files be stored for at least one to five years.
4.2.1.3 Panorama of Huawei Business Continuity and Disaster Recovery Solution
Huawei Business Continuity and Disaster Recovery (BC&DR) Solution is designed to provide business
continuity assurance and data protection for enterprise customers. Huawei provides four major DR
solutions covering the local production center, intra-city DR center, and remote DR center. In
addition, Huawei provides professional DR consulting services for customers' service systems to
ensure service continuity and data protection.
Local HA solution: ensures high availability of key services in the data center and prevents service
interruption and data loss caused by single-component faults.
Active-passive DR solution: intra-city and remote DR are supported. When a disaster occurs, services
in the DR center can be quickly recovered and provide services for external systems.
Active-Active data center solution: In intra-city DR, load of a critical service is balanced between two
data centers, ensuring zero service interruption and data loss when a data center malfunctions.
Geo-redundant DR solution: defends against data center-level disasters and regional disasters and
provides higher service continuity for mission-critical services. Generally, the intra-city
active/standby + remote active/standby solution or intra-city active-active + remote active/standby
solution is used.
4.2.2 DR Solution Architecture

4.2.2.1 Disaster Recovery Design Mode
The Huawei DR and backup solution provides traditional data center-level and cloud data center-
level DR and backup solutions, covering application-level and data-level DR and backup as well as
implementing application-based data protection. In addition, DR solutions are provided for
applications of different levels to balance DR requirements and total costs. The Huawei DR and
backup solution complies with customers' service and development strategies and provides
professional services including strategic consulting, RD planning, service implementation, and
continuous operation management.
Remote replication is mainly used for data backup and disaster recovery. Different remote
replication modes apply to different application scenarios.
Synchronous remote replication: applicable to scenarios where the primary site is close to the
secondary site, for example, intra-city DR and backup.
Asynchronous remote replication: applicable to scenarios where the primary site is far away from
the secondary site or the network bandwidth is limited, for example, remote (cross-country/region,
global) disaster recovery and backup.
In specific application scenarios, you need to consider the distance and replication mode of remote
replication. The service application scenarios include central backup and DR sites and active-active
continuous service sites.
In active/standby DR mode, the customer builds a DR center at another site besides the production
center to implement one-to-one data-level or application-level protection. Compared with complex
topologies such as geo-redundant and centralized disaster recovery, active-passive disaster recovery
is the most widely used disaster recovery mode in the market.
The active-passive DR solution applies to two scenarios: scenarios where both the primary and
secondary storage systems are Huawei storage systems and scenarios where the production storage
system is a peer vendor's storage system.
Huawei's active-active data center solution allows two data centers to carry service loads
concurrently, improving data center performance and resource utilization.
Currently, data centers can work in either active-passive mode or active-active mode.
In active-passive mode, some services are mainly processed in data center A and hot standby is
implemented in data center B, and some services are mainly processed in data center B and hot
standby is implemented in data center A, achieving approximate active-active effect.
In active-active mode, all I/O paths can access active-active LUNs to achieve load balancing and
seamless failover.
Huawei active-active data center solution adopts the active-active architecture and combines the
industry-leading HyperMetro-based functions with the web, database cluster, load balancing,
transmission devices, and network components to provide customers with an end-to-end active-
active data center solution within 100 km, ensuring service continuity even in the event of a device
or data center failure, services are not affected and can be automatically switched.
4.2.2.2 New DR Mode Evolution in Cloud Computing
To help enterprise customers build high-quality IT systems and meet service development
requirements, Huawei provides professional services in terms of storage, cloud computing, and
servers based on IT products.
At the storage layer, Huawei provides professional storage data migration, disaster recovery,
backup, and virtualization takeover services to meet enterprise customers' requirements for storage
replacement, data protection, and unified storage management.
In terms of cloud computing, Huawei provides professional services, such as cloud planning and
design, FusionSphere solution implementation, FusionCloud desktop solution implementation,
FusionSphere service migration, big data planning, and big data solution implementation, to meet
enterprise customers' requirements on virtualization planning and design, implementation,
migration, and big data planning and implementation.
At the data center layer, Huawei provides professional data center consolidation services to meet
customers' requirements for data center L1 and L2, data protection, data center planning, and data
migration. L1 is the infrastructure layer, including the floor layout, power system, cooling system,
cabling system, fire extinguishing system, and physical security. L2 is the IT infrastructure layer,
which uses the cloud computing system as the core, including computing, network, storage, security,
service continuity, disaster recovery, and backup.
The Huawei enterprise DR and backup service provides multiple service products, including storage
data migration, DR, backup, VM migration, and cloud solution implementation services, covering
multiple industries such as government, energy, finance, and education. Huawei provides
professional service solutions with industry-specific characteristics to meet enterprise customers'
requirements for IT infrastructure update, data protection, and technological transformation.
Based on customer requirements and service lifecycle, the Huawei enterprise DR and backup service
provides customers with one-stop professional services, including project management, planning,
design, integration test, integration implementation, integration verification, and optimization.
In addition, professional and diversified tools are used to quickly collect and analyze project
information, design and implement solutions, and customize and deliver the most appropriate
professional service solutions for customers.
4.2.3 Common DR Technologies

4.2.3.1 Host-Layer DR Technology
Host-Layer DR Technology
Dedicated data replication software, such as volume replication software, is installed on servers in
the production center and disaster recovery center to implement remote replication. There must be
a network connection between the two centers as the data channel. The remote application
switchover software can be added to the server layer to form a complete application-level DR
solution.
This data replication mode requires less investment, mainly software procurement costs. Servers
and storage devices of different brands are compatible, which is suitable for users with complex
hardware composition. However, in this mode, synchronization is implemented on the server

through software, which occupies a large number of host resources and network resources.
4.2.3.2 Network-Layer DR Technology
Network-Layer DR Technology
The data replication technology based on the SAN network layer is to add storage gateways to SAN
between front-end application servers and back-end storage systems. The front end connects to
servers and the back end connects to storage devices.
A mirror relationship between two volumes on different storage devices is established for the
storage gateways, and data written into the primary volume is written to the backup volume at the
same time.
When the primary storage device is faulty, services are switched to the secondary storage device and
the backup volume is enabled to ensure that data services are not interrupted.
Working principles:
The host in the production center writes data to the local virtualization gateway.
The virtualization gateway in the production center writes data to the local log volume.
After data is successfully written to the log volume, the virtualization gateway in the production
center returns a confirmation message to the local host.
The virtualization gateway at the production end writes data to the production volume at the local
end and sends a data write request to the virtualization gateway at the DR end.
After receiving the write request, the virtualization gateway at the DR end returns a confirmation
message to the virtualization gateway at the production end.
The virtualization gateway at the DR end writes data to the replication volume at the DR end.
After data is successfully written to the replication volume in the DR center, the virtual gateway in
the DR center returns a completion message to the virtual gateway in the production center.
4.2.3.3 Storage-Layer DR Technology
Storage-Layer DR Technology
Storage-layer DR uses the inter-array data replication technology to replicate data from the local
storage array to the DR storage array and generate an available data copy on the DR storage array.
When the primary storage array is faulty, services can be quickly switched to the secondary storage
array to ensure service continuity.
SAN Synchronous Replication Principle
Synchronization procedure:
The production storage receives a write request from the host. HyperReplication records the request
in logs. When recording the request, HyperReplication records only the address information.
HyperReplication writes the request into the primary and standby LUNs. Generally, the write policy
for a LUN is in the write back state, and data is written into the cache.
HyperReplication waits for the primary and standby LUNs to return write results. If data is
successfully written into both the primary and secondary LUNs, HyperReplication deletes the
request from logs. Otherwise, HyperReplication retains the request in logs and enters the abnormal
disconnection state. In subsequent synchronization, HyperReplication replicates the data block
corresponding to the address information recorded in logs.
HyperReplication returns the write result of the primary LUN to the host.
Splitting:
In this mode, data written by a production host is stored only to the primary LUN, and the difference
between the primary and secondary LUNs is recorded by the differential log. If you want to ensure
data consistency between the primary and secondary LUNs, you can manually start a
synchronization process. During the synchronization process, data blocks marked as differential in
the differential log are incrementally copied from the primary LUN to the secondary LUN. The I/O
processing principle is similar to that of initial synchronization.
SAN Asynchronous Replication Principle
A time segment refers to a logical space in the cache for writing data in a period of time (without the
data amount limit).
In the low RPO scenario, the asynchronous remote replication period is short. The cache of an
OceanStor storage system can store all data in multiple time segments. However, if the host or DR
bandwidth is abnormal and the replication period is prolonged or interrupted, data in the cache is
automatically written into disks based on the disk flushing policy for consistency protection. Upon
replication, the data is read from disks.
Based on the replication period (which is user-defined and ranges from 3 seconds to 1440 minutes),
the system automatically starts a synchronization procedure for incrementally synchronizing data
from the primary site to the standby site (If the synchronization type is set to manual, users need to
manually trigger synchronization). At the start of a replication period, the system generates time
segments TPN+1 and TPX+1 separately in the caches of the primary LUN (LUN A) and the standby
LUN (LUN B).
The primary site receives a write request from the production host.
The primary site writes the data involved in the write request into time segment TPN+1 in the cache
of LUN A and immediately returns the write complete response to the host.
During data synchronization, the system reads data generated in the previous replication period in
time segment TPN in the cache of LUN A, transmits the data to the standby site, and writes the data
into time segment TPX+1 in the cache of LUN B. If the usage of LUN A's cache reaches a certain
threshold, the system automatically writes data into disks. In this case, a snapshot is generated on
disks for the data in time segment TPN. During data synchronization, the system reads the data from
the snapshot on disks and replicates the data to LUN B.
After the data synchronization is complete, the system writes data in time segments TPN and TPX+1
separately in the caches of LUN A and LUN B into disks based on the disk flushing policy (snapshots
are automatically deleted), and waits for the next replication period.
Switchover:
A primary/secondary switchover can be performed for a synchronous remote replication pair when
the pair is in the normal state.
In the split state, a primary/secondary switchover can be performed only after the secondary LUN is
set to writable.
The asynchronous remote replication is in the split state.
In the split state, the secondary LUN must be set to writable.
NAS Asynchronous Replication Principle
At the beginning of each period, the file system asynchronous remote replication creates a snapshot
for the primary file system. Based on the incremental information generated from the time when
the replication in the previous period is complete to the time when the current period starts, the file
system asynchronous remote replication reads the snapshot data and replicates the data to the
secondary file system. After the incremental replication is complete, the data in the secondary file
system is the same as that in the primary file system, data consistency points are formed on the
secondary file system.
Remote replication between file systems is supported. Directory-to-directory or file-to-file

replication mode is not supported.
A file system can be included in only one replication task, but a replication task can contain multiple
file systems.
File systems support only one-to-one replication. A file system cannot serve as the replication source
and destination at the same time. Cascading replication and 3DC are not supported.
The minimum unit of incremental replication is the file system block size (4 KB to 64 KB). The
minimum synchronization period of asynchronous replication is 5 minutes.
The resumable download is supported.
4.2.4 DR Application Scenarios

4.2.4.1 Virtualized DR
Challenges to Customer
The customer has a vSphere virtual data center and wants to build a new data center for DR.
Low TCO and high return on investment (ROI)
Huawei's Solution
An IT system, including storage devices, services, networks, and virtualization platforms, is deployed
in the DR center.
Install Huawei UltraVR DR component in the production center and DR center.
ConsistentAgent is installed on host machines of VMs to implement application-level protection for
VMs.
Customer Benefits
No need for reconstruction of the live network architecture
Flexible configuration of DR policies and one-click recovery
DR rehearsal and switchback
4.2.4.2 Application-Level DR Solution
Challenges to Customer
The current IT system cannot meet the service development requirements and ensure the continuity
of online services.
IT O&M are complicated, involve a high energy consumption, and result in low resource utilization.
Huawei's Solution
Migrate the service system to the Huawei cloud platform.
Deploy CDP storage devices and CDP software in two data centers. Implement the application-level
DR for the two data centers in the same city using the CDP technology.
Customer Benefits
The elastic resources and resource reusing are implemented, resource usage is improved, and O&M
costs are reduced.
The RTO and RPO of key services are zero. When the production center is faulty, services and data
are automatically switched to the DR center, ensuring service continuity.
5 Storage System O&M Management
5.1 Storage System O&M Management

5.1.1 Storage Management Overview
DeviceManager is a piece of integrated storage management software developed by Huawei. It has
been loaded to storage systems before factory delivery. You can log in to DeviceManager using a
web browser or a tablet.
After logging in to the CLI of a storage system, you can query, set, manage, and maintain the storage
system. On any maintenance terminal connected to the storage system, you can use PuTTY to access
the IP address of the management network port on the controller of the storage system through the
SSH protocol to log in to the CLI. The SSH protocol supports two authentication modes: user name +
password and public key.
You can log in to the storage system by either of the following methods:
Login Using a Serial Port
After the controller enclosure is connected to the maintenance terminal using serial cables, you can
log in to the CLI of the storage device using a terminal program (such as PuTTY).
Login Using a Management Network Port
You can log in to the CLI using an IPv4 or IPv6 address.
After connecting the controller enclosure to the maintenance terminal using a network cable, you
can log in to the storage system by using any type of remote login software that supports SSH.
For a 2 U controller subrack, the default IP addresses of the management network ports on
controller A and controller B are 192.168.128.101 and 192.168.128.102, respectively. The default
subnet mask is 255.255.0.0. For a 3 U/6 U controller enclosure, the default IP addresses of the
management network ports on management module 0 and management module 1 are
192.168.128.101 and 192.168.128.102, respectively. The default subnet mask is 255.255.0.0.
The IP address of the controller enclosure's management network port must be in the same network
segment as that of the maintenance terminal. Otherwise, you need to modify the IP address of the
management network port through a serial port by running the change system management_ip
command.
5.1.2 Introduction to Storage Management Tools

DeviceManager is integrated storage management software designed by Huawei for a single storage
system. DeviceManager can help you easily configure, manage, and maintain storage devices.
Users can query, set, manage, and maintain storage systems on DeviceManager and the CLI. Tools
such as SmartKit and eService can improve O&M efficiency.
Before using DeviceManager, ensure that the maintenance terminal meets the following
requirements of DeviceManager:
Operating system and browser versions of the maintenance terminal are supported.
DeviceManager supports multiple operating systems and browsers. For details about the
compatibility information, visit Huawei Storage Interoperability Navigator.
The maintenance terminal communicates with the storage system properly.
The super administrator can log in to the storage system using this authentication mode only.
Before logging in to DeviceManager as a Lightweight Directory Access Protocol (LDAP) domain user,
first configure the LDAP domain server, and then configure parameters on the storage system to add
it into the LDAP domain, and finally create an LDAP domain user.
By default, DeviceManager allows 32 users to log in concurrently.
A storage system provides built-in roles and supports customized roles.
Built-in roles are preset in the system with specific permissions shown in the table. Built-in roles
include the super administrator, administrator, and read-only user.
Permissions of user-defined roles can be configured based on actual requirements.
To support permission control in multi-tenant scenarios, the storage system divides built-in roles
into two groups: system group and tenant group. Specifically, the differences between the system
group and tenant group are as follows:
Tenant group: roles in this group are used only in the tenant view (view that can be operated after
you log in to DeviceManager using a tenant account).
System group: roles belonging to this group are used only in the system view (view that can be
operated after you log in to DeviceManager using a system group account).
5.1.3 Introduction to Basic Management Operations
Figure 5-1 Configuration process
5.2 Storage System O&M Management

5.2.1 O&M Overview
ITIL
Information Technology Infrastructure Library (ITIL) is a widely recognized set of practice guidelines
for effective IT service management. Since 1980, Office of Government Commerce of the UK has
gradually proposed and improved a set of methods for assessing the quality of IT services, which is
called ITIL, to solve the problem of poor IT service quality. In 2001, the British Standards Institution
officially released the British national standard BS15000 with ITIL as the core at the IT Service
Management Forum (itSMF). This has become a major event of historical significance in the IT
service management field.
Traditional IT only plays a supporting role, and now IT is a type of service. To achieve the goals of
reducing costs, increasing productivity, and improving service quality, ITIL has set off a frenzy around
the world. Many famous multinational companies, such as IBM, HP, Microsoft, P&G, and HSBC are
active practitioners of ITIL. As the industry is gradually changing from technology-oriented to service-
oriented, enterprises' requirements for IT service management are also increasing, which greatly
helps standardize IT processes, keep IT processes' pace with business, and improve processing
efficiency.
ITIL has the strong support from the UK, other countries in Europe, North America, New Zealand,
and Australia. Whether an enterprise imports ITIL will be regarded as key indicators for determining
whether an inspection suppliers or outsourcing service contractor is qualified for bidding.
5.2.2 O&M Management Tool

In storage scenarios, the following O&M tools are used:
DeviceManager: single-device O&M software.
SmartKit: a professional tool for Huawei technical support engineers, including compatibility
evaluation, planning and design, one-click fault information collection, inspection, upgrade, and FRU
replacement.
eSight: a customer-oriented multi-device maintenance suite that features fault monitoring and
visualized O&M.
DME: customer-oriented software that manages storage resources in a unified manner, orchestrates
service catalogs, and provides storage services and data application services on demand.
eService client: is deployed in the customer's equipment room. It detects storage device exceptions
in real time and notifies Huawei maintenance center of the exceptions.
eService cloud platform: is deployed in Huawei maintenance center to monitor devices on the entire
network in real time, changing passive maintenance to proactive maintenance and even achieving
agent maintenance.
5.2.3 O&M Scenarios

Maintenance Item Overview
Based on the maintenance items and periods, the system administrator can check the device
environment and device status. If an exception occurs, the system administrator can handle and
maintain the device in a timely manner to ensure the continuous and healthy running of the storage
system.
First Maintenance Items
Item Maintenance Operation
On the maintenance terminal, check whether SmartKit and

its sub-tools have been installed. The sub-tools provide
the following functions:
Device archive collection
Checking SmartKit installation Information collection
Disk health analysis
Inspection
Patch tool
On the maintenance terminal, check whether the eService

Checking the eService installation
tool has been installed and the alarm policy has been
and configuration
configured.
On DeviceManager, check whether an alarm policy has

been configured. After an alarm policy is configured,
alarms will be reported to the customer's server or mobile
phone for timely query and handling. Alarm policy
includes:
Email notification
Checking the alarm policy SMS message notification
configuration System notification
Alarm dump
Trap IP address management
USM user management
Alarm masking
Syslog notification
Daily Maintenance Items

Check and handle the alarms. Log in to DeviceManager or use the configured alarm reporting mode
to view alarms, and handle the alarms in time based on the suggestions.
Weekly Maintenance Items
Use the inspection tool of SmartKit on the maintenance

terminal to perform the inspection. The inspection items
are as follows:
⚫ Hardware status
⚫ Software status
Inspecting storage devices ⚫ Value-added service
⚫ Checking alarms
Note:
If suggestions provided by SmartKit cannot resolve the
problem, use SmartKit to collect related information and
contact Huawei technical support.
Check the equipment room environment according to

check methods.
Checking the equipment room
Note:
environment
If the requirements are not met, adjust the equipment
room environment based on related specifications.
Check whether the rack internal environment meets the

Checking the rack internal requirements.
environment Note:
If the requirements are not met, adjust the rack internal

environment based on related requirements.
Information Collection
The information to be collected includes basic information, fault information, storage device
information, networking information, and application server information.
Information Type Name Description
Provides the serial number and version of a

storage device.
Device serial
Note:
number and
version You can log in to DeviceManager and query the
Basic information
serial number and version of a storage device in
the Basic Information area.
Customer
Provides the contact and contact details.
information
Time when a fault

Records the time when a fault occurs.
occurs
Records the symptom of a fault, such as the

Symptom displayed error dialog box and the received
event notification.
Fault information Operations
Records the operations performed before a fault
performed before
occurs.
a fault occurs
Operations Records the operations performed from the time

performed after a when a fault occurs to the time when the fault is
fault occurs reported to the maintenance personnel.
Hardware module Records the configuration information about the

configuration hardware of a storage device.
Records the status of indicators on a storage

device, especially indicators in orange or red.
Indicator status For details about the indicator status of each

Storage device component on the storage device, see the
information Product Description of the corresponding
product model.
Storage system Manually export the running data and system

data logs of a storage device.
Manually export alarms and logs of a storage

Alarm and log
device.
Network information Connection mode Describes how an application server and a

storage device are connected, such as the Fibre
Information Type Name Description

Channel network mode or iSCSI network mode.
If a switch exists on the network, record the

Switch model
switch model.
Manually export the diagnosis information about

Switch diagnosis the running switch, including the startup
information configuration, current configuration, interface
information, time, and system version.
Describes the topology diagram or provides the

Network topology networking diagram between an application
server and a storage device.
Describes IP address planning rules or provides

the IP address allocation list if an application
IP address
server is connected to a storage device over an
iSCSI network.
Records the type and version of the OS running

OS version
on an application server.
Records the port rate of an application server

Application server
that is connected to a storage device. For details
information Port rate
about how to check the port rate, see the Online
Help.
OS log View and export the OS logs.

HCIA-Storage V4.5 Learning Guide

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HCIA-Storage V4.5 Learning Guide

Uploaded by

Copyright:

Available Formats

Huawei Storage Certification Training

HUAWEI TECHNOLOGIES CO., LTD.

Trademarks and Permissions

Huawei Technologies Co., Ltd.

Huawei Proprietary and Confidential

Huawei Certification System

1 Storage Technology Trends ...................................................................................................... 5

2.6.3 Distributed Storage .................................................................................................................................................. 67

1 Storage Technology Trends

1.1 Storage Technology Trends

1.1.1.4 What Is Information

Data management of ILM is generally divided into the following stages:

1.1.2 Data Storage

1.1.2.4 Data Storage Types

1.1.3 Development of Storage Technologies

1.1.4 Development Trend of Storage Products

Development of Storage Media

Storage Network Development

innovative algorithms and high-speed interconnection protocols. In addition, container-based

2 Basic Storage Technologies

2.1 Intelligent Storage Components

2.1.2 Disk Enclosure

2.1.3 Expansion Module

2.1.3.4 Device Cables

2.1.4.3 Data Organization on a Disk

2.1.4.6 Average Access Time

2.1.5.2 SSD Architecture

2.1.5.8 SSD Performance Advantages

2.1.6 Interface Module

⚫ Power indicator/Hot swap button

2.2 RAID Technologies

Figure 2-1 Working principles of RAID 0

Figure 2-2 Working principles of RAID 1

Figure 2-3 Working principles of RAID 3

Figure 2-4 Working principles of RAID 5

Figure 2-5 Working principles of RAID 6 P+Q

Figure 2-6 Working principles of RAID 6 DP

Figure 2-7 Working principles of RAID 10

Figure 2-8 Working principles of RAID 50

2.2.2 RAID 2.0+

⚫ Material impact on services: During reconstruction, member disks are engaged in

2.2.3 Other RAID Technologies

2.3 Common Storage Protocols

2.3.2 iSCSI, FC, and FCoE

⚫ Device (node) port:

2.3.3 SAS and SATA

⚫ Application layer: describes how to use SAS in different types of applications.

2.3.4 PCIe and NVMe

2.3.5 RDMA and IB

2.3.6 CIFS, NFS, and NDMP

2.4 Storage System Architecture

2.4.2 Storage System Expansion Methods

➢ A host delivers write I/Os to engine 0.

2.4.3 Huawei Storage Product Architecture

➢ The system supports two or three copies of the write cache.

2.5 Storage Network Architecture

2.5.4 Distributed Architecture

2.6 Introduction to Huawei Intelligent Storage Products

2.6.1.3 Product Form

Global wear leveling and anti-wear leveling

2.6.2 Hybrid Flash Storage

2.6.3 Distributed Storage

2.6.4 Edge Data Storage (FusionCube)

3 Advanced Storage Technologies