Professional Documents
Culture Documents
HCIA-Storage
Learning Guide
V4.5
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their
respective holders.
Notice
The purchased products, services and features are stipulated by the contract made between
Huawei and the customer. All or part of the products, services and features described in this
document may not be within the purchase scope or the usage scope. Unless otherwise specified
in the contract, all statements, information, and recommendations in this document are provided
"AS IS" without warranties, guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made
in the preparation of this document to ensure accuracy of the contents, but all statements,
information, and recommendations in this document do not constitute a warranty of any kind,
express or implied.
Contents
Traditional storage is composed of disks. In 1956, IBM invented the world's first hard disk drive,
which used 50 x 24-inch platters with a capacity of only 5 MB. It was as big as two refrigerators and
weighed more than a ton. It was used in the industrial field at that time and was independent of the
mainframe.
External storage is also called direct attached storage. Its earliest form is Just a Bundle Of Disks
(JBOD), which simply combines disks and is represented to hosts as a bundle of independent disks. It
only increases the capacity and cannot ensure data security.
The disks deployed in servers have the following disadvantages: limited slots and insufficient
capacity; poor reliability as data is stored on independent disks; disks become the system
performance bottleneck; low storage space utilization; data is scattered since it is stored on different
servers.
JBOD solves the problem of limited slots to a certain extent, and the RAID technology improves
reliability and performance. External storage gradually develops into storage arrays with controllers.
The controllers contain the cache and support the RAID function. In addition, dedicated
management software can be configured. Storage arrays are represented as a large, high-
performance, and redundant disk to hosts.
DAS has the characteristics of scattered data and low storage space utilization.
As the amount of data in our society is explosively increased, the requirements for data storage are
flexible data sharing, high resource utilization, and extended transmission distance. The emergence
of networks infuses new vitality to storage.
SAN: establishes a network between storage devices and servers to provide block storage services.
NAS: builds networks between servers and storage devices with file systems to provide file storage
services.
Since 2011, unified storage that supports both SAN and NAS protocols is a popular choice. Storage
convergence sets a new trend: converged NAS and SAN. This convergence provides both database
and file sharing services, simplifying storage management, and improving storage utilization.
SAN is a typical storage network. It first emerges as FC SAN using the Fibre Channel network to
transmit data, and later supports IP SAN.
Distributed storage uses general-purpose server hardware to build storage resource pools and is
applicable to cloud computing scenarios. Physical resources are organized using software to form a
high-performance logical storage pool, ensuring reliability and providing multiple storage services.
Generally, distributed storage scatters data to multiple independent storage servers in a scalable
system structure. It uses those storage servers to share storage loads and location servers to locate
storage information. Distributed storage architecture has the following characteristics: universal
hardware, unified architecture, and storage-network decoupling; linear expansion of performance
and capacity, up to thousands of nodes; elastic resource scaling and high resource utilization.
Storage virtualization consolidates the storage devices into logical resources, thereby providing
comprehensive and unified storage services. Unified functions are provided regardless of different
storage forms and device types.
The cloud storage system combines multiple storage devices, applications, and services. It uses
highly virtualized multi-tenant infrastructure to provide scalable storage resources for enterprises.
Those storage resources can be dynamically configured based on organization requirements.
Cloud storage is a concept derived from cloud computing, and is a new network storage technology.
Based on functions such as cluster applications, network technologies, and distributed file systems, a
cloud storage system uses application software to enable various types of storage devices on
networks to work together, providing data storage and service access externally. When a cloud
computing system stores and manages a huge amount of data, the system requires a matched
number of storage devices. In this way, the cloud computing system turns into a cloud storage
HCIA-Storage Learning Guide Page 10
system. Therefore, we can regard a cloud storage system as a cloud computing system with data
storage and management as its core. In a word, cloud storage is an emerging solution that
consolidates storage resources on the cloud for people to access. Users can access data on the cloud
anytime, anywhere, through any networked device.
1.1.3.2 Storage Media
History of HDDs:
⚫ From 1970 to 1991, the storage density of disk platters increased by 25% to 30% annually.
⚫ Starting from 1991, the annual increase rate of storage density surged to 60% to 80%.
⚫ Since 1997, the annual increase rate rocketed up to 100% and even 200%, thanks to IBM's Giant
Magneto Resistive (GMR) technology, which further improved the disk head sensitivity and
storage density.
⚫ IBM 1301: used air-bearing heads to eliminate friction and its capacity reached 28 MB.
⚫ IBM 3340: was a pre-installed box unit with a capacity of 30 MB. It was also called "Winchester"
disk drive, named after the Winchester 30-30 rifle because it was planned to run on two 30 MB
spindles.
⚫ In 1992, 1.8-inch HDDs were invented.
History of SSDs:
⚫ Invented by Dawon Kahng and Simon Min Sze in 1967, the floating gate transistor has become
the basis of NAND flash technology. If you are familiar with MOS tubes, you'll find that the
transistor is similar to MOSFET except a floating gate in the middle. That is why it got the name.
It is wrapped in high-impedance materials and insulated up and down to preserve charges that
enter the floating gate through the quantum tunneling effect.
⚫ In 1976, Dataram sold SSDs called Bulk Core. The SSD had the capacity of 2 MB (which was very
large at that time), and used eight large circuit boards, each board with eighteen 256 KB RAMs.
⚫ At the end of the 1990s, some vendors began to use the flash medium to manufacture SSDs. In
1997, Altec Computer Systeme launched a parallel SCSI flash SSD. In 1999, BiTMICRO released
an 18-GB flash SSD. Since then, flash SSD has gradually replaced RAM SSD and become the
mainstream product of the SSD market. The flash memory can store data even in the event of
power failure, which is similar to the HDD.
⚫ In May 2005, Samsung Electronics announced its entry into the SSD market, the first IT giant
entering this market. It is also the first SSD vendor that is widely recognized today.
⚫ In 2006, NextCom began to use SSDs on its laptops. Samsung launched the SSD with the 32 GB
capacity. According to Samsung, the market of SSDs was 1.3 billion USD in 2007 and reached 4.5
billion USD in 2010. In September, Samsung launched the PRAM SSD, another SSD technology
that used the PRAM as the carrier, and hoped to replace NOR flash memory. In November,
Microsoft's Windows Vista came into being as the first PC operating system to support SSD-
specific features.
⚫ In 2009, the capacity of SSDs caught up with that of HDDs. PureSilicon's 2.5-inch SSD provides 1
TB capacity and consists of 128 pieces of 64 Gbit/s MLC NAND memory. Finally, SSD provides the
same capacity as HDD in the same size. This is very important. HDD vendors once believed that
the HDD capacity could be easily increased by increasing the disk density with low costs.
However, the SSD capacity could be doubled only when the internal chips were doubled, which
was difficult. However, the MLC SSD proves that it is possible to double the capacity by storing
more bits in one cell. In addition, the SSD performance is much higher than that of HDD. The
SSD has the read bandwidth of 240 MB/s, write bandwidth of 215 MB/s, read latency less than
100 microseconds, 50,000 read IOPS, and 10,000 write IOPS. HDD vendors are facing a huge
threat.
HCIA-Storage Learning Guide Page 11
The flash chips of SSD evolve from SLC with one cell storing one bit, MLC with two bits, TLC with
three bits, and now develop into QLC with one cell storing four bits.
1.1.3.3 Interface Protocols
Interface protocols refer to the communication modes and requirements that interfaces for
exchanging information must comply with.
Interfaces are used to transfer data between disk cache and host memory. Different disk interfaces
determine the connection speed between disks and controllers.
During the development of storage protocols, the data transmission rate is increasing. As storage
media evolves from HDDs to SSDs, the protocol develops from SCSI to NVMe, including the PCIe-
based NVMe protocol and NVMe over Fabrics (NVMe-oF) protocol to connect host networks.
NVMe-oF uses ultra-low-latency transmission protocols such as remote direct memory access
(RDMA) to remotely access SSDs, resolving the trade-off between performance, functionality, and
capacity during scale-out of next-generation data centers.
Released in 2016, the NVMe-oF specification supported both Fibre Channel and RDMA. In the
RDMA-based framework, InfiniBand supported converged Ethernet and Internet Wide Area RDMA
Protocol (iWARP).
In the NVMe-oF 1.1 specification released in November 2018, TCP was added as an architecture
option, that is, RDMA over Converged Ethernet (RoCE). With RoCE, no cache was required and the
CPU could directly access disks.
NVMe is an SSD controller interface standard. It is designed for PCIe interface-based SSDs and aims
to maximize flash memory performance. It can provide intensive computing capabilities for
enterprise-class workloads in data-intensive industries, such as life sciences, financial services,
multimedia, and entertainment.
NVMe SSDs are commonly used in databases. Featuring high speed and low latency, NVMe can be
used for file systems and all-flash storage arrays to achieve excellent read/write performance. The
all-flash storage system using NVMe SSDs provides efficient storage, network switching, and
metadata communication.
The entire human society is rapidly evolving into an intelligent society. During this process, the data
volume is growing explosively. The average mobile data usage per user per day is over 1 GB. During
the training of autonomous vehicles, each vehicle generates 64 TB data every day. According to
Huawei's Global Industry Vision 2025, the amount of global data will increase from 33 ZB in 2018 to
180 ZB in 2025. Data is becoming a core business asset of enterprises and even countries. The smart
government, smart finance, and smart factory built based on effective data utilization greatly
improve the efficiency of the entire society. More and more enterprises have realized that data
infrastructure is the key to intelligent success, and storage is the core foundation of data
infrastructure. In the past, we used to classify storage systems based on new technology hotspots,
technical architecture, and storage media. As the economy and society transform from digitalization
to intelligence, we tend to call the new type of storage as the storage in the intelligence era.
It has several trends:
First, intelligence, classified by Huawei as Storage for AI and AI in Storage. Storage for AI indicates
that in the future, storage will better support enterprises in AI training and applications. AI in
Storage means that storage systems use AI technologies and integrate AI into storage lifecycle
management to provide outstanding storage management, performance, efficiency, and stability.
Second, storage arrays will transform, for example, towards all-flash storage arrays. In the future,
more and more applications will require low latency, high reliability, and low TCOs, and all-flash
storage arrays will be the good choice. Although new storage media will emerge to compete, all-
flash storage will be the mainstream storage media in the future. Today, all-flash storage is still not
the mainstream in the storage market.
The third trend is distributed storage. In the 5G intelligent era, high-performance application
scenarios such as AI, HPC, and autonomous driving and the generated massive amount of data
require distributed storage devices. With dedicated hardware, they can provide efficient, cost-
effective, and EB-level large-capacity storage. Distributed storage is facing the challenges of
intensification and large-scale expansion, as well as the possible changes of chips and algorithms in
the future. Scientists attempt to use chip, algorithm, and bus technologies to break the barriers of
the von Neumann architecture, provide more computing power for the underlying data
infrastructure to provide efficient and low-cost storage media, and narrow the gap between storage
and computing. These problems need to be solved by dedicated hardware storage. The concept
similar to Memory Fabric also brings changes to the storage architecture.
The last trend is convergence. In the future, storage will be integrated with the data infrastructure to
support heterogeneous chip computing, streamline diversified protocols, and collaborate with data
processing and big data analytics to reduce data processing costs and improve efficiency. For
example, compared with the storage provided by general-purpose servers, the integration of data
and storage will lower the TCO because data processing is offloaded from servers to storage. Object,
big data, and other protocols are converged and interoperate to implement migration-free big data.
Such convergence greatly affects the design of storage systems and is the key to improving storage
efficiency.
1.1.4.3 Data Storage Trend
In the intelligence era, we must focus on innovation to hardware, protocols, and technologies. From
IBM mainframe to the x86, and then to the virtualization, all-flash storage media and all-IP network
protocols become a major trend.
In the intelligence era, Huawei Cache Coherence System (HCCS) and Compute Express Link (CXL) are
designed based on ultra-fast new interconnection protocols, helping to implement high-speed
interconnection between heterogeneous processors of CPUs and neural processing units (NPUs).
RoCE and NVMe support high-speed data transmission and containerization technologies. In
addition, new hardware and technologies provide abundant choices for data storage. The Memory
Fabric architecture implements memory resource pooling with all-flash + storage class memory
HCIA-Storage Learning Guide Page 13
(SCM) and provides microsecond-level data processing performance. SCM media include Optane,
MRAM, ReRAM, FRAM, and Fast NAND. In terms of reliability, system reconstruction and data
migration are involved. As the chip-level design of all-flash storage advances, upper-layer
applications will be unaware of the underlying storage hardware.
Currently, the access performance of SSDs has been improved by 100-fold compared with that of
HDDs. For NVMe SSDs, the access performance is 10,000 times higher than that of HDDs. While the
latency of storage media has been greatly reduced, the ratio of network latency to the total latency
has rocketed from less than 5% to about 65%. That is to say, in more than half of the time, storage
media is idle, waiting for the network communication. How to reduce network latency is the key to
improving input/output operations per second (IOPS).
Kalff, F., Rebergen, M., Fahrenfort, E. et al. A kilobyte rewritable atomic memory. Nature Nanotech
11, 926–929 (2016). https://doi.org/10.1038/nnano.2016.131
Because an atom is so small, the capacity of atomic storage will be much larger than that of the
existing storage medium in the same size. With the development of science and technology in recent
years, Feynman's idea has become a reality. To pay tribute to Feynman's great idea, some research
teams wrote his lecture into atomic memory. Although the idea of atomic storage is incredible and
its implementation is becoming possible, atomic memory has strict requirements on the operating
environment. Atoms are moving and even the atoms inside solids are vibrating in the ambient
environment, so it is difficult to keep them in an ordered state in general conditions. Atom storage
can only be used in low temperatures, liquid nitrogen, or vacuum conditions.
If both DNA storage and atomic storage are intended to reduce the size of storage and increase the
capacity of storage, quantum storage is designed to improve performance and running speed.
After years of research, both the storage efficiency and the lifecycle of the quantum memory are
improved, but it is still difficult to put the quantum memory into practice. Quantum memory has the
problems of inefficiency, large noise, short lifespan, and difficulty to operate at room temperature.
Only by solving these problems, quantum memory can be put into the market.
The elements in the quantum state are easily lost due to the influence of the external environment.
In addition, it is difficult to ensure 100% accuracy of manufacturing in the quantum state and
performing quantum operations.
References:
Wang, Y., Li, J., Zhang, S. et al. Efficient quantum memory for single-photon polarization qubits. Nat.
Photonics 13, 346–351 (2019). https://doi.org/10.1038/s41566-019-0368-8
Dou Jian-Peng, Li Hang, Pang Xiao-Ling, Zhang Chao-Ni, Yang Tian-Huai, Jin Xian-Min. Research
progress of quantum memory. Acta Physica Sinica, 2019, 68(3): 030307. doi:
10.7498/aps.68.20190039
⚫ Coffer disks store user data, system configurations, logs, and dirty data in the cache to protect
against unexpected power outages.
➢ Built-in coffer disk: Each controller of Huawei OceanStor Dorado V6 has one or two built-in
SSDs as coffer disks. See the product documentation for more details.
➢ External coffer disk: The storage system automatically selects four disks as coffer disks.
Each coffer disk provides 2 GB space to form a RAID 1 group. The remaining space can
store service data. If a coffer disk is faulty, the system automatically replaces the faulty
coffer disk with a normal disk for redundancy.
⚫ Power module: The controller enclosure employs an AC power module for its normal
operations.
➢ A 4 U controller enclosure has four power modules (PSU 0, PSU 1, PSU 2, and PSU 3). PSU 0
and PSU 1 form a power plane to power controllers A and C and provide mutual
redundancy. PSU 2 and PSU 3 form the other power plane to power controllers B and D
and provide mutual redundancy. It is recommended that you connect PSU 0 and PSU 2 to
one PDU and PSU 1 and PSU 3 to another PDU for maximum reliability.
➢ A 2 U controller enclosure has two power modules (PSU 0 and PSU 1) to power controllers
A and B. The two power modules form a power plane and provide mutual redundancy.
Connect PSU 0 and PSU 1 to different PDUs for maximum reliability.
2.1.4 HDD
2.1.4.1 HDD Structure
⚫ A platter is coated with magnetic materials on both surfaces with polarized magnetic grains to
represent a binary information unit, or bit.
⚫ A read/write head reads and writes data for platters. It changes the polarities of magnetic grains
on the platter surface to save data.
⚫ The actuator arm moves the read/write head to the specified position.
⚫ The spindle has a motor and bearing underneath. It rotates the specified position on the platter
to the read/write head.
⚫ The control circuit controls the speed of the platter and movement of the actuator arm, and
delivers commands to the head.
2.1.4.2 HDD Design
Each disk platter has two read/write heads to read and write data on the two surfaces of the platter.
Airflow prevents the head from touching the platter, so the head can move between tracks at a high
speed. A long distance between the head and the platter results in weak signals, and a short distance
may cause the head to rub against the platter surface. The platter surface must therefore be smooth
and flat. Any foreign matter or dust will shorten the distance and cause the head to rub against the
magnetic surface. This will result in permanent data corruption.
Working principles:
⚫ The read/write head starts in the landing zone near the platter spindle.
⚫ The spindle connects to all of the platters and a motor. The spindle motor rotates at a constant
speed to drive the platters.
⚫ When the spindle rotates, there is a small gap between the head and the platter. This is called
the flying height of the head.
⚫ The head is attached to the end of the actuator arm, which drives the head to the specified
position above the platter where data needs to be written or read.
⚫ The head reads and writes data in binary format on the platter surface. The read data is stored
in the flash chip of the disk and then transmitted to the program.
HCIA-Storage Learning Guide Page 19
⚫ In long-distance transmission, the time for data on each line to reach the peer end varies due to
wire resistance or other factors. The next transmission can be initiated only after data on all
lines has reached the peer end.
⚫ High transmission frequency causes serious circuit oscillation and generates interference
between the lines. The frequency of parallel transmission must therefore be carefully set.
Serial transmission:
⚫ Serial transmission is less efficient than parallel transmission, but is generally faster with
potential increases in transmission speed from increasing the transmission frequency.
⚫ Serial transmission is used for long-distance transmission. Currently, PCI interfaces use serial
transmission. The PCIe interface is a typical example of serial transmission. The transmission
rate of a single line is up to 2.5 Gbit/s.
2.1.4.10 Disk Ports
Disks are classified into IDE, SCSI, SATA, SAS, and Fibre Channel disks by port. These disks also differ
in their mechanical bases.
IDE and SATA disks use the ATA mechanical base and are suitable for single-task processing.
SCSI, SAS, and Fibre Channel disks use the SCSI mechanical base and are suitable for multi-task
processing.
Comparison:
⚫ SCSI disks provide faster processing than ATA disks under high data throughput.
⚫ ATA disks overheat during multi-task processing due to the frequent movement of the
read/write head.
⚫ SCSI disks provide higher reliability than ATA disks.
IDE disk port:
⚫ Multiple ATA versions have been released, including ATA-1 (IDE), ATA-2 (Enhanced IDE/Fast ATA),
ATA-3 (Fast ATA-2), ATA-4 (ATA33), ATA-5 (ATA66), ATA-6 (ATA100), and ATA-7 (ATA133).
⚫ ATA ports have several advantages and disadvantages:
➢ Their strengths are their low price and good compatibility.
➢ Their disadvantages are their low speed, limited applications, and strict restrictions on
cable length.
➢ The transmission rate of the PATA port is also inadequate for current user needs.
SATA port:
⚫ During data transmission, the data and signal lines are separated and use independent
transmission clock frequency. The transmission rate of SATA is 30 times that of PATA.
⚫ Advantages:
➢ A SATA port generally has 7+15 pins, uses a single channel, and transmits data faster than
ATA.
➢ SATA uses the cyclic redundancy check (CRC) for instructions and data packets to ensure
data transmission reliability.
➢ SATA surpasses ATA in interference protection.
SCSI port:
⚫ SCSI disks were developed to replace IDE disks to provide higher rotation speed and
transmission rate. SCSI was originally a bus-type interface and worked independently of the
system bus.
HCIA-Storage Learning Guide Page 22
⚫ Advantages:
➢ It is applicable to a wide range of devices. One SCSI controller card can connect to 15
devices simultaneously.
➢ It provides high performance with multi-task processing, low CPU usage, fast rotation
speed, and a high transmission rate.
➢ SCSI disks support diverse applications as external or built-in components with hot-
swappable replacement.
⚫ Disadvantages:
➢ High cost and complex installation and configuration.
SAS port:
⚫ SAS is similar to SATA in its use of a serial architecture for a high transmission rate and
streamlined internal space with shorter internal connections.
⚫ SAS improves the efficiency, availability, and scalability of the storage system. It is backward
compatible with SATA for the physical and protocol layers.
⚫ Advantages:
➢ SAS is superior to SCSI in its transmission rate, anti-interference, and longer connection
distances.
⚫ Disadvantages:
➢ SAS disks are more expensive.
Fibre Channel port:
⚫ Fiber Channel was originally designed for network transmission rather than disk ports. It has
gradually been applied to disk systems in pursuit of higher speed.
⚫ Advantages:
➢ Easy to upgrade. Supports optical fiber cables with a length over 10 km.
➢ Large bandwidth
➢ Strong universality
⚫ Disadvantages:
➢ High cost
➢ Complex to build
2.1.5 SSD
2.1.5.1 SSD Overview
Traditional disks use magnetic materials to store data, but SSDs use NAND flash with cells as storage
units. NAND flash is a non-volatile random access storage medium that can retain stored data after
the power is turned off. It quickly and compactly stores digital information.
SSDs eliminate high-speed rotational components for higher performance, lower power
consumption, and zero noise.
SSDs do not have mechanical parts, but this does not mean that they have an infinite life cycle.
Because NAND flash is a non-volatile medium, original data must be erased before new data can be
written. However, there is a limit to how many times each cell can be erased. Once the limit is
reached, data reads and writes become invalid on that cell.
HCIA-Storage Learning Guide Page 23
⚫ Every 768 pages form a block. Every 1478 blocks form a plane.
⚫ A flash chip consists of two planes, with one storing blocks with odd sequence numbers and the
other storing even sequence numbers. The two planes can be operated concurrently.
ECC must be performed on the data stored in the NAND flash, so the size of the page in the NAND
flash is not an integer of 16 KB, but with an extra group of bytes. For example, if the actual size of a
16 KB page is 16,384 + 1,952 bytes, then the 16,384 bytes are for data storage, and the 1,952 bytes
are for storing data check codes for ECC.
2.1.5.6 Address Mapping Management
The logical block address (LBA) may refer to an address of a data block or the data block that the
address indicates.
PBA: physical block address
The host accesses the SSD through the LBA. Each LBA generally represents a sector of 512 bytes. The
host OS accesses the SSD in units of 4 KB. The basic unit for the host to access the SSD is called host
page.
The flash page of an SSD is the basic unit for the SSD controller to access the flash chip, which is also
called the physical page. Each time the host writes a host page, the SSD controller writes it to a
physical page and records their mapping relationship.
When the host reads a host page, the SSD finds the requested data according to the mapping
relationship.
2.1.5.7 SSD Read and Write Process
SSD write process:
⚫ The SSD controller connects to eight flash dies through eight channels. For better explanation,
the figure shows only one block in each die. Each 4 KB square in the blocks represents a page.
➢ The host writes 4 kilobytes to the block of channel 0 to occupy one page.
➢ The host continues to write 16 kilobytes. This example shows 4 kilobytes being written to
each block of channels 1 through 4.
➢ The host continues to write data to the blocks until all blocks are full.
⚫ When the blocks on all channels are full, the SSD controller selects a new block to write data in
the same way.
⚫ Green indicates valid data and red indicates invalid data. Unnecessary data in the blocks
becomes aged or invalid, and its mapping relationship is replaced.
⚫ For example, host page A was originally stored in flash page X, and the mapping relationship
was A to X. Later, the host rewrites the host page. Flash memory does not overwrite data, so the
SSD writes the new data to a new page Y, establishes the new mapping relationship of A to Y,
and cancels the original mapping relationship. The data in page X becomes aged and invalid,
which is also known as garbage data.
⚫ The host continues to write data to the SSD until it is full. In this case, the host cannot write
more data unless the garbage data is cleared.
SSD read process:
⚫ An 8-fold increase in read speed depends on whether the read data is evenly distributed in the
blocks of each channel. If the 32 KB data is stored in the blocks of channels 1 through 4, the
read speed can only support a 4-fold improvement at most. That is why smaller files are
transmitted at a slower rate.
HCIA-Storage Learning Guide Page 25
⚫ The other is parity. Parity data is additional information calculated using user data. For a RAID
array that uses parity, an additional parity disk is required. The XOR (symbol: ⊕) algorithm is
used for parity.
2.2.1.2 RAID 0
RAID 0, also referred to as striping, provides the best storage performance among all RAID levels.
RAID 0 uses the striping technology to distribute data to all disks in a RAID array.
The amount of data stored in a RAID 1 array is only equal to the capacity of a single disk, and data
copies are retained in another disk. That is, each gigabyte data needs 2 gigabyte disk space.
Therefore, a RAID 1 array consisting of two disks has a space utilization of 50%.
Currently, RAID 6 is implemented in different ways. Different methods are used for obtaining parity
data.
RAID 6 P+Q
⚫ The second parity data block is obtained by performing an XOR operation on diagonal data
blocks in the array. The process of selecting data blocks is relatively complex. DP 0 is obtained by
performing an XOR operation on D 0 in disk 1 in stripe 0, D 5 in disk 2 in stripe 1, D 10 in disk 3
in stripe 2, and D 15 in disk 4 in stripe 3. DP 1 is obtained by performing an XOR operation on D
1 in disk 2 in stripe 0, D 6 in disk 3 in stripe 1, D 11 in disk 4 in stripe 2, and P 3 in the first parity
disk in stripe 3. DP 2 is obtained by performing an XOR operation on D 2 in disk 3 in stripe 0, D 7
in disk 4 in stripe 1, P 2 in the first parity disk in stripe 2, and D 12 in disk 1 in stripe 3. Therefore,
DP 0 = D 0 ⊕ D 5 ⊕ D 10 ⊕ D 15, DP 1 = D 1 ⊕ D 6 ⊕ D 11 ⊕ P 3, and so on.
⚫ A RAID 6 array tolerates failures of up to two disks.
⚫ A RAID 6 array provides relatively poor performance no matter whether DP or P+Q is
implemented. Therefore, RAID 6 applies to the following two scenarios:
➢ Data is critical and should be consistently in online and available state.
➢ Large-capacity (generally > 2 T) disks are used. The reconstruction of a large-capacity disk
takes a long time. Data will be inaccessible for a long time if two disks fail at the same time.
A RAID 6 array tolerates failure of another disk during the reconstruction of one disk. Some
enterprises anticipate to use a dual-redundancy protection RAID array for their large-
capacity disks.
2.2.1.7 RAID 10
For most enterprises, RAID 0 is not really a practical choice, while RAID 1 is limited by disk capacity
utilization. RAID 10 provides the optimal solution by combining RAID 1 and RAID 0. In particular,
RAID 10 provides superior performance by eliminating write penalty in random writes.
A RAID 10 array consists of an even number of disks. User data is written to half of the disks and
mirror copies of user data are retained in the other half of disks. Mirroring is performed based on
stripes.
If disks (such as disk 2 and disk 4) in both the two RAID 1 sub-arrays fail, accesses to data in the RAID
10 array will remain normal. This is because integral copies of the data in faulty disks 2 and 4 are
retained on other two disks (such as disk 3 and disk 1). However, if disks (such as disk 1 and 2) in the
same RAID 1 sub-array fail at the same time, data will be inaccessible.
Theoretically, RAID 10 tolerates failures of half of the physical disks. However, in the worst case,
failures of two disks in the same sub-array may also cause data loss. Generally, RAID 10 protects data
against the failure of a single disk.
2.2.1.8 RAID 50
RAID 50 combines RAID 0 and RAID 5. Two RAID 5 sub-arrays form a RAID 0 array. The two RAID 5
sub-arrays are independent of each other. A RAID 50 array requires at least six disks because a RAID
5 sub-array requires at least three disks.
disk is associated with a storage tier. For example, SSDs are associated with the high
performance tier, SAS disks are associated with the performance tier, and NL-SAS disks are
associated with the capacity tier. A storage tier would not exist if there are no disks of the
corresponding type in a disk domain. A disk domain separates an array of disks from another
array of disks for fully isolating faults and maintaining independent performance and storage
resources. RAID levels are not specified when a disk domain is created. That is, data redundancy
protection methods are not specified. Actually, RAID 2.0+ provides more flexible and specific
data redundancy protection methods. The storage space formed by disks in a disk domain is
divided into storage pools of a smaller granularity and hot spare space shared among storage
tiers. The system automatically sets the hot spare space based on the hot spare policy (high,
low, or none) set by an administrator for the disk domain and the number of disks at each
storage tier in the disk domain. In a traditional RAID array, an administrator should specify a
disk as the hot space disk.
2. Storage Pool and Storage Tier
A storage pool is a storage resource container. The storage resources used by application
servers are all from storage pools.
A storage tier is a collection of storage media providing the same performance level in a storage
pool. Different storage tiers manage storage media of different performance levels and provide
storage space for applications that have different performance requirements.
A storage pool created based on a specified disk domain dynamically allocates CKs from the disk
domain to form CKGs according to the RAID policy of each storage tier for providing storage
resources with RAID protection to applications.
A storage pool can be divided into multiple tiers based on disk types.
When creating a storage pool, a user is allowed to specify a storage tier and related RAID policy
and capacity for the storage pool.
OceanStor storage systems support RAID 1, RAID 10, RAID 3, RAID 5, RAID 50, and RAID 6 and
related RAID policies.
The capacity tier consists of large-capacity SATA and NL-SAS disks. DP RAID 6 is recommended.
3. Disk Group
An OceanStor storage system automatically divides disks of each type in each disk domain into
one or more disk groups (DGs) according to disk quantity.
One DG consists of disks of only one type.
CKs in a CKG are allocated from different disks in a DG.
DGs are internal objects automatically configured by OceanStor storage systems and typically
used for fault isolation. DGs are not presented externally.
4. Logical Drive
A logical drive (LD) is a disk that is managed by a storage system and corresponds to a physical
disk.
5. CK
A chunk (CK) is a disk space of a specified size allocated from a storage pool. It is the basic unit
of a RAID array.
6. CKG
A chunk group (CKG) is a logical storage unit that consists of CKs from different disks in the same
DG based on the RAID algorithm. It is the minimum unit for allocating resources from a disk
domain to a storage pool.
HCIA-Storage Learning Guide Page 36
All CKs in a CKG are allocated from the disks in the same DG. A CKG has RAID attributes, which
are actually configured for corresponding storage tiers. CKs and CKGs are internal objects
automatically configured by storage systems. They are not presented externally.
7. Extent
Each CKG is divided into logical storage spaces of a specific and adjustable size called extents.
Extent is the minimum unit (granularity) for migration and statistics of hot data. It is also the
minimum unit for space application and release in a storage pool.
An extent belongs to a volume or LUN. A user can set the extent size when creating a storage
pool. After that, the extent size cannot be changed. Different storage pools may consist of
extents of different sizes, but one storage pool must consist of extents of the same size.
8. Grain
When a thin LUN is created, extents are divided into 64 KB blocks which are called grains. A thin
LUN allocates storage space by grains. Logical block addresses (LBAs) in a grain are consecutive.
Grains are mapped to thin LUNs. A thick LUN does not involve grains.
9. Volume and LUN
A volume is an internal management object in a storage system.
A LUN is a storage unit that can be directly mapped to a host for data reads and writes. A LUN is
the external embodiment of a volume.
A volume organizes all extents and grains of a LUN and applies for and releases extents to
increase and decrease the actual space used by the volume.
In a typical 4:2 RAID 6 array, the capacity utilization is about 67%. The capacity utilization of a
Huawei OceanStor all-flash storage system with 25 disks is improved by 20% on this basis.
⚫ The AIX architecture is structured in three layers: SCSI device driver, SCSI middle layer, and SCSI
adaptation driver.
2.3.1.4 SCSI Target Model
Based on the SCSI architecture, a target is divided into three layers: port layer, middle layer, and
device layer.
⚫ A PORT model in a target packages or unpackages SCSI instructions on links. For example, a
PORT can package instructions into FPC, iSCSI, or SAS, or unpackage instructions from those
formats.
⚫ A device model in a target serves as a SCSI instruction analyser. It tells the initiator what device
the current LUN is by processing INQUIRT, and processes I/Os through READ/WRITE.
⚫ The middle layer of a target maintains models such as LUN space, task set, and task (command).
There are two ways to maintain LUN space. One is to maintain a global LUN for all PORTs, and
the other is to maintain a LUN space for each PORT.
2.3.1.5 SCSI Protocol and Storage System
The SCSI protocol is the basic protocol used for communication between hosts and storage devices.
The controller sends a signal to the bus processor requesting to use the bus. After the request is
accepted, the controller's high-speed cache sends data. During this process, the bus is occupied by
the controller and other devices connected to the same bus cannot use it. However, the bus
processor can interrupt the data transfer at any time and allow other devices to use the bus for
operations of a higher priority.
A SCSI controller is like a small CPU with its own command set and cache. The special SCSI bus
architecture can dynamically allocate resources to tasks run by multiple devices in a computer. In
this way, multiple tasks can be processed at the same time.
2.3.1.6 SCSI Protocol Addressing
A traditional SCSI controller is connected to a single bus, therefore only one bus ID is allocated. An
enterprise-level server may be configured with multiple SCSI controllers, so there may be multiple
SCSI buses. In a storage network, each FC HBA or iSCSI network adapter is connected to a bus. A bus
ID must therefore be allocated to each bus to distinguish between them.
To address devices connected to a SCSI bus, SCSI device IDs and LUNs are used. Each device on the
SCSI bus must have a unique device ID. The HBA on the server also has its own device ID: 7. Each bus,
including the bus adapter, supports a maximum of 8 or 16 device IDs. The device ID is used to
address devices and identify the priority of the devices on the bus.
Each storage device may include sub-devices, such as virtual disks and tape drives. So LUN IDs are
used to address sub-devices in a storage device.
A ternary description (bus ID, target device ID, and LUN ID) is used to identify a SCSI target.
The iSCSI protocol encapsulates SCSI commands and block data into TCP packets for transmission
over IP networks. As the transport layer protocol of SCSI, iSCSI uses mature IP network technologies
to implement and extend SAN. The SCSI protocol layer generates CDBs and sends the CDBs to the
iSCSI protocol layer. The iSCSI protocol layer then encapsulates the CDBs into PDUs and transmits
the PDUs over an IP network.
2.3.2.2 iSCSI Initiator and Target
The iSCSI communication system inherits some of SCSI's features. The iSCSI communication involves
an initiator that sends I/O requests and a target that responds to the I/O requests and executes I/O
operations. After a connection is set up between the initiator and target, the target controls the
entire process as the primary device.
⚫ There are three types of iSCSI initiators: software-based initiator driver, hardware-based TCP
offload engine (TOE) NIC, and iSCSI HBA. Their performance increases in that order.
⚫ An iSCSI target is usually an iSCSI disk array or iSCSI tape library.
The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators and targets. All
iSCSI nodes are identified by their iSCSI names. This method distinguishes iSCSI names from host
names.
2.3.2.3 iSCSI Architecture
In an iSCSI system, a user sends a data read or write command on a SCSI storage device. The
operating system converts this request into one or multiple SCSI instructions and sends the
instructions to the target SCSI controller card. The iSCSI node encapsulates the instructions and data
into an iSCSI packet and sends the packet to the TCP/IP layer, where the packet is encapsulated into
an IP packet to be transmitted over a network. You can also encrypt the SCSI instructions for
transmission over an insecure network.
Data packets can be transmitted over a LAN or the Internet. The receiving storage controller
restructures the data packets and sends the SCSI control commands and data in the iSCSI packets to
corresponding disks. The disks execute the operation requested by the host or application. For a
data request, data will be read from the disks and sent to the host. The process is completely
transparent to users. Though SCSI instruction execution and data preparation can be implemented
by the network controller software using TCP/IP, the host will spare a lot of CPU resources to
process the SCSI instructions and data. If these transactions are processed by dedicated devices, the
impact on system performance will be reduced to a minimum. An iSCSI adapter combines the
functions of an NIC and an HBA. The iSCSI adapter obtains data by blocks, classifies and processes
data using the TCP/IP processing engine, and sends IP data packets over an IP network. In this way,
users can create IP SANs without compromising server performance.
2.3.2.4 FC Protocol
FC can be referred to as the FC protocol, FC network, or FC interconnection. As FC delivers high
performance, it is becoming more commonly used for front-end host access on point-to-point and
switch-based networks. Like TCP/IP, the FC protocol suite also includes concepts from the TCP/IP
protocol suite and the Ethernet, such as FC switching, FC switch, FC routing, FC router, and SPF
routing algorithm.
FC protocol structure:
⚫ FC-0: defines physical connections and selects different physical media and data rates for
protocol operations. This maximizes system flexibility and allows for existing cables and different
technologies to be used to meet the requirements of different systems. Copper cables and
optical cables are commonly used.
⚫ FC-1: records the 8-bit/10-bit transmission code to balance the transmission bit stream. The
code can also serve as a mechanism to transfer data and detect errors. Its excellent transfer
HCIA-Storage Learning Guide Page 40
capability of 8-bit/10-bit encoding helps reduce component design costs and ensures optimum
transfer density for better clock recovery. Note: 8-bit/10-bit encoding is also applicable to IBM
ESCON.
⚫ FC-2: includes the following items for sending data over the network:
➢ How data should be split into small frames
➢ How much data should be sent at a time (flow control)
➢ Where frames should be sent (including defining service levels based on applications)
⚫ FC-3: defines advanced functions such as striping (data is transferred through multiple
channels), multicast (one message is sent to multiple targets), and group query (multiple ports
are mapped to one node). When FC-2 defines functions for a single port, FC-3 can define
functions across ports.
⚫ FC-4: maps upper-layer protocols. FC performance is mapped to an IP address, a SCSI protocol,
or an ATM protocol. SCSI is a subset of the FC protocol.
Like the Ethernet, FC provides the following network topologies:
⚫ Point-to-point:
➢ The simplest topology that allows direct communication between two nodes (usually a
storage device and a server).
⚫ FC-AL:
➢ Similar to the Ethernet shared bus topology but is in arbitrated loop mode rather than bus
connection mode. Each device is connected to another device end to end to form a loop.
➢ Data frames are transmitted hop by hop in the arbitrated loop and the data frames can be
transmitted only in one direction at any time. As shown in the figure, node A needs to
communicate with node H. After node A wins the arbitration, it sends data frames to node
H. However, the data frames are transmitted clockwise in the sequence of B-C-D-E-F-G-H,
which is inefficient.
Figure 2-9
⚫ Fabric:
➢ Similar to an Ethernet switching topology, a fabric topology is a mesh switching matrix.
➢ The forwarding efficiency is much greater than in FC-AL.
➢ FC devices are connected to fabric switches through optical fibres or copper cables to
implement point-to-point communication between nodes.
FC frees the workstation from the management of every port. Each port manages its own
point-to-point connection to the fabric, and other fabric functions are implemented by FC
switches. On an FC network, there are seven types of ports.
HCIA-Storage Learning Guide Page 41
loss is not allowed. To ensure that FCoE runs properly on an Ethernet network, the Ethernet needs
to be enhanced to prevent packet loss. The enhanced Ethernet is called Converged Enhanced
Ethernet (CEE).
SAS controller can directly control SATA disks. However, SAS cannot be directly used in a SATA
environment, because a SATA controller cannot control SAS disks.
At the protocol layer, SAS includes three types of protocols that are used for data transmission of
different devices.
⚫ The serial SCSI protocol (SSP) is used to transmit SCSI commands.
⚫ The SCSI management protocol (SMP) is used to maintain and manage connected devices.
⚫ The SATA channel protocol (STP) is used for data transmission between SAS and SATA.
When the three protocols operate cooperatively, SAS can be used with SATA and some SCSI devices.
The PCIe protocol features point-to-point connection, high reliability, tree networking, full duplex,
and frame-structure-based transmission.
PCIe protocol layers include the physical layer, data link layer, transaction layer, and application
layer.
⚫ The physical layer in a PCIe bus architecture determines the physical features of the bus. In
future, the performance of a PCIe bus can be further improved by increasing the speed or
changing the encoding or decoding mode. Such changes only affect the physical layer,
facilitating upgrades.
⚫ The data link layer ensures the correctness and reliability of data packets transmitted over a
PCIe bus. It checks whether the data packet encapsulation is complete and correct, adds the
sequence number and CRC code to the data, and uses the ack/nack handshake protocol for
error detection and correction.
⚫ The processing layer receives read and write requests from the software layer or creates a
request encapsulation packet and transmits it to the data link layer. This type of packet is called
a transaction layer packet (TLP). The TLP receives data link layer packets (DLLP) from the link
layer, associates the DLLP with a related software request, and transmits it to the software layer
for processing.
⚫ The application layer is designed by users based on actual needs. Other layers must comply with
the protocol requirements.
2.3.4.2 NVMe Protocol
NVMe is short for Non-Volatile Memory Express. The NVMe standard is oriented to PCIe SSDs. Direct
connection from the native PCIe channel to the CPU can avoid the latency caused by communication
between the external controller (PCH) of the SATA and SAS interface and the CPU.
In terms of the entire storage process, NVMe not only serves as a logical protocol port, but also as an
instruction standard and a specified protocol. The low latency and parallelism of PCIe channels and
the parallelism of contemporary processors, platforms, and applications can be used to greatly
improve the read and write performance of SSDs with controllable costs. They can also reduce the
latency caused by the Advanced Host Controller Interface (AHCI) and ensure enhanced performance
of SSDs in the SATA era.
NVMe protocol stack:
⚫ In terms of the transmission path, I/Os of a SAS all-flash array are transmitted from the front-
end server to the CPU through the FC/IP front-end interface protocol of a storage device. They
are then transmitted to a SAS chip, a SAS expander, and finally a SAS SSD through PCIe links and
switches.
⚫ The Huawei NVMe-based all-flash storage system supports end-to-end NVMe. Data I/Os are
transmitted from a front-end server to the CPU through a storage device's FC-NVMe/NVMe
Over RDMA front-end interface protocol. Back-end data is transmitted directly to NVMe-based
SSDs through 100 Gbit/s RDMA. The CPU of the NVMe-based all-flash storage system appears to
communicate directly with NVMe SSDs via a shorter transmission path, providing higher
transmission efficiency and a lower transmission latency.
⚫ In terms of software protocol parsing, SAS- and NVMe-based all-flash storage systems differ
greatly in protocol interaction for data writes. If the SAS back-end SCSI protocol is used, four
protocol interactions are required for a complete data write operation. Huawei NVMe-based all-
flash storage systems require only two protocol interactions, making them twice as efficient as
SAS-based all-flash storage systems in terms of processing write requests.
Advantages of NVMe:
HCIA-Storage Learning Guide Page 46
⚫ Low latency: Data is not read from registers when commands are executed, resulting in a low
I/O latency.
⚫ High bandwidth: PCIe X4 can provide up to 4 Gbit/s throughput for a single drive.
⚫ High IOPS: NVMe increases the maximum queue depth from 32 to 64,000. The IOPS of SSDs is
also greatly improved.
⚫ Low power consumption: The automatic switchover between power consumption modes and
dynamic power management greatly reduce power consumption.
⚫ Wide driver applicability: The driver applicability problem between different PCIe SSDs is solved.
Huawei OceanStor Dorado all-flash storage systems use NVMe-oF to implement SSD resource
sharing, and provide 32 Gbit/s FC-NVMe and NVMe over 100 Gbit/s RDMA networking designs. In
this way, the same network protocol is used for front-end network connection, back-end disk
enclosure connection, and scale-out controller interconnection.
RDMA uses related hardware and network technologies to enable NICs of servers to directly read
memory, achieving high bandwidth, low latency, and low resource consumption. However, the
RDMA-dedicated IB network architecture is incompatible with a live network, resulting in high costs.
RoCE effectively solves this problem. RoCE is a network protocol that uses the Ethernet to carry
RDMA. There are two versions of RoCE. RoCEv1 is a link layer protocol and cannot be used in
different broadcast domains. RoCEv2 is a network layer protocol and can implement routing
functions.
(switch). The NIC should support iWARP (if CPU offload is used). Otherwise, all iWARP stacks can
be implemented in the SW, and most RDMA performance advantages are lost.
2.3.5.2 IB Protocol
IB technology is specifically designed for server connections and is widely used for communication
between servers (for example, replication and distributed working), between a server and a storage
device (for example, SAN and DAS), and between a server and a network (for example, LAN, WAN,
and the Internet).
IB defines a set of devices used for system communication, including channel adapters, switches,
and routers used to connect to other devices, such as host channel adapters (HCAs) and target
channel adapters (TCAs). The IB protocol has the following features:
⚫ Standard-based protocol: IB was designed by the InfiniBand Trade Association, which was
founded in 1999 and comprised 225 companies. Main members of the association include
Agilent, Dell, HP, IBM, InfiniSwitch, Intel, Mellanox, Network Appliance, and Sun Microsystems.
More than 100 other members help develop and promote the standard.
⚫ Speed: IB provides high speeds.
⚫ Memory: Servers that support IB use HCAs to convert the IB protocol to the PCI-X or PCI-Xpress
bus inside the server. The HCA supports RDMA and is also called kernel bypass. RDMA fits
clusters well. It uses a virtual addressing solution to let a server identify and use memory
resources from other servers without involving any operating system kernels.
⚫ RDMA helps implement transport offload. The transport offload function transfers data packet
routing from the OS to the chip level, reducing the service load of the processor. An 80 GHz
processor is required to process data at a transmission speed of 10 Gbit/s in the OS.
The IB system includes CAs, a switch, a router, a repeater, and connected links. CAs include and HCAs
and TCAs.
⚫ An HCA is used to connect a host processor to the IB structure.
⚫ A TCA is used to connect I/O adapters to the IB structure.
IB in storage: The IB front-end network is used to exchange data with customers. Data is transmitted
based on the IPoIB protocol. The IB back-end network is used for data interaction between nodes in
a storage device. The RPC module uses RDMA to synchronize data between nodes.
IB layers include the application layer, transport layer, network layer, link layer, and physical layer.
The functions of each layer are described as follows:
⚫ Transport layer: responsible for in-order distribution and segmentation of packets, channel
multiplexing, and data transmission. It also sends, receives, and reassembles data packet
segments.
⚫ Network layer: provides a mechanism for routing packets from one substructure to another.
Each routing packet of the source and destination nodes has a global routing header (GRH) and
a 128-bit IPv6 address. A standard global 64-bit identifier is also embedded at the network layer
and this identifier is unique in all subnets. Through the exchange of such identifier values, data
can be transmitted across multiple subnets.
⚫ Link layer: provides such functions as packet design, point-to-point connection, and packet
switching in the local subsystems. At the packet communication level, two special packet types
are specified: data transmission and network management packets. The network management
packet provides functions like operation control, subnet indication, and fault tolerance for
device enumeration. The data transmission packet is used for data transmission. The maximum
size of each packet is 4 KB. In each specific device subnet, the direction and exchange of each
packet are implemented by a local subnet manager with a 16-bit identifier address.
HCIA-Storage Learning Guide Page 48
⚫ Physical layer: defines connections at three rates: 1X, 4X, and 12X. The signal transmission rates
are 2.5 Gbit/s, 10 Gbit/s, and 30 Gbit/s, respectively. IBA therefore allows multiple connections
to obtain a speed of up to 30 Gbit/s. Because the full-duplex serial communication mode is
used, the single-rate bidirectional connection requires only four cables. When the 12-rate mode
is used, only 48 cables are required.
The NDMP protocol is designed for the data backup system of NAS devices. It enables NAS devices to
send data directly to the connected disk devices or the backup servers on the network for backup,
without any backup client agent being required.
There are two networking modes for NDMP:
⚫ On a 2-way network, backup media is connected directly to a NAS storage system instead of to a
backup server. In a backup process, the backup server sends a backup command to the NAS
storage system through the Ethernet. The system then directly backs up data to the tape library
it is connected to.
➢ In the NDMP 2-way backup mode, data flows are transmitted directly to backup media,
greatly improving the transmission performance and reducing server resource usage.
However, a tape library is connected to a NAS storage device, so the tape library can back
up data only for the NAS storage device to which it is connected.
➢ Tape libraries are expensive. To enable different NAS storage devices to share tape
devices, NDMP also supports the 3-way backup mode.
⚫ In the 3-way backup mode, a NAS storage system can transfer backup data to a NAS storage
device connected to a tape library through a dedicated backup network. Then, the storage
device backs up the data to the tape library.
Two controllers are working at the same time. Each connects to all back-end buses, but
each bus is managed by only one controller. Each controller manages half of all back-end
buses. If one controller is faulty, the other takes over all buses. This is more efficient than
Active-Standby.
Mid-range Storage Architecture Evolution:
⚫ Mid-range storage systems always use an independent dual-controller architecture. Controllers
are usually of modular hardware.
⚫ The evolution of mid-range storage mainly focuses on the rate of host interfaces and disk
interfaces, and the number of ports.
⚫ The common form factor is the convergence of SAN and NAS storage services.
Multi-controller Storage:
⚫ Most mission-critical storage systems use multi-controller architecture.
⚫ The main architecture models are as follows:
➢ Bus architecture
➢ Hi-Star architecture
➢ Direct-connection architecture
➢ Virtual matrix architecture
Mission-critical storage architecture evolution:
⚫ In 1990, EMC launched Symmetrix, a full bus architecture. A parallel bus connected front-end
interface modules, cache modules, and back-end disk interface modules for data and signal
exchange in time-division multiplexing mode.
⚫ In 2000, HDS adopted the switching architecture for Lightning 9900 products. Front-end
interface modules, cache modules, and back-end disk interface modules were connected on two
redundant switched networks, increasing communication channels to dozens of times more
than that of the bus architecture. The internal bus was no longer a performance bottleneck.
⚫ In 2003, EMC launched the DMX series based on full mesh architecture. All modules were
connected in point-to-point mode, obtaining theoretically larger internal bandwidth but adding
system complexity and limiting scalability challenges.
⚫ In 2009, to reduce hardware development costs, EMC launched the distributed switching
architecture by connecting a separated switch module to the tightly coupled dual-controller of
mid-range storage systems. This achieved a balance between costs and scalability.
⚫ In 2012, Huawei launched the Huawei OceanStor 18000 series, a mission-critical storage
product also based on distributed switching architecture.
Storage Software Technology Evolution:
A storage system combines unreliable and low-performance disks to provide high-reliability and
high-performance storage through effective management. Storage systems provide sharing, easy-to-
manage, and convenient data protection functions. Storage system software has evolved from basic
RAID and cache to data protection features such as snapshot and replication, to dynamic resource
management with improved data management efficiency, and deduplication and tiered storage with
improved storage efficiency.
Distributed Storage Architecture:
⚫ A distributed storage system organizes local HDDs and SSDs of general-purpose servers into a
large-scale storage resource pool, and then distributes data to multiple data storage servers.
⚫ Currently, distributed storage of Huawei learns from Google, building a distributed file system
among multiple servers and then implementing storage services on the file system.
HCIA-Storage Learning Guide Page 51
⚫ Most storage nodes are general-purpose servers. Huawei OceanStor 100D is compatible with
multiple general-purpose x86 servers and Arm servers.
➢ Protocol: storage protocol layer. The block, object, HDFS, and file services support local
mounting access over iSCSI or VSC, S3/Swift access, HDFS access, and NFS access
respectively.
➢ VBS: block access layer of FusionStorage Block. User I/Os are delivered to VBS over iSCSI or
SCSI.
➢ EDS-B: provides block services with enterprise features, and receives and processes I/Os
from VBS.
➢ EDS-F: provides the HDFS service.
➢ Metadata Controller (MDC): The metadata control device controls distributed cluster node
status, data distribution rules, and data rebuilding rules.
➢ Object Storage Device (OSD): a storage device for storing user data in distributed clusters
of the object storage device
➢ Cluster Manager (CM): manages cluster information.
⚫ Port consistency: In a loop, the EXP (P1) port of an upper-level disk enclosure is connected to
the PRI (P0) port of a lower-level disk enclosure.
⚫ Dual-plane networking: Expansion board A connects to controller A, while expansion board B
connects to controller B.
⚫ Symmetric networking: On controllers A and B, symmetric ports and slots are connected to the
same disk enclosure.
⚫ Forward connection networking: Both expansion modules A and B use forward connection.
⚫ Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed the upper
limit.
IP scale-out is used for Huawei OceanStor V3 and V5 entry-level and mid-range series, Huawei
OceanStor V5 Kunpeng series, and Huawei OceanStor Dorado V6 series. IP scale-out integrates
TCP/IP, Remote Direct Memory Access (RDMA), and Internet Wide Area RDMA Protocol (iWARP) to
implement service switching between controllers, which complies with the all-IP trend of the data
center network.
PCIe scale-out is used for Huawei OceanStor 18000 V3 and V5 series, and Huawei OceanStor Dorado
V3 series. PCIe scale-out integrates PCIe channels and the RDMA technology to implement service
switching between controllers.
PCIe scale-out: features high bandwidth and low latency.
IP scale-out: employs standard data center technologies (such as ETH, TCP/IP, and iWARP) and
infrastructure, and boosts the development of Huawei's proprietary chips for entry-level and mid-
range products.
Next, let's move on to I/O read and write processes of the host. The scenarios are as follows:
⚫ Local Write Process
➢ A host delivers write I/Os to engine 0.
➢ Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message indicating that data is written successfully.
➢ Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
➢ If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (engine 1
for example) where the disk resides.
➢ Engine 1 writes dirty data onto disks.
⚫ Non-local Write Process
➢ A host delivers write I/Os to engine 2.
➢ After detecting that the LUN is owned by engine 0, engine 2 transfers the write I/Os to
engine 0.
➢ Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message to engine 2, indicating that data is written successfully.
➢ Engine 2 returns the write success message to the host.
➢ Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
➢ If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (engine 1
for example) where the disk resides.
➢ Engine 1 writes dirty data onto disks.
⚫ Local Read Process
HCIA-Storage Learning Guide Page 53
➢ Upon reception of host I/Os, the FIM directly distributes the I/Os to appropriate
controllers.
⚫ Full interconnection among controllers
➢ Controllers in a controller enclosure are connected by 100 Gbit/s (40 Gbit/s for Dorado
3000 V6) RDMA links on the backplane.
➢ For scale-out to multiple controller enclosures, any two controllers can be directly
connected to avoid data forwarding.
⚫ Back-end full interconnection
➢ Dorado 8000 and 18000 V6 support BIMs, which allow a smart disk enclosure to be
connected to two controller enclosures and accessed by eight controllers simultaneously.
This technique, together with continuous mirroring, allows the system to tolerate failure of
7 out of 8 controllers.
➢ Dorado 3000, 5000, and 6000 V6 do not support BIMs. Disk enclosures connected to
Dorado 3000, 5000, and 6000 V6 can be accessed by only one controller enclosure.
Continuous mirroring is not supported.
The storage system supports three types of disk enclosures: SAS, smart SAS, and smart NVMe.
Currently, they cannot be used together on one storage system. Smart SAS and smart NVMe disk
enclosures use the same networking mode. In this mode, a controller enclosure uses the shared 2-
port 100 Gbit/s RDMA interface module to connect to a disk enclosure. Each interface module
connects to the four controllers in the controller enclosure through PCIe 3.0 x16. In this way, each
disk enclosure can be simultaneously accessed by all four controllers, achieving full interconnection
between the disk enclosure and the four controllers. A smart disk enclosure has two groups of uplink
ports and can connect to two controller enclosures at the same time. This allows the two controller
enclosures (eight controllers) to simultaneously access a disk enclosure, implementing full
interconnection between the disk enclosure and eight controllers. When full interconnection
between disk enclosures and eight controllers is implemented, the system can use continuous
mirroring to tolerate failure of 7 out of 8 controllers without service interruption.
Huawei storage provides E2E global resource sharing:
⚫ Symmetric architecture
➢ All products support host access in active-active mode. Requests can be evenly distributed
to each front-end link.
➢ They eliminate LUN ownership of controllers, making LUNs easier to use and balancing
loads. They accomplish this by dividing a LUN into multiple slices that are then evenly
distributed to all controllers using the DHT algorithm
➢ Mission-critical products reduce latency with intelligent FIMs that divide LUNs into slices
for hosts I/Os and send the requests to their target controller.
⚫ Shared port
➢ A single port is shared by four controllers in a controller enclosure.
➢ Loads are balanced without host multipathing.
⚫ Global cache
➢ The system directly writes received I/Os (in one or two slices) to the cache of the
corresponding controller and sends an acknowledgement to the host.
➢ The intelligent read cache of all controllers participates in prefetch and cache hit of all LUN
data and metadata.
FIMs of Huawei OceanStor Dorado 8000 and 18000 V6 series storage adopt Huawei-developed
Hi1822 chip to connect to all controllers in a controller enclosure via four internal links and each
HCIA-Storage Learning Guide Page 55
front-end port provides a communication link for the host. If any controller restarts during an
upgrade, services are seamlessly switched to the other controller without impacting hosts and
interrupting links. The host is unaware of controller faults. Switchover is completed within 1 second.
The FIM has the following features:
⚫ Failure of a controller will not disconnect the front-end link, and the host is unaware of the
controller failure.
⚫ The PCIe link between the FIM and the controller is disconnected, and the FIM detects the
controller failure.
⚫ Service switchover is performed between the controllers, and the FIM redistributes host
requests to other controllers.
⚫ The switchover time is about 1 second, which is much shorter than switchover performed by
multipathing software (10-30s).
In global cache mode, host data is directly written into linear space logs, and the logs directly copy
the host data to the memory of multiple controllers using RDMA based on a preset copy policy. The
global cache consists of two parts:
⚫ Global memory: memory of all controllers (four controllers in the figure). This is managed in a
unified memory address, and provides linear address space for the upper layer based on a
redundancy configuration policy.
⚫ WAL: new write cache of the log type
The global pool uses RAID 2.0+, full-strip write of new data, and shared RAID groups between
multiple strips.
Another feature is back-end sharing, which includes sharing of back-end interface modules within an
enclosure and cross-controller enclosure sharing of back-end disk enclosures.
Active-Active Architecture with Full Load Balancing:
⚫ Even distribution of unhomed LUNs
➢ Data on LUNs is divided into 64 MB slices. The slices are distributed to different virtual
nodes based on the hash result (LUN ID + LBA).
⚫ Front-end load balancing
➢ UltraPath selects appropriate physical links to send each slice to the corresponding virtual
node.
➢ The front-end interconnect I/O modules forward the slices to the corresponding virtual
nodes.
➢ Front-end: If there is no UltraPath or FIM, the controllers forward I/Os to the
corresponding virtual nodes.
⚫ Global write cache load balancing
➢ The data volume is balanced.
➢ Data hotspots are balanced.
⚫ Global storage pool load balancing
➢ Usage of disks is balanced.
➢ The wear degree and lifecycle of disks are balanced.
➢ Data is evenly distributed.
➢ Hotspot data is balanced.
⚫ Three cache copies
HCIA-Storage Learning Guide Page 56
For a smart disk array, the controller provides RAID and large-capacity cache, enables the disk array
to have multiple functions, and is equipped with dedicated management software.
2.5.2 NAS
Enterprises need to store a large amount of data and share the data through a network. Therefore,
network-attached storage (NAS) is a good choice. NAS connects storage devices to the live network
and provides data and file services.
For a server or host, NAS is an external device and can be flexibly deployed through the network. In
addition, NAS provides file-level sharing rather than block-level sharing, which makes it easier for
clients to access NAS over the network. UNIX and Microsoft Windows users can seamlessly share
data through NAS or File Transfer Protocol (FTP). When NAS sharing is used, UNIX uses NFS and
Windows uses CIFS.
NAS has the following characteristics:
⚫ NAS provides storage resources through file-level data access and sharing, enabling users to
quickly share files with minimum storage management costs.
⚫ NAS is a preferred file sharing storage solution that does not require multiple file servers.
⚫ NAS also helps eliminate bottlenecks in user access to general-purpose servers.
⚫ NAS uses network and file sharing protocols for archiving and storage. These protocols include
TCP/IP for data transmission as well as CIFS and NFS for providing remote file services.
A general-purpose server can be used to carry any application and run a general-purpose operating
system. Unlike general-purpose servers, NAS is dedicated to file services and provides file sharing
services for other operating systems using open standard protocols. NAS devices are optimized
based on general-purpose servers in aspects such as file service functions, storage, and retrieval. To
improve the high availability of NAS devices, some NAS vendors also support the NAS clustering
function.
The components of a NAS device are as follows:
⚫ NAS engine (CPU and memory)
⚫ One or more NICs that provide network connections, for example, GE NIC and 10GE NIC.
⚫ An optimized operating system for NAS function management
⚫ NFS and CIFS protocols
⚫ Disk resources that use industry-standard storage protocols, such as ATA, SCSI, and Fibre
Channel
NAS protocols include NFS, CIFS, FTP, HTTP, and NDMP.
⚫ NFS is a traditional file sharing protocol in the UNIX environment. It is a stateless protocol. If a
fault occurs, NFS connections can be automatically recovered.
⚫ CIFS is a traditional file sharing protocol in the Microsoft environment. It is a stateful protocol
based on the Server Message Block (SMB) protocol. If a fault occurs, CIFS connections cannot be
automatically recovered. CIFS is integrated into the operating system and does not require
additional software. Moreover, CIFS sends only a small amount of redundant information, so it
has higher transmission efficiency than NFS.
⚫ FTP is one of the protocols in the TCP/IP protocol suite. It consists of two parts: FTP server and
FTP client. The FTP server is used to store files. Users can use the FTP client to access resources
on the FTP server through FTP.
⚫ Hypertext Transfer Protocol (HTTP) is an application-layer protocol used to transfer hypermedia
documents (such as HTML). It is designed for communication between a Web browser and a
Web server, but can also be used for other purposes.
HCIA-Storage Learning Guide Page 58
⚫ Network Data Management Protocol (NDMP) provides an open standard for NAS network
backup. NDMP enables data to be directly written to tapes without being backed up by backup
servers, improving the speed and efficiency of NAS data protection.
Working principles of NFS: Like other file sharing protocols, NFS also uses the C/S architecture.
However, NFS provides only the basic file processing function and does not provide any TCP/IP data
transmission function. The TCP/IP data transmission function can be implemented only by using the
Remote Procedure Call (RPC) protocol. NFS file systems are completely transparent to clients.
Accessing files or directories in an NFS file system is the same as accessing local files or directories.
One program can use RPC to request a service from a program located in another computer over a
network without having to understand the underlying network protocols. RPC assumes the existence
of a transmission protocol such as Transmission Control Protocol (TCP) or User Datagram Protocol
(UDP) to carry the message data between communicating programs. In the OSI network
communication model, RPC traverses the transport layer and application layer. RPC simplifies
development of applications.
RPC works based on the client/server model. The requester is a client, and the service provider is a
server. The client sends a call request with parameters to the RPC server and waits for a response.
On the server side, the process remains in a sleep state until the call request arrives. Upon receipt of
the call request, the server obtains the process parameters, outputs the calculation results, and
sends the response to the client. Then, the server waits for the next call request. The client receives
the response and obtains call results.
One of the typical applications of NFS is using the NFS server as internal shared storage in cloud
computing. The NFS client is optimized based on cloud computing to provide better performance
and reliability. Cloud virtualization software (such as VMware) optimizes the NFS client, so that the
VM storage space can be created on the shared space of the NFS server.
Working principles of CIFS: CIFS runs on top of TCP/IP and allows Windows computers to access files
on UNIX computers over a network.
The CIFS protocol applies to file sharing. Two typical application scenarios are as follows:
⚫ File sharing service
➢ CIFS is commonly used in file sharing service scenarios such as enterprise file sharing.
⚫ Hyper-V VM application scenario
➢ SMB can be used to share mirrors of Hyper-V virtual machines promoted by Microsoft. In
this scenario, the failover feature of SMB 3.0 is required to ensure service continuity upon
a node failure and to ensure the reliability of VMs.
2.5.3 SAN
2.5.3.1 IP SAN Technologies
NIC + Initiator software: Host devices such as servers and workstations use standard NICs to connect
to Ethernet switches. iSCSI storage devices are also connected to the Ethernet switches or to the
NICs of the hosts. The initiator software installed on hosts virtualizes NICs into iSCSI cards. The iSCSI
cards are used to receive and transmit iSCSI data packets, implementing iSCSI and TCP/IP
transmission between the hosts and iSCSI devices. This mode uses standard Ethernet NICs and
switches, eliminating the need for adding other adapters. Therefore, this mode is the most cost-
effective. However, the mode occupies host resources when converting iSCSI packets into TCP/IP
packets, increasing host operation overheads and degrading system performance. The NIC + initiator
software mode is applicable to scenarios that require the relatively low I/O and bandwidth
performance for data access.
HCIA-Storage Learning Guide Page 59
TOE NIC + initiator software: The TOE NIC processes the functions of the TCP/IP protocol layer, and
the host processes the functions of the iSCSI protocol layer. Therefore, the TOE NIC significantly
improves the data transmission rate. Compared with the pure software mode, this mode reduces
host operation overheads and requires minimal network construction expenditure. This is a trade-off
solution.
iSCSI HBA:
⚫ An iSCSI HBA is installed on the host to implement efficient data exchange between the host
and the switch and between the host and the storage device. Functions of the iSCSI protocol
layer and TCP/IP protocol stack are handled by the host HBA, occupying the least CPU resources.
This mode delivers the best data transmission performance but requires high expenditure.
⚫ The iSCSI communication system inherits part of SCSI's features. The iSCSI communication
involves an initiator that sends I/O requests and a target that responds to the I/O requests and
executes I/O operations. After a connection is set up between the initiator and target, the target
controls the entire process as the primary device. The target includes the iSCSI disk array and
iSCSI tape library.
⚫ The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators and
targets. All iSCSI nodes are identified by their iSCSI names. In this way, iSCSI names are
distinguished from host names.
⚫ iSCSI uses iSCSI Qualified Name (IQN) to identify initiators and targets. Addresses change with
the relocation of initiator or target devices, but their names remain unchanged. When setting
up a connection, an initiator sends a request. After the target receives the request, it checks
whether the iSCSI name contained in the request is consistent with that bound with the target.
If the iSCSI names are consistent, the connection is set up. Each iSCSI node has a unique iSCSI
name. One iSCSI name can be used in the connections from one initiator to multiple targets.
Multiple iSCSI names can be used in the connections from one target to multiple initiators.
Logical ports are created based on bond ports, VLAN ports, or Ethernet ports. Logical ports are
virtual ports that carry host services. A unique IP address is allocated to each logical port for carrying
its services.
⚫ Bond port: To improve reliability of paths for accessing file systems and increase bandwidth, you
can bond multiple Ethernet ports on the same interface module to form a bond port.
⚫ VLAN: VLANs logically divide the physical Ethernet ports or bond ports of a storage system into
multiple broadcast domains. On a VLAN, when service data is being sent or received, a VLAN ID
is configured for the data so that the networks and services of VLANs are isolated, further
ensuring service data security and reliability.
⚫ Ethernet port: Physical Ethernet ports on an interface module of a storage system. Bond ports,
VLANs, and logical ports are created based on Ethernet ports.
IP address failover: A logical IP address fails over from a faulty port to an available port. In this
way, services are switched from the faulty port to the available port without interruption. The
faulty port takes over services back after it recovers. This task can be completed automatically
or manually. IP address failover applies to IP SAN and NAS.
During the IP address failover, services are switched from the faulty port to an available port,
ensuring service continuity and improving the reliability of paths for accessing file systems. Users are
not aware of this process.
The essence of IP address failover is a service switchover between ports. The ports can be Ethernet
ports, bond ports, or VLAN ports.
⚫ Ethernet port–based IP address failover: To improve the reliability of paths for accessing file
systems, you can create logical ports based on Ethernet ports.
HCIA-Storage Learning Guide Page 60
Figure 2-10
➢ Host services are running on logical port A of Ethernet port A. The corresponding IP
address is "a". Ethernet port A fails and thereby cannot provide services. After IP address
failover is enabled, the storage system will automatically locate available Ethernet port B,
delete the configuration of logical port A that corresponds to Ethernet port A, and create
and configure logical port A on Ethernet port B. In this way, host services are quickly
switched to logical port A on Ethernet port B. The service switchover is executed quickly.
Users are not aware of this process.
⚫ Bond port-based IP address failover: To improve the reliability of paths for accessing file
systems, you can bond multiple Ethernet ports to form a bond port. When an Ethernet port that
is used to create the bond port fails, services are still running on the bond port. The IP address
fails over only when all Ethernet ports that are used to create the bond port fail.
Figure 2-11
➢ Multiple Ethernet ports are bonded to form bond port A. Logical port A created based on
bond port A can provide high-speed data transmission. When both Ethernet ports A and B
fail due to various causes, the storage system will automatically locate bond port B, delete
logical port A, and create the same logical port A on bond port B. In this way, services are
switched from bond port A to bond port B. After Ethernet ports A and B recover, services
will be switched back to bond port A if failback is enabled. The service switchover is
executed quickly, and users are not aware of this process.
⚫ VLAN-based IP address failover: You can create VLANs to isolate different services.
➢ To implement VLAN-based IP address failover, you must create VLANs, allocate a unique ID
to each VLAN, and use the VLANs to isolate different services. When an Ethernet port on a
HCIA-Storage Learning Guide Page 61
VLAN fails, the storage system will automatically locate an available Ethernet port with the
same VLAN ID and switch services to the available Ethernet port. After the faulty port
recovers, it takes over the services.
➢ VLAN names, such as VLAN A and VLAN B, are automatically generated when VLANs are
created. The actual VLAN names depend on the storage system version.
➢ Ethernet ports and their corresponding switch ports are divided into multiple VLANs, and
different IDs are allocated to the VLANs. The VLANs are used to isolated different services.
VLAN A is created on Ethernet port A, and the VLAN ID is 1. Logical port A that is created
based on VLAN A can be used to isolate services. When Ethernet port A fails due to various
causes, the storage system will automatically locate VLAN B and the port whose VLAN ID is
1, delete logical port A, and create the same logical port A based on VLAN B. In this way,
the port where services are running is switched to VLAN B. After Ethernet port A recovers,
the port where services are running will be switched back to VLAN A if failback is enabled.
➢ An Ethernet port can belong to multiple VLANs. When the Ethernet port fails, all VLANs will
fail. Services must be switched to ports of other available VLANs. The service switchover is
executed quickly, and users are not aware of this process.
2.5.3.2 FC SAN Technologies
FC HBA: The FC HBA converts SCSI packets into Fibre Channel packets, which does not occupy host
resources.
Here are some key concepts in Fibre Channel networking:
⚫ Fibre Channel Routing (FCR) provides connectivity to devices in different fabrics without
merging the fabrics. Different from E_Port cascading of common switches, after switches are
connected through an FCR switch, the two fabric networks are not converged and are still two
independent fabrics. The link switch between two fabrics functions as a router.
⚫ FC Router: a switch running the FC-FC routing service.
⚫ EX_Port: a type of port that functions like an E_Port, but does not propagate fabric services or
routing topology information from one fabric to another.
⚫ Backbone fabric: fabric of a switch running the Fibre Channel router service.
⚫ Edge fabric: fabric that connects a Fibre Channel router.
⚫ Inter fabric link (IFL): the link between an E_Port and an EX-Port, or a VE_Port and a VEX-Port.
Another important concept is zoning. A zone is a set of ports or devices that communicate with each
other. A zone member can only access other members of the same zone. A device can reside in
multiple zones. You can configure basic zones to control the access permission of each device or
port. Moreover, you can set traffic isolation zones. When there are multiple ISLs (E_Ports), an ISL
only transmits the traffic destined for ports that reside in the same traffic isolation zone.
2.5.3.3 Comparison Between IP SAN and FC SAN
First, let's look back on the concept of SAN.
⚫ Protocol: Fibre Channel/iSCSI. The SAN architectures that use the two protocols are FC SAN and
IP SAN.
⚫ Raw device access: suitable for traditional database access.
⚫ Dependence on the application host to provide file access. Share access requires the support of
cluster software, which causes high overheads in processing access conflicts, resulting in poor
performance. In addition, it is difficult to support sharing in heterogeneous environments.
⚫ High performance, high bandwidth, and low latency, but high cost and poor scalability
Then, let's compare FC SAN and IP SAN.
HCIA-Storage Learning Guide Page 62
⚫ To solve the poor scalability issue of DAS, storage devices can be networked using FC SAN to
support connection to more than 100 servers.
⚫ IP SAN is designed to address the management and cost challenges of FC SAN. IP SAN requires
only a few hardware configurations and the hardware is widely used. Therefore, the cost of IP
SAN is much lower than that of FC SAN. Most hosts have been configured with appropriate NICs
and switches, which are also suitable (although not perfect) for iSCSI transmission. High-
performance IP SAN requires dedicated iSCSI HBAs and high-end switches.
⚫ Multiple data protection technologies, such as power failure protection, data pre-copy, coffer
disk, and bad sector repair.
High availability:
⚫ Multiple advanced data protection technologies, such as snapshot, LUN copy, remote
replication, clone, volume mirroring, and active-active, and support for the NDMP protocol.
Intelligence and high efficiency:
⚫ Various control and management functions, such as SmartTier, SmartQoS, and SmartThin,
providing refined control and management.
⚫ DeviceManager supports GUI-based operation and management.
⚫ eService provides self-service intelligent O&M.
2.6.2.2 Product Form
A storage system consists of controller enclosures and disk enclosures, providing customers with an
intelligent storage platform that features high reliability, high performance, and large capacity.
Different types of controller enclosures and disk enclosures are configured for different models.
2.6.2.3 Convergence of SAN and NAS
Convergence of SAN and NAS storage technologies: One storage system supports both SAN and NAS
services at the same time and allows SAN and NAS services to share storage device resources. Hosts
can access any LUN or file system through the front-end port of any controller. During the entire
data life cycle, hot data gradually becomes cold data. If cold data occupies the cache or SSDs for a
long time, valuable resources will be wasted, and the long-term performance of the storage system
will be affected. The storage system uses the intelligent storage tiering technology to flexibly
allocate data storage media in the background.
The intelligent tiering technology needs to be deployed on a device with different media types. Data
is monitored in real time. Data that is not accessed for a long time is marked as cold data and is
gradually transferred from high-performance media to low-speed media, ensuring that service
response from devices does not slow down. After being activated, cold data can be quickly moved to
high-performance media, ensuring stable system performance.
Migration policies can be manually or automatically triggered.
2.6.2.4 Support for Multiple Service Scenarios
Huawei OceanStor hybrid flash storage system integrates SAN and NAS and supports multiple
storage protocols. It is used in a wide range of general-purpose scenarios, including but not limited
to government, finance, telecoms, manufacturing, backup, and DR.
2.6.2.5 Application Scenario – Active-Active Data Centers
Load balancing among controllers
RPO = 0 and RTO ≈ 0 for mission-critical services
Convergence of SAN and NAS: SAN and NAS active-active services can be deployed on the same
device. If a single controller is faulty, local switchover is supported.
Solutions that ensure uninterrupted service running for customers
The active-active solution can be used in industries such as healthcare, finance, and social security.
HCIA-Storage Learning Guide Page 67
environments. It improves storage deployment, expansion, and operation and maintenance (O&M)
efficiency using general-purpose servers. Typical scenarios include Internet-finance channel access
clouds, development and testing clouds, cloud-based services, B2B cloud resource pools in carriers'
BOM domains, and e-Government clouds.
Mission-critical database
Huawei OceanStor 100D delivers enterprise-grade capabilities, such as distributed active-active
storage and consistent low latency, to ensure efficient and stable running of data warehouses and
mission-critical databases, including online analytical processing (OLAP) and online transaction
processing (OLTP).
Big data analytics
OceanStor 100D provides an industry-leading decoupled storage-compute solution for big data,
which integrates traditional data silos and builds a unified big data resource pool for enterprises. It
also leverages enterprise-grade capabilities, such as elastic large-ratio erasure coding (EC) and on-
demand deployment and expansion of decoupled compute and storage resources, to improve big
data service efficiency and reduce TCO. Typical scenarios include big data analytics for finance,
carriers (log retention), and governments.
Content storage and backup archiving
OceanStor 100D provides high-performance and highly reliable object storage resource pools to
meet large throughput, frequent access to hotspot data, as well as long-term storage and online
access requirements of real-time online services such as Internet data, online audio/video, and
enterprise web disks. Typical scenarios include storage, backup, and archiving of financial electronic
check images, audio and video recordings, medical images, government and enterprise electronic
documents, and Internet of Vehicles (IoV).
Blade server: E.g. the Huawei E9000, which integrates computing, storage, and network resources.
The E9000 provides 12 U space for installing Huawei E9000 series server blades, storage nodes, and
capacity expansion nodes.
High-density server: E.g. the Huawei X6800 and X6000 with their high node densities. The Huawei
X6800 is a 4 U server with four nodes, and the Huawei X6000 is a 2 U server with four nodes.
Rack server: E.g. the Huawei RH server or TaiShan server. The TaiShan 2280 is a 2 U 2-socket rack
server. It features high-performance computing, large-capacity storage, low power consumption,
easy management, and easy deployment and is designed for Internet, distributed storage, cloud
computing, big data, and enterprise services.
2.6.4.3 Storage System
The distributed storage system is a distributed block storage software package installed on the
server to virtualize all local disks on that server into a storage resource pool. This allows for block
storage services to be provided.
2.6.4.4 Software Architecture
The hardware for the FusionCube solution includes servers, switches, SSDs, and uninterruptible
power supply (UPS) resources.
The software for the FusionCube solution includes FusionCube Builder, FusionCube Center, DR and
backup software, and distributed storage software.
2.6.4.5 Application Scenario – Edge Data Center Service Scenario
Plug-and-play: Full-stack edge data center, delivery as an integrated cabinet, zero onsite
configuration, and plug-and-play.
Attendance-free: Unified and centralized management of edge data centers, requiring no dedicated
personnel to be in attendance and significantly reducing O&M costs.
Edge-cloud synergy: Enterprise application ecosystem built on the cloud to quickly deliver
applications from the central data center to edge data centers, accelerating customer service
innovation.
HCIA-Storage Learning Guide Page 70
into a thin LUN, the read or write request must be redirected to the actual storage area based on the
mapping relationship between the actual storage area and logical storage area.
A mapping table: This table is used to record the mapping between an actual storage area and a
logical storage area. A mapping table is dynamically updated during the write process and is queried
during the read process.
3.1.1.4 Application Scenarios
SmartThin allocates storage space on demand. The storage system allocates space to application
servers as needed within a specific quota threshold, eliminating the storage resource waste
SmartThin can be used in the following scenarios:
SmartThin expands the capacity of the banking transaction systems in online mode without
interrupting ongoing services.
SmartThin dynamically allocates physical storage spaces on demand to email services and online
storage services.
SmartThin allows different services provided by a carrier to compete for physical storage space to
optimize storage configurations.
3.1.1.5 Configuration Process
To use thin LUNs, you need to import and activate the license file of SmartThin on your storage
device.
After a thin LUN is created, if an alarm is displayed indicating that the storage pool has no available
space, you are advised to expand the storage pool as soon as possible. Otherwise, the thin LUN may
enter the write through mode, causing performance deterioration.
3.1.2 SmartTier
3.1.2.1 Overview
SmartTier is also called intelligent storage tiering. It provides the intelligent data storage
management function that automatically matches data to the storage media best suited to that type
of data by analyzing data activities.
SmartTier migrates hot data to storage media with high performance (such as SSDs) and moves idle
data to more cost-effective storage media (such as NL-SAS disks) with more capacity. This provides
hot data with quick response and high input/output operations per second (IOPS), thereby
improving the performance of the storage system.
3.1.2.2 Storage Tiers
In a storage pool, a storage tier is a collection of storage media that all deliver the same level of
performance. SmartTier divides disks into high-performance, performance, and capacity tiers based
on their performance levels. Each storage tier is comprised of the same type of disks and uses the
same RAID level.
(1) High-performance tier
Disk type: SSDs
Disk characteristics: SSDs have a high IOPS and can quickly respond to I/O request. However, the
cost of storage capacity at each unit is high.
Application characteristics: Applications with intensive random access requests are often deployed
at this tier.
Data characteristics: It carries the most active data (hot data).
(2) Performance tier
HCIA-Storage Learning Guide Page 72
3.1.3 SmartQoS
3.1.3.1 Overview
SmartQoS dynamically allocates storage resources to meet certain performance goals of specified
applications.
As storage technologies develop, a storage system is capable of providing larger capacities.
Accordingly, a growing number of users choose to deploy multiple applications on one storage
device. Different applications contend for system bandwidth and Input/Output Operations Per
Second (IOPS) resources, comprising the performance of critical applications.
SmartQoS helps users properly use storage system resources and ensures high performance of
critical services.
SmartQoS enables users to set performance indicators like IOPS or bandwidth for certain
applications. The storage system dynamically allocates system resources to meet QoS requirements
of certain applications based on specified performance goals. It gives priority to certain applications
with demanding QoS requirements.
3.1.3.2 I/O Priority Scheduling
The I/O priority scheduling technology of SmartQoS is based on the priorities of LUNs, file systems,
or snapshots. Their priorities are determined by users based on the importance of services deployed
on the LUNs, file systems, or snapshots.
Each LUN or file system has a priority, which is configured by a user and saved in a storage system.
When an I/O request enters the storage system, the storage system gives a priority to the I/O
request based on the priority of the LUN, file system, or snapshot that will process the I/O request.
Then the I/O carries the priority throughout this processing procedure. When system resources are
HCIA-Storage Learning Guide Page 74
insufficient, the system preferentially processes high-priority I/Os to improve the performance of
high-priority LUNs, file systems, or snapshots.
I/O priority scheduling is to schedule storage system resources, including CPU compute time and
cache resources.
3.1.3.3 I/O Traffic Control
SmartQoS traffic control consists of I/O class queue management, token distribution, and dequeue
control of controlled objects.
I/O traffic control uses a token-based system. When a user sets an upper limit for the performance
of a traffic control group, that limit is converted into the number of corresponding tokens. In a
storage system, where the IOPS is limited, each I/O operation corresponds to a token. If bandwidth
is limited, tokens are allocated to sectors.
The storage system adjusts the number of tokens in an I/O queue based on the priority of a LUN, file
system, or snapshot. The more tokens a LUN, file system, or snapshot I/O queue has, the more
resources the system allocates to the LUN, file system, or snapshot I/O queue. The storage system
preferentially processes I/O requests in the I/O queue of the LUN, file system, or snapshot.
For example, if a user enables a SmartQoS policy for two LUNs in a storage system and sets
performance objectives in the SmartQoS policy, the storage system can limit the system resources
allocated to the LUNs to reserve more resources for high-priority LUNs. The performance goals
measured by the SmartQoS are bandwidth, IOPS, and latency.
3.1.3.4 Application Scenarios
SmartQoS dynamically allocates storage resources to ensure performance for critical services and
high-priority users.
1. Ensuring the performance for critical services
If OLTP and archive backup services are concurrently running on the same storage device, both
services need sufficient system resources.
(1) Online Transaction Processing (OLTP) service is a key service and has high requirements on
real-time performance.
(2) Backup service has a large amount of data with fewer requirements on latency.
SmartQoS specifies the performance objectives for different services to ensure the performance
of critical services. You can use either of the following methods to ensure the performance of
critical services: Set I/O priorities to meet the high priority requirements of OLTP services, and
create SmartQoS traffic control policies to meet service requirements.
2. Ensuring the service performance for high-level users
For cost reduction, some users will not build their dedicated storage systems independently.
They prefer to run their storage applications on the storage platforms offered by storage
resource providers. This lowers the total cost of ownership (TCO) and ensures the service
continuity. On such shared storage platforms, services of different types and features contend
for storage resources, so the high-priority users may fail to obtain their desired storage
resources.
SmartQoS creates SmartQoS policies and sets I/O priorities for different subscribers. In this way,
when resources are insufficient, services of high-priority subscribers can be preferentially
processed and their service quality requirements can be met.
3.1.3.5 Configuration Process
Before configuring SmartQoS, read the configuration process and check its license file. The service
monitoring function provided by the storage system can be used to obtain the I/O characteristics of
HCIA-Storage Learning Guide Page 75
LUNs or file systems and use them as the basis of SmartQoS policies. SmartQoS policies adjust and
control applications to ensure continuity of critical services.
3.1.4 SmartDedupe
3.1.4.1 Overview
SmartDedupe eliminates redundant data from a storage system. This deduplication technology
reduces the amount of physical storage capacity occupied by data to release more storage capacity
for increasing services.
Huawei OceanStor storage systems provide inline deduplication to deduplicate the data that is newly
written into the storage systems.
Inline deduplication deletes duplicate data before the data is written into disks.
Similar deduplication analyzes data that has been written to disks based on similar fingerprints to
find out duplicate and similar data blocks, and then performs deduplication.
3.1.4.2 Inline Deduplication Working Principles
Deduplication data block size specifies the deduplication granularity.
Fingerprint is a fixed-length binary value that represents a data block. OceanStor Dorado V6 uses a
weak hash algorithm to calculate a fingerprint for any data block. The storage system saves the
mapping relationship between the fingerprints and storage locations of all data blocks in a
fingerprint library.
Inline deduplication working principles
1. The storage system uses a weak hash algorithm to calculate a fingerprint for any data block that
is newly written into the storage system.
2. The storage system checks whether the fingerprint for the new data block is consistent with the
fingerprint in the fingerprint library.
➢ If yes, a byte-by-byte comparison is performed by the storage system.
➢ If no, the storage system determines that the data newly written is a new data block,
writes the new data block to a disk, and records the mapping between the fingerprint and
storage location of the data block in the fingerprint library.
3. The storage system performs a byte-by-byte comparison to check whether the new data block
is consistent with the existing one.
➢ If yes, the storage system determines that the new data block is the same as the existing
one, deletes the data block, and directs the fingerprint and storage location mapping of the
data block to the original data block in the fingerprint library.
➢ If no, the storage system writes the data block to a disk. The processing procedure is the
same as that when the deduplication function is disabled.
3.1.4.3 Post-processing Similarity Deduplication Working Principles
An opportunity table saves data blocks' fingerprint and location information for identifying hot
fingerprints.
The process of deleting similar duplicate data is as follows:
1. Write data and calculate the fingerprint and write it to the opportunity table.
A storage system divides the newly written data into blocks, uses the similar fingerprint
algorithm to calculate the similar fingerprint of the newly written data block, writes the data
block to a disk, and writes the fingerprint and location information of the data block to the
opportunity table.
HCIA-Storage Learning Guide Page 76
2. After data is written, the storage system periodically performs similarity deduplication.
(1) The storage system periodically checks whether there is similar fingerprint in the
opportunity table.
➢ If yes, it performs byte-by-byte comparison.
➢ If no, it continues the periodic check.
(2) The storage system performs a byte-by-byte comparison to check whether similar blocks
are actually the same.
➢ If they are the same, the storage system determines that the new data block is the
same as the original data block. The storage system deletes the data block, and maps
its fingerprint and storage location to the remaining data block.
➢ If they are just similar, the system performs differential compression on data blocks,
records its fingerprint to the fingerprint library, updates fingerprint to the metadata of
data blocks and recycles the spaces of these data blocks.
3.1.4.4 Application Scenarios
In Virtual Desktop Infrastructure (VDI) applications, users create multiple virtual images on a physical
storage device. These images have a large amount of duplicate data. As the amount of duplicate
data increases, the storage system struggles to keep up with service requirements. SmartDedupe
can delete duplicate data among images to release storage resources and store more service data.
3.1.4.5 Configuration Process
When creating a LUN, you need to select an application type. The deduplication function of the
application type has been configured by default. You can run commands to view the deduplication
and compression status of each application type.
3.1.5 SmartCompression
3.1.5.1 Overview
SmartCompression reorganizes data to save space and improves the data transfer, processing, and
storage efficiency without losing any data. OceanStor Dorado V6 series storage systems support
inline compression and post-process compression.
Inline compression: The system deduplicates and compresses data before writing it to disks. User
data is processed in real time.
Post-process compression: Data is written to disks in advance and then read and compressed when
the system is idle.
3.1.5.2 Working Principles
The LZ77 algorithm is a lossless compression algorithm. The LZ77 algorithm replaces the data to be
encoded or decoded with the position index of the data that has just been encoded or decoded. The
repetition of current data and data that has just been encoded or decoded implements compression.
LZ77 uses a sliding window to implement this algorithm. The scan head starts scanning the string
from the head, and there is a sliding window with a length of N before the scan head. If there is the
string at the scanning head being the same as the longest matching string in the window, (offset in
the window, longest matching length) it is used to replace the latter repeated string.
3.1.5.3 Application Scenarios
More CPU resources will be occupied as the volume of data that is compressed by the storage
system increases.
HCIA-Storage Learning Guide Page 77
1. Databases: Data compression is an ideal choice for databases. Many users would be happy to
sacrifice a little performance to recover over 65% of their storage capacity.
2. File services: Data compression is also applied to file services. Peak hours occupy half of the
total service time and the dataset compression ratio of the system is 50%. In this scenario,
SmartCompression slightly decreases the IOPS.
3. Engineering data and seismic geological data: The data has similar requirements to database
backups. This type of data is stored in the same format, but there is not as much duplicate data.
Such data can be compressed to save the storage space.
SmartDedupe + SmartCompression
Deduplication and compression can be used together to save more space. SmartDedupe can be
combined with SmartCompression for data testing or development systems, storage systems with a
file service enabled, and for engineering data systems.
3.1.5.4 Configuration Process
The SmartCompression configuration process includes checking the license and enabling
SmartCompression. Checking the license is to ensure the availability of SmartCompression.
3.1.6 SmartMigration
3.1.6.1 Overview
SmartMigration is a key service migration technology. Service data can be migrated within a storage
system and between different storage systems.
"Consistent" means that after the service migration is complete, all of the service data has been
replicated from a source LUN to a target LUN.
3.1.6.2 Working Principles
SmartMigration synchronizes and splits service data to migrate all data from a source LUN to a target
LUN.
3.1.6.3 SmartMigration Service Data Synchronization
Pair: In SmartMigration, a pair is a source LUN and the target LUN that the data will be migrated to.
A pair can have only one source LUN and one target LUN.
LM modules manage SmartMigration in a storage system.
Dual-write changes data and writes data into both the source and target LUNs during service data
migration.
LOG records data changes on a source LUN to determine whether the data is concurrently written to
the target LUN using dual-write technology.
A data change log (DCL) records differential data that fails to be written to the target LUN during the
data change synchronization.
Service data synchronization between a source LUN and a target LUN includes initial synchronization
and data change synchronization. The two synchronization modes are independent and can be
performed at the same time to ensure that service data changes on the host are synchronized to the
source LUN and the target LUN.
3.1.6.4 SmartMigration Pair Splitting
Splitting is performed on a single pair. The splitting process includes stopping service data
synchronization between the source and target LUNs in a pair to exchange LUN information, and
removing the data migration relationship after the exchange is complete.
HCIA-Storage Learning Guide Page 78
In splitting, host services are suspended. After information is exchanged, services are delivered to
the target LUN. Service migration is invisible to the users.
1. LUN information exchange
Before a target LUN can take over services from a source LUN, the two LUNs must synchronize
and then exchange information.
2. Pair splitting
Pair splitting: Data migration relationship between a source LUN and a target LUN is removed
after LUN information is exchanged.
The consistency splitting of SmartMigration means that multiple pairs exchange LUN
information at the same time and concurrently remove pair relationships after the information
exchange is complete, ensuring that data consistency at any point in time before and after the
pairs are split.
3.1.6.5 Configuration Process
The configuration processes of SmartMigration in a storage system include checking the license file,
creating a SmartMigration task, and splitting a SmartMigration task.
A target LUN stores a point-in-time data duplicate of the source LUN only after the pair is split. The
data duplicate can be used to recover the source LUN in the event that the source LUN is damaged.
In addition, the data duplicate is accessible in scenarios such as application testing and data analysis.
HyperSnap is a consistency data copy of the source data at a specific point in time. It is an available
copy of a specified data set. The copy contains a static image of a source data at the point in time
when the source data is copied.
3.2.1.2 Working Principles
Redirect on write (ROW) technology is the core technology for snapshot. In the overwrite scenario, a
new space is allocated to new data. After the write operation is successful, the original space is
released.
Data organization: LUNs created in a storage pool of OceanStor Dorado V6 consist of data and
metadata volumes.
⚫ A metadata volume records the data organization information (logical block address (LBA),
version, and clone ID) and data attributes. A metadata volume is organized in a tree structure.
Logical block address (LBA) indicates the address of a logical block. Version corresponds to the
point in time of a snapshot, and Clone ID is the number of data copies.
⚫ Data volume stores data written to a LUN.
Source volume stores the source data for which a snapshot is to be created. It is presented as a LUN
to users.
Snapshot volume is a logical data copy generated after a snapshot is created for a source volume. It
is presented as a snapshot LUN to users.
Inactive: A snapshot is in the inactive state. In this state, the snapshot is unavailable and can be used
only after being activated.
3.2.1.3 Lossless Performance
The process of activating a snapshot is to save a data state of a source object at the time when it is
activated. Specific operations include creating a mirror for the source object of the snapshot and
associating the mirror with an activated snapshot. The principle of creating a mirror for the source
object is as follows: Common LUNs and snapshots in the OceanStor Dorado V6 all-flash storage
system use the ROW-based read and write mode. In ROW-based read and write mode, each time the
host writes new data, the system reallocates space to store the new data and updates the LUN
mapping table to point to the new data space. As shown in the slide, L0 to L4 are logical addresses,
P0 to P8 are physical addresses, and A to I are data.
3.2.1.4 Snapshot Rollback
When required, the rollback function can immediately restore the data in a storage system to the
state at the snapshot point in time, offsetting data damage or data loss caused by misoperations or
viruses after the snapshot point in time.
Rollback copies the snapshot data to a target object (a source LUN or a snapshot). After rollback is
started, the target object instantly uses the data (the data of the target object is that of the
snapshot). To ensure that the target object is available immediately after the rollback is started,
perform the following operations:
⚫ Perform read redirection on the target object which needs read rollback.
⚫ Perform rollback before write on the target object which needs write rollback.
Stopping a rollback cannot restore the data of the target object to the state before the rollback.
Therefore, you are advised to add a snapshot for data protection before starting a rollback. Stopping
the rollback only stops the data copy process but cannot restore the data on the source object to the
state before the rollback.
HCIA-Storage Learning Guide Page 80
3.2.2 HyperClone
3.2.2.1 Overview
HyperClone allows you to obtain full copies of LUNs without interrupting host services. These copies
can be used for data backup and restoration, data reproduction, and data analysis.
3.2.2.2 Working Principles
HyperClone provides a full copy of the source LUN's data at the synchronization start time. The
target LUN can be read and written immediately, without waiting for the copy process to complete.
The source and target LUNs are physically isolated. Operations on the member LUNs do not affect
each other. When data on the source LUN is damaged, data can be reversely synchronized from the
target LUN to the source LUN. A differential bitmap records the data written to the source and
target LUNs to support subsequent incremental synchronization.
3.2.2.3 Synchronization
When a HyperClone pair starts synchronization, the system generates an instant snapshot for the
source LUN, synchronizes the snapshot data to the target LUN, and records subsequent write
operations in a differential table.
When synchronization is performed again, the system compares the data of the source and target
LUNs, and only synchronizes the differential data to the target LUN. The data written to the target
LUN between the two synchronizations will be overwritten. Before synchronization, users can create
a snapshot for a target LUN to retain its data changes.
Relevant concepts:
1. Pair: In HyperClone, a pair has one source LUN and one target LUN. A pair is a mirror
relationship between the source and target LUNs. A source LUN can form multiple HyperClone
pairs with different target LUNs. A target LUN can be added to only one HyperClone pair.
2. Synchronization: Data is copied from a source LUN to a target LUN.
3. Reverse synchronization: If data on the source LUN needs to be restored, you can reversely
synchronize data from the target LUN to the source LUN.
4. Differential copy: The differential data can be synchronized from the source LUN to the target
LUN based on the differential bitmap.
3.2.2.4 Reverse Synchronization
If a source LUN is damaged, data on a target LUN can be restored to the source LUN. All or
differential data can be copied to the source LUN.
When reverse synchronization starts, the system generates a snapshot for a target LUN and
synchronizes all of the snapshot data to a source LUN. For incremental reverse synchronization, the
system compares the data of the source and target LUNs, and only copies the differential data to the
source LUN.
The source and target LUNs can be read and written immediately after reverse synchronization
starts.
3.2.2.5 Restrictions on Feature Configuration
HyperClone has restrictions on other functions of Huawei OceanStor Dorado V3.
3.2.2.6 Application Scenarios
HyperClone is widely used. It can serve as a backup source, as a data mining source, and as a
checkpoint for application status. After the synchronization is complete, the source and target LUNs
HCIA-Storage Learning Guide Page 82
of HyperClone are physically isolated. Operations on the source and target LUNs do not affect each
other.
Application Scenario 1: Data Backup and Restoration
HyperClone generates one or multiple copies of source data to achieve point-in-time backup, which
can be used to restore the source LUN data in the event of data corruption. Using a target LUN can
back up data online and restore data more quickly.
Application Scenario 2: Data Analysis and Reproduction
Data analysis researches on a great amount of data in order to extract useful information, draw
conclusions, and support decision-making. The analysis services use data on target LUNs to prevent
the analysis and production services from competing for source LUN resources, ensuring system
performance.
Data reproduction (HyperClone) can create multiple copies of the same source LUN for multiple
target LUNs.
3.2.2.7 Configuration Process
Configure HyperClone on DeviceManager.
Create a clone to copy data to a target LUN.
Create a protection group to protect data. Before this operation, you need to create protected
objects, including protection groups and LUNs. A protection group is a collection of one LUN group
or multiple LUNs.
Create a clone consistency group to facilitate unified operations on clones, improve efficiency, and
ensure that the data of all clones in the group is consistent at the time point.
3.2.3 HyperReplication
3.2.3.1 Overview
As digitization advances in various industries, data has become critical to the efficient operation of
enterprises, and users impose increasingly demanding requirements on the stability of storage
systems. Although many enterprises have highly stable storage systems, it is a big challenge for them
to ensure data restoration from damage caused by natural disasters. To ensure continuity,
recoverability, and high availability of service data, remote DR solutions emerge. The
HyperReplication technology is one of the key technologies used in remote DR solutions.
HyperReplication is Huawei remote replication feature. HyperReplication provides a flexible and
powerful data replication function that facilitates remote data backup and restoration, continuous
support for service data, and disaster recovery.
A primary site is a production center that includes primary storage systems, application servers, and
links.
A secondary site is a backup center that includes secondary storage systems, application servers, and
links.
HyperReplication supports the following two modes:
⚫ Synchronous remote replication between LUNs: Data is synchronized between primary and
secondary LUNs in real time. No data is lost when a disaster occurs. However, the performance
of production services is affected by the latency of the data transmission between primary and
secondary LUNs.
⚫ Asynchronous remote LUN replication between LUNs: Data is periodically synchronized between
primary and secondary LUNs. The performance of production services is not affected by the
HCIA-Storage Learning Guide Page 83
latency of the data transmission between primary and secondary LUNs. If a disaster occurs,
some data may be lost.
3.2.3.2 Introduction to DR and Backup
When two data centers (DCs) use HyperReplication, they work in active/standby mode. The
production center is in the service running status, and the DR center is in the non-service running
status.
For active/standby DR, if a device in DC A is faulty or even if the entire DC A is faulty, services are
automatically switched to DC B.
For backup, DC B backs up only data in DC A and does not carry services when DC A is faulty.
3.2.3.3 Relevant Concepts
⚫ Pair: A pair refers to the data replication relationship between a primary LUN and a secondary
LUN. The primary LUN and secondary LUN of a pair must belong to different storage systems.
⚫ Data status: HyperReplication identifies the data status of the current pair based on the data
difference between primary and secondary LUNs. When a disaster occurs, HyperReplication
determines whether to allow a primary/secondary switchover for a pair based on the data
status of the pair. There are two types of pair data status: complete and incomplete.
⚫ Writable secondary LUN: Data delivered by a host can be written to a secondary LUN. A
secondary LUN can be set to writable in two scenarios:
➢ A primary LUN fails and the HyperReplication links are disconnected. In this case, a
secondary LUN can be set to writable in the secondary storage system.
➢ A primary LUN fails but the HyperReplication links are in normal state. The pair must be
split before you enable the secondary LUN to be writable in the primary or secondary
storage system.
⚫ Consistency group: A collection of pairs whose services are associated. For example, the primary
storage system has three primary LUNs, which respectively store service data, log, and change
tracking information of a database. If data on any of the three LUNs is invalid, all data on the
three LUNs becomes invalid. The pairs to which three LUNs belong comprise a consistency
group.
⚫ Synchronization: Data is replicated from a primary LUN to a secondary LUN. HyperReplication
involves initial synchronization and incremental synchronization.
⚫ Split: Data synchronization between primary and secondary LUNs is suspended. After splitting,
there is still the pair relationship between the primary LUN and the secondary LUN. Hosts'
access permission for both the LUNs remains unchanged. At some time, users may not want to
copy data from the primary LUN to the secondary LUN. For example, if the bandwidth is
insufficient to support critical services, you need to suspend the data synchronization of
HyperReplication between links. In such case, you can perform the split operation to suspend
data synchronization.
⚫ Primary/secondary switchover: A process during which primary and secondary LUNs in a pair
are switched over. This process changes the primary/secondary relationship of LUNs in a
HyperReplication pair.
3.2.3.4 Phases
HyperReplication involves the following phases: creating a HyperReplication relationship,
synchronizing data, switching over services, and restoring data.
1. Create a HyperReplication pair.
HCIA-Storage Learning Guide Page 84
2. Synchronize all data manually or automatically from the primary LUN to the secondary LUN of
the HyperReplication pair. In addition, periodically synchronize incremental data on the primary
LUN to the secondary LUN.
3. Check the data status of the HyperReplication pair and the read/write properties of the
secondary LUN to determine whether a primary/secondary switchover can be performed. Then
perform a primary/secondary switchover to form a new HyperReplication pair.
4. Synchronize data from the secondary storage system to the primary storage system. Then
perform a primary/secondary switchover to restore to the original pair relationship.
1. Running status of a pair
You can perform synchronization, splitting, and primary/secondary switchover operations on a
HyperReplication pair based on its running status. After performing an operation, you can view
the running status of the pair to check whether the operation is successful.
2. Working principles of asynchronous remote replication
Asynchronous remote replication periodically replicates data from the primary storage system
to the secondary storage system.
Asynchronous remote replication of Huawei OceanStor storage systems adopts the innovative
multi-time-point caching technology.
3. HyperReplication service switchover
If a primary site suffers a disaster, a secondary site can quickly take over its services to protect
service continuity. The RPO and RTO indicators must be considered during service switchover.
Requirements for running services on the secondary storage system:
➢ Before a disaster occurs, data in a primary LUN is consistent with that in a secondary LUN.
If data in the secondary LUN is incomplete, services may fail to be switched.
➢ Services on the production host have also been configured on the standby host.
➢ The secondary storage system allows a host to access a LUN in a LUN group mapped to the
host.
If a disaster occurs and the primary site is invalid, the HyperReplication links between primary
and secondary LUNs are down. In this case, the administrator needs to manually set the
read/write permissions of the secondary LUN to writable mode to implement service
switchover.
4. Data restoration
After a primary site fails, a secondary site temporarily takes over services of the primary site.
When the primary site recovers, services are switched back.
After the primary site recovers from a disaster, it is required to rebuild a HyperReplication
relationship between primary and secondary storage systems and use data on the secondary
site to restore data on the primary site.
5. Function of a consistency group
Users can perform synchronization, splitting, and primary/secondary switchover operations on a
single HyperReplication pair or manage multiple HyperReplication pairs by using a consistency
group. If associated LUNs have been added to a HyperReplication consistency group, the
consistency group can effectively prevent data loss.
3.2.3.5 Application Scenarios
HyperReplication is used for data DR and backup. The typical application scenarios include central
DR and backup, geo-redundancy, and realizing DR with BCManager eReplication.
Different HyperReplication modes apply to different application scenarios.
HCIA-Storage Learning Guide Page 85
Asynchronous remote replication applies to backup and disaster recovery scenarios where the
network bandwidth is limited or a primary site is far from a secondary site (for example, across
countries or regions).
3.2.3.6 Configuration Process
You can set up a pair relationship between primary and secondary resources to synchronize data.
Unless otherwise specified, the operations in the slide can be performed on either the primary or
the secondary storage device. If you want to perform the operations on the secondary storage
device, perform a primary/secondary switchover first.
3.2.4 HyperMetro
3.2.4.1 Overview
HyperMetro is Huawei's active-active storage solution. Two DCs enabled with HyperMetro back up
each other and both are carrying services. If a device is faulty in a DC or if the entire center is faulty,
the other DC will automatically take over services, solving the switchover problems of traditional DR
centers. This ensures high data reliability and service continuity, and improves the resource
utilization of the storage system.
3.2.4.2 Working Principles
1. Local DC deployment: In most cases, hosts are deployed in different equipment rooms in the
same industrial park. Hosts are deployed in cluster mode. Hosts and storage devices
communicate with each other through switches. Fibre Channel switches and IP switches are
supported. In addition, dual-write mirroring channels are deployed between the storage
systems to ensure the HyperMetro services are running correctly.
2. Cross-DC deployment: Generally, hosts are deployed in two DCs in the same city or adjacent
cities. The physical distance between the two centers is within 300 km. Both are running and
can carry the same services at the same time, improving the overall service capability and
system resource utilization of the DCs. If one DC is faulty, services are automatically switched to
the other one.
In cross-DC deployments involving long-distance transmission (a minimum of 80 km for IP
networking and 25 km for Fibre Channel networking), dense wavelength division multiplexing
(DWDM) devices must be used to ensure a low transmission latency. In addition, HyperMetro
mirroring channels are deployed between the storage systems to ensure the HyperMetro services
are running correctly.
The HyperMetro solution has the following characteristics:
⚫ The data dual-write technology ensures storage redundancy. No data is lost if there is only one
storage system running or the production center fails. Services are switched over quickly,
maximizing customer service continuity. This solution meets the service requirements of RTO =
0 and RPO = 0.
⚫ HyperMetro and SmartVirtualization can be used together to support heterogeneous storage
and consolidate resources on the network layer to protect the existing investment of the
customer.
⚫ This solution can be smoothly upgraded to the 3DC solution with HyperReplication.
Based on the preceding features, the HyperMetro solution can be widely used in industries such as
healthcare, finance, and social security.
HCIA-Storage Learning Guide Page 86
Generally, backup refers to data backup or system backup, while DR refers to data backup or
application backup across equipment rooms. Backup is implemented using backup software,
whereas DR is implemented using replication or mirroring software. The differences between the
two are as follows:
1. DR is designed for protecting data against natural disasters, such as fires and earthquakes.
Therefore, a backup center must be set in a place which is away from the production center at a
certain distance. In contrast, data backup is performed within a data center.
2. A DR system not only protects data but also guarantees business continuity. In contrast, data
backup only focuses on data security.
3. DR protects data integrity. In contrast, backup can only help recover data from a point in time
when a backup task is performed.
4. DR is performed in online mode while backup is performed in offline mode.
5. Data at two sites of a DR system is consistent in real time while backup data is relatively time-
sensitive.
6. When a fault occurs, a DR switchover process in a DR system lasts seconds to minutes, while a
backup system takes hours and maybe even dozens of hours to recover data.
Backup and archiving systems are designed to protect data in different ways and the combination of
the two systems will provide more effective data protection. Backup is designed to protect data by
storing data copies. Archiving is designed to protect data by organizing and storing data for a long
term in a data management manner. In other words, backup can be considered as short-term
retention of data copies, while archiving can be considered as long-term retention of files. In
practice, we do not delete an original copy after it is backed up. However, it will be fine if we delete
an original copy after it is archived, as we might no longer need to access it swiftly. Backup and
archiving work together to better protect data.
4.1.2 Architecture
4.1.2.1 Components
A backup system typically consists of three components: backup software, backup media, and
backup server.
The backup software is useful for creating backup policies, managing media, and adding functions.
The backup software is the core of a backup system and is used for creating and managing copies of
production data stored on storage media. Some backup software can be upgraded with more
functions, such as protection, backup, archiving, and recovery.
Backup media include tape libraries, disk arrays, and virtual tape libraries. A virtual tape library is
essentially a disk array, but it can virtualize a disk storage into a tape library. Compared with
mechanical tapes, virtual tape libraries are compatible with tape backup management software and
conventional backup processes, greatly improving availability and reliability.
The backup server provides services for executing backup policies. The backup software resides and
runs on the backup server. Generally, a backup software client agent needs to be installed on the
service host to be backed up.
Three elements of a backup system are Backup Window (BW), Recovery Point Objective (RPO), and
Recovery Time Objective (RTO).
BW indicates a duration of time allowed for backing up the service data in a service system without
affecting the normal operation of the service system.
RPO is for ensuring the latest backup data is used for DR switchover. A smaller RPO means less data
to be lost.
HCIA-Storage Learning Guide Page 90
RTO refers to an acceptable duration of time and a service level within which a business process
must be restored, in order to minimize the impact of interruption on services.
4.1.2.2 Backup Solution Panorama
Huawei backup solutions include all-in-one, centralized, and cloud backup solutions.
Huawei Data Protection Appliance is a data protection and management product that integrates the
backup software, backup server, and backup storage. With the distributed architecture, Huawei Data
Protection Appliance supports the linear increase in both performance and capacity. Only one
system needs to be deployed to protect, construct, and manage user data and applications. This
helps users better protect data, save data protection investment, and simplify the data management
process. While excelling in a wide range of scenarios, Huawei Data Protection Appliance is suited for
industries such as government, finance, carrier, healthcare, and manufacturing.
The Data Protection Appliance provides a graphical management system, which facilitates users to
manage and maintain the software and hardware of a backup system in a centralized manner.
Centralized backup uses the backup management node to manage local and remote data centers
and schedules backup tasks in a centralized manner, remarkably simplifying the operation of a
backup system and enabling users to manage and control a backup system in view of overall
condition.
Centralized backup has the following advantages:
1. Enables centralized management of backup and recovery tasks of data from a variety of
applications, to form a unified management policy.
2. Integrates backup resources, to optimize utilization of backup resources.
3. Provides flexible scalability by allowing addition of any tapes, clients, and tape libraries.
4. Simplifies management by allowing fewer management engineers to manage more devices and
systems.
Cloud backup involves backing up data from a local production center to a data center (a central
data center of an enterprise or a data center provided by a service provider) using a standard
network protocol over a WAN. Cloud backup is based on services, accessible anywhere, flexible,
secure, and can be shared and used on demand. Cloud backup emerges as a brand new backup
service based on broadband Internet and large storage capacities. In conclusion, cloud backup
provides data storage and backup services by leveraging a variety of functions, such as cluster
applications, grid technologies, and distributed file systems, and integrating a variety of storage
devices across the network through application software.
⚫ The backup system and the application system are independent of each other, conserving
hardware resources of application servers during backup.
Weaknesses:
⚫ Additional backup servers increase hardware costs.
⚫ Backup agents adversely affect the performance of application servers.
⚫ Backup data is transmitted over a LAN, which adversely affects network performance.
⚫ Backup services must be separately maintained, complicating management and maintenance
operations.
⚫ Users must be highly proficient at processing backup services.
LAN-Free
Control flows are transmitted over a LAN, but data flows are not. LAN-Free backup transmits data
over a SAN instead of a LAN. The server that needs to be backed up is connected to backup media
over a SAN. When triggered by a LAN-Free backup client, the media server reads the data that needs
to be backed up and backs up the same data to the shared backup media.
Direction of backup data flows: The backup server sends a control flow over a LAN to an application
server where an agent is installed. The application server responds to the request and reads the
production data. Then, the media server reads the data from the application server and transmits
the data to the backup media. The backup operation is complete.
Strengths:
⚫ Backup data is transmitted without using LAN resources, significantly improving backup
performance while maintaining high network performance.
Weaknesses:
⚫ Backup agents adversely affect the performance of application servers.
⚫ LAN-Free backup requires a high budget.
⚫ Devices must meet certain requirements.
Server-Free
Server-Free backup has many strengths similar to those of LAN-Free backup. The source device,
target device, and SAN device are main components of the backup data channel. The server is still
involved in the backup process, but processes much fewer workloads as the server does not function
as the main backup data channel but instead, like a traffic police, is only responsible for giving
commands other than processing loading and transportation workloads. Control flows are
transmitted over a LAN, but data flows are not.
Direction of backup data flows: Backup data is transmitted over an independent network without
passing through a production server.
Strengths:
⚫ Backup data flows do not consume LAN resources and do not affect network performance.
⚫ Services running on hosts remain nearly unaffected.
⚫ Backup performance is excellent.
Weaknesses:
⚫ Server-Free backup requires a high budget.
⚫ Devices must meet strict requirements.
Server-Less
HCIA-Storage Learning Guide Page 92
Server-Less backup uses the Network Data Management Protocol (NDMP). NDMP is a standard
network backup protocol. It supports communications between intelligent data storage devices,
tape libraries, and backup applications. After a server sends an NDMP command to a storage device
that supports the NDMP protocol, the storage device can directly send the data to other devices
without passing through a host.
4.1.4.2 Deduplication
Digital transformations of enterprises have intensified the explosive growth of service data. The total
amount of backup data that needs to be protected is also increasing sharply. In addition, more and
more duplicate data is being generated from backup and archiving operations. Mass redundant data
consumes a lot of storage and bandwidth resources and leads to issues like long backup windows,
which further affect the availability of service systems.
Huawei Data Protection Appliance supports source-side and parallel deduplication. Deduplication is
performed before backup data is transmitted to storage media, greatly improving backup
performance.
Source-Side Deduplication
Data or files are sliced using an intelligent content-based deduplication algorithm. Then, fingerprints
are created for data blocks by hashing, for querying identical fingerprints in fingerprint libraries. If
identical fingerprints exist, it indicates that the same blocks are stored on the media servers. Existing
blocks will be used to preserve backup capacity and bandwidth resources, and for streamlined data
transfer and storage.
Technical principles:
1. Creates a fingerprint for a data block by hashing.
2. Queries whether the fingerprint exists in the fingerprint library of the Data Protection
Appliance. If yes, it indicates that this data block is duplicate and does not need to be sent to
the Data Protection Appliance. If no, the data block will be sent to the Data Protection
Appliance and written to the backup storage pool. Then, the fingerprint of this data block is
recorded in the deduplication fingerprint library.
Parallel Deduplication
Most conventional deduplication modes are based on a single node and are prone to inefficient data
access, poor processing performance, and insufficient storage capacity in the era of big data.
Huawei Data Protection Appliance uses the parallel deduplication technology by building a
deduplication fingerprint library on multiple nodes and distributing fingerprints on multiple nodes in
parallel. This effectively resolves the performance and storage capacity problems in single-node
solutions.
Technical principles:
After fingerprints are calculated for data blocks, the system uses the grouping algorithm to locate
specific server nodes. Different fingerprints are evenly distributed on different nodes. In this way,
the system queries whether these fingerprints exist on different server nodes, for parallel
deduplication.
With fingerprint libraries, recycled data can be stored in the same space in sequence. Such practice
reduces time for querying all fingerprints in each global deduplication, and maximizes the effect of
read cache of storage, to minimize disk seek switchover frequency due to random disk reads and
improve recovery efficiency.
4.1.4.3 Backup Modes
Snapshot Backup
Snapshot backup supports backup using the snapshot function of a storage system and agent-based
backup.
Fast and recoverable:
⚫ Enables you to browse backup information and quickly recover the selected objects.
⚫ Consolidates incremental copies into a full copy in the background to quickly recover data.
HCIA-Storage Learning Guide Page 94
⚫ Recovers data from hardware snapshots and performs fine-grained recovery using snapshots.
Recovers copies:
Storage array protection:
⚫ Native format
⚫ Automatic storage detection
⚫ Full integration (no script)
⚫ Snapshot support
Storage array recovery:
⚫ Enables you to recover, clone, or mount volumes.
⚫ Enables you to copy data back.
Standard Backup
Standard backup is a scheduled data protection mechanism. A backup task is automatically initialized
at a specified time according to backup policy and plan to read and write the data to be protected to
the backup media.
Working principles:
The standard backup process consists of three steps:
Figure 4-1
1. Reads data to be protected through the backup client (agent client). Based on different
applications, the agent client can be deployed on the production server (agent-based backup)
or can be the agent client built in Huawei Data Protection Appliance (agent-free backup).
2. Reads data from a production system to the Data Protection Appliance over the network (TCP).
3. The Data Protection Appliance receives data and saves it to the backup storage.
For different backup modes, such as full backup, incremental backup, permanent incremental
backup, and differential backup, data is read and transmitted in different ways. All data is
transmitted or only unique data is transmitted with deduplication.
When remote DR is required, remote replication allows replication of backup data to remote data
centers.
Continuous Backup
Continuous backup is a process of continuously backing up data on production hosts to backup
media. Continuous backup is based on the block-level continuous data protection technology. A
backup agent client is installed on production hosts. Data on production hosts is continuously
backed up to the snapshot storage pool of the internal storage system of the Data Protection
Appliance and is stored in the native format. After certain conditions are met, snapshots are created
in the snapshot storage pool to manage data at multiple points in time.
HCIA-Storage Learning Guide Page 95
Figure 4-2
1. The snapshot storage pool allocates the base volume.
2. The agent client for continuous backup connects to the server of the Data Protection Appliance.
3. The bypass monitoring drive in a partition of the production host continuously captures data
changes and caches the same data changes to the memory pool.
4. The agent client for continuous backup continuously transfers data to a storage device in the
snapshot storage pool of the Data Protection Appliance.
5. Source data in the partition of the production host is written to the base volume.
6. Data changes on the production host are written to the log volume first, and then are written to
the base volume storing the source data.
7. Snapshots of the base volume are managed based on the data retention policy for continuous
backup.
Advanced Backup
The advanced backup function of Huawei Data Protection Appliance effectively combines years of
experience in backup and DR and the independently developed copy data storage system to ensure
application data consistency. The advanced backup function helps implement policy-based
automation and DR, provide automation tools for developers, support heterogeneous production
storage, and implement real copy data management.
Working principles:
Capture of production data: Data is captured in the native format. Format conversion is not
required. Data is accessible upon being mounted. SLA policy can be customized based on
applications. Retention duration, RPO, RTO, and data storage locations are intuitively displayed.
Copy Management
Permanent incremental backup: Initial full backup and N incremental backups are performed. A full
copy is generated at each incremental backup point in time. Damages to a copy at an incremental
backup point in time will not impede recovery from any other point in time.
No rollback: Point-in-time copies created through virtual clone can direct to both source data and
current incremental data and can be directly used for recovery.
Copy Access and Use
No data movement: Data is mounted in minutes, and data volume does not affect recovery
efficiency.
HCIA-Storage Learning Guide Page 96
4.1.5 Applications
Databases
Databases are critical service applications in production systems. The native backup function of
databases relies on complicated manual operations. In addition, various databases on different
platforms need protection, which requires a broad compatibility of backup products.
The Data Protection Appliance provides a graphical wizard. Users do not need to manually execute
backup and restoration scripts, which simplifies backup and recovery operations. Database backup
process is as follows:
Install a backup client agent on the production server to be protected and connect the client agent
to the management console. The backup client agent identifies the database data on the production
server, reads the files and data from the production server through the backup API, and transfers the
same files and data to the storage media of the Data Protection Appliance to complete the backup.
The management console of the Data Protection Appliance sends control information to the client
and the Data Protection Appliance server and accordingly, manages the execution of a backup task.
The backup process: The backup client agent invokes the backup API of a database through an API to
read data in the database, processes deduplication or encryption, and then sends the data to the
Data Protection Appliance to complete the backup.
The recovery process: The management console sends a recovery command to the backup client
agent on the production server. The backup client agent invokes the recovery API of a database
through an API to read data from the backup server, and then sends the data to the recovery API to
complete the recovery.
The Data Protection Appliance connects to a database through a dedicated API for backup. The API
varies with databases. For example, the RMAN interface of Oracle and the VDI interface of SQL
Server.
Virtualization Platforms
The popularization of virtualization has increased the confidence of enterprises in storing their core
data in a virtual environment. Therefore, enterprises are in urgent need of data protection in a
virtual environment, in particular, data backup and recovery efficiency is a major concern.
The Data Protection Appliance provides a comprehensive and pertinent virtualization platform
protection solution which provides the following benefits:
⚫ Mass virtual data protection to improve backup efficiency.
⚫ Unified protection for both physical and virtual environments to simplify O&M.
⚫ Flexible recovery methods to avoid service interruptions.
⚫ Agent-free backup to minimize usage of host resources and maximize production performance.
The backup process is as follows: Create a VM snapshot. Back up VM data, including VM
configuration information and data on virtual disks. During backup, the CBT technology can be used
to obtain valid data blocks or incremental data blocks on VM disks. In a full backup operation, valid
data blocks on virtual disks are obtained. In an incremental backup operation, changed data blocks
on the VM are obtained. Delete the VM snapshot created in step 1.
The recovery process: For recovery to a new VM, create a VM based on the original and manual
configuration of the original VM. For recovery by overwriting a VM, manually configure the VM to be
HCIA-Storage Learning Guide Page 97
overwritten. The system reads the disk block data at the corresponding point in time from the media
server and writes the data to the disk of the VM mentioned in step 1. When FusionCompute VMs
use FusionStorage, the backup and recovery processes are different from those of FusionCompute
VMs using virtualized storage. In FusionStorage scenarios, VM disks correspond to LUN volumes on
FusionStorage. The differential bitmap volume provided by FusionStorage is used to obtain changed
data blocks on virtual disks.
File Systems
The file system backup module of the Data Protection Appliance can back up unstructured file
systems. It has the following features:
⚫ Backup types: full backup, incremental backup, and permanent incremental backup.
⚫ Backup and recovery granularity: single file, folder, entire disk
⚫ Block-level deduplication: Reduces the amount of backup data to be transmitted, shortening the
backup window and conserving network resources and storage space consumed by backup data
transmission.
⚫ Recovery location: Allows a file system to be recovered to the same location on the original
host, a specified location on the original host, or a different host.
⚫ Incremental backup: Incremental backup is performed on a per-file basis. The system compares
a file at the backup time with a file at the last modification time to determine whether the file
needs to be backed up. If the file needs to be backed up, the system performs incremental
backup based on the comparison result.
For file system backup, four filtering modes are provided to filter backup data sources, helping users
quickly select files to be backed up.
Backup process:
1. The client deployed on the service production system reads file data to be backed up.
2. The client transmits the data over the network.
3. The Data Protection Appliance receives and stores the data on physical media. The backup
operation is complete.
4. Host status check (Windows and Linux)
Windows:
After installing the client, press Windows+R. In the Run dialog box that is displayed, enter
services.msc to open the service management window. Then, check whether the client service is
started normally. If the client service is started normally, check whether the client is connected to
the server. If connected, you can create a standard backup plan for a file system.
Linux:
The requirements for backing up a Linux file system are similar to those for backing up a Windows
file system. Check the host service or process. If the client is connected to the server, you can create
a standard backup plan for a file system. The check commands used in CentOS 7 are as follows:
systemctl status HWClientService.service
ps -ef|grep esf
Operating Systems
The backup process is as follows:
1. Install a client to obtain the data about an operating system.
2. For Windows, invoke the VSS interface to create a snapshot for the volume where the operating
system resides. For Linux, select the data source to be backed up.
HCIA-Storage Learning Guide Page 98
3. Read the data of the volume where the operating system resides and back up the same data to
the storage media in the Data Protection Appliance.
4. The backup operation is complete (for Windows, delete the snapshot).
Recovery process:
1. Load the WinPE or LiveCD to boot the recovery environment. For Linux, install a client.
2. For Windows, manually partition the disk.
3. Recover the operating system data from the storage media to the specified system volume.
4. For Windows, use the system API to load the driver, modify the registry, and rectify BSOD. For
Linux, modify the configuration file.
5. Reboot the operating system upon completion of the recovery operation.
HA requires redundant servers to form a cluster to run applications and services. HA can be
categorized into the following types:
Active/Passive HA:
A cluster consists of only two nodes (active and standby nodes). In this configuration, the system
uses the active and standby machines to provide services. The system provides services only on the
active device.
When the active device is faulty, the services on the standby device are started to replace the
services provided by the active device.
Typically, the CRM software such as Pacemaker can be used to control the switchover between the
active and standby devices and provide a virtual IP address to provide services.
Active/Active HA:
If a cluster consists of only two active nodes, it is called active-active. If the cluster has multiple
nodes, it is called multi-active.
In this configuration, the system runs the same load on all servers in the cluster.
Take the database as an example. The update of an instance will be synchronized to all instances.
In this configuration, load balancing software, such as HAProxy, is used to provide virtual IP
addresses for services.
Pacemaker is a cluster manager. It uses the message and member capabilities provided by the
preferred cluster infrastructure (OpenAIS or heartbeat) to detect faults by the secondary node and
system, achieving high availability of the cluster service (also called resources).
HAProxy is a piece of free and open-source software written in C language. It provides high
availability, load balancing, and TCP- and HTTP-based application proxy. HAProxy is especially
suitable for web sites with heavy loads that usually require keeping session.
A disaster is an unexpected event (caused by human errors or natural factors) that results in severe
faults or breakdown of the system in one data center. In this case, services may be interrupted or
become unacceptable. If the system unavailability reaches a certain level at a specific time, the
system must be switched to the standby site.
Disaster recovery (DR) refers to the capability of recovering data, applications, and services in data
centers at different locations when the production center is damaged by a disaster.
In addition to the production site, a redundancy site is set up. When a disaster occurs and the
production site is damaged, the redundancy site can take over services from the production site to
ensure service continuity. To achieve higher availability, many users even set up multiple redundant
sites.
Main indicators for measuring a DR system
Recovery Point Objective (RPO) indicates the maximum amount of data that can be lost when a
disaster occurs.
Recovery Time Objective (RTO) indicates the time required for system recovery.
The smaller the RPO and RTO, the higher the system availability, and the larger the investment for
users.
4.2.1.2 DR System Level
Disaster recovery is an important technical application for enterprises and plays an important role in
enterprise data security. When it comes to disaster recovery, many CIOs put remote application-
level disaster recovery in the first place. They also emphasize the construction of a remote disaster
recovery system with zero data loss and automatic application switchover at the highest level. This is
actually a misconception. There is no doubt about the importance of disaster recovery. However,
HCIA-Storage Learning Guide Page 100
disaster recovery does not necessarily mean that application-level disaster recovery must be built.
The most important thing is to select a proper disaster recovery system based on actual
requirements.
Generally speaking, disaster backup is classified into three levels: data level, application level, and
service level. The data level and application level are within the scope of the IT system. The service
level takes the service factors outside the IT system into consideration, including the standby office
location and office personnel.
The data-level disaster recovery focuses on protecting the data from loss or damage after a disaster
occurs. Low-level data-level disaster recovery can be implemented by manually saving backup data
to a remote place. For example, periodically transporting backup tapes to a remote place is one of
the methods. The advanced data disaster recovery solution uses the network-based data replication
tool to implement asynchronous or synchronous data transmission between the production center
and the disaster recovery center. For example, the data replication function based on disk arrays is
used.
Application-level DR creates hosts and applications in the DR site based on the data-level DR. The
support system consists of the data backup system, standby data processing system, and standby
network system. Application-level DR provides the application takeover capability. That is, when the
production center is faulty, applications can be taken over by the DR center to minimize the system
downtime and improve service continuity.
SHARE, an IT information organization initiated by IBM in 1955, released the disaster recovery
standard SHARE 78 at the 78th conference in 1992. SHARE 78 has been widely recognized in the
world.
SHARE 78 divides disaster recovery into eight levels:
Backup or recovery scope
Status of a disaster recovery plan
Distance between the application location and the backup location
Connection between the application location and backup location
Transmission between the two locations
Data allowed to be lost
Backup data update
Ability of a backup location to start a backup job
The definition of remote disaster recovery is classified into seven levels:
Backup and recovery of local data
Access mode of batch storage and read
Access mode of batch storage and read + hot backup location
Network connection
Backup location of the working status
Dual online storage
Zero data loss
In addition, ISO 27001 released by International Organization for Standardization (ISO) requires that
related data and files be stored for at least one to five years.
4.2.1.3 Panorama of Huawei Business Continuity and Disaster Recovery Solution
Huawei Business Continuity and Disaster Recovery (BC&DR) Solution is designed to provide business
continuity assurance and data protection for enterprise customers. Huawei provides four major DR
HCIA-Storage Learning Guide Page 101
solutions covering the local production center, intra-city DR center, and remote DR center. In
addition, Huawei provides professional DR consulting services for customers' service systems to
ensure service continuity and data protection.
Local HA solution: ensures high availability of key services in the data center and prevents service
interruption and data loss caused by single-component faults.
Active-passive DR solution: intra-city and remote DR are supported. When a disaster occurs, services
in the DR center can be quickly recovered and provide services for external systems.
Active-Active data center solution: In intra-city DR, load of a critical service is balanced between two
data centers, ensuring zero service interruption and data loss when a data center malfunctions.
Geo-redundant DR solution: defends against data center-level disasters and regional disasters and
provides higher service continuity for mission-critical services. Generally, the intra-city
active/standby + remote active/standby solution or intra-city active-active + remote active/standby
solution is used.
In active-active mode, all I/O paths can access active-active LUNs to achieve load balancing and
seamless failover.
Huawei active-active data center solution adopts the active-active architecture and combines the
industry-leading HyperMetro-based functions with the web, database cluster, load balancing,
transmission devices, and network components to provide customers with an end-to-end active-
active data center solution within 100 km, ensuring service continuity even in the event of a device
or data center failure, services are not affected and can be automatically switched.
4.2.2.2 New DR Mode Evolution in Cloud Computing
To help enterprise customers build high-quality IT systems and meet service development
requirements, Huawei provides professional services in terms of storage, cloud computing, and
servers based on IT products.
At the storage layer, Huawei provides professional storage data migration, disaster recovery,
backup, and virtualization takeover services to meet enterprise customers' requirements for storage
replacement, data protection, and unified storage management.
In terms of cloud computing, Huawei provides professional services, such as cloud planning and
design, FusionSphere solution implementation, FusionCloud desktop solution implementation,
FusionSphere service migration, big data planning, and big data solution implementation, to meet
enterprise customers' requirements on virtualization planning and design, implementation,
migration, and big data planning and implementation.
At the data center layer, Huawei provides professional data center consolidation services to meet
customers' requirements for data center L1 and L2, data protection, data center planning, and data
migration. L1 is the infrastructure layer, including the floor layout, power system, cooling system,
cabling system, fire extinguishing system, and physical security. L2 is the IT infrastructure layer,
which uses the cloud computing system as the core, including computing, network, storage, security,
service continuity, disaster recovery, and backup.
The Huawei enterprise DR and backup service provides multiple service products, including storage
data migration, DR, backup, VM migration, and cloud solution implementation services, covering
multiple industries such as government, energy, finance, and education. Huawei provides
professional service solutions with industry-specific characteristics to meet enterprise customers'
requirements for IT infrastructure update, data protection, and technological transformation.
Based on customer requirements and service lifecycle, the Huawei enterprise DR and backup service
provides customers with one-stop professional services, including project management, planning,
design, integration test, integration implementation, integration verification, and optimization.
In addition, professional and diversified tools are used to quickly collect and analyze project
information, design and implement solutions, and customize and deliver the most appropriate
professional service solutions for customers.
In this mode, data written by a production host is stored only to the primary LUN, and the difference
between the primary and secondary LUNs is recorded by the differential log. If you want to ensure
data consistency between the primary and secondary LUNs, you can manually start a
synchronization process. During the synchronization process, data blocks marked as differential in
the differential log are incrementally copied from the primary LUN to the secondary LUN. The I/O
processing principle is similar to that of initial synchronization.
SAN Asynchronous Replication Principle
A time segment refers to a logical space in the cache for writing data in a period of time (without the
data amount limit).
In the low RPO scenario, the asynchronous remote replication period is short. The cache of an
OceanStor storage system can store all data in multiple time segments. However, if the host or DR
bandwidth is abnormal and the replication period is prolonged or interrupted, data in the cache is
automatically written into disks based on the disk flushing policy for consistency protection. Upon
replication, the data is read from disks.
Based on the replication period (which is user-defined and ranges from 3 seconds to 1440 minutes),
the system automatically starts a synchronization procedure for incrementally synchronizing data
from the primary site to the standby site (If the synchronization type is set to manual, users need to
manually trigger synchronization). At the start of a replication period, the system generates time
segments TPN+1 and TPX+1 separately in the caches of the primary LUN (LUN A) and the standby
LUN (LUN B).
The primary site receives a write request from the production host.
The primary site writes the data involved in the write request into time segment TPN+1 in the cache
of LUN A and immediately returns the write complete response to the host.
During data synchronization, the system reads data generated in the previous replication period in
time segment TPN in the cache of LUN A, transmits the data to the standby site, and writes the data
into time segment TPX+1 in the cache of LUN B. If the usage of LUN A's cache reaches a certain
threshold, the system automatically writes data into disks. In this case, a snapshot is generated on
disks for the data in time segment TPN. During data synchronization, the system reads the data from
the snapshot on disks and replicates the data to LUN B.
After the data synchronization is complete, the system writes data in time segments TPN and TPX+1
separately in the caches of LUN A and LUN B into disks based on the disk flushing policy (snapshots
are automatically deleted), and waits for the next replication period.
Switchover:
A primary/secondary switchover can be performed for a synchronous remote replication pair when
the pair is in the normal state.
In the split state, a primary/secondary switchover can be performed only after the secondary LUN is
set to writable.
The asynchronous remote replication is in the split state.
In the split state, the secondary LUN must be set to writable.
NAS Asynchronous Replication Principle
At the beginning of each period, the file system asynchronous remote replication creates a snapshot
for the primary file system. Based on the incremental information generated from the time when
the replication in the previous period is complete to the time when the current period starts, the file
system asynchronous remote replication reads the snapshot data and replicates the data to the
secondary file system. After the incremental replication is complete, the data in the secondary file
system is the same as that in the primary file system, data consistency points are formed on the
secondary file system.
HCIA-Storage Learning Guide Page 105
DeviceManager supports multiple operating systems and browsers. For details about the
compatibility information, visit Huawei Storage Interoperability Navigator.
The maintenance terminal communicates with the storage system properly.
The super administrator can log in to the storage system using this authentication mode only.
Before logging in to DeviceManager as a Lightweight Directory Access Protocol (LDAP) domain user,
first configure the LDAP domain server, and then configure parameters on the storage system to add
it into the LDAP domain, and finally create an LDAP domain user.
By default, DeviceManager allows 32 users to log in concurrently.
A storage system provides built-in roles and supports customized roles.
Built-in roles are preset in the system with specific permissions shown in the table. Built-in roles
include the super administrator, administrator, and read-only user.
Permissions of user-defined roles can be configured based on actual requirements.
To support permission control in multi-tenant scenarios, the storage system divides built-in roles
into two groups: system group and tenant group. Specifically, the differences between the system
group and tenant group are as follows:
Tenant group: roles in this group are used only in the tenant view (view that can be operated after
you log in to DeviceManager using a tenant account).
System group: roles belonging to this group are used only in the system view (view that can be
operated after you log in to DeviceManager using a system group account).
HCIA-Storage Learning Guide Page 108
the world. Many famous multinational companies, such as IBM, HP, Microsoft, P&G, and HSBC are
active practitioners of ITIL. As the industry is gradually changing from technology-oriented to service-
oriented, enterprises' requirements for IT service management are also increasing, which greatly
helps standardize IT processes, keep IT processes' pace with business, and improve processing
efficiency.
ITIL has the strong support from the UK, other countries in Europe, North America, New Zealand,
and Australia. Whether an enterprise imports ITIL will be regarded as key indicators for determining
whether an inspection suppliers or outsourcing service contractor is qualified for bidding.
Information Collection
The information to be collected includes basic information, fault information, storage device
information, networking information, and application server information.
Customer
Provides the contact and contact details.
information