Professional Documents
Culture Documents
HCIA-Storage Learning Guide: Huawei Storage Certification Training
HCIA-Storage Learning Guide: Huawei Storage Certification Training
HCIA-Storage
Learning Guide
V4.5
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of
their respective holders.
Notice
The purchased products, services and features are stipulated by the contract made
between Huawei and the customer. All or part of the products, services and features
described in this document may not be within the purchase scope or the usage scope.
Unless otherwise specified in the contract, all statements, information, and
recommendations in this document are provided "AS IS" without warranties,
guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has
been made in the preparation of this document to ensure accuracy of the contents, but
all statements, information, and recommendations in this document do not constitute
a warranty of any kind, express or implied.
Website: http://e.huawei.com
Contents
Input: inputs data in a specific format, which depends on the processing mechanism.
For example, when a computer is used, the input data can be recorded on several
types of media, such as disks and tapes.
Processing: performs actions on the input data to obtain more data value. For
example, the time card hours are calculated to payroll, or sales orders are calculated
to generate sales reports.
Output: generates and outputs the processing result. The form of the output data
depends on the data use. For example, the output data can be an employee's salary.
Data is a record that reflects the attributes of an object and is the specific form that
carries information. Data becomes information after being processed, and information
needs to be digitized into data for storage and transmission.
increased by increasing the disk density with low costs. However, the SSD capacity
could be doubled only when the internal chips were doubled, which was difficult.
However, the MLC SSD proves that it is possible to double the capacity by storing
more bits in one cell. In addition, the SSD performance is much higher than that of
HDD. The SSD has the read bandwidth of 240 MB/s, write bandwidth of 215 MB/s,
read latency less than 100 microseconds, 50,000 read IOPS, and 10,000 write IOPS.
HDD vendors are facing a huge threat.
The flash chips of SSD evolve from SLC with one cell storing one bit, MLC with two bits,
TLC with three bits, and now develop into QLC with one cell storing four bits.
provide more computing power for the underlying data infrastructure to provide efficient
and low-cost storage media, and narrow the gap between storage and computing. These
problems need to be solved by dedicated hardware storage. The concept similar to
Memory Fabric also brings changes to the storage architecture.
The last trend is convergence. In the future, storage will be integrated with the data
infrastructure to support heterogeneous chip computing, streamline diversified protocols,
and collaborate with data processing and big data analytics to reduce data processing
costs and improve efficiency. For example, compared with the storage provided by
general-purpose servers, the integration of data and storage will lower the TCO because
data processing is offloaded from servers to storage. Object, big data, and other protocols
are converged and interoperate to implement migration-free big data. Such convergence
greatly affects the design of storage systems and is the key to improving storage
efficiency.
As the science and technology are developing, the disk capacity is increasing and the disk
size is becoming smaller. When it comes to storing information, a hard disk is still very
large compared to genes, but the amount of stored information is far less than that of
genes. Therefore, scientists start to use DNA to store data. At first, a few teams have tried
to write data into the genomes of living cells. But the approach has a couple of
disadvantages. Cells replicate, introducing new mutations over time that can change the
data. Moreover, cells die, indicating that data is lost. Later, teams attempt to store data
using artificially synthesized DNA, which is freed from cells. Although the DNA storage
density now is high enough and a small amount of artificial DNA can store a large
amount of data, the data read and write are not efficient. In addition, the synthesis of
DNA molecules is expensive. However, it can be predicted that, with the development of
gene sequencing technologies, the cost will be reduced.
References:
Bohannon, J. (2012). DNA: The Ultimate Hard Drive. Science. Retrieved from:
https://www.sciencemag.org/news/2012/08/dna-ultimate-hard-drive
Akram F, Haq IU, Ali H, Laghari AT (October 2018). "Trends to store digital data in DNA:
an overview". Molecular Biology Reports. 45 (5): 1479–1490. doi:10.1007/s11033-018-
4280-y
Although atomic storage is a technology of short history, it is not a new concept.
Early on December 1959, physicist Richard Feynman gave a lecture at the annual
American Physical Society meeting at Caltech "There's Plenty of Room at the Bottom: An
Invitation to Enter a New Field of Physics." In this lecture, Feynman considered the
possibility of using individual atoms as basic units for information storage.
In July 2016, researchers from Delft University of Technology, Netherlands published a
paper in Nature Nanotechnology. They used chlorine atoms on copper plates to store 1
kilobyte of rewritable data. However, the memory temporarily can only operate in a
highly clean vacuum environment or in a liquid nitrogen environment with a temperature
of minus 196°C (77K).
References:
Erwin, S. A picture worth a thousand bytes. Nature Nanotech 11, 919–920 (2016).
https://doi.org/10.1038/nnano.2016.141
Kalff, F., Rebergen, M., Fahrenfort, E. et al. A kilobyte rewritable atomic memory. Nature
Nanotech 11, 926–929 (2016). https://doi.org/10.1038/nnano.2016.131
Because an atom is so small, the capacity of atomic storage will be much larger than that
of the existing storage medium in the same size. With the development of science and
technology in recent years, Feynman's idea has become a reality. To pay tribute to
Feynman's great idea, some research teams wrote his lecture into atomic memory.
Although the idea of atomic storage is incredible and its implementation is becoming
possible, atomic memory has strict requirements on the operating environment. Atoms
are moving and even the atoms inside solids are vibrating in the ambient environment,
so it is difficult to keep them in an ordered state in general conditions. Atom storage can
only be used in low temperatures, liquid nitrogen, or vacuum conditions.
If both DNA storage and atomic storage are intended to reduce the size of storage and
increase the capacity of storage, quantum storage is designed to improve performance
and running speed.
HCIA-Storage Learning Guide Page 17
After years of research, both the storage efficiency and the lifecycle of the quantum
memory are improved, but it is still difficult to put the quantum memory into practice.
Quantum memory has the problems of inefficiency, large noise, short lifespan, and
difficulty to operate at room temperature. Only by solving these problems, quantum
memory can be put into the market.
The elements in the quantum state are easily lost due to the influence of the external
environment. In addition, it is difficult to ensure 100% accuracy of manufacturing in the
quantum state and performing quantum operations.
References:
Wang, Y., Li, J., Zhang, S. et al. Efficient quantum memory for single-photon polarization
qubits. Nat. Photonics 13, 346–351 (2019). https://doi.org/10.1038/s41566-019-0368-8
Dou Jian-Peng, Li Hang, Pang Xiao-Ling, Zhang Chao-Ni, Yang Tian-Huai, Jin Xian-Min.
Research progress of quantum memory. Acta Physica Sinica, 2019, 68(3): 030307. doi:
10.7498/aps.68.20190039
Back-end (BE) ports connect a controller enclosure to a disk enclosure and provide
disks with channels for reading and writing data.
A cache is a memory chip on a disk controller. It provides fast data access and is a
buffer between the internal storage and external interfaces.
An engine is a core component of a development program or system on an
electronic platform. It usually provides support for programs or a set of systems.
Coffer disks store user data, system configurations, logs, and dirty data in the cache
to protect against unexpected power outages.
Built-in coffer disk: Each controller of Huawei OceanStor Dorado V6 has one or
two built-in SSDs as coffer disks. See the product documentation for more
details.
External coffer disk: The storage system automatically selects four disks as coffer
disks. Each coffer disk provides 2 GB space to form a RAID 1 group. The
remaining space can store service data. If a coffer disk is faulty, the system
automatically replaces the faulty coffer disk with a normal disk for redundancy.
Power module: The controller enclosure employs an AC power module for its normal
operations.
A 4 U controller enclosure has four power modules (PSU 0, PSU 1, PSU 2, and
PSU 3). PSU 0 and PSU 1 form a power plane to power controllers A and C and
provide mutual redundancy. PSU 2 and PSU 3 form the other power plane to
power controllers B and D and provide mutual redundancy. It is recommended
that you connect PSU 0 and PSU 2 to one PDU and PSU 1 and PSU 3 to another
PDU for maximum reliability.
A 2 U controller enclosure has two power modules (PSU 0 and PSU 1) to power
controllers A and B. The two power modules form a power plane and provide
mutual redundancy. Connect PSU 0 and PSU 1 to different PDUs for maximum
reliability.
2.1.3.2 CE Switch
Huawei CloudEngine series fixed switches are next-generation Ethernet switches for data
centers and provide high performance, high port density, and low latency. The switches
use a flexible front-to-rear or rear-to-front design for airflow and support IP SANs and
distributed storage networks.
2.1.4 HDD
2.1.4.1 HDD Structure
A platter is coated with magnetic materials on both surfaces with polarized magnetic
grains to represent a binary information unit, or bit.
A read/write head reads and writes data for platters. It changes the polarities of
magnetic grains on the platter surface to save data.
The actuator arm moves the read/write head to the specified position.
The spindle has a motor and bearing underneath. It rotates the specified position on
the platter to the read/write head.
HCIA-Storage Learning Guide Page 21
The control circuit controls the speed of the platter and movement of the actuator
arm, and delivers commands to the head.
Tracks may vary in the number of sectors. A sector can generally store 512 bytes of
user data, but some disks can be formatted into even larger sectors of 4 KB.
where the head does not need to change the track or read a specified sector, but
reads and writes all sectors sequentially and cyclically on one track.
External transfer rate is also called burst data transfer rate or interface transfer rate.
It refers to the data transfer rate between the system bus and the disk buffer and
depends on the disk port type and buffer size.
Serial transmission is less efficient than parallel transmission, but is generally faster
with potential increases in transmission speed from increasing the transmission
frequency.
Serial transmission is used for long-distance transmission. Currently, PCI interfaces
use serial transmission. The PCIe interface is a typical example of serial transmission.
The transmission rate of a single line is up to 2.5 Gbit/s.
Advantages:
It is applicable to a wide range of devices. One SCSI controller card can connect
to 15 devices simultaneously.
It provides high performance with multi-task processing, low CPU usage, fast
rotation speed, and a high transmission rate.
SCSI disks support diverse applications as external or built-in components with
hot-swappable replacement.
Disadvantages:
High cost and complex installation and configuration.
SAS port:
SAS is similar to SATA in its use of a serial architecture for a high transmission rate
and streamlined internal space with shorter internal connections.
SAS improves the efficiency, availability, and scalability of the storage system. It is
backward compatible with SATA for the physical and protocol layers.
Advantages:
SAS is superior to SCSI in its transmission rate, anti-interference, and longer
connection distances.
Disadvantages:
SAS disks are more expensive.
Fibre Channel port:
Fiber Channel was originally designed for network transmission rather than disk
ports. It has gradually been applied to disk systems in pursuit of higher speed.
Advantages:
Easy to upgrade. Supports optical fiber cables with a length over 10 km.
Large bandwidth
Strong universality
Disadvantages:
High cost
Complex to build
2.1.5 SSD
2.1.5.1 SSD Overview
Traditional disks use magnetic materials to store data, but SSDs use NAND flash with
cells as storage units. NAND flash is a non-volatile random access storage medium that
can retain stored data after the power is turned off. It quickly and compactly stores
digital information.
SSDs eliminate high-speed rotational components for higher performance, lower power
consumption, and zero noise.
SSDs do not have mechanical parts, but this does not mean that they have an infinite life
cycle. Because NAND flash is a non-volatile medium, original data must be erased before
HCIA-Storage Learning Guide Page 26
new data can be written. However, there is a limit to how many times each cell can be
erased. Once the limit is reached, data reads and writes become invalid on that cell.
These four types of cells have similar costs but store different amounts of data.
Originally, the capacity of an SSD was only 64 GB or smaller. Now, a TLC SSD can store
up to 2 TB of data. However, each cell type has a different life cycle, resulting in different
SSD reliability. The life cycle is also an important factor in selecting SSDs.
For example, host page A was originally stored in flash page X, and the mapping
relationship was A to X. Later, the host rewrites the host page. Flash memory does
not overwrite data, so the SSD writes the new data to a new page Y, establishes the
new mapping relationship of A to Y, and cancels the original mapping relationship.
The data in page X becomes aged and invalid, which is also known as garbage data.
The host continues to write data to the SSD until it is full. In this case, the host
cannot write more data unless the garbage data is cleared.
SSD read process:
An 8-fold increase in read speed depends on whether the read data is evenly
distributed in the blocks of each channel. If the 32 KB data is stored in the blocks of
channels 1 through 4, the read speed can only support a 4-fold improvement at
most. That is why smaller files are transmitted at a slower rate.
Gbit/s Fibre Channel ports. SmartIO interface modules connect storage devices to
application servers.
The optical module rate must match the rate on the interface module label. Otherwise,
the storage system will report an alarm and the port will become unavailable.
Combines multiple physical disks into one logical disk array to provide larger storage
capacity.
Divides data into blocks and concurrently writes/reads data to/from multiple disks to
improve disk access efficiency.
Provides mirroring or parity for fault tolerance.
Hardware RAID and software RAID can be implemented in storage devices.
Hardware RAID uses a dedicated RAID adapter, disk controller, or storage processor.
The RAID controller has a built-in processor, I/O processor, and memory to improve
resource utilization and data transmission speed. The RAID controller manages
routes and buffers, and controls data flows between the host and the RAID array.
Hardware RAID is usually used in servers.
Software RAID has no built-in processor or I/O processor but relies on a host
processor. Therefore, a low-speed CPU cannot meet the requirements for RAID
implementation. Software RAID is typically used in enterprise-class storage devices.
Disk striping: Space in each disk is divided into multiple strips of a specific size. Data is
also divided into blocks based on strip size when data is being written.
Strip: A strip consists of one or more consecutive sectors in a disk, and multiple strips
form a stripe.
Stripe: A stripe consists of strips of the same location or ID on multiple disks in the
same array.
RAID generally provides two methods for data protection.
One is storing data copies on another redundant disk to improve data reliability and
read performance.
The other is parity. Parity data is additional information calculated using user data.
For a RAID array that uses parity, an additional parity disk is required. The XOR
(symbol: ⊕) algorithm is used for parity.
2.2.1.2 RAID 0
RAID 0, also referred to as striping, provides the best storage performance among all
RAID levels. RAID 0 uses the striping technology to distribute data to all disks in a RAID
array.
HCIA-Storage Learning Guide Page 31
2.2.1.3 RAID 1
RAID 1, also referred to as mirroring, maximizes data security. A RAID 1 array uses two
identical disks including one mirror disk. When data is written to a disk, a copy of the
same data is stored in the mirror disk. When the source (physical) disk fails, the mirror
disk takes over services from the source disk to maintain service continuity. The mirror
disk is used as a backup to provide high data reliability.
The amount of data stored in a RAID 1 array is only equal to the capacity of a single disk,
and data copies are retained in another disk. That is, each gigabyte data needs 2
HCIA-Storage Learning Guide Page 32
gigabyte disk space. Therefore, a RAID 1 array consisting of two disks has a space
utilization of 50%.
2.2.1.4 RAID 3
RAID 3 is similar to RAID 0 but uses dedicated parity stripes. In a RAID 3 array, a
dedicated disk (parity disk) is used to store the parity data of strips in other disks in the
same stripe. If incorrect data is detected or a disk fails, data in the faulty disk can be
recovered using the parity data. RAID 3 applies to data-intensive or single-user
environments where data blocks need to be continuously accessed for a long time. RAID
3 writes data to all member data disks. However, when new data is written to any disk,
RAID 3 recalculates and rewrites parity data. Therefore, when a large amount of data
from an application is written, the parity disk in a RAID 3 array needs to process heavy
workloads. Parity operations have certain impact on the read and write performance of a
RAID 3 array. In addition, the parity disk is subject to the highest failure rate in a RAID 3
array due to heavy workloads. A write penalty occurs when just a small amount of data is
written to multiple disks, which does not improve disk performance as compared with
data writes to a single disk.
HCIA-Storage Learning Guide Page 33
2.2.1.5 RAID 5
RAID 5 is improved based on RAID 3 and consists of striping and parity. In a RAID 5 array,
data is written to disks by striping. In a RAID 5 array, the parity data of different strips is
distributed among member disks instead of a parity disk.
Similar to RAID 3, a write penalty occurs when just a small amount of data is written.
HCIA-Storage Learning Guide Page 34
2.2.1.6 RAID 6
Data protection mechanisms of all RAID arrays previously discussed considered only
failures of individual disks (excluding RAID 0). The time required for reconstruction
increases along with the growth of disk capacities. It may take several days instead of
hours to reconstruct a RAID 5 array consisting of large-capacity disks. During the
reconstruction, the array is in the degraded state, and the failure of any additional disk
will cause the array to be faulty and data to be lost. This is why some organizations or
units need a dual-redundancy system. In other words, a RAID array should tolerate
failures of up to two disks while maintaining normal access to data. Such dual-
redundancy data protection can be implemented in the following ways:
The first one is multi-mirroring. Multi-mirroring is a method of storing multiple
copies of a data block in redundant disks when the data block is stored in the
primary disk. This means heavy overheads.
The second one is a RAID 6 array. A RAID 6 array protects data by tolerating failures
of up to two disks even at the same time.
The formal name of RAID 6 is distributed double-parity (DP) RAID. It is essentially an
improved RAID 5, and also consists of striping and distributed parity. RAID 6 supports
double parity, which means that:
HCIA-Storage Learning Guide Page 35
2.2.1.7 RAID 10
For most enterprises, RAID 0 is not really a practical choice, while RAID 1 is limited by
disk capacity utilization. RAID 10 provides the optimal solution by combining RAID 1 and
RAID 0. In particular, RAID 10 provides superior performance by eliminating write penalty
in random writes.
A RAID 10 array consists of an even number of disks. User data is written to half of the
disks and mirror copies of user data are retained in the other half of disks. Mirroring is
performed based on stripes.
HCIA-Storage Learning Guide Page 37
2.2.1.8 RAID 50
RAID 50 combines RAID 0 and RAID 5. Two RAID 5 sub-arrays form a RAID 0 array. The
two RAID 5 sub-arrays are independent of each other. A RAID 50 array requires at least
six disks because a RAID 5 sub-array requires at least three disks.
HCIA-Storage Learning Guide Page 38
specified. Actually, RAID 2.0+ provides more flexible and specific data redundancy
protection methods. The storage space formed by disks in a disk domain is divided
into storage pools of a smaller granularity and hot spare space shared among
storage tiers. The system automatically sets the hot spare space based on the hot
spare policy (high, low, or none) set by an administrator for the disk domain and the
number of disks at each storage tier in the disk domain. In a traditional RAID array,
an administrator should specify a disk as the hot space disk.
2. Storage Pool and Storage Tier
A storage pool is a storage resource container. The storage resources used by
application servers are all from storage pools.
A storage tier is a collection of storage media providing the same performance level
in a storage pool. Different storage tiers manage storage media of different
performance levels and provide storage space for applications that have different
performance requirements.
A storage pool created based on a specified disk domain dynamically allocates CKs
from the disk domain to form CKGs according to the RAID policy of each storage tier
for providing storage resources with RAID protection to applications.
A storage pool can be divided into multiple tiers based on disk types.
When creating a storage pool, a user is allowed to specify a storage tier and related
RAID policy and capacity for the storage pool.
OceanStor storage systems support RAID 1, RAID 10, RAID 3, RAID 5, RAID 50, and
RAID 6 and related RAID policies.
The capacity tier consists of large-capacity SATA and NL-SAS disks. DP RAID 6 is
recommended.
3. Disk Group
An OceanStor storage system automatically divides disks of each type in each disk
domain into one or more disk groups (DGs) according to disk quantity.
One DG consists of disks of only one type.
CKs in a CKG are allocated from different disks in a DG.
DGs are internal objects automatically configured by OceanStor storage systems and
typically used for fault isolation. DGs are not presented externally.
4. Logical Drive
A logical drive (LD) is a disk that is managed by a storage system and corresponds to
a physical disk.
5. CK
A chunk (CK) is a disk space of a specified size allocated from a storage pool. It is the
basic unit of a RAID array.
6. CKG
A chunk group (CKG) is a logical storage unit that consists of CKs from different
disks in the same DG based on the RAID algorithm. It is the minimum unit for
allocating resources from a disk domain to a storage pool.
HCIA-Storage Learning Guide Page 41
All CKs in a CKG are allocated from the disks in the same DG. A CKG has RAID
attributes, which are actually configured for corresponding storage tiers. CKs and
CKGs are internal objects automatically configured by storage systems. They are not
presented externally.
7. Extent
Each CKG is divided into logical storage spaces of a specific and adjustable size called
extents. Extent is the minimum unit (granularity) for migration and statistics of hot
data. It is also the minimum unit for space application and release in a storage pool.
An extent belongs to a volume or LUN. A user can set the extent size when creating
a storage pool. After that, the extent size cannot be changed. Different storage pools
may consist of extents of different sizes, but one storage pool must consist of extents
of the same size.
8. Grain
When a thin LUN is created, extents are divided into 64 KB blocks which are called
grains. A thin LUN allocates storage space by grains. Logical block addresses (LBAs)
in a grain are consecutive.
Grains are mapped to thin LUNs. A thick LUN does not involve grains.
9. Volume and LUN
A volume is an internal management object in a storage system.
A LUN is a storage unit that can be directly mapped to a host for data reads and
writes. A LUN is the external embodiment of a volume.
A volume organizes all extents and grains of a LUN and applies for and releases
extents to increase and decrease the actual space used by the volume.
2.2.3.2 RAID-TP
RAID protection is essential to a storage system for consistently high reliability and
performance. However, the reliability of RAID protection is challenged by uncontrollable
RAID array construction time due to drastic increase in capacity.
RAID-TP achieves optimal performance, reliability, and capacity utilization.
HCIA-Storage Learning Guide Page 42
Customers have to purchase disks of larger capacity to replace existing disks for system
upgrades. In such a case, one system may consist of disks of different capacities. How to
maintain the optimal capacity utilization in a system that uses a mix of disks with
different capacities?
RAID-TP uses Huawei's optimized FlexEC algorithm that allows the system to tolerate
failures of up to three disks, improving reliability while allowing a longer reconstruction
time window.
RAID-TP with FlexEC algorithm reduces the amount of data read from a single disk by
70%, as compared with traditional RAID, minimizing the impact on system performance.
In a typical 4:2 RAID 6 array, the capacity utilization is about 67%. The capacity
utilization of a Huawei OceanStor all-flash storage system with 25 disks is improved by
20% on this basis.
target responds to the SCSI requests, provides services through LUNs, and provides a task
management function.
2.3.2.4 FC Protocol
FC can be referred to as the FC protocol, FC network, or FC interconnection. As FC
delivers high performance, it is becoming more commonly used for front-end host access
on point-to-point and switch-based networks. Like TCP/IP, the FC protocol suite also
includes concepts from the TCP/IP protocol suite and the Ethernet, such as FC switching,
FC switch, FC routing, FC router, and SPF routing algorithm.
FC protocol structure:
FC-0: defines physical connections and selects different physical media and data
rates for protocol operations. This maximizes system flexibility and allows for existing
cables and different technologies to be used to meet the requirements of different
systems. Copper cables and optical cables are commonly used.
FC-1: records the 8-bit/10-bit transmission code to balance the transmission bit
stream. The code can also serve as a mechanism to transfer data and detect errors.
Its excellent transfer capability of 8-bit/10-bit encoding helps reduce component
design costs and ensures optimum transfer density for better clock recovery. Note: 8-
bit/10-bit encoding is also applicable to IBM ESCON.
FC-2: includes the following items for sending data over the network:
How data should be split into small frames
How much data should be sent at a time (flow control)
Where frames should be sent (including defining service levels based on
applications)
FC-3: defines advanced functions such as striping (data is transferred through
multiple channels), multicast (one message is sent to multiple targets), and group
query (multiple ports are mapped to one node). When FC-2 defines functions for a
single port, FC-3 can define functions across ports.
HCIA-Storage Learning Guide Page 46
Figure 2-9
Fabric:
Similar to an Ethernet switching topology, a fabric topology is a mesh switching
matrix.
The forwarding efficiency is much greater than in FC-AL.
FC devices are connected to fabric switches through optical fibres or copper
cables to implement point-to-point communication between nodes.
FC frees the workstation from the management of every port. Each port
manages its own point-to-point connection to the fabric, and other fabric
functions are implemented by FC switches. On an FC network, there are seven
types of ports.
Device (node) port:
N_Port: Node port. A fabric device can be directly attached.
NL_Port: Node loop port. A device can be attached to a loop.
Switch port:
E_Port: Expansion port (connecting switches).
HCIA-Storage Learning Guide Page 47
Ethernet network, the Ethernet needs to be enhanced to prevent packet loss. The
enhanced Ethernet is called Converged Enhanced Ethernet (CEE).
Port layer: describes the interfaces of the link layer and transport layer, including
how to request, interrupt, and set up connections.
Transport layer: defines how the transmitted commands, status, and data are
encapsulated into SAS frames and how SAS frames are decomposed.
Application layer: describes how to use SAS in different types of applications.
SAS has the following characteristics:
SAS uses the full-duplex (bidirectional) communication mode. The traditional parallel
SCSI can communicate only in one direction. When a device receives a data packet
from the parallel SCSI and needs to respond, a new SCSI communication link needs
to be set up after the previous link is disconnected. However, each SAS cable
contains two input cables and two output cables. This way, SAS can read and write
data at the same time, improving the data throughput efficiency.
Compared with SCSI, SAS has the following advantages:
As it uses the serial communication mode, SAS provides higher throughput and
may deliver higher performance in the future.
Four narrow ports can be bound as a wide link port to provide higher
throughput.
Scalability of SAS
SAS uses expanders to expand interfaces. One SAS domain supports a maximum of
16,384 disk devices.
A SAS expander is an interconnection device in a SAS domain. Similar to an Ethernet
switch, a SAS expander enables an increased number of devices to be connected in a
SAS domain, and reduces the cost in HBAs. Each expander can connect to a
maximum of 128 terminals or expanders. The main components in a SAS domain are
SAS expanders, terminal devices, and connection devices (or SAS connection cables).
A SAS expander is equipped with a routing table that tracks the addresses of all
SAS drives.
A terminal device can be an initiator (usually a SAS HBA) or a target (a SAS or
SATA disk, or an HBA in target mode).
Loops cannot be formed in a SAS domain. This ensures terminal devices can be
detected.
In reality, the number of terminal devices connected to an extender is far fewer than
128 due to bandwidth reasons.
Cable Connection Principles of SAS
Most storage device vendors use SAS cables to connect disk enclosures to controller
enclosures or connect disk enclosures. A SAS cable bundles four independent
channels (narrow ports) into a wide port to provide higher bandwidth. The four
independent channels provide 12 Gbit/s each, so a wide port can provide 48 Gbit/s of
bandwidth. To ensure that the data volume on a SAS cable does not exceed the
maximum bandwidth of the SAS cable, the total number of disks connected to a SAS
loop must be limited.
For a Huawei storage device, the maximum number of disks supported is 168. That
is, a loop comprising up to a maximum seven disk enclosures each with 24 disk slots.
HCIA-Storage Learning Guide Page 50
However, all disks in the loop must be traditional SAS disks. As SSDs are becoming
more common, one must consider that SSDs deliver much higher transmission
speeds than SAS disks. Therefore, for SSDs, a maximum of 96 disks are supported in
a loop: four disk enclosures, each with 24 disk slots, form a loop.
A SAS cable is called a mini SAS cable when the speed of a single channel is 6 Gbit/s,
and a SAS cable is called a high-density mini SAS cable when the speed is increased
to 12 Gbit/s.
Compared with the traditional PCI bus, PCIe has the following advantages:
Dual channels, high bandwidth, and a fast transmission rate: A transmission mode
(RX and TX are separated) similar to the full-duplex mode is implemented. In
addition, a higher transmission rate can be provided. The first-generation PCIe X1
provides 2.5 Gbit/s, the second generation provides 5 Gbit/s, the PCIe 3.0 provides 8
Gbit/s, the PCIe 4.0 provides 16 Gbit/s, and the PCIe 5.0 provides up to 32 Gbit/s.
Compatibility: PCIe is compatible with PCI at the software layer but has upgraded
software.
Ease-of-use: Hot swap is supported. A PCIe bus interface slot contains the hot swap
detection signal, supporting hot swap and heat exchange.
Error processing and reporting: A PCIe bus uses a layered structure, in which the
software layer can process and report errors.
Virtual channels of each physical connection: Each physical channel supports multiple
virtual channels (in theory, eight virtual channels are supported for independent
communication control), thereby supporting QoS of each virtual channel and
achieving high-quality traffic control.
Reduced I/Os, board-level space, and crosstalk: A typical PCI bus data line requires at
least 50 I/O resources, while PCIe XL requires only four I/O resources. Reduced I/Os
saves board-level space, and the direct distance between I/Os can be longer, thereby
reducing crosstalk.
Why PCIe? PCIe is future-oriented, and higher throughputs can be achieved in the future.
PCIe is providing increasing throughput using the latest technologies, and the transition
from PCI to PCIe can be simplified by guaranteeing compatibility with PCI software using
layered protocols and drives. The PCIe protocol features point-to-point connection, high
reliability, tree networking, full duplex, and frame-structure-based transmission.
PCIe protocol layers include the physical layer, data link layer, transaction layer, and
application layer.
The physical layer in a PCIe bus architecture determines the physical features of the
bus. In future, the performance of a PCIe bus can be further improved by increasing
the speed or changing the encoding or decoding mode. Such changes only affect the
physical layer, facilitating upgrades.
The data link layer ensures the correctness and reliability of data packets transmitted
over a PCIe bus. It checks whether the data packet encapsulation is complete and
correct, adds the sequence number and CRC code to the data, and uses the ack/nack
handshake protocol for error detection and correction.
The processing layer receives read and write requests from the software layer or
creates a request encapsulation packet and transmits it to the data link layer. This
type of packet is called a transaction layer packet (TLP). The TLP receives data link
layer packets (DLLP) from the link layer, associates the DLLP with a related software
request, and transmits it to the software layer for processing.
The application layer is designed by users based on actual needs. Other layers must
comply with the protocol requirements.
HCIA-Storage Learning Guide Page 52
RDMA uses related hardware and network technologies to enable NICs of servers to
directly read memory, achieving high bandwidth, low latency, and low resource
consumption. However, the RDMA-dedicated IB network architecture is incompatible with
a live network, resulting in high costs. RoCE effectively solves this problem. RoCE is a
network protocol that uses the Ethernet to carry RDMA. There are two versions of RoCE.
RoCEv1 is a link layer protocol and cannot be used in different broadcast domains.
RoCEv2 is a network layer protocol and can implement routing functions.
2.3.5.2 IB Protocol
IB technology is specifically designed for server connections and is widely used for
communication between servers (for example, replication and distributed working),
between a server and a storage device (for example, SAN and DAS), and between a
server and a network (for example, LAN, WAN, and the Internet).
HCIA-Storage Learning Guide Page 54
IB defines a set of devices used for system communication, including channel adapters,
switches, and routers used to connect to other devices, such as host channel adapters
(HCAs) and target channel adapters (TCAs). The IB protocol has the following features:
Standard-based protocol: IB was designed by the InfiniBand Trade Association, which
was founded in 1999 and comprised 225 companies. Main members of the
association include Agilent, Dell, HP, IBM, InfiniSwitch, Intel, Mellanox, Network
Appliance, and Sun Microsystems. More than 100 other members help develop and
promote the standard.
Speed: IB provides high speeds.
Memory: Servers that support IB use HCAs to convert the IB protocol to the PCI-X or
PCI-Xpress bus inside the server. The HCA supports RDMA and is also called kernel
bypass. RDMA fits clusters well. It uses a virtual addressing solution to let a server
identify and use memory resources from other servers without involving any
operating system kernels.
RDMA helps implement transport offload. The transport offload function transfers
data packet routing from the OS to the chip level, reducing the service load of the
processor. An 80 GHz processor is required to process data at a transmission speed
of 10 Gbit/s in the OS.
The IB system includes CAs, a switch, a router, a repeater, and connected links. CAs
include and HCAs and TCAs.
An HCA is used to connect a host processor to the IB structure.
A TCA is used to connect I/O adapters to the IB structure.
IB in storage: The IB front-end network is used to exchange data with customers. Data is
transmitted based on the IPoIB protocol. The IB back-end network is used for data
interaction between nodes in a storage device. The RPC module uses RDMA to
synchronize data between nodes.
IB layers include the application layer, transport layer, network layer, link layer, and
physical layer. The functions of each layer are described as follows:
Transport layer: responsible for in-order distribution and segmentation of packets,
channel multiplexing, and data transmission. It also sends, receives, and reassembles
data packet segments.
Network layer: provides a mechanism for routing packets from one substructure to
another. Each routing packet of the source and destination nodes has a global
routing header (GRH) and a 128-bit IPv6 address. A standard global 64-bit identifier
is also embedded at the network layer and this identifier is unique in all subnets.
Through the exchange of such identifier values, data can be transmitted across
multiple subnets.
Link layer: provides such functions as packet design, point-to-point connection, and
packet switching in the local subsystems. At the packet communication level, two
special packet types are specified: data transmission and network management
packets. The network management packet provides functions like operation control,
subnet indication, and fault tolerance for device enumeration. The data transmission
packet is used for data transmission. The maximum size of each packet is 4 KB. In
each specific device subnet, the direction and exchange of each packet are
implemented by a local subnet manager with a 16-bit identifier address.
HCIA-Storage Learning Guide Page 55
Physical layer: defines connections at three rates: 1X, 4X, and 12X. The signal
transmission rates are 2.5 Gbit/s, 10 Gbit/s, and 30 Gbit/s, respectively. IBA therefore
allows multiple connections to obtain a speed of up to 30 Gbit/s. Because the full-
duplex serial communication mode is used, the single-rate bidirectional connection
requires only four cables. When the 12-rate mode is used, only 48 cables are
required.
A NAS device is a closed storage system. The Client Agent of the backup software
can only be installed on the production system instead of the NAS device. In the
traditional network backup process, data is read from a NAS device through the CIFS
or NFS sharing protocol, and then transferred to a backup server over a network.
Such a mode occupies network, production system and backup server resources,
resulting in poor performance and an inability to meet the requirements for backing
up a large amount of data.
The NDMP protocol is designed for the data backup system of NAS devices. It enables
NAS devices to send data directly to the connected disk devices or the backup servers on
the network for backup, without any backup client agent being required.
There are two networking modes for NDMP:
On a 2-way network, backup media is connected directly to a NAS storage system
instead of to a backup server. In a backup process, the backup server sends a backup
command to the NAS storage system through the Ethernet. The system then directly
backs up data to the tape library it is connected to.
In the NDMP 2-way backup mode, data flows are transmitted directly to backup
media, greatly improving the transmission performance and reducing server
resource usage. However, a tape library is connected to a NAS storage device, so
the tape library can back up data only for the NAS storage device to which it is
connected.
Tape libraries are expensive. To enable different NAS storage devices to share
tape devices, NDMP also supports the 3-way backup mode.
In the 3-way backup mode, a NAS storage system can transfer backup data to a NAS
storage device connected to a tape library through a dedicated backup network.
Then, the storage device backs up the data to the tape library.
Dual-controller Storage:
Currently, dual-controller architecture is mainly used in mainstream entry-level and
mid-range storage systems.
There are two working modes: Active-Standby and Active-Active.
Active-Standby
This is also called high availability (HA). That is, only one is working at a time,
while the other waits, synchronizes data, and monitors services. If the active
controller fails, the standby controller takes over its services. In addition, the
active controller is powered off or restarted before the takeover to prevent brain
split. The bus use of the active controller is released and then back-end and
front-end buses are taken over.
Active-Active
Two controllers are working at the same time. Each connects to all back-end
buses, but each bus is managed by only one controller. Each controller manages
half of all back-end buses. If one controller is faulty, the other takes over all
buses. This is more efficient than Active-Standby.
Mid-range Storage Architecture Evolution:
Mid-range storage systems always use an independent dual-controller architecture.
Controllers are usually of modular hardware.
The evolution of mid-range storage mainly focuses on the rate of host interfaces and
disk interfaces, and the number of ports.
The common form factor is the convergence of SAN and NAS storage services.
Multi-controller Storage:
Most mission-critical storage systems use multi-controller architecture.
The main architecture models are as follows:
Bus architecture
Hi-Star architecture
Direct-connection architecture
Virtual matrix architecture
Mission-critical storage architecture evolution:
In 1990, EMC launched Symmetrix, a full bus architecture. A parallel bus connected
front-end interface modules, cache modules, and back-end disk interface modules
for data and signal exchange in time-division multiplexing mode.
In 2000, HDS adopted the switching architecture for Lightning 9900 products. Front-
end interface modules, cache modules, and back-end disk interface modules were
connected on two redundant switched networks, increasing communication channels
to dozens of times more than that of the bus architecture. The internal bus was no
longer a performance bottleneck.
In 2003, EMC launched the DMX series based on full mesh architecture. All modules
were connected in point-to-point mode, obtaining theoretically larger internal
bandwidth but adding system complexity and limiting scalability challenges.
HCIA-Storage Learning Guide Page 58
Scale-up:
This traditional vertical expansion architecture continuously adds storage disks into
the existing storage systems to meet demands.
Advantage: simple operation at the initial stage
Disadvantage: As the storage system scale increases, resource increase reaches a
bottleneck.
Scale-out:
This horizontal expansion architecture adds controllers to meet demands.
Advantage: As the scale increases, the unit price decreases and the efficiency is
improved.
Disadvantage: The complexity of software and management increases.
Huawei SAS disk enclosure is used as an example.
Port consistency: In a loop, the EXP port of an upper-level disk enclosure is connected
to the PRI port of a lower-level disk enclosure.
Dual-plane networking: Expansion module A connects to controller A, while
expansion module B connects to controller B.
Symmetric networking: On controllers A and B, symmetric ports and slots are
connected to the same disk enclosure.
Forward and backward connection networking: Expansion module A uses forward
connection, while expansion module B uses backward connection.
Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed
the upper limit.
Huawei smart disk enclosure is used as an example.
Port consistency: In a loop, the EXP (P1) port of an upper-level disk enclosure is
connected to the PRI (P0) port of a lower-level disk enclosure.
Dual-plane networking: Expansion board A connects to controller A, while expansion
board B connects to controller B.
Symmetric networking: On controllers A and B, symmetric ports and slots are
connected to the same disk enclosure.
Forward connection networking: Both expansion modules A and B use forward
connection.
Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed
the upper limit.
IP scale-out is used for Huawei OceanStor V3 and V5 entry-level and mid-range series,
Huawei OceanStor V5 Kunpeng series, and Huawei OceanStor Dorado V6 series. IP scale-
out integrates TCP/IP, Remote Direct Memory Access (RDMA), and Internet Wide Area
RDMA Protocol (iWARP) to implement service switching between controllers, which
complies with the all-IP trend of the data center network.
PCIe scale-out is used for Huawei OceanStor 18000 V3 and V5 series, and Huawei
OceanStor Dorado V3 series. PCIe scale-out integrates PCIe channels and the RDMA
technology to implement service switching between controllers.
PCIe scale-out: features high bandwidth and low latency.
HCIA-Storage Learning Guide Page 60
IP scale-out: employs standard data center technologies (such as ETH, TCP/IP, and
iWARP) and infrastructure, and boosts the development of Huawei's proprietary chips for
entry-level and mid-range products.
Next, let's move on to I/O read and write processes of the host. The scenarios are as
follows:
Local Write Process
A host delivers write I/Os to engine 0.
Engine 0 writes the data into the local cache, implements mirror protection, and
returns a message indicating that data is written successfully.
Engine 0 flushes dirty data onto a disk. If the target disk is on the local
computer, engine 0 directly delivers the write I/Os.
If the target disk is on a remote device, engine 0 transfers the I/Os to the engine
(engine 1 for example) where the disk resides.
Engine 1 writes dirty data onto disks.
Non-local Write Process
A host delivers write I/Os to engine 2.
After detecting that the LUN is owned by engine 0, engine 2 transfers the write
I/Os to engine 0.
Engine 0 writes the data into the local cache, implements mirror protection, and
returns a message to engine 2, indicating that data is written successfully.
Engine 2 returns the write success message to the host.
Engine 0 flushes dirty data onto a disk. If the target disk is on the local
computer, engine 0 directly delivers the write I/Os.
If the target disk is on a remote device, engine 0 transfers the I/Os to the engine
(engine 1 for example) where the disk resides.
Engine 1 writes dirty data onto disks.
Local Read Process
A host delivers write I/Os to engine 0.
If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to the
host.
If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from
the disk. If the target disk is on the local computer, engine 0 reads data from the
disk.
After the read I/Os are hit locally, engine 0 returns the data to the host.
If the target disk is on a remote device, engine 0 transfers the I/Os to the engine
(engine 1 for example) where the disk resides.
Engine 1 reads data from the disk.
Engine 1 accomplishes the data read.
Engine 1 returns the data to engine 0 and then engine 0 returns the data to the
host.
Non-local Read Process
HCIA-Storage Learning Guide Page 61
The LUN is not owned by the engine that delivers read I/Os, and the host
delivers the read I/Os to engine 2.
After detecting that the LUN is owned by engine 0, engine 2 transfers the read
I/Os to engine 0.
If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to
engine 2.
Engine 2 returns the data to the host.
If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from
the disk. If the target disk is on the local computer, engine 0 reads data from the
disk.
After the read I/Os are hit locally, engine 0 returns the data to engine 2 and
then engine 2 returns the data to the host.
If the target disk is on a remote device, engine 0 transfers the I/Os to engine 1
where the disk resides.
Engine 1 reads data from the disk.
Engine 1 completes the data read.
Engine 1 returns the data to engine 0, engine 0 returns the data to engine 2, and
then engine 2 returns the data to the host.
internal links and each front-end port provides a communication link for the host. If any
controller restarts during an upgrade, services are seamlessly switched to the other
controller without impacting hosts and interrupting links. The host is unaware of
controller faults. Switchover is completed within 1 second.
The FIM has the following features:
Failure of a controller will not disconnect the front-end link, and the host is unaware
of the controller failure.
The PCIe link between the FIM and the controller is disconnected, and the FIM
detects the controller failure.
Service switchover is performed between the controllers, and the FIM redistributes
host requests to other controllers.
The switchover time is about 1 second, which is much shorter than switchover
performed by multipathing software (10-30s).
In global cache mode, host data is directly written into linear space logs, and the logs
directly copy the host data to the memory of multiple controllers using RDMA based on a
preset copy policy. The global cache consists of two parts:
Global memory: memory of all controllers (four controllers in the figure). This is
managed in a unified memory address, and provides linear address space for the
upper layer based on a redundancy configuration policy.
WAL: new write cache of the log type
The global pool uses RAID 2.0+, full-strip write of new data, and shared RAID groups
between multiple strips.
Another feature is back-end sharing, which includes sharing of back-end interface
modules within an enclosure and cross-controller enclosure sharing of back-end disk
enclosures.
Active-Active Architecture with Full Load Balancing:
Even distribution of unhomed LUNs
Data on LUNs is divided into 64 MB slices. The slices are distributed to different
virtual nodes based on the hash result (LUN ID + LBA).
Front-end load balancing
UltraPath selects appropriate physical links to send each slice to the
corresponding virtual node.
The front-end interconnect I/O modules forward the slices to the corresponding
virtual nodes.
Front-end: If there is no UltraPath or FIM, the controllers forward I/Os to the
corresponding virtual nodes.
Global write cache load balancing
The data volume is balanced.
Data hotspots are balanced.
Global storage pool load balancing
Usage of disks is balanced.
HCIA-Storage Learning Guide Page 64
2.5.2 NAS
Enterprises need to store a large amount of data and share the data through a network.
Therefore, network-attached storage (NAS) is a good choice. NAS connects storage
devices to the live network and provides data and file services.
For a server or host, NAS is an external device and can be flexibly deployed through the
network. In addition, NAS provides file-level sharing rather than block-level sharing,
which makes it easier for clients to access NAS over the network. UNIX and Microsoft
Windows users can seamlessly share data through NAS or File Transfer Protocol (FTP).
When NAS sharing is used, UNIX uses NFS and Windows uses CIFS.
NAS has the following characteristics:
NAS provides storage resources through file-level data access and sharing, enabling
users to quickly share files with minimum storage management costs.
NAS is a preferred file sharing storage solution that does not require multiple file
servers.
NAS also helps eliminate bottlenecks in user access to general-purpose servers.
NAS uses network and file sharing protocols for archiving and storage. These
protocols include TCP/IP for data transmission as well as CIFS and NFS for providing
remote file services.
A general-purpose server can be used to carry any application and run a general-purpose
operating system. Unlike general-purpose servers, NAS is dedicated to file services and
provides file sharing services for other operating systems using open standard protocols.
NAS devices are optimized based on general-purpose servers in aspects such as file
service functions, storage, and retrieval. To improve the high availability of NAS devices,
some NAS vendors also support the NAS clustering function.
The components of a NAS device are as follows:
NAS engine (CPU and memory)
One or more NICs that provide network connections, for example, GE NIC and 10GE
NIC.
HCIA-Storage Learning Guide Page 66
optimizes the NFS client, so that the VM storage space can be created on the shared
space of the NFS server.
Working principles of CIFS: CIFS runs on top of TCP/IP and allows Windows computers to
access files on UNIX computers over a network.
The CIFS protocol applies to file sharing. Two typical application scenarios are as follows:
File sharing service
CIFS is commonly used in file sharing service scenarios such as enterprise file
sharing.
Hyper-V VM application scenario
SMB can be used to share mirrors of Hyper-V virtual machines promoted by
Microsoft. In this scenario, the failover feature of SMB 3.0 is required to ensure
service continuity upon a node failure and to ensure the reliability of VMs.
2.5.3 SAN
2.5.3.1 IP SAN Technologies
NIC + Initiator software: Host devices such as servers and workstations use standard NICs
to connect to Ethernet switches. iSCSI storage devices are also connected to the Ethernet
switches or to the NICs of the hosts. The initiator software installed on hosts virtualizes
NICs into iSCSI cards. The iSCSI cards are used to receive and transmit iSCSI data packets,
implementing iSCSI and TCP/IP transmission between the hosts and iSCSI devices. This
mode uses standard Ethernet NICs and switches, eliminating the need for adding other
adapters. Therefore, this mode is the most cost-effective. However, the mode occupies
host resources when converting iSCSI packets into TCP/IP packets, increasing host
operation overheads and degrading system performance. The NIC + initiator software
mode is applicable to scenarios that require the relatively low I/O and bandwidth
performance for data access.
TOE NIC + initiator software: The TOE NIC processes the functions of the TCP/IP protocol
layer, and the host processes the functions of the iSCSI protocol layer. Therefore, the TOE
NIC significantly improves the data transmission rate. Compared with the pure software
mode, this mode reduces host operation overheads and requires minimal network
construction expenditure. This is a trade-off solution.
iSCSI HBA:
An iSCSI HBA is installed on the host to implement efficient data exchange between
the host and the switch and between the host and the storage device. Functions of
the iSCSI protocol layer and TCP/IP protocol stack are handled by the host HBA,
occupying the least CPU resources. This mode delivers the best data transmission
performance but requires high expenditure.
The iSCSI communication system inherits part of SCSI's features. The iSCSI
communication involves an initiator that sends I/O requests and a target that
responds to the I/O requests and executes I/O operations. After a connection is set
up between the initiator and target, the target controls the entire process as the
primary device. The target includes the iSCSI disk array and iSCSI tape library.
HCIA-Storage Learning Guide Page 68
The iSCSI protocol defines a set of naming and addressing methods for iSCSI
initiators and targets. All iSCSI nodes are identified by their iSCSI names. In this way,
iSCSI names are distinguished from host names.
iSCSI uses iSCSI Qualified Name (IQN) to identify initiators and targets. Addresses
change with the relocation of initiator or target devices, but their names remain
unchanged. When setting up a connection, an initiator sends a request. After the
target receives the request, it checks whether the iSCSI name contained in the
request is consistent with that bound with the target. If the iSCSI names are
consistent, the connection is set up. Each iSCSI node has a unique iSCSI name. One
iSCSI name can be used in the connections from one initiator to multiple targets.
Multiple iSCSI names can be used in the connections from one target to multiple
initiators.
Logical ports are created based on bond ports, VLAN ports, or Ethernet ports. Logical
ports are virtual ports that carry host services. A unique IP address is allocated to each
logical port for carrying its services.
Bond port: To improve reliability of paths for accessing file systems and increase
bandwidth, you can bond multiple Ethernet ports on the same interface module to
form a bond port.
VLAN: VLANs logically divide the physical Ethernet ports or bond ports of a storage
system into multiple broadcast domains. On a VLAN, when service data is being sent
or received, a VLAN ID is configured for the data so that the networks and services
of VLANs are isolated, further ensuring service data security and reliability.
Ethernet port: Physical Ethernet ports on an interface module of a storage system.
Bond ports, VLANs, and logical ports are created based on Ethernet ports.
IP address failover: A logical IP address fails over from a faulty port to an available
port. In this way, services are switched from the faulty port to the available port
without interruption. The faulty port takes over services back after it recovers. This
task can be completed automatically or manually. IP address failover applies to IP
SAN and NAS.
During the IP address failover, services are switched from the faulty port to an available
port, ensuring service continuity and improving the reliability of paths for accessing file
systems. Users are not aware of this process.
The essence of IP address failover is a service switchover between ports. The ports can be
Ethernet ports, bond ports, or VLAN ports.
Ethernet port–based IP address failover: To improve the reliability of paths for
accessing file systems, you can create logical ports based on Ethernet ports.
HCIA-Storage Learning Guide Page 69
Figure 2-10
Host services are running on logical port A of Ethernet port A. The corresponding
IP address is "a". Ethernet port A fails and thereby cannot provide services. After
IP address failover is enabled, the storage system will automatically locate
available Ethernet port B, delete the configuration of logical port A that
corresponds to Ethernet port A, and create and configure logical port A on
Ethernet port B. In this way, host services are quickly switched to logical port A
on Ethernet port B. The service switchover is executed quickly. Users are not
aware of this process.
Bond port-based IP address failover: To improve the reliability of paths for accessing
file systems, you can bond multiple Ethernet ports to form a bond port. When an
Ethernet port that is used to create the bond port fails, services are still running on
the bond port. The IP address fails over only when all Ethernet ports that are used to
create the bond port fail.
Figure 2-11
Multiple Ethernet ports are bonded to form bond port A. Logical port A created
based on bond port A can provide high-speed data transmission. When both
Ethernet ports A and B fail due to various causes, the storage system will
automatically locate bond port B, delete logical port A, and create the same
logical port A on bond port B. In this way, services are switched from bond port
A to bond port B. After Ethernet ports A and B recover, services will be switched
HCIA-Storage Learning Guide Page 70
control the access permission of each device or port. Moreover, you can set traffic
isolation zones. When there are multiple ISLs (E_Ports), an ISL only transmits the traffic
destined for ports that reside in the same traffic isolation zone.
Arbitration plane: communicates with the HyperMetro quorum server. This plane is
planned only when the HyperMetro function is planned for the block service.
The key software components and their functions are described as follows:
FSM: a management process of Huawei distributed storage that provides operation
and maintenance (O&M) functions, such as alarm management, monitoring, log
management, and configuration. It is recommended that this module be deployed on
two nodes in active/standby mode.
Virtual Block Service (VBS): a process that provides the distributed storage access
point service through SCSI or iSCSI interfaces and enables application servers to
access distributed storage resources
Object Storage Device (OSD): a component of Huawei distributed storage for storing
user data in distributed clusters.
REP: data replication network
Enterprise Data Service (EDS): a component that processes I/O services sent from
VBS.
SSDs use the NAND flash as a permanent storage media. Compared with traditional
HDDs, SSDs offer high speeds, low power consumption and low latency. They are also
smaller, lighter, and shockproof.
High performance
All-SSD configuration boasts high IOPS and low latency.
Support for FlashLink®, such as intelligent multi-core, efficient RAID, hot and cold
data separation, and low latency.
High reliability
Component failure protection, dual-redundancy design, and active-active working
mode; SmartMatrix 3.0 full-mesh architecture for high efficiency, low latency, and
collaborative operations.
Dual-redundancy design, power-off protection, and coffer disk.
Advanced data protection technologies: HyperSnap, HyperReplication, HyperClone,
and HyperMetro.
RAID 2.0+ underlying virtualization.
High availability
Supports online replacement of components, such as controllers, power supplies,
interface modules, and disks.
Supports disk roaming, which enables the storage system to automatically identify
relocated disks and resume their services.
Centrally manages storage resources in third-party storage systems.
2.6.1.4 Components
The controller enclosure uses a modular design and consists of a system enclosure,
controllers (including fan modules), power modules, BBU modules, and disk modules.
centric strategies. Existing IT systems are under increasing pressure to improve. For
example, it takes several hours to process data and integrate data warehouses in the bill
and inventory systems of banks and large enterprises. As a result, services like operation
analysis and service queries cannot be obtained in a timely manner.
Solution:
Huawei's high-performance all-flash solution resolves these problems. High-end all-flash
storage systems are used to carry multiple core applications (services like the transaction
system database). The processing time is reduced by more than half, the response latency
is shortened, and the service efficiency is improved several times over.
4K/8K video, autonomous driving, and big data analytics, are raising data storage
demands.
Huawei OceanStor 100D is a distributed storage product with scale-out and supports the
business needs of both today and tomorrow. It provides elastic on-demand services
powered by cloud infrastructure and carries both critical and emerging workloads.
Huawei OceanStor 100D provides an ultra-high quantity of data storage resource pools
featuring on-demand resource provisioning and elastic capacity expansion in
virtualization and private cloud environments. It improves storage deployment,
expansion, and operation and maintenance (O&M) efficiency using general-purpose
servers. Typical scenarios include Internet-finance channel access clouds, development
and testing clouds, cloud-based services, B2B cloud resource pools in carriers' BOM
domains, and e-Government clouds.
Mission-critical database
Huawei OceanStor 100D delivers enterprise-grade capabilities, such as distributed active-
active storage and consistent low latency, to ensure efficient and stable running of data
warehouses and mission-critical databases, including online analytical processing (OLAP)
and online transaction processing (OLTP).
Big data analytics
OceanStor 100D provides an industry-leading decoupled storage-compute solution for
big data, which integrates traditional data silos and builds a unified big data resource
pool for enterprises. It also leverages enterprise-grade capabilities, such as elastic large-
ratio erasure coding (EC) and on-demand deployment and expansion of decoupled
compute and storage resources, to improve big data service efficiency and reduce TCO.
Typical scenarios include big data analytics for finance, carriers (log retention), and
governments.
Content storage and backup archiving
OceanStor 100D provides high-performance and highly reliable object storage resource
pools to meet large throughput, frequent access to hotspot data, as well as long-term
storage and online access requirements of real-time online services such as Internet data,
online audio/video, and enterprise web disks. Typical scenarios include storage, backup,
and archiving of financial electronic check images, audio and video recordings, medical
images, government and enterprise electronic documents, and Internet of Vehicles (IoV).
Capacity-on-time: Upon receiving a write request from a host, a thin LUN uses direct-on-
time to check whether there is a physical storage space allocated to the logical storage
provided for the request. If no, a space allocation task is triggered, and the size for the
space allocated is measured by the grain as the minimum granularity. Then data is
written to the newly allocated physical storage space.
Direct-on-time: When capacity-on-write is used, the relationship between the actual
storage area and logical storage area of data is not calculated using a fixed formula but
determined by random mappings based on the capacity-on-write principle. Therefore,
when data is read from or written into a thin LUN, the read or write request must be
redirected to the actual storage area based on the mapping relationship between the
actual storage area and logical storage area.
A mapping table: This table is used to record the mapping between an actual storage
area and a logical storage area. A mapping table is dynamically updated during the write
process and is queried during the read process.
3.1.2 SmartTier
3.1.2.1 Overview
SmartTier is also called intelligent storage tiering. It provides the intelligent data storage
management function that automatically matches data to the storage media best suited
to that type of data by analyzing data activities.
SmartTier migrates hot data to storage media with high performance (such as SSDs) and
moves idle data to more cost-effective storage media (such as NL-SAS disks) with more
capacity. This provides hot data with quick response and high input/output operations
per second (IOPS), thereby improving the performance of the storage system.
HCIA-Storage Learning Guide Page 82
access frequency of data blocks. The data analysis module ranks the activity levels of all
data blocks (in the same storage pool) in descending order and the hottest data blocks
are migrated first.
The data migration module implements data migration. SmartTier migrates data blocks
based on the rank and the migration policy. Data blocks of a higher-priority are migrated
to higher tiers (usually the high-performance tier or performance tier), and data blocks
of a lower-priority are migrated to lower tiers (usually the performance tier or capacity
tier).
SmartTier needs extra space for data exchange when dynamically migrating data.
Therefore, a storage pool configured with SmartTier needs to reserve certain free space.
3.1.3 SmartQoS
3.1.3.1 Overview
SmartQoS dynamically allocates storage resources to meet certain performance goals of
specified applications.
As storage technologies develop, a storage system is capable of providing larger
capacities. Accordingly, a growing number of users choose to deploy multiple applications
on one storage device. Different applications contend for system bandwidth and
Input/Output Operations Per Second (IOPS) resources, comprising the performance of
critical applications.
SmartQoS helps users properly use storage system resources and ensures high
performance of critical services.
SmartQoS enables users to set performance indicators like IOPS or bandwidth for certain
applications. The storage system dynamically allocates system resources to meet QoS
requirements of certain applications based on specified performance goals. It gives
priority to certain applications with demanding QoS requirements.
For example, if a user enables a SmartQoS policy for two LUNs in a storage system and
sets performance objectives in the SmartQoS policy, the storage system can limit the
system resources allocated to the LUNs to reserve more resources for high-priority LUNs.
The performance goals measured by the SmartQoS are bandwidth, IOPS, and latency.
3.1.4 SmartDedupe
3.1.4.1 Overview
SmartDedupe eliminates redundant data from a storage system. This deduplication
technology reduces the amount of physical storage capacity occupied by data to release
more storage capacity for increasing services.
Huawei OceanStor storage systems provide inline deduplication to deduplicate the data
that is newly written into the storage systems.
HCIA-Storage Learning Guide Page 86
Inline deduplication deletes duplicate data before the data is written into disks.
Similar deduplication analyzes data that has been written to disks based on similar
fingerprints to find out duplicate and similar data blocks, and then performs
deduplication.
3.1.5 SmartCompression
3.1.5.1 Overview
SmartCompression reorganizes data to save space and improves the data transfer,
processing, and storage efficiency without losing any data. OceanStor Dorado V6 series
storage systems support inline compression and post-process compression.
Inline compression: The system deduplicates and compresses data before writing it to
disks. User data is processed in real time.
Post-process compression: Data is written to disks in advance and then read and
compressed when the system is idle.
3.1.6 SmartMigration
3.1.6.1 Overview
SmartMigration is a key service migration technology. Service data can be migrated
within a storage system and between different storage systems.
"Consistent" means that after the service migration is complete, all of the service data
has been replicated from a source LUN to a target LUN.
Service data synchronization between a source LUN and a target LUN includes initial
synchronization and data change synchronization. The two synchronization modes are
independent and can be performed at the same time to ensure that service data changes
on the host are synchronized to the source LUN and the target LUN.
the new data space. As shown in the slide, L0 to L4 are logical addresses, P0 to P8 are
physical addresses, and A to I are data.
3.2.2 HyperClone
3.2.2.1 Overview
HyperClone allows you to obtain full copies of LUNs without interrupting host services.
These copies can be used for data backup and restoration, data reproduction, and data
analysis.
3.2.2.3 Synchronization
When a HyperClone pair starts synchronization, the system generates an instant snapshot
for the source LUN, synchronizes the snapshot data to the target LUN, and records
subsequent write operations in a differential table.
When synchronization is performed again, the system compares the data of the source
and target LUNs, and only synchronizes the differential data to the target LUN. The data
HCIA-Storage Learning Guide Page 93
written to the target LUN between the two synchronizations will be overwritten. Before
synchronization, users can create a snapshot for a target LUN to retain its data changes.
Relevant concepts:
1. Pair: In HyperClone, a pair has one source LUN and one target LUN. A pair is a
mirror relationship between the source and target LUNs. A source LUN can form
multiple HyperClone pairs with different target LUNs. A target LUN can be added to
only one HyperClone pair.
2. Synchronization: Data is copied from a source LUN to a target LUN.
3. Reverse synchronization: If data on the source LUN needs to be restored, you can
reversely synchronize data from the target LUN to the source LUN.
4. Differential copy: The differential data can be synchronized from the source LUN to
the target LUN based on the differential bitmap.
3.2.3 HyperReplication
3.2.3.1 Overview
As digitization advances in various industries, data has become critical to the efficient
operation of enterprises, and users impose increasingly demanding requirements on the
stability of storage systems. Although many enterprises have highly stable storage
systems, it is a big challenge for them to ensure data restoration from damage caused by
natural disasters. To ensure continuity, recoverability, and high availability of service data,
remote DR solutions emerge. The HyperReplication technology is one of the key
technologies used in remote DR solutions.
HyperReplication is Huawei remote replication feature. HyperReplication provides a
flexible and powerful data replication function that facilitates remote data backup and
restoration, continuous support for service data, and disaster recovery.
A primary site is a production center that includes primary storage systems, application
servers, and links.
A secondary site is a backup center that includes secondary storage systems, application
servers, and links.
HyperReplication supports the following two modes:
Synchronous remote replication between LUNs: Data is synchronized between
primary and secondary LUNs in real time. No data is lost when a disaster occurs.
However, the performance of production services is affected by the latency of the
data transmission between primary and secondary LUNs.
Asynchronous remote LUN replication between LUNs: Data is periodically
synchronized between primary and secondary LUNs. The performance of production
services is not affected by the latency of the data transmission between primary and
secondary LUNs. If a disaster occurs, some data may be lost.
3.2.3.4 Phases
HyperReplication involves the following phases: creating a HyperReplication relationship,
synchronizing data, switching over services, and restoring data.
1. Create a HyperReplication pair.
2. Synchronize all data manually or automatically from the primary LUN to the
secondary LUN of the HyperReplication pair. In addition, periodically synchronize
incremental data on the primary LUN to the secondary LUN.
3. Check the data status of the HyperReplication pair and the read/write properties of
the secondary LUN to determine whether a primary/secondary switchover can be
HCIA-Storage Learning Guide Page 96
3.2.4 HyperMetro
3.2.4.1 Overview
HyperMetro is Huawei's active-active storage solution. Two DCs enabled with
HyperMetro back up each other and both are carrying services. If a device is faulty in a
DC or if the entire center is faulty, the other DC will automatically take over services,
solving the switchover problems of traditional DR centers. This ensures high data
reliability and service continuity, and improves the resource utilization of the storage
system.
The data dual-write technology ensures storage redundancy. No data is lost if there
is only one storage system running or the production center fails. Services are
switched over quickly, maximizing customer service continuity. This solution meets
the service requirements of RTO = 0 and RPO = 0.
HyperMetro and SmartVirtualization can be used together to support heterogeneous
storage and consolidate resources on the network layer to protect the existing
investment of the customer.
This solution can be smoothly upgraded to the 3DC solution with HyperReplication.
Based on the preceding features, the HyperMetro solution can be widely used in
industries such as healthcare, finance, and social security.
offline mode. Therefore, archiving helps reduce costs and facilitate media storage. A file
archiving system can also store files based on file attributes. These attributes can be
author, modification date, or other customized tags. An archiving system stores files
together with their metadata and attributes. In addition, an archiving system provides the
data compression function. In conclusion, archiving involves storing backup data that will
no longer be frequently accessed or updated in offline mode for a long term, and
attaching the "archived" tag to these data according to specific attributes for future
search.
Generally, backup refers to data backup or system backup, while DR refers to data
backup or application backup across equipment rooms. Backup is implemented using
backup software, whereas DR is implemented using replication or mirroring software. The
differences between the two are as follows:
1. DR is designed for protecting data against natural disasters, such as fires and
earthquakes. Therefore, a backup center must be set in a place which is away from
the production center at a certain distance. In contrast, data backup is performed
within a data center.
2. A DR system not only protects data but also guarantees business continuity. In
contrast, data backup only focuses on data security.
3. DR protects data integrity. In contrast, backup can only help recover data from a
point in time when a backup task is performed.
4. DR is performed in online mode while backup is performed in offline mode.
5. Data at two sites of a DR system is consistent in real time while backup data is
relatively time-sensitive.
6. When a fault occurs, a DR switchover process in a DR system lasts seconds to
minutes, while a backup system takes hours and maybe even dozens of hours to
recover data.
Backup and archiving systems are designed to protect data in different ways and the
combination of the two systems will provide more effective data protection. Backup is
designed to protect data by storing data copies. Archiving is designed to protect data by
organizing and storing data for a long term in a data management manner. In other
words, backup can be considered as short-term retention of data copies, while archiving
can be considered as long-term retention of files. In practice, we do not delete an original
copy after it is backed up. However, it will be fine if we delete an original copy after it is
archived, as we might no longer need to access it swiftly. Backup and archiving work
together to better protect data.
4.1.2 Architecture
4.1.2.1 Components
A backup system typically consists of three components: backup software, backup media,
and backup server.
The backup software is useful for creating backup policies, managing media, and adding
functions. The backup software is the core of a backup system and is used for creating
and managing copies of production data stored on storage media. Some backup software
HCIA-Storage Learning Guide Page 102
can be upgraded with more functions, such as protection, backup, archiving, and
recovery.
Backup media include tape libraries, disk arrays, and virtual tape libraries. A virtual tape
library is essentially a disk array, but it can virtualize a disk storage into a tape library.
Compared with mechanical tapes, virtual tape libraries are compatible with tape backup
management software and conventional backup processes, greatly improving availability
and reliability.
The backup server provides services for executing backup policies. The backup software
resides and runs on the backup server. Generally, a backup software client agent needs to
be installed on the service host to be backed up.
Three elements of a backup system are Backup Window (BW), Recovery Point Objective
(RPO), and Recovery Time Objective (RTO).
BW indicates a duration of time allowed for backing up the service data in a service
system without affecting the normal operation of the service system.
RPO is for ensuring the latest backup data is used for DR switchover. A smaller RPO
means less data to be lost.
RTO refers to an acceptable duration of time and a service level within which a business
process must be restored, in order to minimize the impact of interruption on services.
Cloud backup involves backing up data from a local production center to a data center (a
central data center of an enterprise or a data center provided by a service provider) using
a standard network protocol over a WAN. Cloud backup is based on services, accessible
anywhere, flexible, secure, and can be shared and used on demand. Cloud backup
emerges as a brand new backup service based on broadband Internet and large storage
capacities. In conclusion, cloud backup provides data storage and backup services by
leveraging a variety of functions, such as cluster applications, grid technologies, and
distributed file systems, and integrating a variety of storage devices across the network
through application software.
application server and transmits the data to the backup media. The backup operation is
complete.
Strengths:
Backup data is transmitted without using LAN resources, significantly improving
backup performance while maintaining high network performance.
Weaknesses:
Backup agents adversely affect the performance of application servers.
LAN-Free backup requires a high budget.
Devices must meet certain requirements.
Server-Free
Server-Free backup has many strengths similar to those of LAN-Free backup. The source
device, target device, and SAN device are main components of the backup data channel.
The server is still involved in the backup process, but processes much fewer workloads as
the server does not function as the main backup data channel but instead, like a traffic
police, is only responsible for giving commands other than processing loading and
transportation workloads. Control flows are transmitted over a LAN, but data flows are
not.
Direction of backup data flows: Backup data is transmitted over an independent network
without passing through a production server.
Strengths:
Backup data flows do not consume LAN resources and do not affect network
performance.
Services running on hosts remain nearly unaffected.
Backup performance is excellent.
Weaknesses:
Server-Free backup requires a high budget.
Devices must meet strict requirements.
Server-Less
Server-Less backup uses the Network Data Management Protocol (NDMP). NDMP is a
standard network backup protocol. It supports communications between intelligent data
storage devices, tape libraries, and backup applications. After a server sends an NDMP
command to a storage device that supports the NDMP protocol, the storage device can
directly send the data to other devices without passing through a host.
4.1.4.2 Deduplication
Digital transformations of enterprises have intensified the explosive growth of service
data. The total amount of backup data that needs to be protected is also increasing
sharply. In addition, more and more duplicate data is being generated from backup and
archiving operations. Mass redundant data consumes a lot of storage and bandwidth
resources and leads to issues like long backup windows, which further affect the
availability of service systems.
Huawei Data Protection Appliance supports source-side and parallel deduplication.
Deduplication is performed before backup data is transmitted to storage media, greatly
improving backup performance.
Source-Side Deduplication
HCIA-Storage Learning Guide Page 106
Data or files are sliced using an intelligent content-based deduplication algorithm. Then,
fingerprints are created for data blocks by hashing, for querying identical fingerprints in
fingerprint libraries. If identical fingerprints exist, it indicates that the same blocks are
stored on the media servers. Existing blocks will be used to preserve backup capacity and
bandwidth resources, and for streamlined data transfer and storage.
Technical principles:
1. Creates a fingerprint for a data block by hashing.
2. Queries whether the fingerprint exists in the fingerprint library of the Data Protection
Appliance. If yes, it indicates that this data block is duplicate and does not need to be
sent to the Data Protection Appliance. If no, the data block will be sent to the Data
Protection Appliance and written to the backup storage pool. Then, the fingerprint of
this data block is recorded in the deduplication fingerprint library.
Parallel Deduplication
Most conventional deduplication modes are based on a single node and are prone to
inefficient data access, poor processing performance, and insufficient storage capacity in
the era of big data.
Huawei Data Protection Appliance uses the parallel deduplication technology by building
a deduplication fingerprint library on multiple nodes and distributing fingerprints on
multiple nodes in parallel. This effectively resolves the performance and storage capacity
problems in single-node solutions.
Technical principles:
After fingerprints are calculated for data blocks, the system uses the grouping algorithm
to locate specific server nodes. Different fingerprints are evenly distributed on different
nodes. In this way, the system queries whether these fingerprints exist on different server
nodes, for parallel deduplication.
With fingerprint libraries, recycled data can be stored in the same space in sequence.
Such practice reduces time for querying all fingerprints in each global deduplication, and
maximizes the effect of read cache of storage, to minimize disk seek switchover
frequency due to random disk reads and improve recovery efficiency.
Figure 4-1
1. Reads data to be protected through the backup client (agent client). Based on
different applications, the agent client can be deployed on the production server
(agent-based backup) or can be the agent client built in Huawei Data Protection
Appliance (agent-free backup).
2. Reads data from a production system to the Data Protection Appliance over the
network (TCP).
3. The Data Protection Appliance receives data and saves it to the backup storage.
For different backup modes, such as full backup, incremental backup, permanent
incremental backup, and differential backup, data is read and transmitted in different
ways. All data is transmitted or only unique data is transmitted with deduplication.
When remote DR is required, remote replication allows replication of backup data to
remote data centers.
Continuous Backup
Continuous backup is a process of continuously backing up data on production hosts to
backup media. Continuous backup is based on the block-level continuous data protection
technology. A backup agent client is installed on production hosts. Data on production
hosts is continuously backed up to the snapshot storage pool of the internal storage
system of the Data Protection Appliance and is stored in the native format. After certain
conditions are met, snapshots are created in the snapshot storage pool to manage data
at multiple points in time.
HCIA-Storage Learning Guide Page 108
Figure 4-2
1. The snapshot storage pool allocates the base volume.
2. The agent client for continuous backup connects to the server of the Data Protection
Appliance.
3. The bypass monitoring drive in a partition of the production host continuously
captures data changes and caches the same data changes to the memory pool.
4. The agent client for continuous backup continuously transfers data to a storage
device in the snapshot storage pool of the Data Protection Appliance.
5. Source data in the partition of the production host is written to the base volume.
6. Data changes on the production host are written to the log volume first, and then
are written to the base volume storing the source data.
7. Snapshots of the base volume are managed based on the data retention policy for
continuous backup.
Advanced Backup
The advanced backup function of Huawei Data Protection Appliance effectively combines
years of experience in backup and DR and the independently developed copy data
storage system to ensure application data consistency. The advanced backup function
helps implement policy-based automation and DR, provide automation tools for
developers, support heterogeneous production storage, and implement real copy data
management.
Working principles:
Capture of production data: Data is captured in the native format. Format conversion is
not required. Data is accessible upon being mounted. SLA policy can be customized based
on applications. Retention duration, RPO, RTO, and data storage locations are intuitively
displayed.
Copy Management
Permanent incremental backup: Initial full backup and N incremental backups are
performed. A full copy is generated at each incremental backup point in time. Damages
HCIA-Storage Learning Guide Page 109
to a copy at an incremental backup point in time will not impede recovery from any
other point in time.
No rollback: Point-in-time copies created through virtual clone can direct to both source
data and current incremental data and can be directly used for recovery.
Copy Access and Use
No data movement: Data is mounted in minutes, and data volume does not affect
recovery efficiency.
A virtual copy can be mounted to multiple hosts.
Data can be recovered from any point in time.
A host automatically takes over the original production applications after the virtual copy
is mounted.
4.1.5 Applications
Databases
Databases are critical service applications in production systems. The native backup
function of databases relies on complicated manual operations. In addition, various
databases on different platforms need protection, which requires a broad compatibility of
backup products.
The Data Protection Appliance provides a graphical wizard. Users do not need to
manually execute backup and restoration scripts, which simplifies backup and recovery
operations. Database backup process is as follows:
Install a backup client agent on the production server to be protected and connect the
client agent to the management console. The backup client agent identifies the database
data on the production server, reads the files and data from the production server
through the backup API, and transfers the same files and data to the storage media of
the Data Protection Appliance to complete the backup. The management console of the
Data Protection Appliance sends control information to the client and the Data
Protection Appliance server and accordingly, manages the execution of a backup task.
The backup process: The backup client agent invokes the backup API of a database
through an API to read data in the database, processes deduplication or encryption, and
then sends the data to the Data Protection Appliance to complete the backup.
The recovery process: The management console sends a recovery command to the
backup client agent on the production server. The backup client agent invokes the
recovery API of a database through an API to read data from the backup server, and then
sends the data to the recovery API to complete the recovery.
The Data Protection Appliance connects to a database through a dedicated API for
backup. The API varies with databases. For example, the RMAN interface of Oracle and
the VDI interface of SQL Server.
Virtualization Platforms
The popularization of virtualization has increased the confidence of enterprises in storing
their core data in a virtual environment. Therefore, enterprises are in urgent need of data
protection in a virtual environment, in particular, data backup and recovery efficiency is a
major concern.
HCIA-Storage Learning Guide Page 110
According to the statistics of the international authority, in 2004, the direct financial loss
resulting from natural and human-induced disasters reached 123 billion US dollars
worldwide.
In 2005, 400 catastrophes occurred worldwide and caused losses of more than 230 billion
US dollars.
In 2006, the financial loss caused directly by natural and human-induced disasters was
lower than expected at 48 billion US dollars.
The occurrence rate of natural disasters that can be measured was three times greater in
the 1990s than the 1960s, while the financial loss was nine times greater.
The huge losses caused by small-probability natural disasters cannot be ignored.
According to IDC, among the companies that experienced disasters in the ten years
before 2000, 55% collapsed when the disasters occurred, 29% collapsed within 2 years
after the disasters due to data loss, and only 16% survived.
High availability (HA) ensures that applications can still be accessed when a single
component of the local system is faulty, no matter whether the fault is a service process,
physical facility, or IT software/hardware fault.
The best HA is when a machine in the data center breaks down, but the users using the
data center service are unaware of it. However, if a server in a data center breaks down,
it takes some time for services running on the server to fail over. As a result, customers
will be aware of the failure.
The key indicator of HA is availability. Its calculation formula is [1 –
(Downtime)/(Downtime + Uptime)]. We usually use the following nines to represent
availability:
4 nines: 99.99% = 0.01% x 365 x 24 x 60 = 52.56 minutes/year
5 nines: 99.999% = 0.001% x 365 = 5.265 minutes/year
6 nines: 99.9999% = 0.0001% x 365 = 31 seconds/year
For HA, shared storage is usually used. In this case, RPO = 0. In addition, the active/active
HA mode is used to ensure that the RTO is almost 0. If the active/passive HA mode is
used, the RTO needs to be reduced to the minimum.
HA requires redundant servers to form a cluster to run applications and services. HA can
be categorized into the following types:
Active/Passive HA:
A cluster consists of only two nodes (active and standby nodes). In this configuration, the
system uses the active and standby machines to provide services. The system provides
services only on the active device.
When the active device is faulty, the services on the standby device are started to replace
the services provided by the active device.
Typically, the CRM software such as Pacemaker can be used to control the switchover
between the active and standby devices and provide a virtual IP address to provide
services.
Active/Active HA:
If a cluster consists of only two active nodes, it is called active-active. If the cluster has
multiple nodes, it is called multi-active.
HCIA-Storage Learning Guide Page 113
In this configuration, the system runs the same load on all servers in the cluster.
Take the database as an example. The update of an instance will be synchronized to all
instances.
In this configuration, load balancing software, such as HAProxy, is used to provide virtual
IP addresses for services.
Pacemaker is a cluster manager. It uses the message and member capabilities provided
by the preferred cluster infrastructure (OpenAIS or heartbeat) to detect faults by the
secondary node and system, achieving high availability of the cluster service (also called
resources).
HAProxy is a piece of free and open-source software written in C language. It provides
high availability, load balancing, and TCP- and HTTP-based application proxy. HAProxy is
especially suitable for web sites with heavy loads that usually require keeping session.
A disaster is an unexpected event (caused by human errors or natural factors) that
results in severe faults or breakdown of the system in one data center. In this case,
services may be interrupted or become unacceptable. If the system unavailability reaches
a certain level at a specific time, the system must be switched to the standby site.
Disaster recovery (DR) refers to the capability of recovering data, applications, and
services in data centers at different locations when the production center is damaged by
a disaster.
In addition to the production site, a redundancy site is set up. When a disaster occurs and
the production site is damaged, the redundancy site can take over services from the
production site to ensure service continuity. To achieve higher availability, many users
even set up multiple redundant sites.
Main indicators for measuring a DR system
Recovery Point Objective (RPO) indicates the maximum amount of data that can be lost
when a disaster occurs.
Recovery Time Objective (RTO) indicates the time required for system recovery.
The smaller the RPO and RTO, the higher the system availability, and the larger the
investment for users.
The data-level disaster recovery focuses on protecting the data from loss or damage after
a disaster occurs. Low-level data-level disaster recovery can be implemented by manually
saving backup data to a remote place. For example, periodically transporting backup
tapes to a remote place is one of the methods. The advanced data disaster recovery
solution uses the network-based data replication tool to implement asynchronous or
synchronous data transmission between the production center and the disaster recovery
center. For example, the data replication function based on disk arrays is used.
Application-level DR creates hosts and applications in the DR site based on the data-level
DR. The support system consists of the data backup system, standby data processing
system, and standby network system. Application-level DR provides the application
takeover capability. That is, when the production center is faulty, applications can be
taken over by the DR center to minimize the system downtime and improve service
continuity.
SHARE, an IT information organization initiated by IBM in 1955, released the disaster
recovery standard SHARE 78 at the 78th conference in 1992. SHARE 78 has been widely
recognized in the world.
SHARE 78 divides disaster recovery into eight levels:
Backup or recovery scope
Status of a disaster recovery plan
Distance between the application location and the backup location
Connection between the application location and backup location
Transmission between the two locations
Data allowed to be lost
Backup data update
Ability of a backup location to start a backup job
The definition of remote disaster recovery is classified into seven levels:
Backup and recovery of local data
Access mode of batch storage and read
Access mode of batch storage and read + hot backup location
Network connection
Backup location of the working status
Dual online storage
Zero data loss
In addition, ISO 27001 released by International Organization for Standardization (ISO)
requires that related data and files be stored for at least one to five years.
consulting services for customers' service systems to ensure service continuity and data
protection.
Local HA solution: ensures high availability of key services in the data center and
prevents service interruption and data loss caused by single-component faults.
Active-passive DR solution: intra-city and remote DR are supported. When a disaster
occurs, services in the DR center can be quickly recovered and provide services for
external systems.
Active-Active data center solution: In intra-city DR, load of a critical service is balanced
between two data centers, ensuring zero service interruption and data loss when a data
center malfunctions.
Geo-redundant DR solution: defends against data center-level disasters and regional
disasters and provides higher service continuity for mission-critical services. Generally, the
intra-city active/standby + remote active/standby solution or intra-city active-active +
remote active/standby solution is used.
Currently, data centers can work in either active-passive mode or active-active mode.
In active-passive mode, some services are mainly processed in data center A and hot
standby is implemented in data center B, and some services are mainly processed in data
center B and hot standby is implemented in data center A, achieving approximate active-
active effect.
In active-active mode, all I/O paths can access active-active LUNs to achieve load
balancing and seamless failover.
Huawei active-active data center solution adopts the active-active architecture and
combines the industry-leading HyperMetro-based functions with the web, database
cluster, load balancing, transmission devices, and network components to provide
customers with an end-to-end active-active data center solution within 100 km, ensuring
service continuity even in the event of a device or data center failure, services are not
affected and can be automatically switched.
In addition, professional and diversified tools are used to quickly collect and analyze
project information, design and implement solutions, and customize and deliver the most
appropriate professional service solutions for customers.
The primary site writes the data involved in the write request into time segment TPN+1
in the cache of LUN A and immediately returns the write complete response to the host.
During data synchronization, the system reads data generated in the previous replication
period in time segment TPN in the cache of LUN A, transmits the data to the standby
site, and writes the data into time segment TPX+1 in the cache of LUN B. If the usage of
LUN A's cache reaches a certain threshold, the system automatically writes data into
disks. In this case, a snapshot is generated on disks for the data in time segment TPN.
During data synchronization, the system reads the data from the snapshot on disks and
replicates the data to LUN B.
After the data synchronization is complete, the system writes data in time segments TPN
and TPX+1 separately in the caches of LUN A and LUN B into disks based on the disk
flushing policy (snapshots are automatically deleted), and waits for the next replication
period.
Switchover:
A primary/secondary switchover can be performed for a synchronous remote replication
pair when the pair is in the normal state.
In the split state, a primary/secondary switchover can be performed only after the
secondary LUN is set to writable.
The asynchronous remote replication is in the split state.
In the split state, the secondary LUN must be set to writable.
NAS Asynchronous Replication Principle
At the beginning of each period, the file system asynchronous remote replication creates
a snapshot for the primary file system. Based on the incremental information generated
from the time when the replication in the previous period is complete to the time when
the current period starts, the file system asynchronous remote replication reads the
snapshot data and replicates the data to the secondary file system. After the incremental
replication is complete, the data in the secondary file system is the same as that in the
primary file system, data consistency points are formed on the secondary file system.
Remote replication between file systems is supported. Directory-to-directory or file-to-file
replication mode is not supported.
A file system can be included in only one replication task, but a replication task can
contain multiple file systems.
File systems support only one-to-one replication. A file system cannot serve as the
replication source and destination at the same time. Cascading replication and 3DC are
not supported.
The minimum unit of incremental replication is the file system block size (4 KB to 64 KB).
The minimum synchronization period of asynchronous replication is 5 minutes.
The resumable download is supported.
The customer has a vSphere virtual data center and wants to build a new data center for
DR.
Low TCO and high return on investment (ROI)
Huawei's Solution
An IT system, including storage devices, services, networks, and virtualization platforms, is
deployed in the DR center.
Install Huawei UltraVR DR component in the production center and DR center.
ConsistentAgent is installed on host machines of VMs to implement application-level
protection for VMs.
Customer Benefits
No need for reconstruction of the live network architecture
Flexible configuration of DR policies and one-click recovery
DR rehearsal and switchback
Traditional IT only plays a supporting role, and now IT is a type of service. To achieve the
goals of reducing costs, increasing productivity, and improving service quality, ITIL has set
off a frenzy around the world. Many famous multinational companies, such as IBM, HP,
Microsoft, P&G, and HSBC are active practitioners of ITIL. As the industry is gradually
changing from technology-oriented to service-oriented, enterprises' requirements for IT
service management are also increasing, which greatly helps standardize IT processes,
keep IT processes' pace with business, and improve processing efficiency.
ITIL has the strong support from the UK, other countries in Europe, North America, New
Zealand, and Australia. Whether an enterprise imports ITIL will be regarded as key
indicators for determining whether an inspection suppliers or outsourcing service
contractor is qualified for bidding.
Information Collection
The information to be collected includes basic information, fault information, storage
device information, networking information, and application server information.
Customer
Provides the contact and contact details.
information
Time when a
Records the time when a fault occurs.
fault occurs
Hardware
Records the configuration information
module
about the hardware of a storage device.
configuration