Professional Documents
Culture Documents
Reliability Technical
White Paper
reserves the final interpretation and the right to amend this document and this
statement.
stated, the copyright and other related rights belong to Sangfor. Without Sangfor’s
written consent, no person shall in any manner or form on any part of the copy of
this document, extract, backup, modify, distribute, translate into another language,
Disclaimer
This document is for informational purposes only and is subject to change
without notice.
Sangfor Technologies Inc. has made every effort to ensure that its contents are
accurate and reliable at the time of writing this document, but Sangfor is not liable
for any loss or damage caused by omissions, inaccuracies or errors in this document.
Contact us
Service hotline: +60 12711 7129 (7511)
3.3. VM SNAPSHOT................................................................................................................ 12
other components to form a unified resource pool and reduce data center
quality, striving to create minimal, stable and reliable high performance hyper-
converged solution.
Sangfor aCloud is a software-centric platform, the architecture is the most
level reliability.
reliability .
failure risk. The master control mode is used as the access point to
manage the cluster. The platform automatically elects the master control
through the algorithm. If the host of the master node fails, the platform
automatically re-elects the new master node to ensure the stability and
accessibility of the cluster. During the master node switching process, the
cluster nodes by multiple copies in the cluster file system. If any single node
➢ Controller : provides management and control services for the entire cluster,
such as user management and authentication, resource alarms, backup
one master controller is active at the same time, and the status of other node
controllers is Standby .
aCloud HCI solution has four network plane, each network plane is
and a single link does not affect the stability of the hyper-converged
management platform.
Business network: used for normal service access and release. The business network
can implement link redundancy through dual-switch aggregation. You can set the
network port static binding for the service egress. You can set multiple service
outlets for virtual machine selection in the virtual network to ensure high reliability
has a virtual switch instance on each and every host in the cluster. When one of
the hosts goes offline, the traffic that passes through the virtual switch instance on
the host is redirected and taken over by other hosts due to virtual routing and
Storage network: the need to perform data storage through the network of
IO operation; set up a private network to protect data security, no need for static
binding and link aggregation on the switches, aCloud platform implements the
link aggregation function from the software level. aSAN private network link
forcibly retain the most basic computing and RAM resources required for the
platform to run to avoid too many system resources are diverted byvirtual
retains the required system resources based on the functional components that
resume the service in the event of host failure. aCloud provides resource
of the resources are not allocated under normal circumstances, this resource is
allowed to be allocated only when a host fails and the HA mechanism is kicked
in.
entire aCloud platform from being invalidated after the resources are over-
utilized. For the HA mechanism, please see "Chapter 3.2 Virtual Machine High
Availability HA".
the platform, and customize key indicators for intelligent monitoring and rapid
➢ Monitor key information such as virtual machine CPU, memory, IO, internal
➢ Provides various alarm modes such as syslog, snmp trap, email, and SMS. Users
2.6. Watchdog
The system process may suffer a crash, deadlock, etc. caused by an unknown error,
causing the process to not provide external services. At this time, the process
A separate daemon is run in the background of aCloud, the process has the
highest priority, is responsible for monitoring all aCloud system processes, once a
system process crashes, deadlocks, etc., Watchdog will force intervention to restart
the process, resume business operations and record the status information of the
In the event of system crash, process deadlock or abnormal reset failure, in order
to ensure business continuity and fault location and processing, the hyper-
converged platform preferentially restores the service and provides black box
The black box is mainly used to collect and store the kernel log and diagnostic
information of the diagnostic tool before the abnormal exit of the operating system
on the management node and the compute node. After the operating system
crashes, the system maintenance personnel can export and analyze the data
results in the loss of system configuration file, users can quickly restore the system
3.1. VM Restart
When the application layer of the VM GuestOS is not scheduled (blue screen or
tools by installing Sangfor vmtool in virtual machines. a few seconds to The vmtool
sends a heartbeat to the host where the virtual machine is running on every few
seconds, then the host determines whether the application layer of the guest
and network traffic status sent by the VM. After the application layer does not
schedule the state for several minutes, the virtual machine may be considered to
There are many reasons for the abnormality of the virtual machine. The system
blue screen, hardware driver, pirated software, software virus, etc. caused by
hard disk failure, drive error, CPU overclocking, BIOS setting, software poisoning,
etc., the business operating system causes the system to be black screen, etc. At
this point, the hyper-converged platform can provide related automatic restart
When the external environment is faulty (for example, the host network cable is
where VMs enabled with HA are running on by the polling mechanism, every 5s to
detect whether or not the virtual machine state is abnormal, and when abnormal
duration reaches a fault detection sensitivity set by the user (the shortest time
is 10s), the HA virtual machine is switched to other hosts to ensure high availability
of the service system, which greatly shortens the service interruption time caused
in the entire cluster for the abnormal virtual machine to be pulled up, that is,
the " 2.3 resource reservation guarantee" technology. If the resources are
3.3. VM Snapshot
When a virtual machine has an illogical failure and cause a service abnormality,
such as a virtual machine change failure (virtual machine patching, new software
installation, etc.), the hyper-converged platform provides virtual machine snapshot
technology, which can quickly roll back to the healthy service state at the
snapshot time .
A virtual machine snapshot is a state in which the state of a virtual machine is saved
at a certain time, so that the virtual machine can be restored to the state at that
time.
machine live migration mechanism to migrate the virtual machine to other hosts
without affecting the service operation, ensuring that the service continues to
provide services.
synchronized, including the memory, vCPU, disk, and peripheral register status.
computing resources occupied by the source host are released and destination
VM will be started.
During the migration process, the resources of the physical host are checked. If
the resources are insufficient, the migration fails. If the target virtual network is
consistent with the source (if not, the alarm is generated and user decides
required;
2) Cross-storage live migration in the cluster: when the storage location
machine virtual disk image file and then synchronizes the running data.
3) Cross-cluster hot migration: Synchronize virtual disk image files and running
data.
the new virtual machine of aCloud uses the same type of vCPU , so that the
virtual machine does not depend on the physical CPU model (instruction set),
and can support virtual machine live migration across the physical hosts with
machine live migration. The system will first migrate the services running on the
host in the maintenance mode to other hosts, ensuring that the services are
affected during the replacement process. The maintenance mode can achieve
the effect of self-operation and maintenance; the host that enters the single-host
maintenance mode is in a frozen state and cannot read and write data.
migrate the virtual machine and there may be a single point of data failure. In
the host maintenance mode, the virtual storage copy check is performed to
ensure that the data copy on the host has a copy on the other host. Host power-
the usage of resource pools in the cluster and monitor the entire cluster when the
virtual machine service pressure is so high that the performance of the physical
host can be insufficient to carry the normal operation of the service. The DRS
function will dynamically calculate the resource status and dynamically migrate
the virtual machine on the resource overloaded server to the server with sufficient
resources to ensure the healthy running status of the services in the cluster and
prevents the traffic from being switched back and forth due to DRS, and the user
When the virtual machine service pressure increases, the computing resources
allocated when the user creates the service are insufficient to carry the current
stable operation of the service. The hyper-converged platform provides the
dynamic resource expansion function to monitor the memory and CPU resource
usage of the virtual machine in real time. When the computing resources
allocated for the virtual machine are about to reach the bottleneck, and the
computing resource resources of the running physical host are sufficient, the
to the service virtual machine to ensure the normal operation of the service;
When the resources of the running physical host are overloaded, the computing
resource hot add operation will not be performed to avoid squeezing the
The service virtual machine resource usage bottleneck is customized by the user,
including CPU usage, memory usage, and the duration of the computing
resource reaching the bottleneck, ensuring that resources are allocated to the
3.8. VM Priority
When the available resources of the cluster are limited (system resources are
tight, host downtime, virtual machine HA, etc.), priority is required to ensure the
operation of important services. The hyper-converged platform provides virtual
virtual machines and ensure that the virtual machine business has been a higher
When the administrator manually deletes resources such as virtual machines and
retrieve the virtual machines and virtual network devices that have not been
completely deleted. The user provides a "false delete operation buffer" protection
The virtual device deleted by the user will be temporarily put in the recycle bin for
a period of time. At this time, the disk space occupied by the deleted device is
not released, the data is not deleted, and the device in this state can be
retrieved; the deleted device that is in the recycle bin for more than 30 days will
3.10. VM Anti-affinity
relationship, such as multiple RAC node virtual machines in the Oracle RAC
database, if these virtual machines are placed on one host, as if all the eggs are
placed in one basket, the service will be compromised when node fails; aCloud
will not run on the same host. When one host is down, it runs on other hosts in the
cluster. The virtual machine can continue to run to ensure the continuity of the
business. When the DRS dynamic resource scheduling and HA pull up take place,
the mutually exclusive virtual machine still follows the principle of anti-affinity, and
which uses the virtualization technology to “pool” the local hard disk in the
provide NFS/ iSCSI to the upper layer, allowing the virtual machine to freely
requirements.
4.2. Data Replica Based Protection
When the hardware fails (hard disk damage, storage switch/storage network
card failure, etc.), the data on the failed host is lost or cannot be accessed,
the storage pool, and they are distributed on different disks of different physical
hosts. Therefore, the user data still has a functioning copy on other hosts, which
ensures that data will not be lost and services can be run normally.
Note: The multi-copy mechanism only solves the hardware-level faults and does
not solve the logic-level faults. For example, “the upper-layer application is
When multiple copies are inconsistently written due to network and other reasons,
and multiple copies consider themselves to be valid data, when the service is not
clear which copy data is correct, data split-brain occurs, affecting the normal
data + a copy of the arbitration; the arbitration copy is used to determine which
copy of the data is correct, and the service is informed to use the correct copy of
the data to ensure the safe and stable operation of the service.
The arbitrated copy is a special copy. It has only a small amount of parity data,
and the actual storage space is small. The quorum copy also requires that the
data copy must meet the principle of mutual exclusion of the host. Therefore, at
least three storage disks are composed to have a copy of the arbitration. The
core principle of the arbitration mechanism is that "the minority is obeying the
majority", that is, when the number of data copies accessible on the host where
the virtual machine is running can access less than half of the total number of
copies (data copy + arbitration copy), the virtual machine is prohibited to be run
on this host. Conversely, the virtual machine can be run on that host.
When a certain HDD hard disk is damaged in the cluster and the IO read/write
fails, which affects the service, the hyper-converged platform provides data hot
spare disk protection. The system hot spare disk can automatically replace the
damaged HDD hard disk to start working without manual intervention by the user.
In a scenario where the host cluster is large and the number of hard disks is large,
the fault of the hard disk may occur from time to time. The aCloud platform
allows users to stop worrying about data loss caused by hard disk damage and
not-in-time replacement.
4.5. IO QOS Protection
mechanism, and users can ensure the IO supply of important services, including
The service priority policy is: important virtual machine service IO > normal virtual
machine service IO > other IOs (backup, data reconstruction, etc.); the platform
will automatically check the IO throughput load and physical space occupied by
maximize IOs.
When the life of the hard disk expires and the number of bad sectors on the hard
disk is too high, the hard disk is actually in a sub-health state. Although the hard
disk can be recognized for data read and write operations, the hard disk has the
disadvantage of unsuccessful reading and writing and even data loss. The
platform provides a sub-health detection mechanism for the hard disk to detect
and avoid the impact of hard disk failure on the service in advance.
The hard disk sub-health detection calls the smartcrtl and iostat commands to
obtain the status information of the hard disk, and compares with the abnormal
threshold of the hard disk to determine whether the hard disk has sub-health
phenomena (such as slow disk, carton , PCIE SSD life detection, etc.), and filters
through the kernel log for the IO call and the RAID card error logs, and the error
to help users discover the sub-health hard disk and replace it with a healthy hard
disk in time to ensure that the hard disks in the cluster are healthy. The sub-health
hard disk will be restricted to add new fragments. The shards are all silently
processed and cannot write new data, and the data on the sub-health hard disk
After the hard disk is in the sub-health state and the alarm is generated, the
replacement operation. If the data synchronization task needs to read data from
the hard disk to be replaced, the operation of the disk insertion may cause
double faults and thus affect the impact. In this case, you can use the hard disk
maintenance/ hard disk isolation function. Before the system isolates the hard
disk, the data will be fully inspected to ensure that the data on the hard disk has
a healthy copy on the other hard disk. The hard disk after the isolation will not
allow data to be read and written to ensure that services are not affected when
There is an unwarrantable error during the use of the hard disk, that is, a silent
error, until the user needs to use the data, they will find that the data has been
wrong and damaged, and eventually cause irreparable damage, because there
is no warning of silent error. The sign that the error may have occurred has been a
long time, leading to a very serious problem. NetApp conducted observation for
more than 1.5 million hard disk drive over 41 months, and discovered that more
than 400,000 silent data corruption, wherein the hardware RAID controller does
In order to prevent the return of user error data due to silence error, the hyper-
Verify engine and Checksum management module. In conjunction with the key
generated as a "fingerprint" of the data as soon as the user data enters the
system, and is stored. After that, the checksum will be used to verify the data to
The checksum algorithm has two main evaluation criterias: one is the speed at
which the checksum is generated; the other is the conflict rate and uniformity. The
collision rate is the probability that two data are different but generate
XXhash64 algorithm, which is faster and has a lower collision rate than the CRC-32
The checksum is generated in memory and can be transferred and stored along
with the data. When data is stored in non-volatile storage such as disks and SSDs,
checksums also need to be stored. This introduces additional write overhead and
checksum storage by using asynchronous brushback, key I/O path bypass, and I/O
status of the hard disk and the health of the copy periodically. The health data is
used for replica reconstruction of the source to ensure the security status of the
cluster data.
When the data disk and the cache disk are pulled out, the data disk and the
cache disk are taken offline. When the service IO is continuously faulty on the
data disk, the data disk is considered to be faulty, or the cache disk is considered
faulty when the service IO on the cache disk is faulty, the data reconstruction
The data reconstruction process uses the following technical solutions to speed up
the reconstruction:
source hard disks and writing to multiple destination hard disks, realizing
then the reconstruction program can sense the I/O of the upper layer
business;
the priority of the virtual machine. When the space resources of the
user.
4.10. Fault Domain Isolation
storage partitions different disk volumes. Users can divide aSAN into different disk
domain. In the same fault domain, The copy mechanism and the rebuild
mechanism of aSAN will be isolated in the fault domain and will not be rebuilt to
other fault domains; the faults in the same fault domain will not spread to other
fault domains, which can effectively isolate the fault spread; A rack failure only
“3.9 Recycling Bin”section introduced that when the virtual device is completely
removed, the occupied disk space will be freed, equipment cannot be retrieved
after that; in order to further protect the user's operation and the reversibility, the
aSAN virtual storage layer provides a data delayed deletion mechanism to retrieve
When the upper-layer service sends an instruction to delete data to the aSAN
data storage layer (such as completely deleting the virtual machine image
command), aSAN will check the remaining disk space. If the remaining disk space
is sufficient, aSAN will not delete this part immediately. The space is completely
cleared and reclaimed, and this part of the data will be placed in the "to-be-
deleted queue", and the feedback will be applied to the upper layer to delete the
successful result, and then continue to retain the data for a period of time (default
10 days), beyond this time then this part of the data will be deleted.
If the remaining space of aSAN is less than 70%, and there is data in the
background that needs to be deleted, aSAN will recycle the data to be deleted
according to the longest time principle, without waiting for the timeout.
4.12. Data Self-Balancing
aSAN uses data balancing to ensure that in any case, the data is distributed as
evenly as possible within each hard disk in the storage volume, avoiding extreme
data hotspots and utilizing the space and performance of the newly added hard
disk as soon as possible to ensure the hard disks of each host will be used.
1)Planned balancing
as 12 am to 7 am), when different hard drive capacity utilization within the storage
volume is vastly different, it will be trigger data balancing on disks with high usage,
migrating part of the data to a hard disk with low capacity usage.
Within the time frame planned by the user, aSAN's data balancing module
will scan all the hard disks in the storage volume. If the difference between the
highest and lowest hard disk capacity usage in the volume is found to exceed a
certain threshold (default is 30%), that is, the balance is triggered until the
difference between the usage rates of any two hard disks in the volume does not
For example, after the user expands the storage volume, the balance is
triggered to migrate the data to the newly added hard disk during the data
2) Automatic balancing
user intervention. This is to avoid the space of a certain hard disk in the storage
volume is full, and the other hard disk still has free space.
When there is a disk space usage in the storage volume that exceeds the risk
threshold (default is 90%), automatic balancing is triggered until the highest and
lowest disk capacity usage in the volume is less than a certain threshold (default is
3%).
2. Balance implementation
When the trigger condition is met, the system will calculate the upcoming
destination hard disk location that the data will be stored in units of slice data on
the source hard disk; destination hard disk location needs to satisfy the following
principles:
1 ) The principle of mutual exclusion of hosts must be met: that two copies of
the fragment after migration are not allowed to be located on the same host;
2 ) The principle of optimal performance: that is, the hard disk that still satisfies
the optimal data distribution strategy after the slice migration is preferred;
During the balancing process, the newly added/modified data for the slice is
simultaneously written to the source and the target, that is, one more copy is
written; before the end of the balance, the balance program performs data check
on the source and the target to ensure data consistency before and after
balancing; after the balance is completed, the source shards will be moved to the
sub-module, which only affect the module itself, and will not spread and lead to
the overall failure of the aNET network platform, and the high reliability design of
"central controller" in the control plane, the control plane analyzes the
management plane, management agent will issue the configuration to the data
forwarding plane, then the forwarding plane execute on it directly without going
management plane master node is elected through the cluster module, and the
cluster file system is used to store data in each network node in a distributed
manner. If the control node fails, aNET automatically elects a new master control
node, the new master node obtains cluster network configuration data through
the cluster file system to ensure high reliability of the management plane.
5.1.2. Control Plane High Reliability
The control plane adopts the same centralized control scheme as the
management plane. The cluster module selects the master control, and the
various reporting and network node module active reporting mechanism of the
network node, the central controller restores the current control. The real-time
state of each computing and network node is mastered to ensure high reliability
The data forwarding plane runs on the application layer. Different from other
cloud platforms running in the kernel layer, when the forwarding plane is
abnormal, it will not cause the kernel to crash, and the forwarding plane can be
quickly restored by restarting the service mode, greatly reducing the impact to
the reliability of the platform itself; the data forwarding plane supports
the active/standby switchover in a single host. The standby process contains all
the configuration information of the data forwarding plane. After the main
process exits abnormally, the standby process immediately becomes the master
process and takes over all network forwarding services. The service will not be
interrupted, and the single host of the data forwarding plane is guaranteed to
be highly reliable.
instance exists on all hosts in the cluster. When one of the hosts is offline, the traffic
passing through the virtual switch instance on the host is due to virtual routing and
virtual machine HA to other hosts. The traffic is also taken over by other hosts; the
application to the upper layer is that the virtual switch of the virtual machine is
the same one, and the virtual switch of the virtual machine is the same after the
virtual machine is moved, HA, etc. The access relationship is not affected,
ensuring high reliability of the data forwarding plane across the hosts in the
cluster.
5.3. vRouter
The virtual router in the aNET network layer is a centralized router. The traffic that is
forwarded on the Layer 3 needs to be forwarded through the router. When the
node where the router is located fails or the service network port connected to
controller monitors the running status of the host and the status of the service
network port in real time. When the host is faulty or the service network port
cannot communicate, the central controller will calculate the affected virtual
routers and automatically switch these routers to other working hosts to ensure
!
5.4. Distributed Firewall aFW
another host in the cluster to resume service, the virtual network management
module will quickly establish the distributed firewall ACL policies that are
associated with the VM on the host where the VM is running after HA based on
5.5. Reliability
The NFV device is integrated into the aCloud platform in the form of a virtual
machine; the system provides a dual-machine high availability solution for the
At the same time, the aNET network layer monitors the running status of the
NFV device in real time through multiple dimensions (watchdog , disk IO , network
traffic, and BFD detection). If the NFV device fails to work properly, the virtual
router will bypass the associated policy route to ensure that the service is not
When the virtual network is configured incorrectly or the network link is faulty, the
detected are set through the interface. The control plane sends the route to the
controller, and the controller then coordinates the control agents on multiple
nodes for connectivity detection and result reporting, and clearly presents the
logical and physical network path of the entire probe on the UI , helping the user
ping detection is conducted for each other among each host VXLAN port IPs.
When ports can’t be pinged through for over 5s, alarm is generated on VxLAN
network failure and the connectivity status of VxLAN will be presented to help
user fast locate the VxLAN link failure. In the meantime, VxLAN jumbo frame
detection is also supported for users with VxLAN high performance mode
enabled.
aNET data forwarding plane will regularly check the packet transmission status of
network interface, when detecting the network port is unable to transmit packets
for successive 30s , reset process will be applied to the network ports, to ensure
that network port can be used normally as well as fast recovery of user traffic.
6. Hardware Layer
Reliability Design
software integrated delivery and aCloud pure software delivery (with third-party
problems.
network cards, hard drives, memory and RAID, to facilitate the timely detection of
anomaly detected; Testing results are presented in a unified manner, and user
can eliminate risk by operations based on the alarm information and user
prompts.
which can realize failure diagnosis of key components such as CPU, memory,
the risk of CPU failure in advance and ensure the reliability of the CPU.
temperature of each physical core of the CPU every minute. When the CPU
temperature abnormality reaches the set threshold (10 minutes), the platform will
alarm.
CPU frequency monitoring: the HCI background periodically checks the CPU
frequency every hour. When the CPU frequency is down, it will alarm.
ECC monitoring: Real-time monitoring of memory using ECC (Error Checking and
the device to be down or restarted) and CE error (modifiable ECC, ECC error
report doesn't increase, will not affect the continued use of memory), which
becomes smaller, and the leakage event is easy to occur. In recent years, the
memory ECC error problem has become more and more obvious. The UC-class
Hard disk hot swap and RAID : Sangfor hyper-converged appliance supports hard
disk (SAS/SATA) hot swap, supports hard disk RAID 0 , 1 , 10 and multiple other
RAID modes, guarantees high availability of hard disk; It also supports additional
hot spare disk under the RAID configuration to further ensure the high
redundancy of the data disks; supports reconstructing and balancing the data
Hard disk comprehensive monitoring, fault avoidance, high reliability of the hard
disk
the hard disk status in real time, and immediately alerts when the hard disk is offline;
IO error condition of the Dmesg information, and immediately alerts when error is
found;
3) SSD life monitoring: The hyper-converged platform regularly uses the smartctl
command to detect the life of the SSD hard disk. When the available life of the
SSD is less than 10% of the life of the entire hard disk, it will immediately alarm;
4) HDD bad sectors monitoring: aCloud uses smartctl instruction to scan all the
physical hard disk in accordance with user's instruction, alarm is raised immediately
when HDD bad sector is found; If the number of bad sectors is less than 10, then
disk replacement suggestion will be proposed, if it’s more than 10, the hard disk will
be labelled as a sub-health hard disk, the disk will be degraded, and the data will
command to test the latency of the random read 4k IO block size in the 32-depth
scene according to the user's instruction. When the latency is more than 10ms,
alarm will be immediately triggered; when the latency is more than 50ms
emergency alert will be triggered immediately, the hard disk will be set as a sub-
health hard disk and downgraded, and data will be gradually move out of this
hard disk;
command according to the user's instruction to test the bare disk 4k IO block size
example, when the IOPS of 7200 rpm hard disk is less than 60, the IOPS of 10,000
rpm hard disk is less than 100 and the IOPS of 15000 rpm hard disk is less than 140,
Network port connection mode detection: In order to provide the correct network
mode of the network port through the ethtool command to ensure that the
actual network port mode is consistent with the negotiation working mode.
network ports to ensure that the network port configured for a specific purpose
can function to prevent low-level faults such as dropped network ports and
unplugged network cables. If the network port is not deployed correctly, the
alarm prompts;
Network port packet loss detection: to ensure the stability of the service network,
the hyper-converged platform reads the NIC information and counts the packet
loss of the NIC. When the packet loss rate reaches a dangerous value, the alarm
is generated. For example, if the packet loss rate of the network port is greater
running service, the hyper-converged platform detects the network port rate and
alarms when the network port rate reaches a dangerous value; for example, if
the network port rate is less than gigabit, the alarm prompts;
ensure that the service operates in a full-duplex mode with high network
RAID card abnormal status check: the HCI platform analyzes the health status of
the RAID card by reading the RAID status information through system instruction. If
the RAID card has an error or anomaly, an alarm is raised to prompt the user to
JBOD (Non-RAID) mode check: in order to ensure the hot swop feature of the hard
disks, the hyper-converged platform performs RAID JBOD mode detection, if it’s
which support power supply 1+1 redundancy and power hot swap. After one
power fails, the system can continue to operate without affecting the service,
virtual machine, the alarm information is displayed on the page when problems
are found and warning level grouping is offered, users are notified by e-
mail and text messages to ensure alarms are received in a timely manner.
Administrators can set the most suitable alarm policy based on business
as high usage of the host memory, CPU high-frequency and so on; and provides
The multi-copy mechanism of aCloud platform can handle the hardware level
single point of failure, making sure that when the hardware level fails, the
the aCloud platform (all the multiple copies are damaged), or a logical error
The aCloud platform provides the first-time full backup + subsequent incremental
backup + bitmap dirty data marking technology fast backup function to solve
backup efficiency and reducing the impact of the backup process to the
production environment.
1) First, perform a full backup (if there is already a full backup, directly perform
an incremental backup);
2) After the full backup, the service continuously writes new data (G and H )
and marks it with bitmap. At this time, the new data can be directly written in the
original position of the qcow2 file, and the data of the modified location is only
incrementally changed in the next backup; After the end, reset the bitmap to 0 to
machines are related, fast backup also provides multi-disk data consistency check.
For example, in the database application scenario, the database (SQL Server ,
Oracle ), the data disk and the log disk must maintain the consistency of the
backup time. Otherwise, when the backup is restored, the restored Oracle system
will still be unavailable due to the inconsistency, and the aCloud fast backup can
ensure that multiple disks of the database data are restored in a consistent manner.
performance and efficiency, because it can be directly written when writing new
data in the original location, and no copy-on-write will occur. The mapping
between the qcow2 file and the data location will not be out of order, so it does
not affect the performance of the qcow2 image; the incremental backup method
reduces the amount of data for each backup, thereby increasing the backup
speed.
can provide hourly granularity protection, while CDP technology can provides
one second or 5 seconds level of data protection, it records every data change
and restore the data with near-zero loss for ultimate protection.
Sangfor aCloud has done deep optimization for the CDP technology, compared
layer, Sangfor integrated CDP module with the qcow2 file layer, thus providing a
better CDP data protection solution that is low cost, easy to deploy and more
asynchronously copy IO from the main service to the CDP log storage repository,
and periodically generate RP points to ensure that the CDP backup process does
not affect the normal service, and the fault isolation is implemented. The fault of
the CDP module does not affect the normal service as well. The BP point is
generated BP point and RP point are marked with a time stamp to locate the
The traditional CDP software inserts a "probe program" on the IO path. If the
"probe program" itself is faulty or the CDP- dependent storage fails, the business
the bypass mode, and if the CDP module is faulty, it will not cause the failure of
CDP also provides consistency check on stored data on multiple disks to ensure
1 ) the CDP storage has three disks, each IO write forms a RP point marked
with an id, the RP points marked with the same ID on the 3 disks are
consistency , it cannot be shown on the page and used to restore the virtual
machine .
users cope with server room level failures, providing a complete disaster recovery
solution that does not depend on third-party software, reducing the complexity of
the whole solution, making the whole solution simpler and more stable. The active
and standby DR solutions are mainly used for disaster recovery in the same city or
in different locations. The production center and the disaster recovery center
adopt the active/ standby mode. When a disaster occurs in the production center
or a fire occurs, the disaster recovery center can quickly restore services and
maximize the protection for the continuous operation of the business system.
Sangfor aCloud offsite disaster recovery solution realizes asynchronous data
aDR for data backup and transfer. The DR gateway aDR calls CDP backup API to
perform local data backup for the protected VMs and transmit data between DC
management of the production center cluster and the disaster recovery center
achieves 0 RPO and second-level RTO recovery in the event of data center
failure. when a site fails, applications running on the stretched cluster can
seamlessly access the other copy in the other site to realize inter-site business high
In the section " 4.2 Data Replica Based Protection", the business data is written to
the storage volume in multiple copies. After the hyper-converged platform is built
into the stretched cluster, multiple copies of the business data running in the
stretched cluster will be synchronously written to two sites. After receiving the
is completed, only then the next IO can be written to ensure the consistency of
the data copy; when the service is running normally, the local data is
preferentially accessed. When the local data copy is inaccessible, the system will
switch to access the copy in the remote data center; therefore, when one data
center fails, the virtual machine can be pulled up in another data center by HA,
and the data copy 2 is accessed, maximizing the protection for the continuous
active-active service. The virtual machines that active-active services must run in
and VM B, when the VMs are created, you can specify that VM A can only run on
the main site and B on the secondary site to ensure that the virtual machines’
For example, in the Oracle RAC scenario, both RAC nodes are set to run in a
certain server room and are mutually exclusive. Then, when a server room fails,
The stretched cluster performs data consistency check through the arbitration
copy. For details, please refer to " 4.3 Data Arbitration Protection" .
Block A1, Nanshan iPark,
Email: sales@sangfor.com