Professional Documents
Culture Documents
data but not obtaining massive data. In other words, if big data is compared to an
industry, the key to making profits from this industry is to improve the data
processing capability to increase data value.
Technically, big data is closely related to cloud computing. Big data must be
processed by a distributed computing system instead of a standalone computer.
Massive data is mined by a distributed system using distributed processing,
distributed database, cloud storage, and virtualization technologies of cloud
computing.
With the advent of the cloud era, big data attracts increasing attention. According
to the analyst team of Zhucloud, big data is generally used to describe large
amounts of unstructured and semi-structured data created by an enterprise. Data
analysis takes too much time and money after the data is downloaded to relational
databases. Big data analysis is usually associated with cloud computing, because
real-time big data analysis requires a framework (such as MapReduce) to assign
tasks to dozens, hundreds, or even thousands of computers.
An accurate positioning is the basis for a technology to play its role. Properly
positioning AI is the basis for us to understand and apply this technology. Huawei
agrees that AI, a new GPT, is a combination of technologies, just like the wheels
and iron in B.C., railways and electric power in the 19th century, as well as the
automobiles, computers, and Internet in the 20th century.
Through practice, Huawei has discovered that AI can not only replace human work,
but also automatically reduce production costs. This is the biggest difference
between AI and informatization, and is also the most valuable advantage of AI.
The computing industry, through the development of nearly half a century,
continuously changes the society and reshapes industries. The computing industry
itself is constantly evolving as well.
IBM's minicomputers are equipped with POWER CPUs (4–8), which are able to
scale out to 16, even 32 sockets. In recent years, IBM has launched projects such as
OpenPOWER and POWER Linux, marking its transition from a closed architecture
to an open architecture.
Dedicated computing platform: high costs and few applications, applicable to only
a few enterprises and a few apps
There are increasingly diversified compute demands on the cloud, edge, and
device sides in terms of performance, energy consumption, latency, and product
tolerance to extreme environments.
The server market is shifting its focus on virtualization and cloud computing to AI
computing, edge computing, and the HPC. Traditional server vendors are facing a
range of increasing compute demands, posing challenges on the computing
architectures , as well as the deployment, management, and O&M. Major
challenge: How to break through the compute bottleneck of traditional servers
while reducing O&M costs?
In the atomic age, how to break the physical limits of Moore's Law? The 18-fold
increase in the size of massive data and 10-fold of computing power each year
indicate a great deal of heterogeneous computing demands. No single computing
architecture is able for data processing of all types and in all scenarios.
Heterogeneous computing that involves the collaboration between various CPUs,
DSPs, GPUs, AI chips, and FPGA chips the combination of multiple computing
architectures is the optimal solution to meet the requirements of service diversity
(smart phones, smart home, IoT, and smart driving) and data diversity (digits, texts,
pictures, videos, images, structured data, and unstructured data)
The computing industry, through the development of nearly half a century,
continuously changes the society and reshapes industries. The computing industry
itself is constantly evolving as well.
Nvidia's CEO, Jen-Hsun Huang, believes the traditional version of Moore's Law is
dead.
According to our analysis, the popularization of AI in the industry faces the
following four challenges:
Professional talents and skills: The learning curve of the AI technology is long,
the threshold is high, and talents are scarce. Generally, customers are in want
of AI technical personnel to implement AI applications.
In the past 10 years, the physical world has gradually become digitalized.
Computing is becoming fully intelligent and moving towards the edge.
According to third-party statistics, the total annual data volume will reach 165
zettabytes by 2025, where, 50% data will be generated at the edge, and the
requirements for compute power at the edge will burst.
In addition, according to the GIV 2025 report, digitalization and intelligence will be
deepened in 2025. The data usage will reach 80%, the cloud adoption rate will
reach 85%, and the AI application rate will reach 86%.
Atlas 500 AI Edge Station for edge intelligence: Most mainstream edge products in
the industry focus more on networking capabilities and less on computing
capabilities. However, in video analytics and industrial interconnection scenarios,
customers need to process the data collected at the edge in real time to support
fast decision-making. The Atlas 500 can collaborate with the cloud, receive and
update algorithms pushed from the cloud, and quickly deploy new services.
The Atlas 500 AI Edge Station offers powerful performance. It is capable of real-
time data processing at the edge. A single device can provide 16 TOPS of INT8
processing capability with an ultra-low power consumption of less than 1 kWh per
day. The Atlas 500 integrates Wi-Fi and LTE wireless interfaces to support flexible
network access and data transmission.
Built for harsh edge deployment environments, no matter it is the cold Siberia or
the hot Sahara desert, the Atlas 500 can operate stably from –40°C to +70°C.
Based on the five self-developed chipset families and two intelligent engines (for
intelligent management and acceleration), Huawei continuously offers customer-
oriented innovations, breaks through Moore's Law, improves management
efficiency, and deploys intelligent computing solutions across scenarios.
Digital ICs can contain anywhere from one to billions of logic gates, flip-flops,
multiplexers, and other circuits in a few square millimeters.
ICs can also combine analog and digital circuits on a single chip to create functions
such as analog-to-digital converters and digital-to-analog converters. Such mixed-
signal circuits offer smaller size and lower cost, but must carefully account for
signal interference.
The architectural designs of the CPU are Reduced instruction set computing (RISC)
and Complex instruction set computing (CISC). The x86 CPUs from AMD and Intel are
examples of CISC processors. Mainstream RISC CPUs are produced by ARM and IBM,
where ARM's are ARM architecture based and IBM adopts the PowerPC architecture.
Instruction system: RISC designers focus on frequently used instructions and try to
make them simple and efficient. Functions not commonly used are usually
implemented by combined instructions. RISC machines deliver lower efficiency
when implementing special functions. However, it can be made up by using the
flow technology and the superscalar technology. The instruction system of the
CISC computer is rich. Instructions are dedicated to different functions. Therefore,
the efficiency of handling special tasks is high.
Memory operation: RISC has restrictions on memory operations, simplifying
control. However, CISC has more memory operation instructions, and the
operations are direct.
Program: An RISC assembly language program needs generally a large memory
space, and is complex and difficult to design for implementing special functions.
CISC assembly language programming is relatively simple. Programs for scientific
computing and complex operations is relatively easy to design, delivering higher
efficiency.
CPU IC: The RISC CPU has a smaller number of unit circuits and therefore has a
smaller footprint and lower power consumption. The CISC CPU has ample circuit
units, and therefore has powerful functions, a larger footprint, and higher power
consumption.
Design period: RISC microprocessors are simple in structure, compact in layout,
short in design cycle, and easy to adopt the latest technology. The CISC
microprocessor has a complex structure and requires a long design period.
Usage: RISC microprocessors are simple in structure, regular in instructions, easy
to understand and use. The CISC microprocessor has a complex structure and
powerful functions, and is easy to implement special functions.
The mainstream computing architectures of data centers include x86, ARM, and
Power.
Under the tick-tock model, every microarchitecture change (tock) was followed by
a die shrink of the process technology (tick).
Tick–tock was a production model adopted in 2007 by Intel. (Intel released its Core
2 CPU in 2006.) Under this model, every "tick" represented a shrinking of the
process technology of the previous microarchitecture and every "tock" designated
a new microarchitecture.
Under the tick-tock scheme roughly every 12–18 months the Intel alternated
between "Tick" and "Tock".
With each tick, Intel advances their manufacturing process technology in line with
Moore's Law. Each new process introduces higher transistor density and generally
a plethora of other advantages such as higher performance and lower power
consumption. During a tick, Intel retrofits their previous microarchitecture to the
new process which inherently yielded better performance and energy saving. At
this phase, only lightweight features and improvements are introduced.
With each tock, Intel uses the latest manufacturing process technology from their
"tick" to manufacture a newly designed microarchitecture. The new
microarchitecture is designed with the new process in mind and typically
introduces Intel's newest big features and functionalities. New instructions are
often added during this cycle stage.
Headquartered in Cambridge, the company was founded in November 1990 as
Advanced RISC Machines Ltd and structured as a joint venture between Acorn
Computers, Apple Computer (now Apple Inc.) and VLSI Technology.
ARM is also a CPU technology. Unlike Intel and AMD CPUs that adopt the CISC,
ARM's CPU uses RISC.
ARM does not manufacture chips. ARM's revenue comes entirely from IP licensing.
Supports 16-bit, 32-bit, and 64-bit instruction sets, and are compatible with
various application scenarios from IoT, devices, to cloud.
Uses a large number of registers, where most data operations are performed
in registers, enabling faster command execution.
Disadvantages:
The differences of the GPU chips mainly lie in the GPU peripherals and industrial
quality.
Field-Programmable Gate Array (FPGA)
Example: The FPGA implements fast computing and nanosecond-level latency for
specific algorithms, such as fast Fourier transform and Smith-Waterman algorithm,
improving computing performance.
Major FPGA manufacturers include Xilinx, Altera, Lattice, and Microsemi. Xilinx and
Altera control 88% of the market. Almost all major FPGA manufacturers come from
the US.
FPGA is widely used in the aerospace, military, and telecom fields. In the telecom
field, in the stage of the telecom equipment appliance, the FPGA is able to be
parsed and converted by the application network protocol due to the flexibility of
programming and high performance. In the Network Function Virtualization (NFV)
phase, the FPGA improves the NE data plane performance by 3 to 5 times based
on the general-purpose server and Hypervisor and can be managed and
orchestrated by the OpenStack framework. In the cloud era, the FPGA has been
used as a basic IaaS resource to provide development and acceleration services in
the public cloud. AWS, Huawei, and BAT provide similar general-purpose services.
FPGA is more suitable for non-regular, multi-concurrency, compute-intensive, and
protocol parsing scenarios, such as video, gene sequencing, and network
acceleration.
High energy efficiency ratio: lower power consumption compared with CPU
and GPU
High compute power: 4096 FP16 MAC operations per clock cycle
ABD
F
Based on the new-generation intelligent servers, the computation system
integrates management, acceleration, heterogeneous, and AI chips into the
architecture of full-stack all-scenario intelligent solutions covering the cloud,
edge, and device, implementing all-round intelligent acceleration for computation
services.
Before the emergence of the World Wide Web, servers came in the forms of
minicomputers, midrange computers, and mainframe computers. The
minicomputers, midrange computers, and mainframe computers run in
host/terminal mode. Most of them run on the Unix OS. The operator machine
have to log in to the host through the terminal.
In the LAN, the machines that perform operations through terminals are not
exactly modern servers. However, the structure is similar to remote operations
from modern servers.
Servers, as high-performance computers, can provide various services for
clients.
Under the control of the network OS, a server shares the hard disks, tapes,
printers, and expensive dedicated communication devices connected to the
server with customer sites on the network. A server also provides services such
as centralized computing, information release, and data management for
network users.
The von Neumann architecture—also known as the von Neumann model or
Princeton architecture—is a computer architecture that combines program
instruction memory and data memory. This term describes a computing device
for implementing a universal Turing machine, and a reference model of a
sequential structure.
The storage program computer has the following features in terms of the
system structure:
The mainboard has high scalability and a range of slots. Therefore, this
kind of servers is widely used and can meet common server application
requirements.
Blade server
Blade servers provide high availability and high density with several of
them installed in one chassis of standard height.
Rack server
RAM
Temporarily stores computing data in the CPU and data exchanged with
external memories such as hard disks.
When a computer is running, the CPU transfers the data that needs to be
computed to the RAM for computation, and then sends out the
computation result. The operating stability of the RAM also determines
the stability of the computer.
RAID
Higher performance
Fault tolerance
BIOS is a program stored on a ROM chip. The BIOS consists of basic I/O
program, self-test programs after boot, and system auto-boot program of the
computer. The BIOS can read and write the detailed system settings from the
CMOS. Its main function is to enable the underlying and the most direct
hardware setting and control for a computer.
SSD accelerator card: NAND flash is the most important unit of an SSD. SSD
storage devices vary in their properties according to the number of bits stored
in each cell, with single bit cells (SLC) being generally the most reliable,
durable, fast, and expensive type, compared with 2 and 3 bit cells (MLC and
TLC), and finally quad bit cells (QLC).
System software includes the operating systems (such as BSD, DOS, Linux,
macOS, iOS, OS/2, QNX, Unix, and Windows) and basic tools.
Mainstream OSs:
According to the application fields: desktop OS, server OS, host OS, and
embedded OS
Mail:
ERP:
Web Server:
Typical HTTP servers are THE Apache HTTP server of the Apache Software
Foundation, Internet Information Server (IIS) of Microsoft, and Google Web
Server (GWS).
H3C server products are classified into industrial standard servers and mission
critical servers. Industrial standard servers are further divided into H3C industrial
standard servers and HPE industrial standard servers.
H3C industrial standard servers include self-owned rack, tower, and blade
servers. The G3 server (corresponding to Huawei V5 server) is developed by
H3C, and the new brand is UniServer. The G2 server (corresponding to Huawei
V3 server) is all OEM products by HPE, except R4900 G2 which is proprietary.
The HPE industrial standard server is a full series, including rack, blade, tower,
and high-density servers.
The mission critical servers are all manufactured by HPE, including HPE Unix
servers and x86 mission critical servers based on the new-generation
minicomputer architecture. The servers are classified into dynamic servers, x86
servers for mission critical services, and fault-tolerant servers. The dynamic
servers and fault-tolerant servers are Intel Itanium series.
Product & solution portfolio: intelligent servers (rack servers, high-density
servers, blade servers, and mission-critical servers), AI application computing
platform Atlas, and ARM servers.
1. C
3. Typical HTTP servers are THE Apache HTTP server of the Apache Software
Foundation, Internet Information Server (IIS) of Microsoft, and Google Web
Server (GWS).
A server is a mainstream computing product developed in 1990s. It can provide
network users with centralized computing, information release, and data
management services. In addition, a server can share hard disks, tape drives,
printers, and modems to which the server connected, and dedicated
communication devices with network users.
As an important node on the network, a server stores and processes 80% data and
information on the network. Therefore, it is called the soul of the network.
Functions of network terminal devices must be implemented through servers. In
other words, these devices are organized and led by servers.
B/S is short for Browser/Server. You need to install only one browser, such as
Netscape Navigator or Internet Explorer, on a client, and install the Oracle, Sybase,
Informix, or SQL Server database on the server. In this structure, the user interface
is implemented completely using the WWW browser. Some transaction logic is
implemented at the browser, but the main transaction logic is implemented at the
server. The browser exchanges data with the database through Web Server.
Service scale
Entry-level server: Entry-level servers are connected to limited terminals and
have poor stability, scalability, fault tolerance, and redundancy performance.
Therefore, entry-level servers apply only to small-sized enterprises that do
not have large-scale database data exchange, have small network traffic, and
do not need to power on servers for a long time. These servers do not have
many server features.
Workgroup server: Workgroup servers can connect to users of a workgroup
(about 50 clients). The servers have small network scales and low
performance. The servers meet small- and medium-scale network users'
requirements on data processing, file sharing, Internet access, and simple
database applications.
Department-level server: Department-level servers not only have all features
of workgroup servers, but also offer comprehensive management functions
for monitoring and managing circuits. The servers can monitor parameters
such as the temperature, voltage, fan, and chassis, so that management
personnel can learn server working status in a timely manner based on
standard server management software. In addition, most department-level
servers have excellent system scalability, so that the online system upgrade
can be supported when users' service volumes increase. This maximizes the
return on investment (ROI).
Enterprise-level server: Enterprise-level servers are high-end servers. An
enterprise-level server uses the symmetric CPU structure of at least four CPUs,
and provides independent dual-PCI channels, memory expansion board
design, and high memory bandwidth. It supports the hot swappable PSUs
and large-capacity hard disks, and provides super strong data processing
capability and high cluster performance..
Servers have been widely used in various fields, such as the telecom carrier,
Internet service provider (ISP)/Internet content provider (ICP), government, finance,
education, enterprise, and e-commerce. Servers can provide users with the file,
database, email, web, and File Transfer Protocol (FTP) services.
PCs have low computing capabilities (single processor) and their storage
capacity is small and hard to expand. A PC can be used by only one person in
keyboard, mouse, and monitor mode. PCs work separately and for several
hours a time. They do not have redundant components or monitoring.
Similar to the PC structure, a server consists of the mainboard, CPU, hard disk,
memory, and system bus. Servers are customized based on specific
applications. With the development of the information technology and
network, users have higher requirements on the data processing capabilities
and security of information systems. A server differs from a PC in the
processing capabilities, stability, reliability, security, scalability, and
manageability.
Server hardware includes the CPU, mainboard, dual in-line memory module
(DIMM), hard disk, PCIe card, chassis, PSU, and fan.
The figure uses RH2288 V2, RH2288H V2, or RH1288 V2 as an example.
This slide focuses on the logical architecture of the server, the relationship
between modules, and how modules interact with each other.
The CPU connects to the DIMM using double data rate (DDR) signal cables. The
number of DIMMs and DIMM rate supported by the CPU vary depending on the
CPU specifications.
For example, each CPU of the RH2288H V2 server supports 12 DIMMs and a rate
of 1333 or 1066.
The CPU is the core processing unit on a server, and a server is an important
device on the network and needs to process a large number of access requests.
Therefore, servers must have high throughput and robust stability, and support
long-term running. Therefore, the CPU is the brain of a computer and is the
primary indicator for measuring server performance.
The processing of the CPU can be divided into the following four stages: fetch,
decode, execute, and writeback. The CPU retrieves instructions from the memory
or cache, stores the instructions in an instruction register, and decodes and
executes the instructions.
CPU dominant frequency
The dominant frequency also refers to clock speed. It uses unit MHz or GHz
to indicate the frequency at which a CPU computes and processes data. In
most cases, if the dominant frequency is higher, the data processing speed of
the CPU will be faster.
The CPU dominant frequency is calculated using the following formula: CPU
dominant frequency = External frequency x Multiplication factor. The
dominant frequency has a certain relationship with the actual operation
speed, but it is not a simple linear relationship. The CPU computing speed
depends on the performance indicators of the CPU pipeline and bus.
CPU external frequency
The external frequency is the reference frequency of a CPU, measured in MHz.
The CPU external frequency determines the running speed of the mainboard.
Overclocking is the process of making a computer or component operate
faster than the clock frequency specified by the manufacturer. However,
overclocking is not allowed for the server CPU. The CPU computing speed
depends on the operating speed of the mainboard, and the CPU and
mainboard operate synchronously. Bus frequency
The bus frequency is also known as the front side bus (FSB) frequency and
directly affects the speed of exchanging data between a CPU and a DIMM.
The data bandwidth is calculated using the following formula: Data
bandwidth = (Bus frequency x Data bit)/8. The maximum bandwidth of data
transmission depends on all transmitted data bits and transmission frequency.
For example, the Nocona CPU supports 64-bit, and the FSB frequency is 800
MHz. Therefore, the maximum bandwidth of data transmission is 6.4 GB/s.
L1 cache
The L1 cache includes data cache and instruction cache. The capacity and
structure of the built-in L1 cache greatly affect the CPU performance. The
cache is composed of static RAM, which has a complex structure. If the CPU
chip area is not large, the capacity of the L1 cache cannot be too large.
Generally, the L1 cache capacity of a server CPU is 32 KB to 256 KB.
L2 cache
There are internal and external L2 caches. The speed of the internal L2 cache
is the same as that of the dominant frequency, but the speed of the external
L2 cache is only half that of the dominant frequency. The L2 cache capacity
affects the CPU performance. The larger the cache capacity is, the better the
CPU performance is. In the past, the maximum CPU capacity for home
computers is 512 KB, and that for laptops can reach 2 MB. The L2 cache
capacity for servers and workstations can reach 8 MB or higher.
L3 cache
The L3 cache can further reduce memory latency and improve CPU
performance when computing a large amount of data. Increasing L3 cache
can significantly improve server performance. A configuration with a larger L3
cache makes it more efficient to utilize physical memory resources, allowing
the disk I/O subsystem to process more data requests. A processor with
larger L3 cache can provide more efficient file system cache and a shorter
message and processor queue length.
Instruction sets
CISC instruction set: also called complex instruction set. In the CISC
microprocessor, instructions of a program and operations in each instruction
are executed in sequence.
RISC instruction set: This instruction set is developed based on the CISC
instruction system. Compared with the CPU that adopts the CISC, the CPU
that adopts the RISC simplifies the instruction system and improves the
parallel processing capability by using the superscalar and superpipelining
structure. The RISC instruction set is the development trend of high-
performance CPUs.
EPIC instruction set: This instruction set serves as an important step for Intel
processors to move to the RISC system. Intel uses the advanced and powerful
EPIC instruction set to develop the 64-bit OS-based IA-64 microprocessor
architecture.
Multi-core CPU
A multi-core CPU is a CPU with two or more independent computing engines
(kernels). The development of multi-core technology stems from engineers'
recognition that increasing the speed of a single-core chip generates too
much heat and does not lead to corresponding performance improvements,
which are similar to the previous processor products. A multi-core processor
is a single chip (also called a "silicon core") that can be directly inserted into a
single processor slot. However, an operating system utilizes all related
resources to use each execution core as a discrete logical processor. By
dividing tasks between two execution cores, the multi-core processor
performs more tasks in a specific clock cycle.
The storage, an important computer component, is used to store programs and
data. For computers, the memory function can be supported and normal working
can be ensured only when the storage is available. The storage is classified into the
main memory and external storage by purpose. The main memory, referred to as
the internal storage (that is, memory), is the storage space that the CPU can
address and is made of semiconductor components. The memory features a fast
access rate.
As a main computer component, the memory is in opposition to the external
storage. Programs, such as Windows OSs, typing software, and game software, are
generally installed on the external storage, such as hard disks. However, program
functions cannot be used. To use the program functions, programs must be put
into the memory and be executed there. Actually, we enter a text or play a game in
the memory. Bookshelves and bookcases for putting books are just like the
external storage, while the desk is like the memory. Permanent and a large amount
of data is generally stored in the external storage, while temporary or a small
amount of data and programs are stored in the memory. Memory performance
affects the computer operating speed.
Common DIMM manufacturers
Three major dynamic random access memory (DRAM) vendors: Samsung, SK
Hynix, and Micron
Module vendors: Kingston and Ramaxel purchase DRAM particles to
manufacture DIMMs.
Dual-channel: The dual-channel architecture includes two independent and
complementary intelligent memory controllers, and the two memory controllers
can operate simultaneously without waiting time between each other, which
doubles the memory bandwidth.
Memory interleaving: This technology divides the main memory into two or more
sections, and the CPU can quickly address these sections without waiting. It is used
to organize the memory modules on the server mainboard to improve memory
transmission performance.
Registered memory: The REGISTERED ECC SDRAM memory has 2 or 3 dedicated
integrated circuit chips, called RegisterICs. These integrated circuit chips improve
current drive capabilities and enable IA servers to support large-capacity memory.
Online spare memory:
When a multi-bit error occurs on the primary or extended memory, or a
physical memory fault occurs, the server continues to run.
The spare memory takes over the work of the faulty memory.
The spare memory area must be larger or equal to the memory capacity of
other areas.
Memory mirroring:
Mirroring provides data protection for the system in the case of multi-bit
errors or physical memory faults to ensure normal system running.
Data is written to the memory areas of two mirrors at the same time but is
read from an area.
UDIMM: The address and control signals of the controller are directly sent to the
DIMM.
The server often uses UDIMM with a temperature sensor and the error
checking and correcting (ECC) function.
RDIMM: The address and control signals of the controller are sent to the DRAM
chip through the Register. The clock signals of the controller are sent to the DRAM
chip through the PLL.
LRDIMM supports more than eight ranks for a channel. This feature can
improve the system memory capacity.
What is a rank?
Answer: The bit width of the interface between the CPU and the DIMM is 64 bits.
Each memory chip is 4-bit or 8-bit wide. Therefore, multiple memory chips must be
combined to form a data collection that is 64-bit wide to interconnect with the
CPU. A rank is a 64-bit wide data area on a DIMM.
Check details of the configuration on the webpage of Huawei Server Product
Memory Configuration Assistant:
http://support.huawei.com/onlinetoolweb/smca/?language=en
If there is only one DIMM, it must be inserted into slot slot0 (the farthest slot
from the CPU) of the specified channel.
If you want to insert single-rank, dual-rank, and four-rank DIMMs in the form
of 2DPC, start from the farthest slot with the highest-rank DIMM.
Frequency (1333 -> 1600 -> 1866 -> 2133 -> 2400)
Solid-state drive (SSD): A hard disk made up of a solid-state electronic storage
chip array. An SSD consists of a control unit and a storage unit (flash or DRAM
chip). SSD is the same as the common hard drive in the specification and definition
of interface, function, usage, and product shape and size. SSDs are widely used in
the fields of military, industrial control, video surveillance, network surveillance,
power generation, medical care, and aviation, as well as on vehicle-mounted
devices, network terminals, and navigation devices.
Strengths: fast in reading and writing, shockproof and anti-drop, low power
consumption, noise-free, wide range of operating temperatures, portable
Weaknesses: small capacity, limited lifecycle, expensive
Hybrid hard drive (HHD): A hybrid hard drive is a combination of HDD and SSD,
which uses small-capacity flash memory chips to store frequently accessed files.
Hard disks are the most important storage medium. Flash memory chips serve as a
buffer, which stores frequently accessed files in flash memory chips to reduce the
seeking time and improve efficiency.
Strengths: faster in application data storage and recovery (such as the word
processing machine), faster startup of the system, lower power consumption,
lower heat generation, longer lifecycle, prolonged battery lifecycle for
laptops and tablets, lower working noise
Weaknesses: longer seeking time of hard disks, more frequent spin changes
of the hard disks, impossibility in data recovery in case of processing failures
by the flash memory module, higher costs on system hardware
Hard disk drive (HDD) is made of one or more magnetic disks (in the material of
aluminum or glass), a magnetic head, a rotating shaft, a control motor, a magnetic
head controller, a data converter, an interface, and a cache.
In the early stage, the hard disk ports include IDE and SCSI ports. With the
development of hard disk technologies, such ports no longer exist.
The mainstream hard disk interfaces include SATA, SAS, FC (not used by servers),
and PCIe. (Huawei servers mainly use SAS and SATA.)
SATA 1.0: 1.5 Gbit/s; SATA 2.0: 3 Gbit/s; SATA 3.0: 6 Gbit/s;
Rotational speed: The rotational speed is the number of rotations made by hard
disk platters per minute. The unit is revolutions per minute (rpm). In most cases,
the rotational speed of a hard disk reaches up to 5400 rpm or 7200 rpm. The hard
disk that uses the SCSI interface reaches up to 10,000-15,000 rpm.
Data transfer rate: The data transfer rate of a hard disk is the speed at which the
hard disk reads and writes data. It is measured in MB/s. The data transfer rate of a
hard disk consists of the internal data transfer rate and the external data transfer
rate.
The power consumption of SSDs is slightly lower than that of HDDs (except
PCIe SSDs). A 3.5-inch HDD consumes more power than a 2.5-inch HDD of
the same type, and an HDD with a larger capacity consumes more power. An
HDD with a higher rotational speed consumes more power. An SSD with a
larger capacity consumes more power.
The life of HDDs is limited only by the loading/unloading times. The life of
SSDs is affected by the storage media NAND. P/E is the erase life per cell. TLC
has the lowest P/E and is not used in enterprise-level applications in most
cases. The mainstream HHDs use MLC and eMLC. Select an appropriate HHD
based on the read/write service requirements.
The mean time between failures (MTBF) of 10K/15K SAS HHD is almost the
same as the MTBF of an SSD. An NL HHD has a relatively lower MTBF. MTBF
can indicate the failure rate. The failure rate of NL HHDs is higher. The cloud
disk is about 0.8 million hours and the reliability is lower than that of NL
HDDs.
UBER indicates the uncorrectable bit error rate. The UBER of SSDs is two
magnitudes higher than that of HDDs. SSDs have stronger fault recovery
capability than HDDs.
HDDs have worse vibration resistance and anit-impact capabilities than SSDs.
In harsh environment, SSDs have an advantage over HDDs.
Redundant Array of Independent Disks (RAID) technology organizes multiple
independent disks into one logical disk, thereby improving disk read/write
performance and data security. Large-capacity disks are expensive. The basic idea
of RAID is to combine multiple small-capacity and inexpensive disks to obtain the
same capacity, performance, and reliability as expensive large-capacity disks at a
low cost. As disk costs and prices continue to decline, RAID can use most of the
disks, and being inexpensive is not the focus. Therefore, the RAID Advisory Board
(RAB) decided to replace "inexpensive" with "independent", and the RAID became
the Redundant Array of Independent Disks. This is only a change in the name, but
not a change in the substance.
The new concept is RAID level. The specific capacity depends on the RAID level
used by users. The usage varies according to different RAID levels.
Stripe: A stripe is formed by strips in the same positions (or with the same
numbers) on multiple disk drives in a disk array.
Application scenarios:
RAID 0: suitable for scenarios that require high read/write speed but low
security, such as graphics workstations
RAID 1: suitable for scenarios that require random data writes and high
security, such as servers, databases, and storage devices
RAID 1E: suitable for scenarios that require data transmission and high
security, such as video editing, large-scale database storage
RAID 5/6: suitable for scenarios that require random data transmission and
high security, such as financial and database storage
RAID 10: suitable for scenarios that have high requirements on random
read/write and security, such as banking and financial scenarios
Data parity: Redundant data is used to detect and rectify data errors. The
redundant data is usually calculated through Hamming check or XOR operations.
Data parity can greatly improve the reliability, performance, and error tolerance of
the disk arrays. However, the system needs to read data from multiple locations,
calculate, and compare data during the parity process, which affects system
performance. Each RAID level uses one or more of such technologies to achieve
different data reliability, availability, and I/O performance. You need to
comprehensively evaluate reliability, performance, and costs, and then select a
proper RAID level (or new level or type) or RAID mode based on system
requirements.
The two solutions achieve data protection. For details on the principles, see the
next slide.
Battery:
When the system is powered off unexpectedly, data in the DDR is still stored
in the DDR.
The backup battery unit (BBU) supplies power to the DDR to ensure that the
self-refresh function of the DDR is normal.
The data is stored for a limited period, which is usually 48 hours to 72 hours.
During working, the battery needs to be discharged periodically, which
affects performance for about 4 hours to 9 hours.
The battery life is greatly affected by the environment.
SuperCAP:
After the system is powered off unexpectedly, data is transferred from the
DDR to the Nand flash in the flash card.
The supercapacitor supplies power to the controller, DDR, and flash card to
ensure that data can be transferred to the flash card.
After the data transfer is complete, the supercapacitor does not need to be
charged. Data can be stored permanently.
The supercapacitor can be charged or discharged in a short period, which has
little impact on system performance.
During working, the capacitor capacity decreases continuously, but the
capacity can ensure the normal working in the entire life cycle.
Direct connection to hard disks:
Expander:
The Expander mode is often used for rack servers that focus on storage.
Software RAID does not provide the following functions:
Hot swap
PCI: used in the scenarios that require a high data transmission rate, such as
digital graphics, image and voice processing, and high-speed real-time data
collection and processing. The PCI bus can solve the problem about the low
data transmission rate of the original standard bus.
Application type: Based on the computer type, network cards can be classified into
workstation network cards and server network cards.
High rate: Servers are used to process big data computing and have
demanding requirements on the network card rate, such as 10 Gbit/s or
25 Gbit/s. Some high-performance servers require 100 Gbit/s.
Low CPU usage: If CPU responds to network cards frequently, the speed
of processing other tasks will decrease. Server network cards have built-
in control chips that can process some CPU tasks, helping reduce CPU
overhead.
LOM is integrated in the PCH chip of the server mainboard and is not replaceable.
It does not occupy the PCIe slot of the server.
To improve compatibility of network cards, PCIe defines the PCIe size. Vendors
develop network cards based on the PCIe size so that they can be installed in
standard PCIe slots.
To make full use of the space, Huawei develops the non-standard flexible I/O card
dedicated to Huawei servers.
A mezzanine card is designed for blades only. It does not provide external physical
ports. All signals are transmitted through the blade backplane. Mezzanine cards of
different vendors cannot be interchanged.
A network card provides two types of physical ports: electrical port and optical
port.
Electrical port: It is the network port seen on a common PC. It is one type of
RJ45 port and connects with common network cables.
Optical port: It connects to an optical module. The port for housing the
optical module is called an optical cage.
Optical modules can be classified into SFP+, SFP28, and QSFP+ based on the
encapsulation mode. SFP+ and SFP28 have the same structure and are compatible
with each other. SFP28 supports a high rate of 25G, whereas SFP+ supports only
10G. The appearance of QSFP+ differs greatly from that of SFP+. QSFP+ supports
a rate higher than 40G.
The direct attach cable (DAC) is a direct copper cable. Its module head is
integrated with the cable, and no optical module needs to be configured. The
cable has a large attenuation. Generally, the cable length is 1 m, 3 m, or 5 m.
However, the cable is cheap and is the best solution for short-distance
transmission.
An active optical cable (AOC) functions as two optical modules + optical fibers and
is also an integrated cable. This type of cable features high data transmission
reliability but is expensive.
Currently, the mainstream chips used by Huawei network cards come from Intel,
Broadcom, Cavium, and Mellanox.
ATX standard-for entry-level servers or workstations. The output power ranges
from 125 W to 350 W. Generally, a 20-pin dual-row rectangular socket is used to
supply power to the mainboard. Currently, the power supply specification of the
Pentium 4 processor platform is ATX12V, and a 4-pin 12 V power output end is
added, so as to better meet the power supply requirement of the Pentium 4
processor.
The Server System Infrastructure (SSI) standard is a power supply standard for IA
servers. It is formulated to standardize the power supply technology of servers,
reduce development costs, and extend the service life of the servers. The SSI
standard specifies the power supply specifications, backplane specifications,
chassis specifications, and heat dissipation system specifications of servers.
Power supply redundancy modes:
1+1: In this mode, each module provides 50% of the output power. When
one module is removed, the other provides 100% of the output power.
2+1: In this mode, three modules are required. Each module provides 1/3 of
the output power. When one module is removed, each of the other two
modules provides 50% of the output power.
Note: When the system power is large, three modules can be used to implement
2+1 redundancy, and two modules can be used to implement non-redundancy.
However, the system power must be less than the sum of the power of the two
modules minus the current equalization error. When the system power is less than
the power of one module, two modules can be used to implement 1+1. When the
system power is greater than the power of one module, 1+1 may cause overload.
When the OS wants to do some work, device hardware is controlled by BIOS to
complete the work.
The BIOS functions above are not listed in any particular order.
CMOS and RTC are two key concepts related to BIOS.
The default RTC time is the local time of the factory. When the time is modified
during OS installation or usage, the RTC time is automatically synchronized so that
the time is consecutive after a system power-off.
RTC uses physical the crystal oscillator with a deviation. In scenarios that require
high time precision, the OS needs to synchronize time with the NTP clock source
periodically. For more about NTP, see https://en.wikipedia.org/wiki/NTP.
UEFI
UEFI is developed based on EFI1.10. In 2005, Intel submitted EFI to open-
source UEFI International Organization for management. The major
contributors of UEFI are Intel, Microsoft, and AMI. UEFI uses modularity,
dynamic link, and C-language constant stack transfer to build the system,
getting rid of the traditional complex 16-bit assembly code of BIOS.
The great thing about UEFI is that it is so easy to use as the Windows
interface.
On UEFI, the mouse is used to replace the keyboard. The modules for
adjusting functions are the same as those of the Windows program. UEFI can
be considered as a small-sized Windows system.
Functions
Larger disk capacity: The GPT partition format in the UEFI standard supports
hard disks with a size of over 100 TB and 100 primary partitions, which is
especially useful for Windows 7 users.
Higher performance: UEFI can be used in any 64-bit processor. It has a great
addressing capability and excellent performance. To put it simply, you load
more hardware and start up to Windows faster.
64-bit system: Starting from Vista SP1 and Windows Server 2003, all 64-bit
systems can be started through UEFI, whereas Windows XP and 32-bit
systems are started only through the compatible module of UEFI.
BMC is a small-sized OS independent of the server system. It is used for remote
management, monitoring, installation, and restart of servers. BMC is integrated on
the mainboard or is inserted into the mainboard through PCIe. BMC is presented
as a standard RJ45 network port with an independent IP address. For common
maintenance, use a browser and enter the IP address: Port to log in to the
management interface. The server cluster uses BMC commands to perform large-
scale unattended operations.
Highlights:
KVM: The KVM module receives video data from x86 systems over the video
graphics array (VGA) port. Then it compresses the video data and sends the
compressed data to a remote KVM client over the network. Besides, the KVM
module receives keyboard and mouse data from the remote KVM client.
Then it transmits the data to x86 systems by using a simulated USB keyboard
and mouse device.
Black box: The black box receives running track information from x86 systems
over the PCIe port and provides an interface for exporting the recorded
information.
LPC: iBMC communicates with x86 systems through LPC ports and supports
standard IPMI ports.
History
In 1998, Intel, DELL, HP, and NEC jointly proposed the IPMI specifications, which
can remotely control the temperature and voltage over the network.
In 2001, the IPMI was upgraded from 1.0 to 1.5, and the PCI Management Bus
feature was added.
In 2004, Intel released IPMI 2.0 specifications, which are compatible with IPMI 1.0
and 1.5 specifications. The Console Redirection feature is added, and the server
can be managed remotely through the port, modem, and LAN. In addition, the
security, VLAN, and blade server support are enhanced.
The core of IPMI is a dedicated chip/controller BMC (server processor or
baseboard management controller), which does not depend on the processor,
BIOS, or operating system of the server. It is an independent agent-free
management subsystem that runs independently in the system. BMC can work as
long as the BMC and IPMI firmware are available. BMC is usually an independent
board installed on the server mainboard. Currently, the server mainboard supports
the IPMI. IPMI has a good autonomous feature, which overcomes the limitations of
the previous management mode based on the operating system. For example,
when the OS does not respond or is not loaded, it can still be powered on or off,
and information can still be extracted.
IPMI's Serial Over LAN (SOL) feature changes the transmission direction of the
local serial port during the IPMI session, thereby providing remote access to the
emergency management service, Windows dedicated management console, or
Linux serial console. This provides a standard way to remotely view the boots, OS
loader, or emergency management console to diagnose and fix server-related
problems. This is a vendor-independent way to diagnose and repair faults.
Users do not need to worry about the security of command transmission. The IPMI
enhanced authentication (based on the SHA-1 and HMAC) and encryption
(Advanced Encryption Standard and Arcfour) functions help implement secure
remote operations. The support for VLANs facilitates the configuration and
management of private networks and can be configured based on channels.
The converged infrastructure solution consists of servers, data storage devices,
network devices, IT infrastructure management, automation, and service process
software.
Three types of operating systems are server operating systems, desktop operating
systems, and embedded operating systems.
Server OSs include Windows, Linux, UNIX, and more. Each operating system has
different versions. We only need to know some common server OSs and versions.
What is the benefit of the server OS compared with the single-user OS?
The commonly used versions include 3.11, 3.12, 4.10, V4.11, and V5.0. The
mainstream version is NetWare 5, which supports all important desktop OSs (DOS,
Windows, OS/2, Unix, and Macintosh) and the IBM SAA environment. It provides a
high-performance integrated platform for enterprises and institutions that need
complex network computing using products from multiple vendors. NetWare is a
multi-task and multi-user network operating system. Its later versions provide the
system fault tolerance (SFT) capability. The open protocol technology (OPT) is used.
The combination of various protocols enables different types of workstations to
communicate with the public server. This technology meets the requirements of
users for communication between different types of networks, and implements
seamless communication between different networks. That is, various network
protocols are closely connected, which facilitates the communication with various
minicomputers and mainframe computers. NetWare does not require a dedicated
server. Any type of PC can be used. NetWare servers have better support for
diskless stations and games, and are often used in teaching networks and game
halls.
OpenStack is a community, a project, and a piece of open-sourced software that
provides operating platforms to deploy cloud and tool set.
It is an open-source cloud computing management platform project, and its major
components are combined to complete specific tasks. OpenStack supports almost
all types of cloud environments. The project objective is to provide a cloud
computing management platform which is easy, scalable, and standard.
OpenStack manages data center resources and simplifies resource allocation. It
manages the following resources:
Compute resources: OpenStack controls massive groups of storage, compute,
and networking resource across the data center, and manages via an
OpenStack API. This offers an administrator control and allows users to
provide resources through the web interface.
Storage resources: Due to performance and price requirements, many
organizations cannot meet the requirements of traditional enterprise-class
storage technologies. Therefore, OpenStack can provide configurable object
storage or block storage functions based on user requirements.
Network resources: Nowadays, data centers involve a large number of
devices, including servers, network devices, storage devices, and security
devices. They will be divided into more virtual devices or virtual networks. As
a result, the number of IP addresses, route configurations, and security rules
will increase explosively. Traditional network management technologies do
not have high scalability and automation capabilities. Therefore, OpenStack
provides network and IP address management in plug-in, scalable, and API-
driven mode.
Scalability and elasticity are the main objectives;
Accept the ultimate consistency and implement this principle as far as possible;
Exchange Server is a well-designed mail server product that provides all the
necessary email services. In addition to the conventional SMTP/POP protocol
services, it also supports IMAP4, LDAP, and NNTP protocols. Exchange Server has
two versions, the standard version includes Active Server, network news service
and a series of interfaces connecting other mail systems; The enterprise edition
provides, in addition to the functions of the standard edition, an email gateway for
communicating with the IBM OfficeVision, X.400, VM, and SNADS. The Exchange
Server supports web-based email access.
Answers:
1. ABCD
2. T
A cluster is a type of parallel or distributed processing system. It consists of a
collection of interconnected stand-alone computers working together as a single,
integrated computing resource. These computers work together and run a series
of common applications, and further provide single system mapping between a
user and an application. Externally, the cluster is a single system and provides
unified services. Internally, computers in the cluster are physically connected by
using cables, and are logically connected by using cluster software. These
connections offer the computers load balancing and failover capabilities, which are
not possible on a single computer.
Advantages of a cluster:
Improved performance: Some computing-intensive applications require
powerful computing capabilities. In this case, a cluster is suggested.
Reduced cost: A computer cluster can deliver better performance at a lower
cost than a general computer.
Improved scalability: Conventionally, users have to upgrade their severs to
the expensive, latest ones to upgrade the system capacity. With the cluster
technology, you only need to add new servers into the cluster.
Enhanced reliability: The cluster technology enables the system to continue
operating properly in the event of a failure, minimizing the system downtime
and improving the system reliability.
High scalability: Server clustering is highly scalable. As the demand and load
increase, more servers can be added to the cluster. In this configuration, multiple
servers execute the same application and database operations.
High availability (HA): HA refers to a system's ability to prevent system faults or
automatically recover from faults without any human intervention. By transferring
the applications on a faulty server to the backup server, the cluster system can
increase the uptime to 99.9%, thus greatly minimizing the system downtime.
High manageability: The system administrator can remotely manage one or even a
group of clusters, in the same way for managing a single-node system.
Clusters are classified into the following types:
The FPGA is generally used to build digital circuits. The logic and I/O blocks in the
FPGA can be reconfigured as required. It also offers static reprogramming and
online dynamic system restructuring, so that the functions of hardware can be
modified by programming as software. It is no exaggeration to say, the FPGA can
be used to implement any function of digital devices, ranging from high-
performance CPUs to 74-series circuits. The FPGA is like a piece of white paper or a
pile of building blocks, allowing engineers to design digital systems freely using
traditional schematic input methods or hardware description languages.
Major SSD providers:
Intel: Since 2009, Intel has occupied a large enterprise market share with its
SATA SSD. However, the PCIe SSD launched in 2012 did not fare well as
expected. Intel became a market dominator with its NVMe SSD. The sales
revenue in 2015 is $ 1.44B.
Samsung: It captures a smaller market share of enterprise SSDs than Intel.
Data center SATA SSDs are its bread-winning SSDs. Samsung provides OEM
of SAS SSDs for EMC. In 2014, it stepped into the PCIe SSD market.
WD: It targets the high-end storage market. As is a subsidiary of WD, HGST
uses Intel chips and sells WD SSDs (to customers including EMC, Dell, and
HP) in the SAS market.
SanDisk: In recent years, SanDisk has made great efforts in the enterprise
market thanks to its large sales volume of PC OEM. Its customers include
Dell, HP, and Netapp. After acquiring Fusion-IO, the company failed to
achieve merge transformation and was finally taken over by WD.
Toshiba: Since 2015, Toshiba started its line of PCIe SSDs, which has high
performance.
Let's look at the intelligent SSD controller chips. We started the R&D of SSD
controller chips 13 years ago, and has developed four generations of chips and
seven generations of SSD products. The latest generation of the intelligent SSD
controller chip features 16 nm process, PCIe NVMe and SAS convergence, PCIe 3.0
& SAS 3.0, PCIe hot plug, intelligent acceleration, multi-stream, atomic write, QoS,
and super wear leveling algorithm, prolonging the service life by 20%.
The intelligent converged network interface card chip has been developed since 2004
and now has the third generation. The third-generation intelligent integrated network
chip features 16 nm process, Ethernet and FC convergence, 25GE to 100GE Ethernet,
16G to 32G FC networks, 48 built-in programmable data forwarding cores, OVS and
RoCE v1/v2 protocol offload, 15 Mpps OVS forwarding performance, and SR-IOV.
Proprietary chip as the core and multi-protocol acceleration: Huawei iNIC adopts
the new-generation ASIC network controller (Hi1822) that supports 2 x 100G or 4 x
25G ETH ports and PCIe 3.0 x16 interfaces. It supports industry-leading 15 Mpps
OVS offload, setting the industry benchmark for Elastic Cloud Server (ECS) network
performance.
Computing acceleration with 15% CPU resources are offloaded: Huawei Unique
Networking iNIC supports C programming and uses a proprietary programmable
network engine to accelerate services in cloud networks and storage scenarios and
optimize infrastructure utilization.
High reliability and in-service upgrade: Huawei iNIC supports half-height half-
length standard cards, facilitating server deployment and O&M. It also leverages a
low-power design. The IN can be deployed under the 15 W power consumption
constraint, which has no impact on the deployment of existing servers. In this way,
the IN can be quickly deployed to accelerate services and shorten the TTM of
customers' networks.
Frequent device or service faults occur due to the ever-increasing ICT system scale
and service capacity, and increasingly scattered devices. This poses higher
requirements on the ICT O&M. More O&M spending is required in terms of
manpower, time, and funding. According to Forrester, traditional O&M accounts for
70% of the enterprise IT spending.
Server management software: The management software is layered. The
underlying two layers are the most important for server management. Standalone
management provides server management capabilities. Without this layer, servers
cannot be managed. The BMCs of our competitors are produced by OEMs. They
are unable to provide the flexibility as Hauwei.
The upper layer is the centralized management software, that is, eSight. This layer
brings benefits to customers and improves the efficiency of O&M personnel. The
level 1 Internet layer needs the management support and some in-band
management features of the BMC layer.
Before introducing the BMC, you need to know what platform management is.
Platform management refers to the monitoring and control on the system
hardware. For example, the temperature, voltage, fan, and power supply of the
system are monitored so that adjustment can be made accordingly to ensure that
the system is in a healthy state. The platform management module monitors and
controls the system hardware, records hardware information and logs, and
prompts users to locate faults. The preceding functions can be integrated into a
controller, that is, the BMC.
The BMC is an independent system that consists of the CPU, minisystem, and
management software. The BMC is required for servers. Common PCs do not have
the BMC.
eSight is a new-generation ICT management system developed by Huawei for
enterprises. It can manage networks of different equipment suppliers in different
geographic areas. eSight manages IT devices, network devices, and terminals in a
unified manner. eSight supports integration with mainstream third-party
management systems. Based on our 20 years of experience, eSight is an ICT life-
cycle management system built based on ICT installation, routine maintenance,
optimization, and upgrade.
Ansible advantages:
Ansible architecture:
MPP is an old architecture. It is parallel but tightly coupled. It is a single big iron
system and probably runs only one OS. The cluster architecture is loosely coupled
and consists of a large number of small independent nodes. Currently, the most
famous MPP architecture is IBM Blue Gene.
The network topology of an HPC cluster consists of the computing layer, network
layer, and storage layer. In addition to common compute nodes, GPU nodes and
fat nodes can also be deployed at the computing layer according to application
requirements. Moreover, management nodes and login nodes responsible for
scheduling cluster loads and jobs are also deployed at the computing layer. The
network layer consists of the computing network and management network for
high-speed cluster interconnection. The storage layer also consists of storage
management and login nodes. These nodes are connected to the shared storage
system through the high-speed storage network to provide high-bandwidth
storage I/O services.
Traditional HPC technologies and architectures are mature after years of
development. However, with the rise and rapid application of cloud computing,
deep learning, and big data technologies, the convergence of HPC cloudification,
HPDA, and deep learning acceleration by using GPU heterogeneous HPC is
emerging.
More and more customers are providing their supercomputing services for
more users in the form of cloud computing. At the mean time, many small
and medium-sized enterprises cannot purchase expensive HPC servers due to
limited investments. In this case, HPC cloud computing services can reduce
the CAPEX and charge users based on their requirements and actual usage.
This type of elastic services is very important.
The continuous improvement of computing capabilities and the emergence
of a large amount of available data drive the emergence and development of
deep learning. Therefore, the HPC technology can be used to develop a new-
generation of deep learning systems to boost the development of deep
learning.
HPDA is applied in the following fields: 1. Consumer behavior analysis and
search ranking analysis in Internet applications. 2. Medical care, logistics
analysis, and financial fraud detection in traditional industries. 3. Product
design and quality analysis in industrial applications.
In addition, with the advent of Sunway TaihuLight, the computing scale of HPC
clusters has reached the 100 PFLOPS magnitude, and EFLOPS computing becomes
the next target of the HPC industry, which requires major breakthroughs in fields
8
such as computing architectures, network technologies, storage protocols,
compilation environments, and power consumption control.
An HPC system generally uses multiple processors on a standalone computer or
multiple computers in a cluster as computing resources. Multiple computers in a
cluster are operated as a single resource. HPC systems range from large-scale
clusters based on standard computers to those based on dedicated hardware.
In automobile, aviation, and chip manufacturing fields, HPC is used for CAE
simulation, which analyzes product mesh models with a large number of polygons.
The compute nodes need to communicate with each other frequently during the
computing process. To prevent processors from being idle and improve
computing efficiency, a low-latency and high-bandwidth network is required for
data transmission between a large number of compute nodes.
Each industry has its own HPC application software and different
requirements on cluster performance and configuration. These application
characteristics must be understood to provide optimal environment for
application deployment.
GPU compute nodes: uses GPGPU cards for GPU computing acceleration
Three-plane networking:
Terms:
Up to 16 DDR4 DIMMs
100GE LOM
Reliability and operability design: The quick connectors remain undamaged after
200 removals and insertions in a test.
Smaller granularity
Intel Phi fusion acceleration coprocessor chip
Theoretical system memory bandwidth:
The copy operation is the simplest. It reads a value from a memory unit and then
writes the value to another memory unit.
The scale operation reads a value from a memory unit, multiplies the value by a
factor, and then writes the result to another memory unit.
The add operation reads two values from a memory unit, adds the values, and
writes the result to another memory unit. The triad operation performs a
combination of the copy, scale, and add operations. The triad operation reads two
values from a memory unit (a and b), multiplies28one value and adds the other value
to the multiplication result (a + factor x b), and then writes the result to another
memory unit.
The TCP/IP protocol stack has tens of microseconds of packet RX and TX latency
and causes high CPU usage, which has become the system bottleneck.
To resolve this problem, the Remote Directory Memory Access (RDMA) protocol is
developed to replace the traditional TCP/IP protocol stack. Compared with the
TCP/IP protocol stack, the RDMA protocol allows applications to directly read and
write NICs, greatly reducing the protocol processing time and CPU usage.
However, the RDMA protocol is sensitive to packet loss, and 1‰ of packet loss on
a network can reduce the RDMA throughput by 30%. The RDMA protocol has high
requirements on network packet loss.
Traditional TCP/IP:
Data packets need to pass through the OS and other software layers, which
consumes a large amount of resources and memory bus bandwidth.
Large buffers
Packet loss
RDMA:
Zero copy: Data is directly copied from the network port to the application
memory.
The ARM+RoCE network solution saves a large number of external NICs and
switches, allowing more investments on computing resources.
Some application scenarios involve reading and writing a large amount of data
and require a large storage capacity and throughput. These application scenarios
include: seismic data processing and reservoir simulation in the oil industry,
meteorological and seismic prediction, satellite remote sensing and mapping,
astronomical image processing, and gene sequence comparison. The HPC cluster
shared storage system well addresses the requirements of these application
scenarios.
Lustre is an open-source, distributed, and parallel file system. It has the following
advantages: 1. Provides a single namespace. 2. Allows capacity and performance
expansion by adding nodes. 3. Supports online expansion. 4. Supports concurrent
read/write operations of multiple clients. 5. Uses distributed locks to ensure data
consistency.
The Lustre parallel file system allows multiple nodes in a cluster to read and write
the same file at the same time. This mechanism greatly improves the I/O
performance of file systems that support parallel I/O applications. It stripes data
on multiple storage arrays and integrates all storage servers and storage arrays. In
this way, Lustre builds huge and scalable background storage functions with low-
cost hardware.
GPFS allows all nodes in a cluster to access data in the same file and provides a
unified file storage space. Different from other cluster file systems, the GPFS file
system supports concurrent and high-speed file access for applications on multiple
nodes to achieve outstanding performance, especially when a large amount of
data is operated in sequence. Although typical GPFS applications are designed for
multiple nodes, the performance in single-node36scenarios is also improved. GPFS is
ideal for application environments where centralized data access exceeds the
processing capability of the distributed file server.
OpenHPC is a comprehensive HPC software stack and a reference collection of
open-source HPC software components. The 1.3.3 version has passed the
comprehensive test of Huawei ARM servers.
Theoretically, if the source code is available, all HPC applications can be ported to
the ARM platform.
The Huawei HPC solution supports Linux OSs such as RHEL, CentOS, and SLES.
Underlying software for system status and performance monitoring is deployed on
these OSs to achieve efficient cluster resource management and job scheduling. In
addition, the Huawei HPC solution provides multiple parallel libraries, compilers,
mathematical libraries, and development tools to build an efficient parallel running
environment.
The compilers process and link source code to generate executable files. Based on
different hardware platforms, the compilers use different compilation parameters
to optimize the source program for better execution efficiency.
HPC application software uses many common math algorithms. These algorithms
become standard math libraries after long-time improvement and optimization.
Different open-source organizations and vendors can implement the algorithms
for their own products.
The HPC cluster parallel libraries are various implementations of the MPI standard.
Jobs are processed by multiple processes in parallel, and inter-process
synchronization is implemented by using message transmission.
The parallel libraries, compilers, and math libraries are located above the OS and
under application software. They are collectively referred to as the parallel
environment. Parallel running of HPC applications depends on proper parallel
libraries, compilers, and math libraries. Optimizing parallel libraries, compilers, and
math libraries can improve application performance in the same hardware
environment.
Key:
F
T
F
F
T
Exciting breakthroughs are happening every month, every week, even every day.
On newspapers, one technical page is usually insufficient.
The upgrade speed of AI technologies has broken the Moore's Law. A new
generation of technologies comes every year.
Massive data generated on edge devices needs to be processed locally and in real
time, driving new demands such as privacy, security, high bandwidth, and low
costs.
Though we may not realize, the era of edge computing has come.
In the intelligent edge era, local devices will have strong computing power to
tackle future challenges.
Edge-cloud integration will open up scattered data silos to enable data flow and
Internet of Everything.
The number of global street lamps will reach 350 million in 2025. These street
lamps are connected to billions of cameras and various environmental sensors.
Each street lamp generates several gigabytes of data every day, which needs to be
analyzed, processed, and stored.
Edge devices are transformed so that they can carry AI. In addition, AI capabilities
on the cloud are also transformed to lightweight systems so that they can adapt to
the software environment, hardware environment, and usage scenarios of edge
devices.
Edge devices have the following characteristics:
Small memory capacity and week computing power
Requires miniaturized models.
Requires applications that can be quickly started and loaded.
Does not support multithreading.
Cloud computing:
Cloud computing is a computing mode that uses the Internet to share
resources such as computing devices, storage devices, and applications
anytime, anywhere, and on demand.
Fog computing:
According to Cisco's definition, fog computing is a distributed computing
infrastructure oriented to the Internet of Things (IoT). It extends computing
power and data analysis applications to the network edge, enabling
customers to analyze and manage data locally and thereby obtain real-time
results through the connections.
MCC:
Mobile cloud computing (MCC) integrates cloud computing, mobile
computing, and wireless application communication technologies to improve
service quality for mobile users and provide new service opportunities for
network operators and cloud service providers.
MEC:
Mobile edge computing (MEC) is considered to be a key factor in the
evolution of the cellular base station model. It combines edge servers and
cellular base stations to connect to or disconnect from remote cloud data
centers.
Cloud computing is centralized and far away from terminal devices such as
cameras and sensors. Deploying computing power on the cloud will cause
problems such as high network latency, network congestion, and service quality
deterioration, which cannot satisfy the requirements of real-time applications.
However, terminal devices usually have limited computing power compared with
the cloud. Edge computing well addresses this problem by extending computing
power from the cloud to edge nodes near terminal devices.
After the intelligent edge is implemented, edge nodes are managed so that
applications on the cloud can be extended to the edge. Data on the edge and that
on the cloud are collaborated to support remote management, data processing,
analysis, decision making, and intelligentization. In the mean time, unified O&M
capabilities such as device/application monitoring and log collection are provided
on the cloud to build a cloud-edge-synergistic edge computing solution for
enterprises.
Key technologies:
The framework and software stack of intelligent edge computing consist of the
following parts: 1. Hardware acceleration at the bottom layer. 2. Localized,
miniaturized, and lightweight intelligence at the middle layer. 3. Cloud-edge
synergy such as capability delegation, desensitized data upload, and device
management at the upper layer.
Wind, sun, rain, dust, high and low temperature, maintenance difficulties, and low
power consumption
Facial recognition
Edge side:
The recommended edge hardware is Atlas 300 or Atlas 500 (with GPUs).
IEF pushes edge facial recognition, customer flow monitoring, and heat map
applications for deployment.
Low latency: Images uploaded by cameras are processed quickly and locally.
Model training on the cloud: Models are automatically trained and the
Ascend chips are supported.
Optical character recognition (OCR) for finance and logistics
Terminal camera:
Edge side:
Intelligent edge: The cloud centrally pushes edge slicing applications and
manages the entire lifecycle.
Finance and logistics OCR
Industry, benchmark customer, and scenario
Terminal-side HD cameras:
Infrared photography (4 MB to 5 MB)
Production line cell image obtaining
Edge side:
The recommended edge hardware is Atlas 300 or Atlas 500 (with GPUs).
IEF pushes the visual quality inspection model to the edge for deployment.
IEF manages the application lifecycle (with the algorithm iteratively
optimized).
IEF manages containers and edge hardware.
Strengths and benefits of the cloud-edge synergy solution:
Low latency: The model is run locally and the latency of single-model
processing is less than 2s.
Quality inspection: The quality inspection accuracy rate is 100%. The small
image processing latency is 100 ms and will be improved to 60 ms in the
future.
Edge cloud synergy: Edge applications and devices are scheduled and
managed centrally.
Model training on the cloud: Models are automatically trained and the
Ascend chips are supported.
This chip has the strongest inference capability in the industry. A single chip
supports AI inference and analysis for 16 channels of HD videos.
The accelerator card integrates four intelligent chips and can process 64 channels
of videos independently.
The two products consume very little power and are ideal for edge devices and
edge cloud data centers.
This is an edge server used for intelligent analysis and inference.
It has the following features: high density, energy saving, ultra-large storage, and
ultra-high computing power. One such server is equal to multiple common servers.
It is ideal for edge nodes where the conditions of the installation environment are
limited.
This server supports intelligent analysis for 256 channels of videos of people,
vehicles, and other objects.
Size of an STB
Deep learning: abstracts human brains from the angle of information processing
to build simple models and networks connected in different ways.
Machine learning: allows machines (computers) to learn knowledge.
Deep learning is a new domain of machine learning and was proposed by Hinton
and other scholars in 2006. Deep learning is derived from the multi-layer neural
network, and its essence is to combine feature representation and learning. Deep
learning gives up on the interpretability and simply pursues the learning
effectiveness.
The Atlas 800 is preconfigured with the AI system and can be used immediately
out of the box. Customers can focus more on business scenarios without worrying
about the complexity of infrastructure. For example, in a bank, a large number of
credit card applications are processed every day. Generally, a bank specialist can
handle only 50 applications each day. With Atlas 800, a bank specialist can handle
more than 1,200 applications a day.
The PC running MindSpore Studio is connected to the Atlas 200 DK through a USB
port or network port. The Atlas 200 DK consists of the Hi3559C multimedia
processing chip and the Atlas 200 AI accelerator module.
Content processing:
80% of data processed by Internet data centers is unstructured. This issue will
be particularly prominent after 5G is popularized.
Precision marketing:
New retail:
The number of stores connected to Alibaba Ling Shou Tong has exceeded 1
million.
The number of Suning Xiaodian stores reached about 4,000 by the end of
2018.
Vehicle-mounted AI:
Intelligent customer service is applied in the 10086 voice navigation system and
taobao.com as an intelligent e-commerce channel. Intelligent assistants also help
agents quickly understand customers' demands.
Pain points:
Manual handling accuracy and efficiency decreases with the working time.
The labor cost is high, especially during peak periods of the logistics industry.
The Atlas 200 and Atlas 500 are used to intelligently reconstruct cameras and
access control systems, enabling the surveillance system to provide functions such
as VIP identification, blacklist identification, and conflict warning. The security and
user experience of financial institutions are improved.
This is Huawei's full-stack all-scenario AI portfolio.
MindSpore: unified training and inference framework for the device, edge,
and cloud (independent or collaborative)
China has over 1.6 million kilometers of high-voltage transmission lines and
over 4 million transmission towers and poles. The stability of power grids is
vital to the development of the country and people's livelihoods.
For reliable power supply, transmission lines need regular inspections. The
traditional method is risky and consumes a lot of labor and resources.
Solution:
The Atlas 200 enables real-time surveillance, analysis, and risk warning at the
front end, improving timeliness, reducing manual workload, and increasing
accuracy.
The Atlas 200 has a low-power design. The entire camera consumes 8 W,
runs on solar power, and is maintenance-free throughout its lifecycle.
Business challenges:
The customer is about to deploy metro line 17 and wants to build a facial
recognition system. The system will serve purposes such as risk warning
against large passenger traffic, specific personnel identification and
monitoring, passenger behavior identification and tracking, and accurate
investigation for criminal cases.
The system deployment environment has many restrictions and requires the
devices to be space-saving and support future capacity expansion.
Simplifies preventive maintenance and improves preventive maintenance
efficiency.
Locates faults in real time, and dispatches resources to handle the problems to
control loss.
Business challenges: