You are on page 1of 8

HP Advanced Memory Error Detection

Technology


Technology brief

Introduction ......................................................................................................................................... 2
SDRAM technology .............................................................................................................................. 2
Memory errors .................................................................................................................................... 3
Traditional memory error classifications .............................................................................................. 4
Correctable and uncorrectable errors ................................................................................................. 4
Why memory errors are increasing ........................................................................................................ 4
Server memory capacity is increasing ................................................................................................. 4
DRAM technology is changing ........................................................................................................... 5
HP Advanced Memory Error Detection Technology .................................................................................. 5
Enhancements .................................................................................................................................. 5
Advantages ..................................................................................................................................... 6
HP ProLiant servers supported ............................................................................................................ 6
Conclusion .......................................................................................................................................... 6
For more information ............................................................................................................................ 8





2
Introduction
Across the industry, memory errors have increased significantly due to the growth in overall server
memory capacity and the increase in the number of bits per DRAM chip. Uncorrectable memory
errors can cause applications and operating systems to crash, so they are costly in terms of downtime
and repairs.
Over the past 18 years, HP has introduced several memory technology innovations to ensure data
reliability and protection. In 1999, we introduced the Pre-Failure Alert notification system to monitor
and predict potential problems with critical components such as system memory modules (DIMMs).
The notification system sends an alert to a system administrator when a DIMM exceeds a predefined
threshold for correctable memory errors. This lets the administrator schedule server maintenance to
replace a DIMM that may fail, avoiding unexpected interruption of business operations.
In the ProLiant System ROM upgrade (version May 2011 or later), we have enhanced protection with
HP Advanced Memory Error Detection Technology. This innovation seeks out specific defects that
either cause performance degradation or significantly increase the probability of an uncorrectable
(non-recoverable) memory condition. By improving the prediction of non-recoverable memory events,
this technology prevents unnecessary DIMM replacements and increases server uptime.
This paper details the enhancements in and advantages of HP Advanced Memory Error Detection
Technology. It begins with a description of Synchronous DRAM (SDRAM) technology and memory
errors, and it explains why memory errors are occurring more frequently.
SDRAM technology
A standard Error Correction Code (ECC) DDR3 DIMM delivers 72 bits at a time to a memory bus. The
72-bit data blocka 64-bit data word and 8 bits of ECCis called a rank. As shown in Figure 1,
one rank consists of data from nine DRAM chips that provide 8 bits each (called x8 or by 8 chips)
or 18 DRAM chips that provide 4 bits each (x4 chips). DIMMs are classified as single-rank, dual-rank,
or quad-rank (not shown). Quad-ranked DIMMs can have 72 x4 DRAM chips or 36 x8 DRAM chips,
including ECC chips. Memory manufacturers use multiple ranks to increase the capacity of DIMMs per
memory channel. Today, a quad-ranked DDR3 DIMM with 4 Gb DRAM chips has a usable capacity
of 32 GB.

Figure 1: Single-sided and double-sided SDRAM DIMMs and corresponding DIMM rank




3
Each DDR3 DRAM chip contains billions of memory cells ordered in eight banks (arrays) of rows and
columns. Using a 2 Gb x4 DRAM chip as an example, each bank contains 2
10
(1,024) rows and 2
16

(65,536) columns, totaling more than 256 million cells per bank and more than 2 billion cells per
chip. Each memory cell contains a circuit with a transistor and a capacitor that stores an electrical
charge. The charge state of the capacitor represents binary informationa 1 or 0 data bit.
Capacitors can only store a charge briefly, so they must be recharged (refreshed) thousands of times
per second. The operating voltage of the DIMM determines the level of the electrical charge.
As shown in the read operation Figure 2, the memory controller sends the address signalsbank,
row, and columnthat specify the location of the target DRAM cell. In the designated bank, the row
decoder activates the row (word line) and the column decoder activates the column (bit line). Next,
the capacitor in the target cell sends its stored charge through the bit line to the sense amplifier.
Because the stored charges are very small, the sense amplifiers detect and amplify each charge
before sending the data to the I/O buffer. The sense amplifiers are also responsible for restoring
capacitors to their original state after reading the data. The 4 or 8 bits of data, called a data symbol,
then go through the output data pins to the memory bus.

Figure 2: Representation of DIMM, chip, bank, cell hierarchy per rank


Memory errors
Several events or conditions can cause errors in individual memory cells, in multiple cells in different
rows (a column failure), or in multiple cells in different columns (a row failure). For example, a defect
in a word line or bit line can prevent part of a row or column from receiving a signal, resulting in a
row or column failure. A row or column failure can also result from a failure in the row decoder or
column decoder circuitry. A defect in a sense amplifier can also cause a column failure. Additionally,
several phenomena, called noise sources, can degrade signals in route to the sense amplifiers.


4
The industry has traditionally classified memory errors by the number of bits affected and the causes
of the errors. But for systems with large memory footprints, its more meaningful to classify errors as
correctable or uncorrectable. The following sections explain this distinction.
Traditional memory error classifications
Memory errors are commonly classified according to the number of bits affected in a 64-bit data
word. An error in one bit of a data word is a single-bit error. An error in more than one bit of a data
word is a multi-bit error.
Memory errors are also classified as hard or soft depending on what caused them. DRAM defects,
bad solder joints, and data pin issues cause hard errors so that the device consistently returns
incorrect results. For example, a stuck memory cell returns the same bit value, even when a different
bit is written to it. In contrast, soft errors are transient and non-repeating. They can be caused by an
electrical disturbance inside the memory array or on the memory interface.
Correctable and uncorrectable errors
The outcome of a memory error depends on whether it can be corrected. Some row failures and
column failures are correctable depending on both the DIMM configuration (x4 or x8) and the error
correction capability of the system. ECC can correct single-bit errors within a single x4 or x8 DRAM
chip, but ECC can only detect a multi-bit error. Only 4 DRAM chips allow the use of advanced error-
correction control technologies
1
in server environments.
Advanced error-correction control technologies can detect and correct multi-bit failures in a single x4
DRAM chip. Their algorithms can correct any single-bit or multi-bit errors in a 4-bit symbol, also
known as a symbol error. This allows recovery from a x4 DRAM chip failure. The algorithms can also
detect two symbol errors across two x4 DRAM chips.
Intel Xeon- and AMD Opteron-based systems use advanced error-correction control technologies
to correct one 4-bit symbol error and detect two symbol errors (single-symbol correct, double-symbol
detect). If there is an error in more than two symbols, the technologies may not be able to detect them.
Another technology known as Double Device Data Correction (DDDC) can correct errors in two
symbols and detect errors in three symbols (double-symbol correct, triple-symbol detect). This means
that if one DRAM chip fails, but the DIMM remains in operation, DDDC will continue to work even if a
second chip has an error or fails. Intel Xeon systems support DDDC in lockstep memory mode. In
lockstep mode, two channels operate as a single channel so that each write and read operation
moves a cache line two channels wide. Both channels split the cache line to provide 2x 8-bit error
detection and 8-bit error correction within a single DRAM.
Why memory errors are increasing
Two trends increase the likelihood of memory errors in servers:
Server memory capacity is increasing.
DRAM technology is changing to meet the demand for higher DIMM storage capacity.
Server memory capacity is increasing
The growth of high-performance computing (HPC) and virtualized IT environments is driving operating
systems to address more memory. This is causing manufacturers to expand the memory capacity of
servers. In the last 5 years, the average memory capacity per server has grown by more than
500%from 5.6 GB to 33 GB per server across all HP ProLiant server lines.

1
Intel Single Device Data Correction and IBM Chipkill


5
Maximum server memory capacity is also increasing to meet the demands of HPC and virtualization.
For example, an HP ProLiant DL580 G7 server fully populated with 32 GB DIMMs contains 2 TB of
system memory, which translates to 18 trillion memory cells.
DRAM technology is changing
Memory manufacturers increase DIMM storage capacity by decreasing DRAM feature size
(increasing chip density). As DRAM cells become smaller, manufacturers lower the operating voltage
to increase the memory speed and decrease power use. Memory manufacturers have lowered the
operating voltage for standard DIMMs from 2.5 V, to 1.8 V, to 1.5 V and eventually 1.25 V.
Smaller feature sizes and higher operating frequencies equate to fewer stored charges in the
capacitors. This smaller number of stored charges reduces tolerance to noise sources and makes it
more difficult for sense amplifiers to interpret the bit value of a capacitors charge accurately. Also,
reducing the number of stored charges makes it easier to change the state of a cell. This combined
with higher bit density, increases the number of bits that may be affected by an ionizing event, such
as an alpha particle.
HP Advanced Memory Error Detection Technology
Because of higher memory error frequency, some server administrators are unnecessarily shutting
down servers to replace DIMMs that experience correctable errors. The best way to prevent
unnecessary DIMM replacements is to filter out superfluous errors and identify critical errors that can
lead to a shutdown. Thats the goal of HP Advanced Memory Error Detection Technology.
Enhancements
The HP Advanced Memory Error Detection Technology algorithm analyzes multiple parameters of
correctable memory error events and intelligently detects when the system is at increased probability
of a non-recoverable, uncorrectable memory error condition.
The algorithm performs calculations on 4-bit and 8-bit symbols instead of analyzing individual bits. It
tracks multiple parameters of correctable memory errors and, after considering several properties of
the DIMM, it decides when to notify the administrator to replace the DIMM. The algorithm does not
prematurely alert customers to replace DIMMs based on single-bit errors because they negligibly
increase the probability of an uncorrectable error.
The algorithm considers unique parameters of correctable memory errors for x8 DIMMs as compared
to x4 DIMMs. This is because advanced memory-correction control technologies cannot protect these
DIMMs against a complete DRAM chip failure. The algorithm also detects bank failures for x4 or x8
DIMMs because these failures may increase the probability of an uncorrectable memory error.
The HP iLO3 management processor sends an alert to the servers administrator when a DIMM
exceeds a predefined threshold for correctable memory errors or experiences an uncorrectable
memory error. The administrator can view a log of correctable and uncorrectable memory error
events through the Integrated Management Log (IML) as shown in Figures 3A and 3B. The
administrator can access the IML using a supported browser, even when the server is off. The
administrators ability to view the event log when the server is off can be beneficial when
troubleshooting remote host server problems.


6

Figure 3A: Example of IML event log with Correctable Memory Error alerts


Figure 3B: Example of IML event log with an Uncorrectable Memory Error alert


Advantages
The HP Advanced Memory Error Detection Technology algorithm is better at pinpointing critical
memory errors that can shut down a server. It reduces server downtime by alerting server
administrators only when the server is truly at a higher risk of receiving a non-recoverable
uncorrectable memory error. Server administrators can then better plan downtime to replace
degraded DIMMs, avoiding the unplanned downtime associated with a non-recoverable memory
error.
HP ProLiant servers supported
HP Advanced Memory Error Detection Technology is introduced in the System ROM upgrade (May
2011 or later) for certain Intel Xeon-based ProLiant G6 and G7 platforms and for certain AMD
Opteron-based ProLiant G7 platforms. For a list of specific servers, go to the For more information
section. The technology will be implemented in future generations of ProLiant servers.
Conclusion
Since 1999, the HP Pre-Failure Alert notification system has alerted customers of potential failure in
DDR3 DIMMs that exceed a predefined threshold for correctable memory errors. This allowed
administrators to schedule server maintenance to replace a DIMM and avoid unexpected interruption
of business operations.
But over the past few years, the number of reported memory errors has increased due to the growth in
server memory capacity and the increase in DRAM chip density. These reported memory errors
include particular errors that do not significantly increase the probability of a non-recoverable memory


7
condition. As a result, administrators have unnecessarily or prematurely replaced good DIMMs at a
cost of unnecessary downtime and repairs.
In the ProLiant System ROM upgrade (version May 2011 or later), we have enhanced memory error
protection with HP Advanced Memory Error Detection Technology. This innovation monitors several
memory parameters and seeks out specific defects that either cause performance degradation or
significantly increase the probability of a non-recoverable memory condition. By improving the
prediction of critical memory error conditions, this technology prevents unnecessary DIMM
replacement and increases server uptime.

Copyright 2011 Hewlett-Packard Development Company, L.P. The
information contained herein is subject to change without notice. The only
warranties for HP products and services are set forth in the express warranty
statements accompanying such products and services. Nothing herein should
be construed as constituting an additional warranty. HP shall not be liable for
technical or editorial errors or omissions contained herein.

Intel and Intel Xeon are trademarks of Intel Corporation in the United States
and other countries.AMD and AMD Opteron are trademarks of Advanced
Micro Devices, Inc.

TC0000818, July 2011


For more information
Visit the URLs listed below if you need additional information.
Resource description Web address
Certain ProLiant G7-Series Servers-
SYSTEM ROM UPGRADE REQUIRED for
Certain ProLiant G7-Series Servers
Configured with Intel Xeon 5500 Series
Processors or Intel Xeon 5600 Series
Processors
http://h20000.www2.hp.com/bizsupport/TechSupport/Docu
ment.jsp?locale=en_US&objectID=c02914487
ProLiant Servers- SYSTEM ROM UPGRADE
REQUIRED for ProLiant G6 Servers
Configured with Intel Xeon 5500 Series
Processors, Intel Xeon 5600 Series
Processors, or Intel Xeon 3500 Series
Processors
http://h20000.www2.hp.com/bizsupport/TechSupport/Docu
ment.jsp?locale=en_US&objectID=c02914394
ProLiant Servers - SYSTEM ROM
UPGRADE REQUIRED - HP Advanced Error
Detection Technology Increases Server
Uptime and Is Available Via the May
2011 (or Later) System ROM Upgrade for
Certain HP ProLiant G6 and G7 Servers
http://h20000.www2.hp.com/bizsupport/TechSupport/Docu
ment.jsp?locale=en_US&objectID=c02914486
Memory technology evolution: an
overview of system memory technologies
http://h20000.www2.hp.com/bc/docs/support/SupportMan
ual/c00256987/c00256987.pdf
DDR3 memory technology http://h20000.www2.hp.com/bc/docs/support/SupportMan
ual/c02126499/c02126499.pdf


Send comments about this paper to TechCom@HP.com
Follow us on Twitter: http://twitter.com/ISSGeekatHP

You might also like