Professional Documents
Culture Documents
X10MExadata X10M Serviceability
X10MExadata X10M Serviceability
X10M Serviceability
The following is intended to outline our general product direction. It is intended for
information purposes only and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied
upon in making purchasing decisions.
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
HALRTs
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
X10M Events vs. X9M Events
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
X10M Events vs. X9M Events
ALERT 1 N/A
SPAMD DEFECT 0 N/A
FAULT 81 N/A
ALERT N/A 60
SPINTEL DEFECT N/A 42
FAULT N/A 196
ALERT 22 11
SPENV DEFECT 2 3
FAULT 21 15
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Events Breakdown
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
SPENV Events
MESSAGE KA Doc ID RESOURCE DESCRIPTION
SPENV-8002-2U 2926525.1 DIMM PMIC DIMM has experienced an overtemperature failure.
SPENV-8002-18 2926524.1 DIMM PMIC DIMM has experienced an output overvoltage failure.
SPENV-8002-0M 2926492.1 DIMM PMIC DIMM has experienced an input overvoltage failure.
SPENV-8001-RV 2892977.1 Fan PSU Fan Failure
the output current from the supply exceeded its
SPENV-8001-S0 2892966.1 PSU
supported range.
output voltage from the supply exceeded its supported
SPENV-8001-TL 2892978.1 PSU
range.
output voltage from the supply was below its supported
SPENV-8001-UG 2892979.1 PSU
range.
SPENV-8001-M7 2868202.1 CPU/DIMM Excessive processor or DIMM throttling
SPENV-8001-QM 2892975.1 PSU PSU Overvoltage Input Exceeded
PCIe device was unexpectedly removed from the
SPENV-8001-VN 2892564.1 PCIe
system
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
SPENV Events (continued)
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe Events
X5->X8 >=X9 AMD Fault Class Our Guidance (according to ILOM Code)
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
SPX86A-8002-RK SPINTEL-8005-YX SPAMD-8001-UX pcie.fatal 3. If problem persists, then replace the PCIe card in the suspect list.
4. If problem persists, then replace only the CPU associated with the PCIe device.
5. Escalate (and collect the snapshot)
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
data-link-layer- 3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-8009-3J SPINTEL-8005-WJ SPAMD-8001-SJ
inactive 4. If problem persists, then replace only the CPU associated with the PCIe device.
5. If problem persists, then replace the motherboard
6. Escalate
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-8004-FN SPINTEL-8006-1H SPAMD-8001-XC link-degraded-width
4. If problem persists, then replace only the CPU associated with the PCIe device.
5. If problem persists, then replace the motherboard
6. Escalate
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-8004-E6 SPINTEL-8006-0D SPAMD-8001-WR link-degraded-speed
4. If problem persists, then replace only the CPU associated with the PCIe device.
5. If problem persists, then replace the motherboard
6. Escalate
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables. .
3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-800A-8S SPINTEL-8005-X2 SPAMD-8001-T2 device-init-failed
4. If problem persists, then replace only the CPU associated with the PCIe device.
5. If problem persists, then replace the motherboard
6. Escalate
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
SPINTEL-8008-FK 3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-8004-TG SPAMD-8001-RE pcie.correctable
SPINTEL-8008-N0 4. If problem persists, then replace the motherboard
5. If problem persists, then replace only the processor associated with the PCIe device.
6. Escalate
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Serviceability Assessment Issue Summary
FRU BUG # Description
Motherboard 34849371 The black power connector on the 12-disk backplane is not properly secured to
the disk backplane. Power connector shroud on 4-disk backplane and 12-disk
backplanes comes loose when removing power cables.
Motherboard 34849342 Two green captive fastener torx screws that secure the motherboard tray to the
chassis bottom are too long making the process of removing the MB w/ tray
challenging. Motherboard hold down captive screws in stiffener too long. Bill
is phasing in a shorter screw, same part number as on the CEM riser.
CHASSIS 34849402 The thermal cable attached to the 4-disk backplane is difficult to detach due to
accessibility. Increase the length of cable thermal cable to allow 4-disk backplane
to be removed from chassis front before detaching cable. Front (FIM area)
thermal sensor cable (too short). The thermal cable is being removed from the
server design
CEM 34849443 The CEM uses 2 philips and 1 torque screw to secure Ortano card to
CEM. Should all be same type.
The CEM uses 2 Philips and 1 torque screw to secure Ortano card to CEM. The
screws will all be of the same type ( torx ).
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Serviceability Assessment Unresolved Summary
FRU Description
MOTHERBOARD Silk screen on motherboard is lacking for half of the memory DIMM
slots. Will not be changed, as there is no room to place silk screen on
motherboard. The memory DIMM map is located on Air Baffle
MOTHERBOARD The "Fault Remind" button is beneath the air shroud.
Will not be changed. Baffle will need to be removed when
accessing Fault Remind button
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Serviceability Highlights
• DIMM PMIC
• PMIC (Power Management IC) is a power management IC (PMIC) designed for typical DDR5
RDIMM, DDR5 LRDIMM and DDR5 NVDIMM applications.
• Aura 10
• NVME 30.71TB SFF SSD
• 2/RU Servers
• DB Nodes are no longer 1U servers.
• CPU
• To get FRU part number will have to parse the description field
• FRU part number supplied in fault is the Family Model and will not get to FRU part number
• ASR
• AMD (E5-2L) is the first architecture other than Intel and SPARC.
• No new setup requirements
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
RAS: Intel verses AMD
The following slides are taken from the presentation “Comparison of Intel X8/X9
and AMD E2/E4 RAS capabilities” by Dave Rudy in May of 2021
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
CPU/Memory Correctable Errors
• Intel can collect correctable errors via sideband PECI interface
• AMD BIOS, using platform-first error handling (PFEH), will collect correctable errors
and forward to ILOM
• By default, BIOS will send a CE on first occurrence.
• For repeat errors, BIOS will switch to error polling to avoid overrun of the FPGA
RASport.
• Polled errors can batch up to 4064 CE’s into a single packet.
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
CPU/Memory Fatal Errors
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
CPU/Memory Deconfiguration
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe Correctable Errors
• Intel can directly read PCIe AER Registers via PECI to detect and diagnose CEs
• AMD has neither a sideband address to AER nor a firmware-first equivalent to allow
BIOS to intercept Errors.
• As a result, there is no support for PCIe correctable error diagnosis from ILOM.
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe Fatal Errors
• Intel can directly read PCIe AER Registers via PECI to detect and diagnose fatal
errors.
• AMD systems have BIOS scrape for prior-boot PCIe fatal errors and forward the
results to ILOM.
• Requires that the host is able to boot after the fatal error
• See notes about identifying the failing device on the next pages.
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe Inventory
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe link up, speed and width checks
• Intel can directly read the PCIe inventory from PECI and check all devices
• AMD systems have BIOS send the PCIe inventory to ILOM at boot, over the FPGA
RASport.
• If BIOS does not send the information for a device, ILOM has nothing to check.
• Implication: ILOM cannot report when a device completely fails to train. It can
only report when a device trains at lower speed/width than expected.
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
High Level Comparison of RAS Features
Feature Intel X8/X9 AMD E2/E4
CPU/MEM CEs Immediate PECI scrape Immediate BIOS (PFEH) scrape, fwd to
ILOM over FPGA RASport
CPU/MEM fatal UEs Immediate PECI scrape Post-reboot BIOS scrape, fwd to ILOM
over IPMI
CPU/MEM Supported Memory deconfiguration after fault diagnosis
deconfiguration not supported
CPU temperature via PECI via APML
DIMM Temperature via PECI direct i2c to DIMMs
PCIe CEs Immediate PECI scrape Undiagnosed
PCIe fatal UEs Immediate PECI scrape Post-reboot BIOS scrape, fwd to ILOM
over IPMI
PCIe inventory Collected via PECI Collected by BIOS, fwd to ILOM over FPGA
RASport
PCIe link, speed, width Performed using PECI data Performed using BIOS forwarded data*
checks
Debugging CPU init failures MRC codes sent by BIOS to ILOM ABL codes sent by BIOS to ILOM
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Bugs
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Thank You
TERMS
• ABL (AGESA Boot Loader) initializes system main memory, locates the compressed BIOS image in
the SPI ROM, and decompresses it into DRAM.
• AER (Advanced Error Reporting) is an optional PCI Express feature that allows for more enhanced
reporting and control of errors than the basic error reporting scheme.
• AGESA (AMD Generic Encapsulated Software Architecture) is a procedure library developed by
AMD, used to perform the Platform Initialization (PI) on mainboards using their AMD64 architecture.
• AMD (Advanced Micro Devices)
• APML (Advanced Platform Management Link (SBI)" is a SMBus v2.0 compatible 2-wire processor
slave interface. APML is also referred as the sideband interface (SBI).
• ADDDC (Adaptive Double DRAM Device Correction) dynamically maps out the failing DRAM
device and continue to provide SDDC ECC coverage on the DIMM, translating to longer DIMM
longevity.
• IPMI (Intelligent Platform Management Interface) is a set of standardized specifications for
hardware-based platform management systems that makes it possible to control and monitor
servers centrally.
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
TERMS (continued…)
• PECI (Platform Environment Control Interface) is an Intel proprietary single wire serial interface
that provides a communication channel between Intel processors and chipset components to
external system management logic and thermal monitoring devices
• PFEH (Platform-First Error Handling) Systems, apparatuses, and methods for implementing a
hardware enforcement mechanism to enable platform-specific firmware visibility into an error state
ahead of the operating system are disclosed. A system includes at least one or more processor cores,
control logic, a plurality of registers, platform-specific firmware, and an operating system (OS). The
control logic allows the platform-specific firmware to decide if and when the error state is visible to
the OS. The control logic enables the platform-specific firmware, rather than the OS, to make
decisions about the replacement of faulty components in the system.
• PMIC (Power Management IC) is a power management IC (PMIC) designed for typical DDR5 RDIMM,
DDR5 LRDIMM and DDR5 NVDIMM applications.
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only