You are on page 1of 26

Exadata X10M TOI

X10M Serviceability

OHD Serviceability Engineering


March 06, 2023
Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for
information purposes only and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied
upon in making purchasing decisions.

The development, release, timing, and pricing of any features or functionality


described for Oracle’s products may change and remains at the sole discretion of
Oracle Corporation.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
HALRTs

There are no new HALRT events!

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
X10M Events vs. X9M Events

AMD INTEL Dictionary


ILOM – Self-diagnosis events specific to the ILOM. These are utilized by
Yes Yes
the X5-X9 generations of X86 platforms .
ISTOR -Storage specific events detected by the ILOM. These are common across both
Yes Yes
X86 and SPARC platforms
SPENV-This common dictionary will be carried forward for all platforms. Intel, AMD
Yes Yes and ARM based servers will use this dictionary to call out environmental events for
things such as Fans, Power, and thermal.
SPAMD - This dictionary was created for AMD based servers to have their platform
Yes No
specific events diagnosed to this dictionary. This will include CPU, DIMM, PCIe
SPINTEL- Dictionary for Intel based servers (X9 and forward) to have their platform
No Yes specific events diagnosed to this dictionary. This will include CPU, DIMM, PCIe

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
X10M Events vs. X9M Events

Dictionary Event type X10M (E5) events X9M events

ALERT 1 N/A
SPAMD DEFECT 0 N/A
FAULT 81 N/A
ALERT N/A 60
SPINTEL DEFECT N/A 42
FAULT N/A 196
ALERT 22 11
SPENV DEFECT 2 3
FAULT 21 15

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Events Breakdown

• The SPAMD events are as follows:


• CPU = 66
• ILOM = 1
• MEMORY = 8
• PCIe = 8
• Power = 1
• Emphases on CPU events.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
SPENV Events
MESSAGE KA Doc ID RESOURCE DESCRIPTION
SPENV-8002-2U 2926525.1 DIMM PMIC DIMM has experienced an overtemperature failure.
SPENV-8002-18 2926524.1 DIMM PMIC DIMM has experienced an output overvoltage failure.
SPENV-8002-0M 2926492.1 DIMM PMIC DIMM has experienced an input overvoltage failure.
SPENV-8001-RV 2892977.1 Fan PSU Fan Failure
the output current from the supply exceeded its
SPENV-8001-S0 2892966.1 PSU
supported range.
output voltage from the supply exceeded its supported
SPENV-8001-TL 2892978.1 PSU
range.
output voltage from the supply was below its supported
SPENV-8001-UG 2892979.1 PSU
range.
SPENV-8001-M7 2868202.1 CPU/DIMM Excessive processor or DIMM throttling
SPENV-8001-QM 2892975.1 PSU PSU Overvoltage Input Exceeded
PCIe device was unexpectedly removed from the
SPENV-8001-VN 2892564.1 PCIe
system

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
SPENV Events (continued)

MESSAGE KA Doc ID RESOURCE DESCRIPTION


SPENV-8001-Y6 2916278.1 SSD SSD Cable Incorrect
SPENV-8001-XT 2892548.1 PSU PSU Mix Models Unidentified
SPENV-8001-W9 2892528.1 PSU PSU Mix Models Identified

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe Events
X5->X8 >=X9 AMD Fault Class Our Guidance (according to ILOM Code)
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
SPX86A-8002-RK SPINTEL-8005-YX SPAMD-8001-UX pcie.fatal 3. If problem persists, then replace the PCIe card in the suspect list.
4. If problem persists, then replace only the CPU associated with the PCIe device.
5. Escalate (and collect the snapshot)
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
data-link-layer- 3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-8009-3J SPINTEL-8005-WJ SPAMD-8001-SJ
inactive 4. If problem persists, then replace only the CPU associated with the PCIe device.
5. If problem persists, then replace the motherboard
6. Escalate
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-8004-FN SPINTEL-8006-1H SPAMD-8001-XC link-degraded-width
4. If problem persists, then replace only the CPU associated with the PCIe device.
5. If problem persists, then replace the motherboard
6. Escalate
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-8004-E6 SPINTEL-8006-0D SPAMD-8001-WR link-degraded-speed
4. If problem persists, then replace only the CPU associated with the PCIe device.
5. If problem persists, then replace the motherboard
6. Escalate
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables. .
3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-800A-8S SPINTEL-8005-X2 SPAMD-8001-T2 device-init-failed
4. If problem persists, then replace only the CPU associated with the PCIe device.
5. If problem persists, then replace the motherboard
6. Escalate
1. Validate SW/FW
2. Re-seat the PCIe card, perform a visual inspection of the connectors and cables.
SPINTEL-8008-FK 3. If problem persists, then replace the PCIe card in the suspect list.
SPX86A-8004-TG SPAMD-8001-RE pcie.correctable
SPINTEL-8008-N0 4. If problem persists, then replace the motherboard
5. If problem persists, then replace only the processor associated with the PCIe device.
6. Escalate
Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Serviceability Assessment Issue Summary
FRU BUG # Description

Motherboard 34849371 The black power connector on the 12-disk backplane is not properly secured to
the disk backplane. Power connector shroud on 4-disk backplane and 12-disk
backplanes comes loose when removing power cables.

Motherboard 34849342 Two green captive fastener torx screws that secure the motherboard tray to the
chassis bottom are too long making the process of removing the MB w/ tray
challenging. Motherboard hold down captive screws in stiffener too long. Bill
is phasing in a shorter screw, same part number as on the CEM riser.

CHASSIS 34849402 The thermal cable attached to the 4-disk backplane is difficult to detach due to
accessibility. Increase the length of cable thermal cable to allow 4-disk backplane
to be removed from chassis front before detaching cable. Front (FIM area)
thermal sensor cable (too short). The thermal cable is being removed from the
server design

CEM 34849443 The CEM uses 2 philips and 1 torque screw to secure Ortano card to
CEM. Should all be same type.
The CEM uses 2 Philips and 1 torque screw to secure Ortano card to CEM. The
screws will all be of the same type ( torx ).

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Serviceability Assessment Unresolved Summary

FRU Description
MOTHERBOARD Silk screen on motherboard is lacking for half of the memory DIMM
slots. Will not be changed, as there is no room to place silk screen on
motherboard. The memory DIMM map is located on Air Baffle
MOTHERBOARD The "Fault Remind" button is beneath the air shroud.
Will not be changed. Baffle will need to be removed when
accessing Fault Remind button

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Serviceability Highlights

• DIMM PMIC
• PMIC (Power Management IC) is a power management IC (PMIC) designed for typical DDR5
RDIMM, DDR5 LRDIMM and DDR5 NVDIMM applications.
• Aura 10
• NVME 30.71TB SFF SSD
• 2/RU Servers
• DB Nodes are no longer 1U servers.
• CPU
• To get FRU part number will have to parse the description field
• FRU part number supplied in fault is the Family Model and will not get to FRU part number
• ASR
• AMD (E5-2L) is the first architecture other than Intel and SPARC.
• No new setup requirements

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
RAS: Intel verses AMD

The following slides are taken from the presentation “Comparison of Intel X8/X9
and AMD E2/E4 RAS capabilities” by Dave Rudy in May of 2021

Thank You Dave!

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
CPU/Memory Correctable Errors
• Intel can collect correctable errors via sideband PECI interface
• AMD BIOS, using platform-first error handling (PFEH), will collect correctable errors
and forward to ILOM
• By default, BIOS will send a CE on first occurrence.
• For repeat errors, BIOS will switch to error polling to avoid overrun of the FPGA
RASport.
• Polled errors can batch up to 4064 CE’s into a single packet.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
CPU/Memory Fatal Errors

• Intel can collect fatal errors via sideband PECI interface


• AMD must wait for the host to reboot and BIOS to scrape error registers.
• Implication: if the CPU “dies” an unrecoverable death where it cannot later boot,
may have limited telemetry on AMD systems
• But will have the failure to boot diagnosis, to indicate the chip is dead.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
CPU/Memory Deconfiguration

• Intel has multiple ways to deconfigure memory:


• BIOS itself can invoke ADDDC (no longer used)
• ILOM policy can retire a DIMM; retirement preformed by interaction with BIOS
on next boot.
• AMD systems do not have any Oracle policy to deconfigure memory
• However, AMD’s AGESA Boot Loader (ABL) may deconfigure memory that does
not train.
• ABL deconfigures down to a supported memory configuration.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe Correctable Errors

• Intel can directly read PCIe AER Registers via PECI to detect and diagnose CEs
• AMD has neither a sideband address to AER nor a firmware-first equivalent to allow
BIOS to intercept Errors.
• As a result, there is no support for PCIe correctable error diagnosis from ILOM.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe Fatal Errors

• Intel can directly read PCIe AER Registers via PECI to detect and diagnose fatal
errors.
• AMD systems have BIOS scrape for prior-boot PCIe fatal errors and forward the
results to ILOM.
• Requires that the host is able to boot after the fatal error
• See notes about identifying the failing device on the next pages.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe Inventory

• Intel can directly probe the PCIe inventory from PECI


• AMD systems have BIOS send the PCIe inventory to ILOM at boot.
• Note physical chip PCIe port numbering on AMD is different than the logical bus
numbering. A Translation step is required.
• PCIe trees can assign dynamic bus numbers but the error reporting registers
available to BIOS only report the port #, not the logical bus #. Making some
errors hard to isolate.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
PCIe link up, speed and width checks

• Intel can directly read the PCIe inventory from PECI and check all devices
• AMD systems have BIOS send the PCIe inventory to ILOM at boot, over the FPGA
RASport.
• If BIOS does not send the information for a device, ILOM has nothing to check.
• Implication: ILOM cannot report when a device completely fails to train. It can
only report when a device trains at lower speed/width than expected.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
High Level Comparison of RAS Features
Feature Intel X8/X9 AMD E2/E4
CPU/MEM CEs Immediate PECI scrape Immediate BIOS (PFEH) scrape, fwd to
ILOM over FPGA RASport
CPU/MEM fatal UEs Immediate PECI scrape Post-reboot BIOS scrape, fwd to ILOM
over IPMI
CPU/MEM Supported Memory deconfiguration after fault diagnosis
deconfiguration not supported
CPU temperature via PECI via APML
DIMM Temperature via PECI direct i2c to DIMMs
PCIe CEs Immediate PECI scrape Undiagnosed
PCIe fatal UEs Immediate PECI scrape Post-reboot BIOS scrape, fwd to ILOM
over IPMI
PCIe inventory Collected via PECI Collected by BIOS, fwd to ILOM over FPGA
RASport
PCIe link, speed, width Performed using PECI data Performed using BIOS forwarded data*
checks
Debugging CPU init failures MRC codes sent by BIOS to ILOM ABL codes sent by BIOS to ILOM

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Bugs

• 35136975 - EM Incorrectly Processing HALRT faults from Compute Nodes


Tests indicate the first HALRT is processed correctly, but subsequent HALRT faults
are not processed into incidents, even though they are received by the EM
agent.
Eventually new HALRT events will be processed (usually the next day); time frame is
indeterminate.
• the issue is SNMP4j issue on the Exadata side.
• Good news: affects SNMPV3 so limited exposure
• Bad news: it will affect ASR
• Fixed in the April sustaining release.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
Thank You
TERMS
• ABL (AGESA Boot Loader) initializes system main memory, locates the compressed BIOS image in
the SPI ROM, and decompresses it into DRAM.
• AER (Advanced Error Reporting) is an optional PCI Express feature that allows for more enhanced
reporting and control of errors than the basic error reporting scheme.
• AGESA (AMD Generic Encapsulated Software Architecture) is a procedure library developed by
AMD, used to perform the Platform Initialization (PI) on mainboards using their AMD64 architecture.
• AMD (Advanced Micro Devices)
• APML (Advanced Platform Management Link (SBI)" is a SMBus v2.0 compatible 2-wire processor
slave interface. APML is also referred as the sideband interface (SBI).
• ADDDC (Adaptive Double DRAM Device Correction) dynamically maps out the failing DRAM
device and continue to provide SDDC ECC coverage on the DIMM, translating to longer DIMM
longevity.
• IPMI (Intelligent Platform Management Interface) is a set of standardized specifications for
hardware-based platform management systems that makes it possible to control and monitor
servers centrally.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only
TERMS (continued…)
• PECI (Platform Environment Control Interface) is an Intel proprietary single wire serial interface
that provides a communication channel between Intel processors and chipset components to
external system management logic and thermal monitoring devices
• PFEH (Platform-First Error Handling) Systems, apparatuses, and methods for implementing a
hardware enforcement mechanism to enable platform-specific firmware visibility into an error state
ahead of the operating system are disclosed. A system includes at least one or more processor cores,
control logic, a plurality of registers, platform-specific firmware, and an operating system (OS). The
control logic allows the platform-specific firmware to decide if and when the error state is visible to
the OS. The control logic enables the platform-specific firmware, rather than the OS, to make
decisions about the replacement of faulty components in the system.
• PMIC (Power Management IC) is a power management IC (PMIC) designed for typical DDR5 RDIMM,
DDR5 LRDIMM and DDR5 NVDIMM applications.

Copyright © 2023, Oracle and/or its affiliates | Oracle Confidential | For Internal and Authorized
OPN Partners Use Only

You might also like