Professional Documents
Culture Documents
Session – 3
Date : 29thMar’17
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 2
Server Hardware Components
Server can be mainly divided in 3 major categories
CPU
Compute
Memory
Storage controller
Storage
HDD/SSD
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 3
POST diagnostics & LEDs
At the blade start-up, the POST diagnostics test the
CPUs, DIMMs, HDDs, and adapter cards.
Any failure notifications are sent to Cisco UCS Manager.
You can view these notification in the system event log
(SEL) or in the output of the show tech-support
command.
If errors are found, an amber diagnostic LED lights up
next to the failed component.
The HDD status LEDs are on the front of the HDD. Faults
on the CPU, DIMMs, or adapter cards also cause the
server health LED to light up as a solid amber for minor
error conditions or blinking amber for critical error
conditions.
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 4
DIMM Issues
DIMM troubleshooting best practices
Correctable ECC
Un-Correctable ECC
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 5
Correctable memory error management
enhancements
Feature details
Multiple-year study of field memory failures on Cisco UCS servers concludes
that there is no correlation between uncorrectable and correctable errors
Single bit/correctable ECC errors would not be cause the DIMMs to go in a
degraded/inoperable state.
Sensors for correctable faults has been removed and these have been moved
to SEL (system event logs)
Uncorrectable ECC errors cause the blade to show an inoperable state as they
have a sensor threshold of 1.
Customer benefits
Avoid unnecessary server disruption due to replacement of memory for
system with correctable errors.
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 6
Correct Installation of DIMMs
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 7
CPU issues
All Cisco UCS servers support 1–2 or 1–4 CPUs. A
problem with a CPU can cause a server to fail to boot,
run very slowly, or cause serious data loss or
corruption.
If the CPU was recently replaced or upgraded, make
sure the new CPU is compatible with the server and
that a BIOS supporting the CPU was installed.
When replacing a CPU, make sure to correctly
thermally bond the CPU and the heat sink.
Video: Removing and Installing a CPU and Heat Sink
on an Intel Xeon Processor E5-2400 Series
Video: Removing and Installing a CPU and Heat Sink
on an Intel Xeon Processor E5-2600 Series
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 8
Cisco UCS B200 M4 blade server to support Intel
E5-2600 v4 Series CPUs
Cisco UCS B200 M4 Server Upgrade Guide for E5-2600 v4 Series CPUs
http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/hw/blade-
servers/install/CPU_Upgrade_Guide_v4_Series.html
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 9
CPU CATERR_N Details
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 11
Disk Drive and RAID Issues
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 12
Identifying and troubleshooting Disk
issues
Components of Storage Subsystem:- HDD’s/SSD’s +
Drive backplane + Expander + sas cables +
storage/RAID controller.
Drives issues may be any of these:-
Disk failure.
Media errors.
Other errors.
DONOT reseat a failed HDD as this may cause bad
blocks to be rebuilt in the virtual volume.
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 13
Interpreting the Status of a Monitored Disk
Drive
Operability—The operational state of the drive.
Presence—The presence of the disk drive, and whether it can be detected in the
server drive bay, regardless of its operational state.
Operability Status Presence Status Interpretation
Operable Equipped No fault condition. The disk drive is in the
server and can be used.
Inoperable Equipped Fault condition. The disk drive is in the
server, but one of the following could be
causing an operability problem:
•The disk drive is unusable due to a
hardware issue such as bad blocks.
•There is a problem with the IPMI link to the
storage controller.
N/A Missing Fault condition. The server drive bay does
not contain a disk drive.
N/A Equipped Fault condition. The disk drive is in the
server, but one of the following could be
causing an operability problem:
•The server is powered off.
•The storage controller firmware is the
wrong version and does not support disk
drive monitoring.
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 14
Cisco UCS Manager Reports More Disks in
Server than Total Slots Available
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 15
Troubleshooting Virtual Drives
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 16
Write back (WB) write cache policy on the virtual drive
utilizes the cache module on the storage controller.
WB policy makes data writing faster on the VD as it
reduces the number of write operations to main disk.
The battery on the controller felicitates retention of data
in the cache module in case of power loss.
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 17
Consistency Check & Patrol Read
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 18
Troubleshooting Server Power issues.
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 19
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 20
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 21
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 22
Q&A
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 23
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 24