Ucs Partner Webinar Series Part 3

UCS Partner Webinar
Session – 3
Date : 29thMar’17
By: BGL-SV TAC team

- Yogesha MG
- Rahul Kumar
- Saurabh Kalra
- Navneet Gupta
- Harish Jamakhandi
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 1
Agenda
• Troubleshooting Server Hardware Issues

• Troubleshooting IOM Issues
• Troubleshooting SAN Boot and SAN Connectivity Issues
• Q&A
Server Hardware Components
 Server can be mainly divided in 3 major categories
CPU
 Compute
Memory
Storage controller
 Storage
HDD/SSD
 Network Mezzanine cards (Cisco VIC)
POST diagnostics & LEDs
 At the blade start-up, the POST diagnostics test the
CPUs, DIMMs, HDDs, and adapter cards.
 Any failure notifications are sent to Cisco UCS Manager.
 You can view these notification in the system event log
(SEL) or in the output of the show tech-support
command.
 If errors are found, an amber diagnostic LED lights up
next to the failed component.
 The HDD status LEDs are on the front of the HDD. Faults
on the CPU, DIMMs, or adapter cards also cause the
server health LED to light up as a solid amber for minor
error conditions or blinking amber for critical error
conditions.
DIMM Issues
DIMM troubleshooting best practices
Correctable ECC
Un-Correctable ECC
DIMM troubleshooting Best Practices

http://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-
computing/ucs-manager/whitepaper-c11-736116.pdf
Correctable memory error management
enhancements
Feature details
 Multiple-year study of field memory failures on Cisco UCS servers concludes
that there is no correlation between uncorrectable and correctable errors
 Single bit/correctable ECC errors would not be cause the DIMMs to go in a
degraded/inoperable state.
 Sensors for correctable faults has been removed and these have been moved
to SEL (system event logs)
 Uncorrectable ECC errors cause the blade to show an inoperable state as they
have a sensor threshold of 1.
Customer benefits
 Avoid unnecessary server disruption due to replacement of memory for
system with correctable errors.
Correct Installation of DIMMs
CPU issues
 All Cisco UCS servers support 1–2 or 1–4 CPUs. A
problem with a CPU can cause a server to fail to boot,
run very slowly, or cause serious data loss or
corruption.
 If the CPU was recently replaced or upgraded, make
sure the new CPU is compatible with the server and
that a BIOS supporting the CPU was installed.
 When replacing a CPU, make sure to correctly
thermally bond the CPU and the heat sink.
 Video: Removing and Installing a CPU and Heat Sink
on an Intel Xeon Processor E5-2400 Series
 Video: Removing and Installing a CPU and Heat Sink
on an Intel Xeon Processor E5-2600 Series
Cisco UCS B200 M4 blade server to support Intel
E5-2600 v4 Series CPUs
Software or Firmware Minimum Version
Cisco UCS Manager Release 3.1(1e) with 3.1(1g) ucs-

catalog.3.1.1g.T.bin, or 2.2(7b)
Server CIMC Release 3.1(1g) or 2.2(7b)
Server BIOS Release 3.1(1g) or 2.2(7b)
Cisco UCS B200 M4 Server Upgrade Guide for E5-2600 v4 Series CPUs
http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/hw/blade-
servers/install/CPU_Upgrade_Guide_v4_Series.html
CPU CATERR_N Details
 The CATERR_N signal indicates that one or more of the

processors experienced a catastrophic error.
 This can be due to an uncorrectable fault on one/more
memory units.
 Bus communication fault (Communication between the
CPU and the system board.
 Represent an error on the QPI link.
 Uncorrectable fault on the CPU.
 Untested/mismatched drivers.
 Issues on the OS/Running an untested OS can also
cause this.
Troubleshooting CATERR issues
 Collect CIMC/server tech support.

 Capture a screenshot of the console screen in case
PSOD/BSOD/server freeze symptoms are seen.
 Collect a CIMC tech support/OS tech support after hard
reset is performed from the UCS GUI or a warm reboot
from the vKVM screen.
 Always run tested drivers as mentioned on the Cisco
UCS HCL https://ucshcltool.cloudapps.cisco.com/public/
Disk Drive and RAID Issues
 Each disk drive has an activity LED that indicates an

outstanding I/O operation to the drive and a health LED
that turns solid amber if a drive fault is detected. Drive
faults can be detected in the BIOS POST.
 CIMC/BMC holds information on the disk drives.
 Use OS tools regularly to detect and correct drive
problems (for example, bad sectors).
 StorCli/MegaCli are such OS tools, these can also be
used to manually put a drive in rebuild without
executing controller WEBBIOS which may need a
server reboot.
Identifying and troubleshooting Disk
issues
 Components of Storage Subsystem:- HDD’s/SSD’s +
Drive backplane + Expander + sas cables +
storage/RAID controller.
 Drives issues may be any of these:-
 Disk failure.
 Media errors.
 Other errors.
 DONOT reseat a failed HDD as this may cause bad
blocks to be rebuilt in the virtual volume.
Interpreting the Status of a Monitored Disk
Drive
 Operability—The operational state of the drive.
 Presence—The presence of the disk drive, and whether it can be detected in the
server drive bay, regardless of its operational state.
Operability Status Presence Status Interpretation
Operable Equipped No fault condition. The disk drive is in the
server and can be used.
Inoperable Equipped Fault condition. The disk drive is in the
server, but one of the following could be
causing an operability problem:
•The disk drive is unusable due to a
hardware issue such as bad blocks.
•There is a problem with the IPMI link to the
storage controller.
N/A Missing Fault condition. The server drive bay does
not contain a disk drive.
N/A Equipped Fault condition. The disk drive is in the
server, but one of the following could be
causing an operability problem:
•The server is powered off.
•The storage controller firmware is the
wrong version and does not support disk
drive monitoring.
Cisco UCS Manager Reports More Disks in
Server than Total Slots Available
 Possible Cause—This problem is typically caused by a

communication failure between Cisco UCS Manager and the
server that reports the inaccurate information
 Decommission and re-acknowledge the server.
Troubleshooting Virtual Drives
 A VD may be degraded because of the following

reasons:-
 A failed physical drive may cause a virtual volume to go
degraded/offline based upon number of drive failures it
may support. Eg RAID 5 supports 1, RAID 6 support 2
drive failures.
 A virtual drive can report a cache degraded state when
the cache on the storage controller is not being utilized.
 Failed battery supporting the cache or a battery in
relearn can cause cache degraded state.
 Write back (WB) write cache policy on the virtual drive
utilizes the cache module on the storage controller.
 WB policy makes data writing faster on the VD as it
reduces the number of write operations to main disk.
 The battery on the controller felicitates retention of data
in the cache module in case of power loss.
 Write through-Does not utilize the cache module and

the write I/O operations are directly stored on the
storage array.
 In case the system is susceptible to power loss, it is
advisable to use this method.
Consistency Check & Patrol Read
 CC-Checks the consistency of the virtual volume.

 Patrol read checks the state of sectors on the physical
drive.
 Background processes which generally run at non peak
hours, by default scheduled for Saturday 02:00AM
PST.
 In case both the process start concurrently newer
firmware of the controller enables restart of one of the
processes which is not executed at that time.
Troubleshooting Server Power issues.
 In case of issues related to “Unable to change blade

power state”
 There may be a case that the blade is not able to receive
power from the chassis backplane.
 Open a cli session to the UCSM ip, connect to the blade
cimc and run the command “power”
 Check the sensor readings on the power rails on the
motherboard.
Q&A

Ucs Partner Webinar Series Part 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ucs Partner Webinar Series Part 3

Uploaded by

Copyright:

Available Formats

UCS Partner Webinar

By: BGL-SV TAC team

• Troubleshooting Server Hardware Issues

 Network Mezzanine cards (Cisco VIC)

DIMM troubleshooting Best Practices

Software or Firmware Minimum Version

Cisco UCS Manager Release 3.1(1e) with 3.1(1g) ucs-

Server CIMC Release 3.1(1g) or 2.2(7b)

Server BIOS Release 3.1(1g) or 2.2(7b)

 The CATERR_N signal indicates that one or more of the

 Collect CIMC/server tech support.

 Each disk drive has an activity LED that indicates an

 Possible Cause—This problem is typically caused by a

 A VD may be degraded because of the following

 Write through-Does not utilize the cache module and

 CC-Checks the consistency of the virtual volume.

 In case of issues related to “Unable to change blade

You might also like