You are on page 1of 24

UCS Partner Webinar

Session – 3
Date : 29thMar’17

By: BGL-SV TAC team


- Yogesha MG
- Rahul Kumar
- Saurabh Kalra
- Navneet Gupta
- Harish Jamakhandi
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 1
Agenda

• Troubleshooting Server Hardware Issues


• Troubleshooting IOM Issues
• Troubleshooting SAN Boot and SAN Connectivity Issues
• Q&A

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 2
Server Hardware Components
 Server can be mainly divided in 3 major categories

CPU
 Compute
Memory

Storage controller
 Storage
HDD/SSD

 Network Mezzanine cards (Cisco VIC)

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 3
POST diagnostics & LEDs
 At the blade start-up, the POST diagnostics test the
CPUs, DIMMs, HDDs, and adapter cards.
 Any failure notifications are sent to Cisco UCS Manager.
 You can view these notification in the system event log
(SEL) or in the output of the show tech-support
command.
 If errors are found, an amber diagnostic LED lights up
next to the failed component.
 The HDD status LEDs are on the front of the HDD. Faults
on the CPU, DIMMs, or adapter cards also cause the
server health LED to light up as a solid amber for minor
error conditions or blinking amber for critical error
conditions.
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 4
DIMM Issues
DIMM troubleshooting best practices

Correctable ECC

Un-Correctable ECC

DIMM troubleshooting Best Practices


http://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-
computing/ucs-manager/whitepaper-c11-736116.pdf

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 5
Correctable memory error management
enhancements
Feature details
 Multiple-year study of field memory failures on Cisco UCS servers concludes
that there is no correlation between uncorrectable and correctable errors
 Single bit/correctable ECC errors would not be cause the DIMMs to go in a
degraded/inoperable state.
 Sensors for correctable faults has been removed and these have been moved
to SEL (system event logs)
 Uncorrectable ECC errors cause the blade to show an inoperable state as they
have a sensor threshold of 1.
Customer benefits
 Avoid unnecessary server disruption due to replacement of memory for
system with correctable errors.

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 6
Correct Installation of DIMMs

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 7
CPU issues
 All Cisco UCS servers support 1–2 or 1–4 CPUs. A
problem with a CPU can cause a server to fail to boot,
run very slowly, or cause serious data loss or
corruption.
 If the CPU was recently replaced or upgraded, make
sure the new CPU is compatible with the server and
that a BIOS supporting the CPU was installed.
 When replacing a CPU, make sure to correctly
thermally bond the CPU and the heat sink.
 Video: Removing and Installing a CPU and Heat Sink
on an Intel Xeon Processor E5-2400 Series
 Video: Removing and Installing a CPU and Heat Sink
on an Intel Xeon Processor E5-2600 Series
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 8
Cisco UCS B200 M4 blade server to support Intel
E5-2600 v4 Series CPUs

Software or Firmware Minimum Version

Cisco UCS Manager Release 3.1(1e) with 3.1(1g) ucs-


catalog.3.1.1g.T.bin, or 2.2(7b)

Server CIMC Release 3.1(1g) or 2.2(7b)

Server BIOS Release 3.1(1g) or 2.2(7b)

Cisco UCS B200 M4 Server Upgrade Guide for E5-2600 v4 Series CPUs

http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/hw/blade-
servers/install/CPU_Upgrade_Guide_v4_Series.html

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 9
CPU CATERR_N Details

 The CATERR_N signal indicates that one or more of the


processors experienced a catastrophic error.
 This can be due to an uncorrectable fault on one/more
memory units.
 Bus communication fault (Communication between the
CPU and the system board.
 Represent an error on the QPI link.
 Uncorrectable fault on the CPU.
 Untested/mismatched drivers.
 Issues on the OS/Running an untested OS can also
cause this.
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 10
Troubleshooting CATERR issues

 Collect CIMC/server tech support.


 Capture a screenshot of the console screen in case
PSOD/BSOD/server freeze symptoms are seen.
 Collect a CIMC tech support/OS tech support after hard
reset is performed from the UCS GUI or a warm reboot
from the vKVM screen.
 Always run tested drivers as mentioned on the Cisco
UCS HCL https://ucshcltool.cloudapps.cisco.com/public/

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 11
Disk Drive and RAID Issues

 Each disk drive has an activity LED that indicates an


outstanding I/O operation to the drive and a health LED
that turns solid amber if a drive fault is detected. Drive
faults can be detected in the BIOS POST.
 CIMC/BMC holds information on the disk drives.
 Use OS tools regularly to detect and correct drive
problems (for example, bad sectors).
 StorCli/MegaCli are such OS tools, these can also be
used to manually put a drive in rebuild without
executing controller WEBBIOS which may need a
server reboot.

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 12
Identifying and troubleshooting Disk
issues
 Components of Storage Subsystem:- HDD’s/SSD’s +
Drive backplane + Expander + sas cables +
storage/RAID controller.
 Drives issues may be any of these:-
 Disk failure.
 Media errors.
 Other errors.
 DONOT reseat a failed HDD as this may cause bad
blocks to be rebuilt in the virtual volume.

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 13
Interpreting the Status of a Monitored Disk
Drive
 Operability—The operational state of the drive.
 Presence—The presence of the disk drive, and whether it can be detected in the
server drive bay, regardless of its operational state.
Operability Status Presence Status Interpretation
Operable Equipped No fault condition. The disk drive is in the
server and can be used.
Inoperable Equipped Fault condition. The disk drive is in the
server, but one of the following could be
causing an operability problem:
•The disk drive is unusable due to a
hardware issue such as bad blocks.
•There is a problem with the IPMI link to the
storage controller.
N/A Missing Fault condition. The server drive bay does
not contain a disk drive.
N/A Equipped Fault condition. The disk drive is in the
server, but one of the following could be
causing an operability problem:
•The server is powered off.
•The storage controller firmware is the
wrong version and does not support disk
drive monitoring.
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 14
Cisco UCS Manager Reports More Disks in
Server than Total Slots Available

 Possible Cause—This problem is typically caused by a


communication failure between Cisco UCS Manager and the
server that reports the inaccurate information
 Decommission and re-acknowledge the server.

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 15
Troubleshooting Virtual Drives

 A VD may be degraded because of the following


reasons:-
 A failed physical drive may cause a virtual volume to go
degraded/offline based upon number of drive failures it
may support. Eg RAID 5 supports 1, RAID 6 support 2
drive failures.
 A virtual drive can report a cache degraded state when
the cache on the storage controller is not being utilized.
 Failed battery supporting the cache or a battery in
relearn can cause cache degraded state.

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 16
 Write back (WB) write cache policy on the virtual drive
utilizes the cache module on the storage controller.
 WB policy makes data writing faster on the VD as it
reduces the number of write operations to main disk.
 The battery on the controller felicitates retention of data
in the cache module in case of power loss.

 Write through-Does not utilize the cache module and


the write I/O operations are directly stored on the
storage array.
 In case the system is susceptible to power loss, it is
advisable to use this method.

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 17
Consistency Check & Patrol Read

 CC-Checks the consistency of the virtual volume.


 Patrol read checks the state of sectors on the physical
drive.
 Background processes which generally run at non peak
hours, by default scheduled for Saturday 02:00AM
PST.
 In case both the process start concurrently newer
firmware of the controller enables restart of one of the
processes which is not executed at that time.

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 18
Troubleshooting Server Power issues.

 In case of issues related to “Unable to change blade


power state”
 There may be a case that the blade is not able to receive
power from the chassis backplane.
 Open a cli session to the UCSM ip, connect to the blade
cimc and run the command “power”
 Check the sensor readings on the power rails on the
motherboard.

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 19
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 20
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 21
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 22
Q&A

Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 23
Presentation_ID © 2010 Cisco Systems, Inc. All rights reserved. CAE Bootcamp 24

You might also like