You are on page 1of 78

UCS 4+ Years of Lessons Learned

by the TAC
BRKCOM-3010

Ken Krzyzewski - Technical Leader


Cisco TAC Data Center & Virtualization Technical Leadership Team
Agenda
• Introduction
• “What went wrong?” interactive scenarios
• UCS Admin. Best Practices Overview
• Avoiding the Avoidable
• Maximizing UCS TAC interactions
• Conclusion

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Session Goals
• Learn how to avoid the avoidable
– Leverage UCS best practices

• Reinforce “good habits” (for those already following best practices)


– Learn what can happen when you take shortcuts

• Manage UCS TAC Cases more effectively


– Insider tips for quicker issue resolution

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
A real UCS customer saga…

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
Release Notes

6
Release Notes
Unofficial Survey…”Do you read the Release Notes?”

Customers Response TAC's Perception

7
20

Yes Yes
No No

93 80

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
Release Notes
Pay particular attention to:
• Open Caveats
– Resolved Caveats are the typical reason for an upgrade

• New Features
• Internal Dependencies

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
Release Notes
Mixed Cisco UCS Release Support
• Supported in release 2.1(x) and above
• Allows for independent infrastructure upgrades
• Consult the release notes for details

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
Release Notes
Mixed Cisco UCS Release Support matrix example

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
Release Notes
Mixed Release
• Refer to the “Minimum B/C Bundle…Features” section

• New features may require B-bundle upgrade

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
Release Notes
Caveat details if running in a mixed firmware environment

• Resolved caveats may require B-bundle upgrade

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
UCS Firmware Upgrades

13
UCS Firmware Upgrades
Treat them like elective surgery
• Pre-op check-up • Pro-active TAC SR
• The operation • The upgrade
• Recovery Room • Verify functionality
• Released from surgical center • Resume production

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
UCS Firmware Upgrades
Pre-Upgrade Check list
• Consult Release Notes, work with your account team
• Back-up your system
• Review Compatibility Matrices
• Eliminate Critical/Major Faults
• Watch our Video Upgrade Guides
• Check Cisco’s online community and support forums

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
UCSM Firmware Upgrades
Frequently forgotten or missed items
• Updating OS drivers to meet the compatibility matrix
• Backing up the system prior to upgrade
• Upgrade of blade BIOS & Board Controller

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
UCSM Firmware Upgrade
UCS HW and SW Interoperability example

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
UCSM Firmware Upgrade (cont.)
Results (continued)

Adapter Driver = 1.5.0.45 (FNIC) / 2.1.2.38 (ENIC)


Adapter Firmware = 2.1(3)
Boot Code / BIOS =

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
UCSM Upgrades
Host Firmware Package – Simple option

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
UCS Firmware Upgrade
New features in 2.2(2)
• During the Auto-Upgrade process:
– System back-up reminder
– Critical/Major fault presence alerts

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Maintenance Windows

21
Maintenance Windows
TAC customer example
• Customer opens a case regarding a critical fault
• We explain how to resolve it, and due to the potential for it to be service
impacting, strongly suggest a maintenance window
• 1 Hour later, the same customer is back with a P1 SR!

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
Maintenance Windows
• Better safe than sorry
• An Industry standard best practice for Data Centers
• Especially critical for Fabric Interconnect changes

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
System Back Up

24
System Backups
TAC Case example
• During an attempted upgrade, configuration database corrupted
• UCSM was in a degraded state, not allowing back-ups nor show tech
• 11 month old ‘show tech ucsm’ found in our TAC SR database
• Painful, but successful, reconfiguration (4+ hour effort) of entire UCS domain

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
UCSM Back-ups
• UCS Back-up Types
1. Full State
2. System Configuration
3. Logical Configuration
4. All Configuration

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
Backup Types
Full State
• Binary file
• Encrypted (passwords and sensitive data not in clear text)
• Intended for Disaster Recovery
• Ideal for pre-Upgrade

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
Backup Types
System Configuration
• XML file
• System configuration such as username/roles
• Exportable to external Fabric Interconnects

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Backup Types
Logical Configuration
• XML file
• Service Profiles, VLANs, VSANs, pools & policies
• Exportable to external Fabric Interconnects

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Backup Types
All Configuration
• XML file
• Includes System & Logical configuration settings
• Exportable to external Fabric Interconnects

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
Fibre-Channel Port Channels

31
FC Port Channels
TAC Case Example
• Hosts reporting high storage latency
• They added 3 additional FC uplinks (doubling bandwidth)
• No change in latency! What happened?

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
FC Switch
FC Port Channels
Individual FC Uplink behavior

Fabric Int.

FDISC fcid1 fcid4


FDISC fcid2 fcid5 fcid1
fcid3 fcid6 fcid4
FDISC
LOGO

Individual FC Uplinks

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
FC Port Channels
The power of the bundle

On logical link comprised of the individual physical links FC Switch


Frames sent round-robin, per link, per Src/Dst/OxID

Fabric Interconnect

Port-channel bundle
BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
FC Port Channels
Details
• Requires MDS or Nexus upstream switch
• Dynamically modify Port Channel link membership
• Load Balancing amongst member links is inherent
– No need to be concerned about multiple high b/w hosts pinned to same link

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
FC Port Channels
Back to our TAC case…

No PC
25000
20000
15000
10000 No PC

5000
0
Port 1 Port 2 Port 3 Port 4 Port 5 Port 6

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
FC Port Channels
Port Channel vs. individual FC uplinks
25000

20000

15000
PC
10000 No PC

5000

0
Port 1 Port 2 Port 3 Port 4 Port 5 Port 6

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
FC Port Channels
Conclusion
• Port Channels (if possible) are preferred
• Provides optimal traffic distribution
• Dynamic PC membership changes
• No known down-side

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
FC Topologies
FC Topologies
Best Practice Model
FC Switches

A-Side B-Side

<- Physical Separation ->

Fabric
Interconnects

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
FC Topologies
Common Mistake 1 (ISL)
FC Switches

A-Side B-Side

Fabric
Interconnects

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
FC Topologies
Common Mistake 2 (crossing)
FC Switches

A-Side B-Side

Fabric
Interconnects

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
FC Topologies
Rare Mistake (ISL + Cross)
FC Switches

A-Side B-Side

Fabric
Interconnects

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
3rd Party Transceivers

44
3rd Party Transceivers
TAC Case example
• We recently encountered “Cisco Compatible” non-Cisco twin-ax installed in a
relatively large UCS B-Series deployment
• What we found:
– Cisco PID spoofed, prevented “unsupported transceiver” fault
– Low percentage of frames with CRC errors
– Fibre Channel performance severely impacted due to the dropped frames
• Please be aware of our experience when choosing 3rd Party transceivers

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
DIMM Faults
Degraded DIMM alerts
Background
• Probability of ECC increases as DIMM geometries shrink
• ECC threshold monitoring can lead to Degraded DIMM marked faults
• Not to be confused with DIMMs marked “Inoperable”

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
Degraded DIMMs
Impact
• Per UCS Engineering studies:
– UCS servers handle ECC errors without impact to server
– No Performance impact with DIMM’s in degraded state
– Our thresholds for marking a DIMM degraded deemed too conservative

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 48
Degraded DIMMs
Resolution
• New Firmware will change • Workaround:
thresholds to practical values – You can safely ignore the ‘degraded
DIMM’ faults until you upgrade, or
• Fixed in 2.2(1b) and 2.1(3c)
– RMA the degraded DIMM

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
DIMM Blacklisting
New in 2.2(1b)
• With DIMM Blacklisting feature enabled, if uncorrectable DIMM errors are
encountered:
– CIMC records location of faulty DIMM
– During next boot sequence, the faulty DIMM gets mapped out

• Benefits
– Allows server to safely remain in production
– Allows for RMA of faulty DIMM when convenient
• Please note that the feature is disabled by default

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 50
C-Series
C-Series
CIMC configuration
• Please configure your CIMC!
• Troubleshooting is nearly impossible without it

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 52
C-Series
TAC Case example
• Large C-series standalone deployment
• Disk performance degraded on a number of servers
• TAC found failed BBU’s
• Servers had moved from write-back to write-through
• Recommendation:
– Employ SNMP or IPMI monitoring, especially in large deployments

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 53
C-Series
Monitoring feature
• We expect to have a UCSM-like feature to facilitate standalone C-Series
monitoring available in a future release.

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
C-Series
C-Series Integration
• Please ensure to update the Rack Servers
– Bundle C is used to upgrade integrated racks servers

• Caution: Re-ack of a FEX will require a re-ack of all associated rack servers

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 55
Working with the TAC
Working with the TAC
General Best Practices
• Generate appropriate ‘show tech’ dumps ASAP

• Have a complete Topology Diagram available

• Invite TAC SR owner to 3rd Party vendor calls when appropriate

• Always include attach@cisco.com in all emails

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 57
Working with the TAC
Why a great Problem Description is worth the effort
• Speeds up the problem resolution process
• An engineer familiar with the symptoms is more likely to grab it
• Helps the SR reviewer ensure that the case is progressing as expected

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 58
Working with the TAC
What to include in your Problem Descriptions
• Firmware Version & type(s) of equipment
• Error messages
• Clear and concise explanation of the problem
• Pertinent details, such as:
– New installation vs. production environment
– Any changes that may have led to the problem
– Impact to your business

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 59
Working with the TAC
Upgrades
• Engage the TAC if you have any questions prior to an upgrade

• Engage the TAC if you encounter problems during an upgrade


– The sooner, the better

• Please note that TAC engineers are unable to:


– Perform upgrades
– Recommend Firmware versions
– Perform bug scrubs

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 60
Working with the TAC
Upgrades (continued)
• Cisco Services are available to assist with upgrades and bug scrubs
• Consult your Cisco account team for details

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
Working with the TAC
Hardware Failures & RMA’s
• Expect requests for logs, and allow time for analysis

• Hardware components should only be replaced when warranted

• Incompatible hardware and software can appear as failed hardware


– Examples:
• B200-M3 with Ivy Bridge processer requires 2.1(3) or higher
• Inoperable DIMM’s may be due to missing Voltage Regulator update

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 62
Working with TAC
Hardware Failures & RMA’s (continued)
• Fully assembled blades are an option
• Typically available Next Business Day (NBD)
• Subject to part(s) availability at the assembly depot

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
Working with the TAC
SR Ownership
• Only re-queue when immediate assistance is required
• Re-queuing for a status update can be counter-productive
• Behind every UCS TAC Engineer:
– Colleagues & Mentors
– Subject Matter Experts
– Team Leads
– Managers
– Technical Leaders
– Escalation engineers

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
Conclusion

65
Wrapping it up, the rest of the story…

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 66
Wrap Up
Release Notes – What was missed (from the 2.0 Release Notes)
After an upgrade from a prior release to 2.0(1), a critical fault may be raised about
an overlapping or matching FCoE VLAN ID used for a vSAN and an Ethernet
VLAN ID under the same fabric as the FCoE VLAN.

The fault can be avoided by changing either the FCoE VLAN ID or the Ethernet
VLAN ID so that they have two different IDs prior to the upgrade.

Resolving the problem after the upgrade may lead to down time for the system.

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
Wrap Up
The rest of the story
• His boss joins the conference bridge and asks: “ Did we have a maintenance
window in place for the change?”
• Awkward silence after he admitted that there was no maintenance window in
place.
• This didn’t have to happen, please don’t let it happen to you.

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 68
Key Takeaways
• Utilize Maintenance Windows
• Read and understand Release Notes
• Maintain good Compatibility Matrix hygiene
• Leverage the TAC efficiently & effectively
• When in doubt, call the TAC

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
Complete Your Online Session Evaluation
• Give us your feedback and you
could win fabulous prizes. Winners
announced daily.
• Complete your session evaluation
through the Cisco Live mobile app
or visit one of the interactive kiosks
located throughout the convention
center.

Don’t forget: Cisco Live sessions will be available


for viewing on-demand after the event at
CiscoLive.com/Online

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
Continue Your Education
• Demos in the Cisco Campus
• Walk-in Self-Paced Labs
• Table Topics
• Meet the Engineer 1:1 meetings

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
I2C – A brief history

73
I2C
What is I2C?
• I2C = Inter-Integrated Circuit, around for many decades
• Master/Slave bus technology
• Employed by UCS to facilitate IO Module communication with chassis
components:
– Fan and PSU readings
– Chassis SEEPROM (a.k.a Shared Storage) access
– Blade readings

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 74
I2C Bus in the UCS Chassis

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 75
I2C
Typical Defect Symptoms
• Fan faults
• Shared Storage faults
• Temperature warnings
• Fans spinning at 100%

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 76
I2C
Improving over time
• I2C issues improve as firmware progresses:

– CSCue49366 : Midplane I2C norxack, c.ms errors cause budget-MC Error(-5)


– CSCtx52556 : No UCSM fault yet one or more Fan LEDs indicate a fault
– CSCtx49686 : Chassis thermal faults do not clear – Fan related
– CSCua10675 : FAN kernel driver bug fix
– CSCtl43716 : 9541 device error. Fan Modules reported inoperable, running 100%

BRKCOM-3010 © 2014 Cisco and/or its affiliates. All rights reserved. Cisco Public 77

You might also like