NEST eonresence
~—PT203: SOS! Nutanix Troubleshooting
Bassem Rezkalla, Sr. Systems Reliability Engineer
eMac uae Ime pels
@nutanix #nextcont #PT203
NUTANDSDisclaimer
Formare-Leoking Statoment Disclaimer
‘This resaiatonar fs accompanyng oral comme tary may ine expres ard imped formarddoaking statements, inurg but not ites
‘tazomens concorngou busines pla ard cjocves, product esos and comnlogy rat ar under evlopront orn rorecs are caoaD oo
Sich prodctlestures in leclagy, out lan ore pode! aires inure eases, e-mplemertaton lov procul on sacra
hardware iators. atopic pariah ps hal ae m process. produc perorrance, campectve sien, nasty ener, ard pottal marke
Sepotritin. Thee owarnoktigstatements ae ol hate lal, and stead ate baad onc ren epecons estates, prions ane
Bale. Tae arcuraa)ctcuen foward oaking siatomartsdopends Upon ute vers, and oN ka, uncoTaies anda acre bojend out
anol tal ray cause tive latemerts abe harcrale and cane otra eal prfomares of achvereris oder maleialy and adeery
from ese aniepsoa omelet auch astern, cud, aor shor: fale te doveep oF ueweectoa difleultea or aye ovdoprg. now
rode! lentes x eroagy ona tel "entre basa cd ays n lack af custo o ark sowslares fou new pra featres oF
afl feu atvar fo nsroporae ane fore’ hardars plaorme alu fom. oF colaya nie oration now swatoge
Partners and the poss te ray fol ncn aegis res for arming ach lrsage partnership re sosuctos or acceleration ot
Stoatbno corpsing sabre, nud pubic caus miresrucur: a chitin induay or compatins dames 0” customordorard: and oPer rsa
‘etal sour Al Report on Form 10K forth sal yor ere ly 31,207, le wath Ue SEG, edt ibe Secubes ang Exch age
‘Commesicn, Thao fw lokeng satementeepesk ory aac tre date fs prentaon and, enceot ae requis by aw, we asBura 7 gion
‘9 update ferarclecking sateen orelectacial resus or tosoqurt evets or aumstances. Any ute pedi oreadmap forraten &
Intondadtocutre gone! prout drocioa,andie ne aconritven, pramaa.or 99a obgatbn forNuanis io dale any mater, ado, oF
‘coral This omaton shail ot be ised hen making a prcsaanydecson Further note thal Nts has made no dlernaton a ot
‘Soparat oes wil bs charged lr any utr recuct orhaearets or unetonaly which ray Wiraoy bo made avalabe. Nutaik may. nt oor
Lscreton, chose charge spare fes forthe cal ery ct any roduc enfancernenis.rfreiarallywhidh are mae rade aval
(Gorainriomaren cortaed nh presertaton and be accompanying orl Sommertary may rte or be Gacedonstidoe. cubleators, auvoye
Srdober dala ciaher!trom trparysutees ard cur own nena eatmates an esearch Whi we Deheve hese pary sha, pea One
Survoys and ater data are rlao aso ihe cate o ths presetaton. tay have rot néependenty void, and we make no represenatin as tothe
SScequacy fess, suracy or complete rosso! ar ntomatn cai rom ir pry sources
‘Trademark Disciaimer
‘BZOT7 Nulanx. he. Al ighsreserved. Nolan the Enterprise Cloud Plato, te Nua lgo ad any oer Nutanic products and features
‘menoned havin are registred radomars of vadomarnso Ncang, rc. he Unied Slates ard eer Coureres. Al other brane names and bas
‘Penioned her ar for entiation pupeses en andar teprapery of or respcive hots). Mula ay ot asso. wt, 0 Spongored or
Srdorseaby auch roi)— Agenda
|. MANAGING NUTANIX ENVIRONMENTS
Cluster Monitoring
NCC overview
Prism Analysis (and Prism Central)
TROUBLESHOOTING NUTANIX ENVIRONMENTS
General Troubleshooting
Troubleshooting Scenarios
Engaging support best practices
Additional Resources
Va
«NE.
‘CONFERENCE— Monitoring
mx ®
mike uo ®
mm XY t A SNMP
ef
splunk>
J
ee =
Prism Alerts
“NEXT
‘CONFERENCE— Prism Alerts & Pulse HD
t Reports
Deep Analytics
And rwentory
i ®
¥
e 9° & 2 Bh A om
Cluster hie Home
“NEXT
‘CONFERENCE— Auto-case Generation
Example:
Description _» Block Serial Number: ‘sini
ue Mar 22 2016 18:54:51 GMT-0700 (PDT)
owerSupplyDown
tr 1046:Bottom power supply is down on
block =
cluster_id:
alert_body: No Alert Body Available
New Alerts Appended o
Block Serial Number:
——
cluster_id: Su
alert_body: No Alert Body Available
Resolution | Scheduled Maintenance. As advised by customer
“NEXT
‘CONFERENCE— Auto-case Generation
THESE ALERTS WILL AUTO GENERATE SUPPORT CASES:
Stargate process is down for more than 3 hours
(Stargate TemporarilyDown)
Curator scan fails (CuratorScanFailure)
+ Running out of space on the cluster
Running out of space on CVMs
Hardware Clock Failure (HardwareClockFailure)
Faulty RAM module (RAMFault)
Power Supply failure (PowerSupplyDown)
Hf you want up to date information check
httpy/portal.nutanix.com/kb/1959 on the portal — KB 1959
For our customers leveraging our partners hardware platforms, we will
generate software based alerts which triggers auto support cases.
‘CONFERENCE— Working with Prism Alerts
alt
“CONFERENCE— Working with Prism Central Alerts Dashboard
: Sis
—- >i
ay
Les
‘CONFERENCE— NCC Health ChecksNCC Checks
Affected CvMs,
Pe a yee of a diagnose cluster
Chock Name
heal
Default checks are non-disru
KB article for each NCC check
Helps get a baseline}
NCC can be upgrad with no impact to cluster operation
‘Troubleshooting infarmnation {Including relevant 2}
PASS: The tested aspect of the cluster is healthy and no further
action is required
ea hikes icin
MRR et
10.1.181.106
cannot be evaluated as PASS/FAIL— Prism Analysis
ray
irre
:NE.
‘CONFERENCE— Entity & Metric Charts
‘CONFERENCE— Prism Central Analysis— Troubleshooting Nutanix Environments: A Framework
+ Problem Isolation
+ Fixes and Mitigations
+ Root Cause Analysis
+ Product Improvement
«NE.
‘CONFERENCE— Troubleshooting by Layers
APPLICATION
+ SQL, VDI, Oracle RAC, etc.
cvm
+ Stargate, Curator, Cassandra, etc.
HYPERVISOR
+ AHV, ESXi, Hyper-V, XenServer
HARDWARE
+ NVMe, SSD, HDD, Memory, NIC, Processor, etc.
NETWORK
+ OVS, vSwitch, Physical Switch, etc.
‘CONFERENCE— Troubleshooting: Problem Isolation
+ Rapidly reduce failure domain scope, achieve faster resolution.
+ Any recent changes in the environment?
IMPACT
+ Is storage available? x
+ Are there performance issues?
* Can you reach Prism?
Use Bulld-In REPORTING
+ Prism Alerts QD el
+ Cluster Health
* NCC
+ Cluster logs
+ User Reports
«NE.
‘CONFERENCETroubleshooting: Problem Isolation — Cluster Status
grep -v UP— Troubleshooting: Problem Isolation — allssh, hostssh, NCC, Logging
* allssh
+ NCC
+ /home/nutanix/data/logs and sysstats
+ INFO, WARN, ERROR, FATAL
+ allssh ‘Is -Itr data/logs/*.FATAL
+ If FATALs are actively occurring and you're experiencing issues, they may be related
+ hostssh “vmware -vi" instead of allssh ‘ssh -I root 192.168.5.1 “vmware -vI
+ If you're seeing an error, check the Nutanix Knowledge Base!— Problem Isolation - Data Resiliency Status
‘bata Resilency Statue
OK
Data Resitency poss
——
an —=_———lmical
eeaiapateaal key mot possible
a =
+ nol cluster get-domain-fauit-tolerance:status type=node CONFERENCE— Typical Troubleshooting Scenarios
UPGRADE IS NOT PROGRESSING
+ Logging: genesis.out, host_upgrade.out, firmware_upgrade.out
* upgrade_status
+ host_upgrade_status
+ firmware_upgrade_status.
STORAGE UNAVAILABLE
+ Doall CVMs have connectivity to each other and to the hypervisor?
+ Recent stargate FATALs?
* Cassandra status?
REPLICATION, SNAPSHOTTING, AND METRO RELATED ISSUES
+ Logging: Cerebro logs
NCC // HEALTH CHECKS FAILING
+ Running NCC should indicate the nature of the issue and give a KB describing
how to resolve the issue.
«NE.
‘CONFERENCEGs— Root Cause Analysis - Log Collection
Summary Checks
Passed
‘CONFERENCE— Best Practices for Engaging Support
Update your break/fix contact via My Nutanix Portal COMPATIBILITY MATRIX
Upgrade to the latest NCC and start a health_check
Clear problem description
What steps have you already taken?
Keep components on the recommended version levels,
Press the Escalate Button in portal for immediate
attention
Provide feedback after case closure. Surveys matter!
«NE.
‘CONFERENCE— Additional Resources
‘The Nutanix Bible - Architecture details
portal.nutanix.com - Nutanix Support Portal, KBs, Documentation, Software, etc.
portal.nutanix.com/kb/4530 ~ Additional troubleshooting details for Acropolis File Services
IF-YOU LIKED THIS SESSION, YOU MAY ALSO LIKE:
+ Nutanix Architecture Deep Dive and the Deep Dive Super Session
* Getting the Network Right (The First Time)
+ Fail Fast and Never Again
+ AHV — Virtualization You Always Wanted
«NE.
‘CONFERENCEThank You
Coes Neeser ie
Ce ue ee oncom ese
CU ee ear Bn oa