You are on page 1of 26
NEST eonresence ~—PT203: SOS! Nutanix Troubleshooting Bassem Rezkalla, Sr. Systems Reliability Engineer eMac uae Ime pels @nutanix #nextcont #PT203 NUTANDS Disclaimer Formare-Leoking Statoment Disclaimer ‘This resaiatonar fs accompanyng oral comme tary may ine expres ard imped formarddoaking statements, inurg but not ites ‘tazomens concorngou busines pla ard cjocves, product esos and comnlogy rat ar under evlopront orn rorecs are caoaD oo Sich prodctlestures in leclagy, out lan ore pode! aires inure eases, e-mplemertaton lov procul on sacra hardware iators. atopic pariah ps hal ae m process. produc perorrance, campectve sien, nasty ener, ard pottal marke Sepotritin. Thee owarnoktigstatements ae ol hate lal, and stead ate baad onc ren epecons estates, prions ane Bale. Tae arcuraa)ctcuen foward oaking siatomartsdopends Upon ute vers, and oN ka, uncoTaies anda acre bojend out anol tal ray cause tive latemerts abe harcrale and cane otra eal prfomares of achvereris oder maleialy and adeery from ese aniepsoa omelet auch astern, cud, aor shor: fale te doveep oF ueweectoa difleultea or aye ovdoprg. now rode! lentes x eroagy ona tel "entre basa cd ays n lack af custo o ark sowslares fou new pra featres oF afl feu atvar fo nsroporae ane fore’ hardars plaorme alu fom. oF colaya nie oration now swatoge Partners and the poss te ray fol ncn aegis res for arming ach lrsage partnership re sosuctos or acceleration ot Stoatbno corpsing sabre, nud pubic caus miresrucur: a chitin induay or compatins dames 0” customordorard: and oPer rsa ‘etal sour Al Report on Form 10K forth sal yor ere ly 31,207, le wath Ue SEG, edt ibe Secubes ang Exch age ‘Commesicn, Thao fw lokeng satementeepesk ory aac tre date fs prentaon and, enceot ae requis by aw, we asBura 7 gion ‘9 update ferarclecking sateen orelectacial resus or tosoqurt evets or aumstances. Any ute pedi oreadmap forraten & Intondadtocutre gone! prout drocioa,andie ne aconritven, pramaa.or 99a obgatbn forNuanis io dale any mater, ado, oF ‘coral This omaton shail ot be ised hen making a prcsaanydecson Further note thal Nts has made no dlernaton a ot ‘Soparat oes wil bs charged lr any utr recuct orhaearets or unetonaly which ray Wiraoy bo made avalabe. Nutaik may. nt oor Lscreton, chose charge spare fes forthe cal ery ct any roduc enfancernenis.rfreiarallywhidh are mae rade aval (Gorainriomaren cortaed nh presertaton and be accompanying orl Sommertary may rte or be Gacedonstidoe. cubleators, auvoye Srdober dala ciaher!trom trparysutees ard cur own nena eatmates an esearch Whi we Deheve hese pary sha, pea One Survoys and ater data are rlao aso ihe cate o ths presetaton. tay have rot néependenty void, and we make no represenatin as tothe SScequacy fess, suracy or complete rosso! ar ntomatn cai rom ir pry sources ‘Trademark Disciaimer ‘BZOT7 Nulanx. he. Al ighsreserved. Nolan the Enterprise Cloud Plato, te Nua lgo ad any oer Nutanic products and features ‘menoned havin are registred radomars of vadomarnso Ncang, rc. he Unied Slates ard eer Coureres. Al other brane names and bas ‘Penioned her ar for entiation pupeses en andar teprapery of or respcive hots). Mula ay ot asso. wt, 0 Spongored or Srdorseaby auch roi) — Agenda |. MANAGING NUTANIX ENVIRONMENTS Cluster Monitoring NCC overview Prism Analysis (and Prism Central) TROUBLESHOOTING NUTANIX ENVIRONMENTS General Troubleshooting Troubleshooting Scenarios Engaging support best practices Additional Resources Va «NE. ‘CONFERENCE — Monitoring mx ® mike uo ® mm XY t A SNMP ef splunk> J ee = Prism Alerts “NEXT ‘CONFERENCE — Prism Alerts & Pulse HD t Reports Deep Analytics And rwentory i ® ¥ e 9° & 2 Bh A om Cluster hie Home “NEXT ‘CONFERENCE — Auto-case Generation Example: Description _» Block Serial Number: ‘sini ue Mar 22 2016 18:54:51 GMT-0700 (PDT) owerSupplyDown tr 1046:Bottom power supply is down on block = cluster_id: alert_body: No Alert Body Available New Alerts Appended o Block Serial Number: —— cluster_id: Su alert_body: No Alert Body Available Resolution | Scheduled Maintenance. As advised by customer “NEXT ‘CONFERENCE — Auto-case Generation THESE ALERTS WILL AUTO GENERATE SUPPORT CASES: Stargate process is down for more than 3 hours (Stargate TemporarilyDown) Curator scan fails (CuratorScanFailure) + Running out of space on the cluster Running out of space on CVMs Hardware Clock Failure (HardwareClockFailure) Faulty RAM module (RAMFault) Power Supply failure (PowerSupplyDown) Hf you want up to date information check httpy/portal.nutanix.com/kb/1959 on the portal — KB 1959 For our customers leveraging our partners hardware platforms, we will generate software based alerts which triggers auto support cases. ‘CONFERENCE — Working with Prism Alerts alt “CONFERENCE — Working with Prism Central Alerts Dashboard : Sis —- >i ay Les ‘CONFERENCE — NCC Health Checks NCC Checks Affected CvMs, Pe a yee of a diagnose cluster Chock Name heal Default checks are non-disru KB article for each NCC check Helps get a baseline} NCC can be upgrad with no impact to cluster operation ‘Troubleshooting infarmnation {Including relevant 2} PASS: The tested aspect of the cluster is healthy and no further action is required ea hikes icin MRR et 10.1.181.106 cannot be evaluated as PASS/FAIL — Prism Analysis ray irre :NE. ‘CONFERENCE — Entity & Metric Charts ‘CONFERENCE — Prism Central Analysis — Troubleshooting Nutanix Environments: A Framework + Problem Isolation + Fixes and Mitigations + Root Cause Analysis + Product Improvement «NE. ‘CONFERENCE — Troubleshooting by Layers APPLICATION + SQL, VDI, Oracle RAC, etc. cvm + Stargate, Curator, Cassandra, etc. HYPERVISOR + AHV, ESXi, Hyper-V, XenServer HARDWARE + NVMe, SSD, HDD, Memory, NIC, Processor, etc. NETWORK + OVS, vSwitch, Physical Switch, etc. ‘CONFERENCE — Troubleshooting: Problem Isolation + Rapidly reduce failure domain scope, achieve faster resolution. + Any recent changes in the environment? IMPACT + Is storage available? x + Are there performance issues? * Can you reach Prism? Use Bulld-In REPORTING + Prism Alerts QD el + Cluster Health * NCC + Cluster logs + User Reports «NE. ‘CONFERENCE Troubleshooting: Problem Isolation — Cluster Status grep -v UP — Troubleshooting: Problem Isolation — allssh, hostssh, NCC, Logging * allssh + NCC + /home/nutanix/data/logs and sysstats + INFO, WARN, ERROR, FATAL + allssh ‘Is -Itr data/logs/*.FATAL + If FATALs are actively occurring and you're experiencing issues, they may be related + hostssh “vmware -vi" instead of allssh ‘ssh -I root 192.168.5.1 “vmware -vI + If you're seeing an error, check the Nutanix Knowledge Base! — Problem Isolation - Data Resiliency Status ‘bata Resilency Statue OK Data Resitency poss —— an —=_———lmical eeaiapateaal key mot possible a = + nol cluster get-domain-fauit-tolerance:status type=node CONFERENCE — Typical Troubleshooting Scenarios UPGRADE IS NOT PROGRESSING + Logging: genesis.out, host_upgrade.out, firmware_upgrade.out * upgrade_status + host_upgrade_status + firmware_upgrade_status. STORAGE UNAVAILABLE + Doall CVMs have connectivity to each other and to the hypervisor? + Recent stargate FATALs? * Cassandra status? REPLICATION, SNAPSHOTTING, AND METRO RELATED ISSUES + Logging: Cerebro logs NCC // HEALTH CHECKS FAILING + Running NCC should indicate the nature of the issue and give a KB describing how to resolve the issue. «NE. ‘CONFERENCE Gs — Root Cause Analysis - Log Collection Summary Checks Passed ‘CONFERENCE — Best Practices for Engaging Support Update your break/fix contact via My Nutanix Portal COMPATIBILITY MATRIX Upgrade to the latest NCC and start a health_check Clear problem description What steps have you already taken? Keep components on the recommended version levels, Press the Escalate Button in portal for immediate attention Provide feedback after case closure. Surveys matter! «NE. ‘CONFERENCE — Additional Resources ‘The Nutanix Bible - Architecture details portal.nutanix.com - Nutanix Support Portal, KBs, Documentation, Software, etc. portal.nutanix.com/kb/4530 ~ Additional troubleshooting details for Acropolis File Services IF-YOU LIKED THIS SESSION, YOU MAY ALSO LIKE: + Nutanix Architecture Deep Dive and the Deep Dive Super Session * Getting the Network Right (The First Time) + Fail Fast and Never Again + AHV — Virtualization You Always Wanted «NE. ‘CONFERENCE Thank You Coes Neeser ie Ce ue ee oncom ese CU ee ear Bn oa

You might also like