Professional Documents
Culture Documents
Do NOT Distribute
Hard
dwarre and Doown S Stora
age S
Syste
em
Trou
ubles
shootting for
f P Partne
ers
Offered
d by: Global Su
upport, Learnin
L g & Perrforman
nce
Studentt Guide
NetApp Internal Only
Do NOT Distribute
Hardware and Down Storage System Troubleshooting Partners
Course Title:
Student Guide
ATTENTION
The information contained in this guide is intended for training use only. This guide contains information and
activities that, while beneficial for the purposes of training in a closed, non-productive environment, can result in
downtime or other severe consequences and therefore are not intended as a reference guide. This guide is not a
technical reference and should not, under any circumstances, be used in production environments. To obtain
NetApp Internal Only
Do NOT Distribute
reference materials, please refer to the NetApp product documentation at www.now.com for product information.
COPYRIGHT
Copyright © 1994–2013 NetApp, Inc. All rights reserved. Printed in the U.S.A.
No part of this document covered by copyright may be reproduced in any form or by any means— graphic,
electronic, or mechanical, including photocopying, recording, taping, or storage in an electronic retrieval system
without prior written permission of the copyright owner. Software derived from copyrighted NetApp material is
subject to the following license and disclaimer:
THIS SOFTWARE IS PROVIDED BY NETAPP "AS IS" AND WITHOUT ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE,
WHICH ARE HEREBY DISCLAIMED. IN NO EVENT SHALL NETAPP BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
NetApp reserves the right to change any products described herein at any time, and without notice. NetApp
assumes no responsibility or liability arising from the use of products described herein, except as expressly
agreed to in writing by NetApp. The use or purchase of this product does not convey a license under any patent
rights, trademark rights, or any other intellectual property rights of NetApp. The product described in this manual
may be protected by one or more U.S.A. patents, foreign patents, or pending applications.
RESTRICTED RIGHTS LEGEND: NetApp training documentation is commercial off-the-shelf data developed
entirely at private expense and is provided to the U.S. Government with LIMITED RIGHTS as defined under FAR
52.227-14 Alternative I (December 2007). Use, duplication, or disclosure by the U.S. Government is subject to
the restrictions as set forth in the Commercial Computer Software - Restricted Rights clause at FAR 52.227-19. In
the event licensee is a U.S. DoD agency, the Government's rights in software, supporting Documentation, and
technical data are governed by the restrictions in the Technical Data Commercial Items clause at DFARS
252.227-7015 and the Commercial Computer Software and Commercial Computer Software
Documentation clause at DFARS 227-7202.
TRADEMARK INFORMATION
All applicable trademark attribution is listed here. NetApp; the NetApp logo; the Network Appliance logo; Bycast;
Cryptainer; Cryptoshred; DataFabric; Data ONTAP; Decru; Decru DataFort; FAServer; FilerView; FlexCache;
FlexClone; FlexShare FlexVol; FPolicy; gFiler; Go further, faster; Manage ONTAP; MultiStore; NearStore;
NetApp Internal Only
Do NOT Distribute
NetCache; NOW (NetApp on the Web); ONTAPI; RAID-DP; SANscreen; SecureShare; Simulate ONTAP;
SnapCopy; SnapDrive; SnapLock; SnapManager; SnapMirror; SnapMover; SnapRestore; SnapValidator;
SnapVault; Spinnaker Networks; Spinnaker Networks logo; SpinAccess; SpinCluster; SpinFlex; SpinFS; SpinHA;
SpinMove; SpinServer; SpinStor; StorageGRID; StoreVault; SyncMirror; Topio; vFiler; VFM; and WAFL are
registered trademarks of NetApp, Inc. in the U.S.A. and/or other countries. Network Appliance, Snapshot, and
The evolution of storage are trademarks of NetApp, Inc. in the U.S.A. and/or other countries and registered
trademarks in some other countries. The StoreVault logo, ApplianceWatch, ApplianceWatch PRO, ASUP,
AutoSupport, ComplianceClock, DataFort, Data Motion, FlexScale, FlexSuite, Lifetime Key Management,
LockVault, NOW, MetroCluster, OpenKey, ReplicatorX, SecureAdmin, Shadow Tape, SnapDirector, SnapFilter,
SnapMigrator, SnapSuite, Tech OnTap, Virtual File Manager, VPolicy, and Web Filer are trademarks of NetApp,
Inc. in the U.S.A. and other countries. Get Successful and Select are service marks of NetApp, Inc. in the U.S.A.
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. A complete and current list of other IBM trademarks is
available on the Web at http://www.ibm.com/legal/copytrade.shtml. Apple is a registered trademark and
QuickTime is a trademark of Apple, Inc. in the U.S.A. and/or other countries. Microsoft is a registered trademark
and Windows Media is a trademark of Microsoft Corporation in the U.S.A. and/or other countries. RealAudio,
RealNetworks, RealPlayer, RealSystem, RealText, and RealVideo a registered trademarks and RealMedia,
RealProxy, and SureStream are trademarks of RealNetworks, Inc. in the U.S.A. and/or other countries.
All other brands or products are trademarks or registered trademarks of their respective holders and
should be treated as such.
NetApp, Inc. is a licensee of the CompactFlash and CF Logo trademarks.
NetCache is certified RealSystem compatible.
Change History
NetApp Internal Only
Do NOT Distribute
Date Comment
August 2012 Restructure of course based on new
layout of 3-day format
Addition of Hardware Overview
February 2013 Tech refresh and format update
June 2013 Minor edits
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 1
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Hardware and Down
Storage System
Troubleshooting for
Partners
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 2
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Logistics Safety
Introductions Alarm signal
Schedule (start time, Evacuation route
breaks, lunch, close) Assembly area
Telephones and Electrical safety
messages
Food and drinks
Restrooms
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 3
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Who are you?
What is your job title and responsibilities?
What you would like to gain from taking this
class?
How long have you been working with this
technology?
Did you take the pre-requisite course and
if not, why?
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 4
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Course Objectives
NetApp Internal Only
Do NOT Distribute
By the end of this course, you should be able to:
Describe the storage controller boot process
Explain the purpose of the special boot
menus
Discuss differences between WAFL_check
and wafliron
Describe how RAID handles disk errors
Identify the proper method of core acquisition
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 5
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Course Agenda
NetApp Internal Only
Do NOT Distribute
Day 1
– Introductions and overview
– Module 1: Hardware Overview
– Module 2: High Availability Hardware Basics
Day 2
– Module 3: Boot Process
– Module 4: Special Boot Menu
– Module 5: Common Storage Controller Problems
Day 3
– Module 6: WAFL_check and wafliron
– Module 7: Troubleshooting Loop Issues
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 6
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Introduction to NetApp Products
Data ONTAP 7.3 Fundamentals
NetApp Certified Data Management Administrator
(NCDA)
1 – 2 years experience supporting NetApp hardware
and software
Fibre Channel SAN Troubleshooting (web-based)
1 – 2 years experience supporting NetApp hardware
and software
NetApp Certified Data Management Administrator
(NCDA)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 7
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Select Comment > Add Sticky Note
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 8
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Thank You!
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 9
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Hardware and Down
Storage System
Troubleshooting for
Partners
Hardware Overview
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 10
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Objectives
NetApp Internal Only
Do NOT Distribute
By the end of this module, you should be able to:
Identify various platforms and associated storage
Demonstrate use of the Netapp Hardware Universe
Discuss newest available platforms
Identify shared features and components of various
platforms
Discuss FC-AL technology and related components
Discuss SAS technology and related components
Discuss remote management options
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 11
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Topics
NetApp Internal Only
Do NOT Distribute
Types of Storage Systems
Netapp Hardware Universe
Support Status Codes
Determining Shelf Type
Fiber Channel Arbitrated Loop (FC-AL)
Serial Attached Storage (SAS)
Remote System Management
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 12
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Hardware
References
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 13
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware References
NetApp Internal Only
Do NOT Distribute
Hardware Universe
Product Library - Hardware Pages
EOA / EOS Page
TOI, TSB, CSB Archives
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 14
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Supported Data ONTAP versions
Compatible PCI cards, including allowed and
preferred slots
Compatible shelves, and shelf modules
Storage Capacity and max spindle count
System memory and NVRAM memory
SCSI ports and supported tape backup
devices
The Netapp Hardware Universe is a menu based tool that shows the supported configurations for
different Storage Controller models based on the Storage Controller Model and desired ONTAP Version.
http://hardware.netapp.com
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 15
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware Universe
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
Hardware Universe
http://hardware.netapp.com
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 16
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware Universe
NetApp Internal Only
Do NOT Distribute
Choose a
hardware
category
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 17
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware Universe
NetApp Internal Only
Do NOT Distribute
Choose a
Data
ONTAP
release
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 18
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware Universe
NetApp Internal Only
Do NOT Distribute
Choose a
platform
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 19
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware Universe
NetApp Internal Only
Do NOT Distribute
Choose an
adapter
type
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 20
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware Universe
NetApp Internal Only
Do NOT Distribute
Choose
information
to display
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 21
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware Universe
NetApp Internal Only
Do NOT Distribute
Click
Show
Results
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 22
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware Universe
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 23
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware References
NetApp Internal Only
Do NOT Distribute
Product Library - Hardware Pages
– Guides for setting up and upgrading storage systems
– Action plans for replacing individual components
NetApp Hardware Universe
– Includes part numbers
– Ability to perform platform comparisons
EOA / EOS page
– Shows dates for end of availability and end of support for all
platforms, shelves, PCI cards, etc.
Hardware References
-Product Library
Netapp Support Online --> Documentation --> Product Documentation
https://support.netapp.com/documentation/productsatoz/index.html
====================
-NetApp Hardware Universe:
https://hardware.netapp.com
====================
-EOA / EOS Page
Netapp Support Online --> Documentation --> End of Availability
http://support.netapp.com/NOW/products/eoa/
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 24
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
FCS
– First Customer Shipment
EOA
– End of Availability (Stop selling the product)
EOS (End of Support) – Software
– End of new code for this product
– For feature releases - 2 years after EOA date
– For patch releases – up to 3 years after the feature
release end
EOS – Hardware
– Last support date for hardware replacement
Typically 5 years after EOA date
Current list can be found at:
– http://support.netapp.com/NOW/products/eoa/
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 25
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 1-1
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 26
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Storage Controller
Overview
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 27
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Active storage for user and application
data
Uptime and performance are critical
FC-AL, SAS, or SSD disks typically used
ATA disks can be used to lower cost but
are not as fast or reliable
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 28
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Standard
• FAS9xx
• FAS30x0
• FAS60x0
• FAS32x0 *
• FAS62x0 *
Shrunken-head Two-in-a-box
• FAS2x0 • FAS31x0
• FAS20x0 • FAS32x0 *
• FAS2240-x • FAS62x0 *
*FAS62x0 and FAS32x0 are only Standard when not in C-C configuration (only available on FAS3210
and FAS6210)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 29
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Also known as Nearstore Controller
Data ONTAP optimized for Secondary
Storage
Higher number of concurrent SnapMirror and
SnapVault backups
Typically uses ATA disks
Often not clustered and sometimes have
single-path cabling
NearStores typically use ATA disks which are typically larger and less expensive per GB but also have
lower performance and higher failure rates.
Single attached: Choosing not to dual-attach shelves is more cost-effective because it uses fewer HBAs,
shelf modules, and cables but this makes a system much less resilient to failures.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 30
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
R200 (Legacy product)
FAS3xx0 *
FAS6xx0 *
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 31
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Controller head is separate from other
controllers and disk shelves
– FAS9x0
– FAS30x0
– FAS60x0
– FAS32x0 *
– FAS62x0 *
*FAS62x0 and FAs32x0 are only Standard when not in C-C configuration
*Note: V-Series and N-Series are the same hardware with a different personality which SHOULD NOT be
changed in the field.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 32
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Storage System head is a module in the first
shelf
– FAS2x0
– FAS20x0
– FAS22x0
Typically used for smaller implementations
Lower performance
Less expandability
More affordable
All use NVMEM
NetApp Confidential — Limited Use
NVMEM: NVRAM chip is onboard, system memory is battery-backed, and a portion of the memory is
used for NVRAM functionality. This means that a motherboard replacement results in a changed sysid.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 33
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Both Storage Controllers of a cluster pair are
in one chassis
– FAS31x0
– FAS32x0 *
– FAS62x0 *
Interconnect through the chassis backplane
Each head has its own set of PCI cards, I/O
ports, and management ports
Both heads share power
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 34
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Advanced Systems
Overview
FAS62x0, FAS32x0,
FAS2240-x
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 35
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Overview of FAS62x0 and FAS32x0 platforms
– Overview and internals of each system
– Supported configurations
– Specifics on new NVRAM8 and NVMEM
Common features
– Storage support
– New boot device
– Replaceable parts
– Hardware upgrades
– SP
– sldiag
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 36
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
V / FAS62x0
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 37
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Section Overview
NetApp Internal Only
Do NOT Distribute
Basic Information
Supported Configurations
Hardware Overview (ports, slots, etc.)
Hardware Internals
NVRAM8
Unit LEDs
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 38
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Basic Information
NetApp Internal Only
Do NOT Distribute
New two-in-a-box high-end / high-performance platform
Replaces FAS60x0 and FAS31x0 series
Up to 3PB capacity, plus 2x more PCIe connectivity, built-in
10Gb Ethernet, 8Gb FC, and 6Gb SAS
Single chassis supports 1 controller, 1 controller plus IOXM,
or 2 controllers
IOXM = IO eXpansion Module (new for FAS62x0)
Chassis Details:
– 6U
– 6 Fan FRUs
– 2 PSUs
– 1 mid-plane
(not a separate FRU)
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 39
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Three Sizes
NetApp Internal Only
Do NOT Distribute
•Low – FAS6210
•Mid – FAS6240
•High – FAS6280
Three Sizes
V: Value
L: Low
M: Mid
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 40
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
FAS6210 Configurations
NetApp Internal Only
Do NOT Distribute
Controller/blank
– Known as Cb
Controller/Controller
– Known as CC
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 41
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Controller/IOXM
– Known as CI
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 42
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Configuration Internal Term FAS6210 FAS6240 FAS6280
Single Chassis,
Single Controller, Cb Yes No No
no IOXM
Single Chassis,
Single Controller CI No Yes Yes
and IOXM
Single Chassis,
Dual Controller
C-C Yes No No
Dual Chassis,
Single Controller, Cb-Cb Yes No No
no IOXM in each
Dual Chassis,
Single Controller CI-CI No Yes Yes
and IOXM in Each
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 43
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Configuration Notes
NetApp Internal Only
Do NOT Distribute
All configurations also supported for V-Series
Supported MetroCluster configurations:
– FAS6210: Single controller (no IOxM) in each chassis
– FAS6240 / FAS6280: Single controller and IOxM in
each chassis
– No MetroCluster support for C-C configuration
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 44
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Controller IOXM
•2 x 1GbE (e0a, e0b) •8 x Full Length PCIe (slots 7-10, 13-16)
•4 x 10GbE (e0c, e0d, e0e, e0f) •2 x Vertical I/O (Slots 11-12)
•4 x 8Gb FC (0a, 0b, 0c, 0d) •4 x Unused/Covered Ethernet ports
•1 x Vertical I/O (slot 1) •Not hot removable or swappable
•1 x NVRAM 8 (slot 2) •Controller will panic if removed
•4 x Full Length PCIe (slots 3-6) •If inserted into running single node
•1 x 10/100 SP/Management (wrench) chassis the IOXM will not be recognized
•1 x 10/100 ACP (wrench w/lock) until controller is rebooted
•1 x Serial console ( |0|0| )
•Unused/Covered USB port
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 45
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
FAS6210 FAS6240 FAS6280
Processor Cores
8 @ 2.26 GHz 8 @ 2.53 GHz 12 @ 2.93 GHz
per controller
Memory
24 GB 48 GB 96 GB
per controller
NVRAM8 Memory 4 GB 4 GB 4 GB
*Processors:
-FAS6210: 2 x Intel Nehalem 4-core@2.26GHz, E5520
-FAS6240: 2 x Intel Nehalem 4-core@2.53GHz, E5540
-FAS6280: 2 x Intel Westmere 6-core@2.93GHz, X56*Memory:
-12 Sockets on motherboard
-All use DDR3-106-FAS6210: 6 x 4GB (X3204-R6)
-FAS6240: 12 x 4GB (X3204-R6)
-FAS6280: 12 x 8 GB (X3205-R6)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 46
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
FAS62x0 Controller
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 47
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
FAS62x0 IOXM
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 48
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
FAS62x0 Chassis
NetApp Internal Only
Do NOT Distribute
Same as FAS31x0
– Midplane and controller
are keyed to prevent
installation of incorrect
hardware
Front:
– 6 x Fan Assembly
Rear:
– 2 x Controller / IOxM
– 2 x PSU
FAS62x0 Chassis
-Midplane contains a guide pin which prevents installation of a FAS/V31x0 PSU into a FAS/V62x0 System.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 49
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 50
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 51
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
This piece is included in every controller in the field so onsite engineers should have an easy time
determining what parts are where.
This map includes what pieces need to be removed in order to get to certain parts and it also shows
which bank of DIMMs is 1-6 and which is 7-12.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 52
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
10Gbe Riser: X3217-RLeft PCIe riser card for IOXM only: X3216-RRight PCIe riser card for controller or
IOXM: X3211-RLeft and Right are when looking at the system from the rear.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 53
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Can be installed only
in slots 1, 11, or 12.
PCIe, custom
form-factor
(non-standard height)
4 x 3/6Gb SAS
4 x 2/4/8 Gb FC
– Can be configured as
initiator or target
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 54
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Custom form-factor PCI-e
Card
Installed in Slot 2 only
DIMM is a separate FRU
Battery is a separate FRU
No tools needed to remove
card but a phillips screwdriver
is needed to remove the
battery
Uses standard SAS / QSFP
connectors.
Hardware Internals: NVRAM*Be sure to label cables when servicing, especially when SAS I/O cards are
in use because the cables are functionally the same.
*SAS cables are 6Gb! These are the same cables supported with DS2246 but NOT DS4243.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 55
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
After a dirty shutdown a de-stage operation is performed
– DRAM contents are moved to flash
– Better than being battery-backed for a finite number of days
Card uses a Smart Battery
– High current for short period of time
– Charger does not turn on until discharged 25%
No external cable needed for an HA pair in C-C configuration
– INT LNK LED on card lights up when in this configuration
-The contents of DRAM are moved to flash components within a minute of the power loss and then the
card shuts down completely.
-NVRAM8 requires a high current for a short period of time (1.5A for 1 minute) instead of a low current for
a long period of time (<50 mA for a minimum of 3 days)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 56
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Clean shutdown will result in NVRAM being flushed properly
Dirty shutdown may result in unwritten data in NVRAM
When replacing the board the STATUS button on the bottom of the
card should be pressed to activate the LED on the top side of the card
– Green: Clean shutdown, no customer data
– Red: Customer data in NVRAM, replay required
– Amber: Not good / unknown status. Card should be replaced
– No light: battery may be bad or in shutdown mode. Power cycle system with card
installed and try again
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 57
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 58
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Controller (Lower Left)
IOxM (Lower Left)
PSU (Near the top of each)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 59
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
V / FAS32x0
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 60
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
v/FAS32X0 Overview
NetApp Internal Only
Do NOT Distribute
Basic Information
Supported Configurations
Hardware Overview (ports, slots, etc.)
Hardware Internals
NVMEM
Unit LEDs
Platform Upgrade Comparison
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 61
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Basic Information
NetApp Internal Only
Do NOT Distribute
New two-in-a-box mid-range platform
Single chassis supports 1 controller, 1 controller plus
IOXM, or 2 controllers
IOXM = IO eXpansion Module (new for FAS32x0)
Chassis Details:
– 3U
– 3 Fan FRUs
– 2 PSUs
– 1 mid-plane
(not a separate FRU)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 62
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 63
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 64
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Three Sizes
NetApp Internal Only
Do NOT Distribute
•Low – FAS3210
•Mid – FAS3240
•High – FAS3270
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 65
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Internal Term FAS3210 FAS3240 FAS3270
Single Chassis, Single
Controller (no IOXM) Cb Yes No No
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 66
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Single Controller
(FAS3210 only)
Controller + IOXM
(FAS3240 and FAS3270 only)
Dual Controller
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 67
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Controller IOXM
•2 x 6G SAS (0a, 0b) •4 x Full Length PCIe (slots 3-6)
•2 x 1GbE (e0a, e0b) •4 x Unused/Covered Ethernet ports
•2 x 4Gb FC (0c, 0d) •Not hot removable or swappable
•2 x 10GbE – HA IC only(c0a, c0b) •Controller will panic if removed
•1 x 10/100 SP/Management (wrench) •If inserted into running single node
•1 x 10/100 ACP (wrench w/lock) chassis the IOXM will not be recognized
•1 x Serial console ( |0|0| ) until controller is rebooted
•1 x Full Length PCIe (slot 1)
•1 x 3/4 Length PCIe (slot 2)
•Unused/Covered USB port
All ports and slots (including IOXM) are included in sysconfig output.
sysconfig also includes power status, firmware version, and serial number of IOXM.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 68
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
FAS3210 FAS3240 FAS3270
Memory 8 GB 16 GB 32 GB
NVMEM 1 2 4
*Processors:
-FAS3210: 1 x Intel E5220, 2.3GHz (Dual Core)
-FAS3240: 1 x Intel L5410, 2.3GHz (Quad Core)
-FAS3270: 2 x Intel E5240, 3.0GHz (Dual Core)
*Max spindles requires Data ONTAP 8.0+ otherwise these numbers are half.
*Max capacity is calculated using 3TB SATA disks which require Data ONTAP 8.0.2. If using Data
ONTAP 8.0.1 or 7.3.x the largest supported disk is 2TB.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 69
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 70
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
*FAS3210:
X3131-R6: 1GB,DDR2,ECC,REG,PC667,X8 (Single Channel) (DIMM-NV1)
X3133-R6: 2GB,DDR2,ECC,REG,PC667,X8 (Single Channel) (DIMM-1, DIMM-2)
*FAS3240:
X3199-R6: 2GB,DDR2,ECC,REG,PC667,X4 (Dual Channel) (DIMM-NV1, DIMM-NV2, DIMM-1, DIMM-2)
*FAS3270:
X3199-R6: 2GB,DDR2,ECC,REG,PC667,X4 (Dual Channel) (DIMM-NV1, DIMM-NV2)
X3250-R6: 4GB,DDR2,ECC,REG,PC667,X4 (Dual Channel) (DIMM-1, DIMM-2, DIMM-3, DIMM4)
The DIMM names in the graphic are exactly how they are now named in ‘sysconfig –M’. Note that this the
first time that NVMEM systems have distinguished between which DIMMs are battery backed and which
are used only for system memory.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 71
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 72
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Controller is embedded (is not a separate FRU)
Uses a battery backed portion of system memory
Battery is a FRU
Battery can hold NVMEM contents for 72 hours minimum
External LED will flash green once every 2 seconds when the
system is off and NVMEM contains data
In HA configurations controllers synchronize NVMEM contents
with partner:
– CC: 10GbE over midplane (external 10GbE ports are disabled)
– CI config: 10GbE external cabling
– 2 paths for redundancy
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 73
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Full portion of battery backed DIMMs does not go to
NVMEM
All space used for NVMEM is battery backed
Memory reserved for System and NVMEM per model:
– FAS3210: ~4.5GB System + ~.5GB NVMEM = 5GB Total
– FAS3240: ~7GB System + ~1GB NVMEM = 8 GB Total
– FAS3270: ~18 GB System + ~2GB NVMEM = 20GB Total
sysconfig output shows total memory and amount
dedicated to NVMEM:
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 74
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 75
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Front LEDs
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 76
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Systems ship with current generally
available release (GA)
Only 7.3.5 and above is supported in the 7.x
code line
If system has 8.0.1 and 7.3.x is desired a
fresh installation will not work
Must use revert_to
Revert to 7.3.5:
https://kb.netapp.com/support/index?page=content&id=101260Releases earlier than 7.3.5 *will not* work
due to driver dependencies.
The revert_to comand and NOT a fresh install must be used to get to 7.3.5 due to 7.3.x not being able
to read and work with the RAID labels written with Data ONTAP 8.x.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 77
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 78
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
V / FAS2240-x
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 79
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Shrunken-head storage controller
Based on existing SAS-based storage shelves
– FAS2240-2: Based on DS2246 (2U, 6Gb, 24 x 2.5" internal disk
drives)
– FAS2240-4: Based on DS4243 (4U, 3Gb, 24 x 3.5" internal disk
drives)
Common controller (PCM) used for each version of
platform
Uses new mezzanine cards for additional I/O
capabilities
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 80
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
FAS2240: Features
NetApp Internal Only
Do NOT Distribute
Battery-backed system memory used for NVMEM
Native ACP support
Integrated Service Processor (SP)
HA interconnect utilizes shelf chassis midplane
Ability to add one expansion shelf stack
– 10 x DS4243 or DS2246
– 6 x DS14-mk4-FC or DS14-mk2-AT
Disk shelf conversion ability
FAS2240: Features
NVMEM: 2GB DIMM with white tabs is battery backed but only 768MB of that DIMM is actually used for
NVMEM.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 81
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
FAS2240: Specifications
NetApp Internal Only
Do NOT Distribute
FAS2240-2 FAS2240-4
Form Factor 2U 4U
Chassis Depth 20 in 24 in
Onboard GbE 4 4
2 x 8Gb FC or 2 x 8Gb FC or
Other Ports Supported 2 x 10 GbE via optional 2 x 10 GbE via optional
I/O card I/O card
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 82
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Controller Modules Power Supplies
FAS2240-2 and FAS2240-IOM6 looks the same as IOM3 but runs at 6Gbps
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 83
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 84
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
FAS2240 FRUs
NetApp Internal Only
Do NOT Distribute
Hot Swappable Components
• FAN
• Power Supplies
Not Hot Swappable
FAS2240-2 • Processor Control Module
• Memory DIMMS (2GB/4GB)
• NVMEM battery
• Coin Cell (RTC) battery
• USB Boot Media
• PCIe card
• Mezzanine Cards
FAS2240-4
FAS2240 FRUs
*DIMMs:
-4GB (X3208A-R6): Black tabs, shows up in 'sysconfig -M' as DIMM-1
-2GB (X3209A-R6): White tabs, shows up in 'sysconfig -M' as DIMM-NV*While the SFPs for the different
mezzanine cards (8Gb FC / 10Gb Eth) are both SFP+, they are DIFFERENT FRUs:
-8 Gb FC SFP+: X6588-R6
-10 Gb Eth SFP+: X6589-R
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 85
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
FAS2240 - Internals
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 86
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 87
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Two options currently available:
– 2 x 8 Gb FC
– 2 x 10 Gb Ethernet
Motherboard Connector
Note: A mezzanine card is a non-essential option which simply adds I/O capabilities. It is not required for
the system to function properly.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 88
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Shared Features and
Information
FAS62x0, FAS32x0, and FAS2240-x
Note:
Everything discussed as FAS in this section also applies to V-Series model equivalents.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 89
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Storage Support
USB Boot Device
Hardware Upgrades
FRUs
Service Processor
sldiag
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 90
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Disk Shelves
Data ONTAP Data ONTAP Data ONTAP
7.3.5 + 8.0.1+ 8.1
DS2246 Supported Supported Supported
DS4243 Supported Supported Supported
DS14-mk4 Supported Supported Supported
DS14-mk2-AT Supported Supported Supported
DS14-mk2 Supported Supported Supported*
DS14 Not Supported Not Supported Not Supported
*Data ONTAP 7.3.5 + refers to 7.3.5 and anything newer than this release in the 7.x family.
*Data ONTAP 8.0.1 + refers to 8.0.1 and anything newer than this release in the 8.x family.
*mk2 shelf is supported in Data ONTAP 8.1 ONLY with ESH4 modules.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 91
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Hot Swappable Components
• Fans
• Power Supplies
Non-Hot Swappable Components:
• CPU Module Tray
FAS3200
• Memory DIMMs
• NVMEM Battery and DIMM
(FAS32x0)
• NVRAM8 Card, Battery, and DIMM
(FAS62x0)
• PCI Cards
• PCI Risers
• USB Boot Device
FAS6200 • IOXM
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 92
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Hardware Upgrades
NetApp Internal Only
Do NOT Distribute
(FAS32x0 & FAS62x0 only)
Single FASx210 controllers can be upgraded to a
dual node configuration
Single controller plus IOXM and Dual Controller
configurations:
*Enclosure PROM is programmed to a specific type
and cannot be changed
New enclosure is required in order to change system
personality
Same basic rules apply for V-Series and N-Series
systems
Exception: V-Series, N-Series, and FAS controllers
are not interchangeable
NetApp Confidential — Limited Use
Hardware Upgrades
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 93
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
(FAS32x0 & FAS62x0 only)
Attempting to hot-change configuration will
cause a shut down
System will not be able to boot
Error messages:
https://kb.netapp.com/support/index?page=content&id=201135
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 94
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
New device - Replaces compact flash
– Boot device for Data ONTAP
– Holds environmental variables
Same resiliency requirements as CF
Current density is 2GB
Replaceable FRU
Information shows up under slot 0 (u0a) in sysconfig -v:
Hardware Upgrades
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 95
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
New, improved system management tool
Replaces RLM / BMC
Fully integrated; not a separate FRU
Independent processor, memory, network
Access: ssh (wrench port) or Ctrl+g from console
– Login: naroot, Password: <system root pwd>
Functions:
– System Sensor Monitoring
– Event Logging
– System Management
– Self Management
– Send AutoSupport message
– RSA
Functions:
• System Sensor Monitoring: Listing of all sensors, details for a single sensor, generate AutoSupport
messages when sensors out of normal operating range
• Event Logging: Tracks events in persistent log, view event history, search Events, get information
regarding state of the event log
• System Management: Access Data ONTAP console, power-cycle Data ONTAP processor, generate
core dumps, FRU inventory and information, battery information
• Self Management (Mgmt of SP itself): Configure networking (IPv4 and IPV6), update firmware for the
Service Processor and devices on the storage system, view status, reset and reboot, on-line help
• Send AutoSupport message: Done when system goes down, send short AutoSupport message with ‘sp
test autosupport’ from SP interface, mail settings are based on options set in Data ONTAP.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 96
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
All system diagnostic tests are now accessed via
maintenance mode
LOADER> boot_ontap, option 5 from 1-5 / 1-8 menu will
result in limited functionality.
LOADER> boot_diags for full functionality
Not menu driven; all CLI
Diagnostic processes are threaded and run in background
which allows:
– Running multiple tests concurrently
– Checking the status of tests as they are running
– Running other maintenance mode commands as tests are
running
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 97
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 1-2
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 98
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Storage Hardware
FC-AL
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 99
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Shelf Hardware
NetApp Internal Only
Do NOT Distribute
DS-14, DS-14 HA
DS-14 mk2, DS-14mk2 AT
DS-14mk4
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 100
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Hot Swappable Components
• Power Supply Unit (2 fans in each)
• Disk Drive (one at a time)
• Shelf Module (if multi-pathed or clustered)
Non-Hot Swappable Components
DS14 Family
• Entire Shelf
• More than 1 disk drive at a time per raid
group
Name Stencil
Fault LED
Shelf ID
Power Supplies:
All DS-14 models use the same Power supplies.
End of Support
DS-14 mk1 shelves and LRC and ESH(1) are not supported in ONTAP 7.3.x and newer code.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 101
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Module A (LRC)
PS 1 PS 2
Modules in Shelves:
Module A and B can be LRC, ESH, ESH2, ESH4, or AT-FCX (as long as it is supported by the shelf type), see the
sysconfig guide for details. AT-FC & AT-FC2 is only supported in the B position. LRCs are pictured here. Also note
that module A is in the top slot of the shelf and is upside down.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 102
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Name Stencil
Fault LED
Shelf ID
LEDs
Refer to the hardware guides available on NOW for the translation of LEDs to a problem description. Also, DS14
shelves come with a pull-out LED guide in the front of the shelf at the bottom center. However, this is easily removed
and may be missing.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 103
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
The Mk1 Disk Canister will not plug into the newer Mk2 shelf
Reference Shelf Identification and Replacement job aid
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 104
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
DS14 Disk Shelf REAR VIEW
• Fiber Channel Only Drives
• No Gb Switch (1 Gb Only)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 105
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Disk Identification
NetApp Internal Only
Do NOT Distribute
The disk ID is composed of 2 parts
– The loop ID
– The device ID
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 106
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Loop ID
NetApp Internal Only
Do NOT Distribute
The Loop ID is assigned according to the slot
the storage controller card is physically
installed in. For example:
– A storage controller in slot 4 has a loop ID of 4.
Some of the disks may be 4.16, 4.17, 4.18
– A dual port controller in slot 9 supports 2 loops.
9a and 9b. The disks may be 9a.16, 9a.17.
– A controller on the motherboard of a storage
system has a Loop ID beginning with 0. Such
as 0a, 0b, 0c
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 107
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Starts at 1 Starts at 16 and skips 30-1,
46-7, 62-3, 78-7, 94-5, 110-11
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 108
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Speed: 1 Gb
Termination: Switch
Connectors:
• 1st shelf: SC in / DB9 out
• Other shelves: DB9 in / DB9 out
Passive Component – it cannot detect errors
Termination
Switch Not switched hub technology. One
malfunctioning disk can take down the loop.
Not compatible with ESH modules in same loop
Compatible shelves:
• DS-14
• DS-14mk2
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 109
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Speed: 1 or 2 GB (Switch on shelf)
Termination: Switched
Connectors:
• 1st shelf: LC in / HSSDC2 out
• Other shelves: HSSDC2 in /
HSSDC2 out
Active Component – It can detect errors
Switched technology to better isolate
single bad disk from the loop.
Compatible shelves:
Termination • DS-14
Switch • DS-14mk2
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 110
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
ESH2
NetApp Internal Only
Do NOT Distribute
Speed: 1 or 2 GB (Switch on shelf)
Termination: Automatic
Connectors: SFP
Better resiliency against single disk
taking down loop then LRC or ESH
Compatible shelves:
• DS-14
• DS-14mk2
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 111
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
ESH4
NetApp Internal Only
Do NOT Distribute
Speed: 1, 2 or 4 Gb
Termination: Automatic
Connectors: SFP
Compatible shelves:
• DS-14mk2
• DS-14mk4
•NOTE: When 2Gb disks are inserted on shelf operating at 4Gb speeds, the 2Gb disks will not work. It is common to
see the 2Gb disks show as ‘BYP/PCYCL’ in STORAGE output.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 112
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
AT-FC (SCM)
NetApp Internal Only
Do NOT Distribute
Speed: 2Gb
Termination: Switched
Connectors:
– 1st shelf: LC in / HSSDC2 out
– Other shelves: HSSDC2 in /
HSSDC2 out
Does not support dual attached
shelves.
– Supported only in B slot on shelf
Termination Compatible shelves:
Switch
– DS-14mk2-AT
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 113
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
AT-FC2 (SCM2)
NetApp Internal Only
Do NOT Distribute
Speed: 2Gb
Termination: Switched
Connectors: SFP
Does not support dual
attached shelves.
– Supported only in B slot
on shelf
Compatible shelves:
– DS-14mk2-AT
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 114
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
AT-FCX
NetApp Internal Only
Do NOT Distribute
Speed: 1Gb or 2Gb
capable
– Set by jumper on module
Termination: Automatic
Connectors: SFP
Supports dual-attached
shelves in single head or
clustered environment.
Compatible shelves:
– DS-14mk2-AT
NetApp Confidential — Limited Use
AT-FCX
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 115
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Storage Hardware
SAS
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 116
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
DS4243 (Sequoia)
– 4U, 24 Drives, 3 Gb/s
Bay 0 Bay 3
Bay 20 Bay 23
DS2246 (Hackberry)
– 2U, 24 Drives, 6 Gb/s
Bay 0 Bay 23
-Modules within the SAS family ARE NOT supported in different shelves (no IOM6 in DS4243, no IOM3 in
DS2246).
-Both disk shelves also support the Vespa controller (FAS2240-2 and FAS2240-4).
-Drives are numbered left to right, starting at upper left corner (0-23)
-Left side hubcap includes the shelf ID display and LEDs for Power, Activity, and Warning.
-Shelf ID selector us beneath the hubcap.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 117
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
PSU 1 PSU 3
4U
PSU 2 PSU 4
IOM A IOM B
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 118
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
IOM A IOM B
2U
PSU 1 PSU 2
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 119
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
IOM (SAS)
NetApp Internal Only
Do NOT Distribute
IOM3 (Sable)
IOM6 (Badger)
IOM (SAS)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 120
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Intermix of DS4243 and DS2246 shelves is NOT
supported in the same stack
DS4243 supports only IOM3 modules
DS2246 supports only IOM6 modules
Shelves must contain drives of the same type
– Different capacities of the same drive type can be mixed
DS4243 SAS and SATA can be mixed within a stack
– Recommended to limit to a single crossover point.
DS4243 SSD stacks must be homogeneous
– Cannot mix with SAS or SATA shelves
Although multiple crossover points (transitions between SAS and SATA shelves) are supported, it is recommended to
limit stacks to a single crossover point.
DS4243 and DS2246 shelves must not be intermixed on the same stack .
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 121
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Cables include electronic cable type ID (part
number) and electronic cable unique ID (serial
number) which can be viewed in Data ONTAP
QSFP
• Quad Small Form-factor Pluggable. A larger SFP connector with presence detect intelligence to
denote position in the shelf stack. Includes electronic cable type ID (part number) and electronic
cable unique ID (serial number).
*Exception: The FAS2050 supports a dual-port, half-height PCIe adapter with Mini-SAS connectors.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 122
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Square port on one shelf should be connected to the
Circle port on the next
Multi-path HA is required for supported DS4243 and
DS2246 connectivity
– System will log not multipathed message if one
path is broken
– Exception is the FAS2040
Quad-port SAS HBAs: Same shelves should not be
attached to A and B or C and D
Maximum 10 shelves per channel
For quad-port SAS HBAs ports A and B are on one ASIC chip, and ports C and D are on a second ASIC chip
therefore if ports A and C to connect to the top shelf in each stack, and using ports B and D to connect to the bottom
shelf in each stack, the controllers maintains connectivity to the disk shelves if an ASIC chip fails or misbehaves.
FAS2040 is the exception to the MPHA rule because it only has one external SAS port and no PCI slots.
SAS can actually address thousands of devices per channel but 10 shelves is a limit set within ONTAP
for performance reasons.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 123
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Power must be on to change the
address on the OPS panel!
Push and hold the ID switch for 2-3
seconds to enter set mode
1’s digit will flash, push the button
repeatedly to select 0-9
Hold the ID switch for another 2-3
seconds and the 10’s digit will
flash, push the button repeatedly to
select 0-9
Hold the ID switch for another 2-3
seconds to exit set mode
Halt storage system
Power cycle shelf
Boot storage system
If the two shelves on a stack have the same shelf ID, then functionality is not impacted. The shelf with the duplicate
shelf ID will have its serial number used in place of the shelf ID in AutoSupport output.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 124
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Use the sysconfig output to determine your shelf
type:
Shelf Type Sysconfig output
FC7 Shelf 0: EDM Kernel Version : 0.4
FC8 Shelf 0: EDM Kernel Version : 1.0.A
FC9 Shelf 0: VEM Kernel Version : 2.5 App. Version : 3.1
DS14mk1/2-FC Shelf 1: LRC Firmware rev. LRC A: 11 LRC B: 11 *
DS14mk2/4-FC Shelf 1: ESH2/4 Firmware rev. ESH A: 19 ESH B: 19 *
DS-14mk2-AT Shelf 1: AT-FCX Firmware rev. AT-FCX A: 27 AT-FCX B: 27
DS4243 slot 4: SAS Host Adapter 4b (PMC-Sierra PM8001 rev. C, SAS...
Shelf 0: IOM3 Firmware rev. IOM3 A: 0102 IOM3 B: 0102
DS2246 slot 2: SAS Host Adapter 2c (PMC-Sierra PM8001 rev. C, SAS, <UP>)
Shelf 0: IOM6 Firmware rev. IOM6 A: 0104 IOM6 B: 0104
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 125
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Goals of ACP
– Provides new options for non-disruptive recovery of shelf modules such
as:
SAS expander reset
SAS expander power cycle
– Enhanced First-Time Data Capture for shelf issues
– Provide the infrastructure for future innovation
– ACPP firmware update
Non-disruptive to I/O
Downloaded over the Ethernet link
Image delivery
– Packaged in Data ONTAP
– Available on NOW site for download
What ACP is not
– NOT a way to do enclosure services
– NOT a replacement for in-band shelf functionality (such as data)
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 126
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Initiated by Data ONTAP
– Integrated into I/O error recovery chain
Delivered independently of the data path
Intended to be non-disruptive
– Quick recovery
– No effect on shelf peer I/O module
– Single-Failure & MPHA: recover without I/O delays
– Multiple Failures (or Single-Path): recover with I/O delays
The alternative is system panic
Special logic to prevent back-to-back recovery events
– ACPP-enforced 10-second waiting period
– Avoids tug-of-war between nodes in an HA pair
Single-Path
FAS2050 is the only configuration that will be single path in a clustered in Environment with SAS shelves. All other
configurations should be shipped MPHA.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 127
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Private Ethernet network
– User configures one Ethernet port
on each node
– Future platforms will have a
dedicated port
IP addresses automatically assigned
by Data ONTAP
Daisy-Chain Topology (see figure)
Pros :
– Simplicity, Cost
Cons :
– Single points of failure in the
Ethernet cables
Although Data ONTAP will keep
running
– FRU isolation requires diagnosis
Daisy-Chain ACP Cabling
ACP is configured during setup
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 128
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Setup Menu
NetApp Internal Only
Do NOT Distribute
ACP comes after RLM in the setup sequence
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 129
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Primary tool for diagnosing ACP connectivity issues
Summarizes a controller’s view of the ACP network
Sample output:
Alternate Control Path: enabled
Ethernet Interface: e0b
ACP Status: Active
ACP IP address: 198.15.1.212
ACP domain: 198.15.1.0
ACP netmask: 255.255.255.0
ACP Connectivity Status: Full Connectivity
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 130
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Alternate Control Path: enabled or DISABLED
Ethernet Interface: port assigned to this ACPA
ACP status: Active or Inactive
ACP Connectivity status
– No Connectivity – no ACPP connected
– Full Connectivity – data path matches control path
– Partial Connectivity – some IOMs seen only on data path (not ACP)
– Additional Connectivity – some IOMs seen only on ACP (not on data path)
– NA - ACP state is Inactive
Alternate Control Path: enabled
Ethernet Interface: e0b
ACP Status: Active
ACP IP address: 198.15.1.212
ACP domain: 198.15.1.0
ACP netmask: 255.255.255.0
ACP Connectivity Status: Full Connectivity
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 131
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
ACP Connectivity Status: Full Connectivity
ACPP Summary
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 132
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Alternate Control Path: enabled
Ethernet Interface: e0b
ACP Status: Active
ACP IP address: 198.15.1.212
ACP domain: 198.15.1.0
ACP netmask: 255.255.255.0
ACP Connectivity Status: Partial Connectivity
Example: 7c.001.A OK
7c.001.B ACP connection lost
7c.002.A, Never connected to ACP
7c.002.B
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 133
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
storage download acp [<adapter_name>.<shelf_id>.<module_number>]
bovard> Wed Apr 22 08:19:48 GMT [acp.command.sent:info]: sent firmware download (image: ACP-IOM3-01.00.tgz)
command to 198.15.1.218.
Wed Apr 22 08:20:17 GMT [acp.command.response:info]: Command firmware download to 198.15.1.218 was
successful.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 134
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
acpadmin list_all
NetApp Internal Only
Do NOT Distribute
Lists all the ACPP seen by Data ONTAP
– For all the ACPPs which are not accessible through in-band, there will
be no shelf S/N and Inband ID
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 135
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
acpadmin expander_reset
NetApp Internal Only
Do NOT Distribute
acpadmin expander_reset <adapter_name>.<shelf_id>.<module_number>
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 136
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Shelf logs can be collected from some shelf modules
– ESH2
– ESH4
– AT-FCX
A reboot erases the most logs (AT-FCX & ESH4
have some persistent logging, but is less helpful).
Connect a special console cable to the Module
Different cable for each module type
Directions for collecting the shelf logs can be supplied
by NetApp support
NetApp engineering can analyze the logs
* NOTE: Autosupports also contain shelf log information. This can be found in the SHELF-log.gz output of the
logs. The ASUP logs will only contain the most recent 5MB of logs, even though the actual logs may be larger. The
full log is located in /etc/log/shelflog.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 137
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Cable Connectors
NetApp Internal Only
Do NOT Distribute
Common use of Optical and Copper Cables:
Optical cable from head to the first shelf
– LC, SC and SFP (with LC connector)
Copper cable from shelf to shelf
– DB9, HSSDC2 and SFP
To determine the cable you need
– Storage System Side:
sysconfig –a to see the adapter
Parts finder: match connector with adapter
– Shelf side: Parts finder to match connector
with shelf module
NetApp Confidential — Limited Use
Cable Connectors
Generally, an optical cable connects the Storage System head to the first shelf and copper cables are used from shelf
to shelf
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 138
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Storage System Connector
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 139
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Shelf Module 1st Shelf IN Shelf to shelf
connectors
LRC SC DB9
ESH LC HSSDC2
AT-FC
MultiPath HA
With Multi-path HA we have an optical connection at the end of the loop as it connects back to the Storage Controller.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 140
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
DB9 - FCAL
NetApp Internal Only
Do NOT Distribute
Used for shelf to shelf connections
between LRC modules
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 141
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
SC - FCAL
NetApp Internal Only
Do NOT Distribute
Used to connect to the IN on the LRC
module on shelf one.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 142
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
LC - FCAL
NetApp Internal Only
Do NOT Distribute
Connect to IN on ESH on
1st shelf
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 143
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
HSSDC2 - FCAL
NetApp Internal Only
Do NOT Distribute
Used for shelf to shelf ESH
modules.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 144
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Connect ESH2, ESH4, AT-FCX
Optical SFP
Copper SFP
Shortwave
optical adapter
To use an Optical SFP connection, use an LC connected optical cable attached to a SFP to LC Shortwave adapter.
The SFP to LC Shortwave adapter inserts into the SFP slot.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 145
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
QSFP - SAS
NetApp Internal Only
Do NOT Distribute
Copper or Optical
For use in SAS
shelf (Sequoia)
Supports four
independent
channels per
cable
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 146
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 1-3
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 147
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Remote System
Management
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 148
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Allows logging of all console messages
Allows management and logging with no
network connectivity
Necessary in down storage system situations
– View boot process including any problems
– Maintenance mode
– Diagnostics
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 149
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
The RLM is for remote administration:
– Accessed over SSH
– Provides remote console access
– Force a core dump
– Power cycle system
– View error logs (system log and events all)
– Remote Support (RSE/RSA
– Is powered by the storage system’s power
supplies
– Network connectivity may be necessary when
configuring
NetApp Confidential — Limited Use
The Remote LAN Module (RLM) is available in the FAS3000 and above.
-Was an option in FAS3020/50
-Has been standard in every system since then (FAS3040/70, FAS31x0, FAS60x0)
The RLM is accessed via a SSH connection (login: naroot) and then can provide console access to the
system. This is especially needed if the storage system will not boot or is hung.
Remote Support: Allows NetApp Support to trigger ASUPs, collect core files (not create), collect data from
/etc/log and /etc/crash.
http://support.netapp.com/NOW/download/tools/rsa/
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 150
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
rlm setup
– Used to assign an IP address, network mask and gateway
– Can configure for DHCP
rlm status
– Displays IP address, network mask, and gateway
– Check RLM Firmware version
rlm reboot
– Reboot the RLM
– Does not affect storage system status
– Approximately 1 minute to complete
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 151
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Display all events logged by the RLM
– RLM system1> events all
Display console log information
– RLM system1> system log
Display the system hardware sensor list
– RLM system1> system sensors
Display the RLM configuration
– RLM system1> rlm status
Display a summary of information about the records in the events log
– RLM system1> events info
Dump the system core and reset the storage system
– RLM system1> system core
The RLM is most useful for triaging cases where the controller has experienced an ‘unexplained
takeover’. To ensure best possible means for diagnosing the problem it is important to collect ‘events all’
and ‘system log’ immediately following such an event (BEFORE running diags or entering maintenance
mode).
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 152
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Access: ssh (wrench port) or Ctrl+g from console
– Login: naroot, Password: <system root pwd>
Provides remote console access
Call home and down controller notifications (ASUP)
Remote power cycle, system coredump, appliance reset
Access to system logs from a down appliance
Non-volatile hardware system event logs
Captured console log history
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 153
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
New, improved system management tool
Replaces RLM / BMC
Fully integrated; not a separate FRU
Independent processor, memory, network
Access: ssh (wrench port) or Ctrl+g from console
– Login: naroot, Password: <system root pwd>
Functions:
– System Sensor Monitoring
– Event Logging
– System Management
– Self Management
– Send AutoSupport message
– RSA
Functions:
• System Sensor Monitoring: Listing of all sensors, details for a single sensor, generate AutoSupport
messages when sensors out of normal operating range
• Event Logging: Tracks events in persistent log, view event history, search Events, get information
regarding state of the event log
• System Management: Access Data ONTAP console, power-cycle Data ONTAP processor, generate
core dumps, FRU inventory and information, battery information
• Self Management (Mgmt of SP itself): Configure networking (IPv4 and IPV6), update firmware for the
Service Processor and devices on the storage system, view status, reset and reboot, on-line help
• Send AutoSupport message: Done when system goes down, send short AutoSupport message with ‘sp
test autosupport’ from SP interface, mail settings are based on options set in Data ONTAP.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 154
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Configurable from Data ONTAP or LOADER:
> sp setup
Check status and show configuration using
sysconfig -v or sp status:
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 155
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
events all
system log
system sensors all
system power cycle
system core
sp status
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 156
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 157
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Feature SP RLM BMC
Sensor Management Yes No No
Environment sensors for things like temperature, power-supply faults and battery status.
Integrated into device Yes No No
The management device is part of the storage controller’s motherboard and cannot be removed.
Remote Data ONTAP Console Yes Yes No
A way of reaching Data ONTAP’s serial console via Ethernet.
Remote Power Cycle Yes Yes No
Power-down and then power-up the storage system remotely.
AutoSupport messages Yes Partial No
Automated support messages based on “trouble” spots in the device. For example, when a sensor is out of range, an
AutoSupport message will be generated.
Fan / Cooling Management Yes No No
Control over the fan speeds that is driven by information collected by onboard sensors.
SNMP Yes Yes No
Generated traps for exceptional events for management applications that use SNMP for device management.
System Event Logs (SEL) Yes Yes No
A persistent log of events so you can see what happened to a storage controller when diagnosing a problem.
Detailed FRU Information Yes No No
Get an inventory of the system’s FRUs and see data such as the serial number and FRU firmware version.
IPv6 Support Yes Yes No
Support for IPv6 as a method for communicating with the service processor remotely.
Battery Information Yes Partial No
Detailed information regarding the type and firmware of the installed batteries.
Self-test Yes No No
A self-test to check for hardware, software, or configuration problems that would interfere with the operation of the management
processor.
Ping and Traceroute Yes No No
Tests to see if you can reach a host. Great for basic network connectivity debugging.
View and Control System LEDs Yes No No
Inspect the state of various LEDs. Change the state to on orNetApp
off. Confidential — Limited Use 149
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 158
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Summary
NetApp Internal Only
Do NOT Distribute
You should now be able to:
Identify various platforms and associated storage
Demonstrate use of the Netapp Hardware Universe
Discuss newest available platforms
Identify shared features and components of various
platforms
Discuss FC-AL technology and related components
Discuss SAS technology and related components
Discuss remote management options
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 159
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 160
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
This page intentionally left blank
NetApp Internal Only
Do NOT Distribute
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 161
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Hardware and Down
Storage System
Troubleshooting for
Partners
High Availability
Hardware Basics
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 162
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Objectives
NetApp Internal Only
Do NOT Distribute
By the end of this module, you should be able to:
Describe what protection high availability provides
Describe high availability (HA) components
Perform high availability Cabling
Perform Multi-path CFO cabling
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 163
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Topics
NetApp Internal Only
Do NOT Distribute
This module contains the following sections
Controller Failover Option (CFO) Technology
Describe the HA components
Describe HA Cabling
Describe multi-path CFO cabling
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 164
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
High Availability is designed to protect against
Storage Controller (head) failure, providing
transparent automatic takeover
IC Both Storage Controllers are active in their
0a
0b 0a
0b respective normal state, but are passive to each
other.
Clustering provides
Shelf 1 Shelf 1
• High Availability using a single controller to
Shelf 2 Shelf 2
provide all services (data) of the failed
controller
X
IC
0a 0b
0b 0a • Data is completely consistent using mirrored
NVRAMs through the Cluster Interconnect
(IC)
Shelf 1 Shelf 1
Shelf 2 Shelf 2
Note: Disk shelves are shared in clusters and should never
be powered off
The NetApp initial high availability product uses the high speed interconnect to pass heartbeat signals
and information about changes to the file systems between two co-operating Storage Controllers.
Heartbeats will also pass via the network and the fiber channel disk. Each Storage Controller will own a
set of file systems. NVRAM information from each Storage Controller will be copied to its partner. If a
Storage Controller goes down, the survivor will sense it. The survivor will grab the disks from the downed
Storage Controller, mount the filesystems, take over the network address (IP) and hardware address
(MAC) and automatically begin serving the data.
All of this will be transparent to NFS clients using hard mounts. Clients of the downed Storage Controller
will never know that the surviving Storage Controller is ghosting for it.
Note: There are two distinct Storage Controllers here. Each Storage Controller has an address and a set
of files. Even when a switchover occurs, it appears to the clients that there are two different Storage
Controllers running. To access a particular file, the client must know which Storage Controller it is on.
The failover is equivalent to a reboot. therefore, stateful connections, CIFS or TCP, will behave the same
as a reboot. Clients applications which don’t automatically reconnect will have to be restarted.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 165
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
High Availability requires two nodes to be
connected by either:
– Point-to-point cabling using a cluster
interconnect card between two Storage
Controllers
– The backplane in integrated storage
systems such as the FAS270c, FAS2000
and FAS3100 family, which is a single
chassis
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 166
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
HA Components
NetApp Internal Only
Do NOT Distribute
FC-AL Controllers cards are used to connect
each storage controller head to each set of
shelves:
– shelves owned by itself
– shelves owned by the partner
Shelf with two modules installed that support
clustering: LRC, ESH, ESH2, ESH4, AT-FCX,
IOM3
Cluster interconnect card
Mailbox disks
Mailbox disks
https://kb.netapp.com/support/index?page=content&id=1010888
There are two mailbox disk in each node. They store HA information such as time since last partner contact, and the
number of disks each storage controller sees. Which disks are being used for mailbox disks can be seen with the
command “cf monitor all” from diag or advanced mode.
There are occasions when the mailbox disks on one node will be out of sync with the mailbox disks on the other
node, due to one head being done or other anomaly. They will usually resolve this within two minutes. The mailbox
disks will be recreated at boot automatically assuming good access to all paths.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 167
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Interconnect Interface
NetApp Internal Only
Do NOT Distribute
Differences in high availability cards
NVRAM IV
– Requires a separate interconnect card such as
Mellanox
NVRAM V has cluster-interconnect built in
NVRAM VI has cluster-interconnect built in
NVRAM VII built into FAS3100 motherboard
(interconnect is via internal backplane)
Interconnect Interface
NVRAM and HA
Each Storage Controller dedicates half of its NVRAM to a synchronously updated copy of its partner's NVRAM. If a
takeover occurs, the takeover Storage Controller uses the cluster data in the part of the NVRAM dedicated to the
failed Storage Controller to ensure that no information is lost.
Sysconfig Guide
Use the Sysconfig Guide to verify what adapters are compatible and what slot it goes in.
NVRAM DIMMs
On NVRAMs that are built into the Motherboard such as in the FAS2000s and FAS3100s, DIMM 0 refers to the
NVRAM DIMM. On the FAS3100 this DIMM is replaceable.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 168
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Cabling Diagrams
NetApp Internal Only
Do NOT Distribute
By defining a method of 0a
0a 0b 0b
drawing cabling diagrams it
is easier to visualize the
physical layout.
Shelf 1 Shelf 1
The reason for recommending TSEs follow this method when drawing cabling diagrams is to have a common method
so everyone understands a diagram without additional explanation. Of course when a TSE draws these diagrams
they won’t be in color, but they don’t need to be. The important points are, the Primary loops are on the outside and
the Partner Loops are on the inside.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 169
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Cabling a Single FAS Unit With single head cabling
(typical example) we connect each shelf
with 2 FC-AL loops for
path redundancy
If one loop fails the disks
0a can still be accessed over
0c
the other path and no
outage will occur
Cable each shelf using
(0a or 0b) with (0c or 0d)
– Either combination is
acceptable
Termination
Each loop is terminated at shelf 2
In the wild
Note: in the field you may see head #1 0a and head #2 0a go to the same stack of shelves. This may
work but it is not correct and is not supported for hardware disk ownership (HDO). This would need to be
re-cabled.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 170
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Connect both cluster interconnect
Cabling a HA Storage System cables without crossing them
between Node A and Node B.
with Hardware Disk (P1-P1, P2-P2)
Ownership (HDO) Cabling the 0a ports from both nodes
IC to the same shelf is an illegal
configuration with hardware disk
0a 0a ownership
0b 0b
We have cabled 0a on each head to
the disks it owns
Shelf 1 Shelf 1 – This is module A on each shelf
We have cabled 0b on each head to
it’s partner disks
Shelf 2 Shelf 2 – This is module B on each shelf
Termination
Each loop is terminated at shelf 2
In the wild
Note: in the field you may see head #1 0a and head #2 0a go to the same stack of shelves. This may
work but it is not correct and is not supported for hardware disk ownership (HDO). This would need to be
re-cabled.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 171
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Limitations of HA Failovers
NetApp Internal Only
Do NOT Distribute
Normal cluster
– The A loop problem failover
– Data stays online which is the goal
Problems with a HA failover:
– Brief outage for protocols
– Performance concerns as all processes for
both nodes are running on one physical head
– Must plan to perform a giveback, which will
also cause a brief outage
Limitations of HA Failovers
Normal Cluster
With normal HA, a failure of a component on the A loop means the head loses access to its owned disks. This
requires a cluster failover to keep data online. Keeping data online is the goal and this is achieved. However a
failover can cause problems such as brief outage for protocols, performance concerns and sometimes a brief outage
during a giveback.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 172
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Resiliency Multi-Path HA
NetApp Internal Only
Do NOT Distribute
MPHA provides an additional path from the
storage controller to the disk shelves for
active/active configurations for improved
resiliency and performance
Both cluster nodes have an A and a B path to
each set of shelves
If the Primary loop goes down, there is a
backup loop from the primary node, so a
cluster failover is not required
Some examples of problems Multi-path HA can withstand without a failover that a normal cluster could not:
•Module failure on A loop
•Cable failure on A loop
•Storage controller and ASIC problem on A loop
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 173
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Supported hardware platforms starting with 7.1.1*
– FAS9XX, FAS3020/3050
Supported hardware platforms starting with 7.2.1*
– FAS3070 and FAS6XXX systems
* MPHA is NOT supported on 7.1 or 7.2
ESH2 and higher or AT-FCX storage shelves
– AT-FCX storage shelves must be CPLD version 24+ (RoHS
compliant) and firmware version must be 32 or higher
– Environment command can be used to find the CPLD
version
Software Disk Ownership MUST be enabled (SANOWN)
Twice the number of disk ports compared to classical
active/active storage configuration
No additional license is required
RoHS Compliant
If the AT-FCX module is RoHS compliant then it will have a minimum CPLD version of 24.
Note: CPLD version 23 and earlier do not support HBA connections to the out port.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 174
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
IC
1. Start with normal
0c 0c cluster cabling
0d 0d
0a 0a 2. Cable the 2nd primary
0b 0b path to the shelves. This
will attach to the end of
the last B module on the
owned shelf stack. We
use the B module because
Shelf 1 Shelf 1 this is to be redundant to
the A loop.
Shelf 2 Shelf 2
3. Cable the 2nd Partner
path to the shelves. This
will attach to the last A
module on the partner
shelf stack.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 175
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
IC
0c 0d
0d 0c
Adapter with same ID
0a 0b connect to the same
0b 0a stack of shelves
Device ID’s are unique
per HA configuration
and not only per
Controller
Shelf 1 Shelf 1
Shelf 2 Shelf 2
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 176
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
1. On one node get the Disk Serial Number of the 1st disk on the 0a
loop
2. Search the rest of the sysconfig –a output to see if that disk
appears in another loop
if so, record the matching loops (i.e. 0a=0c)
3. Search the sysconfig –a output on the partner to see if that
disk appears in another loop
if so, record the matching loops (i.e. system1 0a=system2
0b)
4. Perform steps 1-3 for all loops
5. Check sysconfig –r to see which disks are owned by which
node (i.e., system1 owned loops are 0a, 0c, and it’s partner
loops are 0b, 0d)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 177
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
If you have a HA system, the results may look
something like:
system1 system 2
0a = 0d system1 IC system2
0b = 0c 0a 0a
0b 0b
0c = 0b
0d 0d 0c
0d = 0a 0c
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 178
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
With HA and MPHA, shelves remain a single
point of failure
Some failures that would bring down the loop
are:
– Shelf hardware problems such as backplane
failure
Eliminate SPoF
Shelves can be eliminated as SPoF with SyncMirror or MetroCluster with SyncMirror.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 179
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Two disk stacks that comprise different plexes
owned by one head and 2 disk (stacks) plexes
owned by partner
Each write from the owning node has to be
written to both plexes on each stack
If one set of shelves (plex) goes offline, data is
still available on 2nd set of shelves (plexes)
with no downtime
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 180
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
SyncMirror (Local)
NetApp Internal Only
Do NOT Distribute
IC
Plex 0 Plex 0
Mirroring Mirroring
Plex 1 Plex 1
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 181
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
MetroCluster uses SyncMirror. However, the 2nd plex
can be in a remote location where the partner is
located.
Because of the extended distance between the
cluster nodes, SAN switches are often used to
connect the storage system to the remote storage
Stretch Metrocluster does not use FC Switches and
therefore can cover limited distance
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 182
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
HA Considerations
NetApp Internal Only
Do NOT Distribute
NVRAM is split between the storage
controllers, even before takeover
When taken over, the total workload of both
storage on one controller (head)
Disk load remains the same for both storage
controllers
Properly configured /etc/rc files are
required for proper network failover
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 183
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
It is important to understand the reason for an
unplanned takeover
CFO options can affect when takeover occurs
Pre 8.1.2, HA auto-giveback default is off;
8.1.2+, the default is giveback after five
minutes
Giveback should not occur until the reason for
the takeover is understood and the problem(s)
corrected
Auto Giveback
There is an option to turn on auto giveback. Pre 8.1.2, it is off by default, but can be enabled with
“option cf.giveback.auto.enable on”. To adjust the timing, use “options
cf.giveback.auto.delay.seconds <number of seconds>”.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 184
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Command Description
cf takeover Normal takeover command
cf takeover –f Allows a takeover to proceed even if it will abort a coredump
on the other system
cf takeover –n For Non-disruptive upgrades (partner node was running an
incompatible version of Data ONTAP)
cf Can lead to data inconsistency as NVRAM contents are
forcetakeover discarded
cf giveback Normal giveback command
cf giveback –f Giveback even if outstanding CIFS sessions, active system
dump processes, or other system operations makes a giveback
dangerous or disruptive
cf It can lead to data inconsistency as NVRAM contents are
forcegiveback discarded for node being given back – It may be needed when
the system panics during normal giveback
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 185
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Problem Option Defaul
t
Hung system cf.takeover.on_failure
On
If one or more NICs or VIFs that are enabled for negotiated failover on a node failed then takeover will
occur. The options that need to be set are:
cf.takeover.on_network_interface_failure set to on
cf.takeover.on_network_interface_failure.policy set to any_nic
Interfaces are configured to be enabled for takeover using the ifconfig command with the nfo option.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 186
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Monitoring CF
NetApp Internal Only
Do NOT Distribute
cf status
– Is cluster enabled?
– Is partner up, down, waiting for giveback, or in
takeover?
cf monitor
– Minimal stats in basic mode
– In diag mode cf monitor all give a *lot* of
data
cf hw_assist
– Monitors the health of the partner node
– Can speed the start of takeover in some hardware
failure situations
NetApp Confidential — Limited Use 26
Monitoring CF
Configuring hw_assist
Requires RLM or SP and 7.3+. Is not supported on any system with a BMC (such as FAS20xx).
https://kb.netapp.com/support/index?page=content&id=1010145&locale=en_US
Output examples:
node1> cf status
Cluster enabled, node2 is up.
node1> cf monitor
current time: 30May2013 22:14:12
UP 1+02:13:21, partner ‘node2', cluster monitor enabled
VIA Interconnect is up (link 0 up, link 1 up), takeover capability on-line
partner update TAKEOVER_ENABLED (30May2013 22:14:12)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 187
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Summary
NetApp Internal Only
Do NOT Distribute
You should now be able to:
Describe what protection high availability provides
Describe high availability components
Perform high availability Cabling
Perform Multi-path CFO cabling
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 188
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Review
NetApp Internal Only
Do NOT Distribute
Name single points of failure in a normal HA
configuration?
Shelf chassis
How much additional hardware is needed to
configure Multi-path CFO?
2 FCAL ports per head, 4 cables, & 4 SFPs in OUT
shelf module ports on last shelf if not already
installed.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 189
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 2-1
Lab 2-2
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 190
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 191
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Hardware and Down
Storage System
Troubleshooting for
Partners
Boot Process
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 192
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Objectives
NetApp Internal Only
Do NOT Distribute
By the end of this module, you should be able to:
Describe the storage controller boot sequence
Explain the different methods of booting
Know that printenv boot-device sets which device from
which to boot
Describe the use for setenv auto-boot
Describe the differences between booting with boot vs. bye
State the role of a the compact flash card at boot time
Define the RAID Tree - plex, RAID group, logical volume, disks
Identify how RAID groups work with mixed disk sizes
Describe the different Data ONTAP, RAID and WAFL versions
Analyze the console logs during the boot process
Evaluate when to run diags
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 193
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Topics
NetApp Internal Only
Do NOT Distribute
This module includes the following topics:
Boot sequence
Boot methods and environment variables
Boot versus Bye
Flash Card Partitions
Data ONTAP, RAID and WAFL versions
Following the boot sequence
The Raid tree
Diagnostics
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 194
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Boot Sequence – 7G
NetApp Internal Only
Do NOT Distribute
BIOS
RAID
WAFL
Data ONTAP
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 195
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Boot Sequence 7G
NetApp Internal Only
Do NOT Distribute
Motherboard Firmware
• OFW/CFE/LOADER
Flash
• Data ONTAP Kernel
System Memory
RAID
WAFL /etc
Storage
Network rc hosts exports registry
{config
files}
Protocol
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 196
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
BIOS
FreeBSD Kernel
7-Mode Module
RAID
WAFL
Data ONTAP
Firmware
OFW, CFE and LOADER are types of firmware used by NetApp. Firmware provides access to
motherboard commands. It is the equivalent to BIOS on a PC. For Example, diagnostics and
environmental variables are accessed from firmware prompt
At this point in the boot process, the platform module is started which determines if 7-mode is required. A
virtualized 7G module is started.
RAID
When RAID starts up, the disks are accessed and disk labels are read, but no data. At this time
commands are system level, accessing information about the aggregates or traditional volumes.
WAFL
WAFL is the filesystem of Data ONTAP. Once WAFL is loaded, data on the disks is accessible.
One of the few times when WAFL is loaded without loading Data ONTAP is during WAFL_check.
Data ONTAP
After WAFL is loaded, the rest of Data ONTAP is loaded from the disks. Processes that were not loaded
by the mini-kernel are loaded and started. Configuration information that is specific to this storage
controller is read and applied here.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 197
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Motherboard Firmware
• OFW/CFE/LOADER
System Memory
Virtualization
DOT Kernel
RAID /etc
WAFL
Storage {config
rc hosts exports registry
Network files}
Protocol
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 198
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Firmware – Booted from motherboard
– LOADER
Newest method
Native motherboard firmware boots and then starts
LOADER
LOADER looks similar to CFE system
Uses many, but not all, the same commands and syntax as
CFE
Netboot HTTP
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 199
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Boot Process
NetApp Internal Only
Do NOT Distribute
BIOS
RAID
WAFL
Data ONTAP
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 200
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Motherboard Firmware
NetApp Internal Only
Do NOT Distribute
Purposes of motherboardf
Initialize the system (POST)
Point to where to load the OS mini-kernel
Contains persistent storage of environment
variables
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 201
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
OFW – Open Firmware
It is a published standard
Code can be written to be hardware independent
Firmware prompt: ok
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 202
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
During the boot process we check the
environmental variable boot-device to find
which device to boot Data ONTAP
Old OFW storage controllers look for a floppy
disk:
ok printenv boot-device
boot-device = floppy fcal
New OFW storage controllers look for a CF card:
ok printenv boot-device
boot-device = c fcal
Firmware Update
If you upgrade motherboard firmware along with Data ONTAP (DOT), run set-defaults to reset the
environment variables. This will resolve a lot of upgrade problems where the storage system will not boot
following upgrade.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 203
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
CFE – Common Firmware Environment
Different command syntax than OFW
NetApp re-writes the native firmware on the
motherboard using CFE
Standard commands in firmware across hardware
platforms
Supports Compact Flash boot:
– boot_ontap
– boot_primary
– boot_backup
– boot_diags
Firmware Prompt: CFE>
The decision was made to standardize on one firmware command set for all storage systems. CFE gave
us the functionality we needed and also has the functionality to be installed on any motherboard firmware
in place of the manufacturer’s firmware. If CFE was not the default firmware, such as on x86
motherboards, we would create a new version of CFE firmware that would run on that motherboard. This
gave the users a consistent look, feel and command set.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 204
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
CFE/LOADER storage controllers check the
variable AUTOBOOT_FROM
Note that newer storage controllers with CFE or LOADER firmware cannot boot
to FCAL
FAS6030:
PRIMARY_KERNEL_URL fat://ide0.0/x86_64/kernel/primary.krn
BACKUP_KERNEL_URL fat://ide0.0/backup/x86_64/kernel/primary.krn
GX_PRIMARY_KERNEL_URL fat://ide0.0/x86_64/freebsd/image1/kernel
GX_BACKUP_KERNEL_URL fat://ide0.0/x86_64/freebsd/image2/kernel
ntap.init.kernelname x86_64/freebsd/image1/kernel
DIAG_URL fat://ide0.0/x86_64/diag/diag.krn
FIRMWARE_URL fat://ide0.0/x86_64/firmware/EXCELSIO/firmware.img
AUTOBOOT true
AUTOBOOT_FROM PRIMARY
BIOS_INTERFACE 9FC3
BOOT_FLASH flash0a
CF_BIOS_VERSION 1.4.0
CF_LOADER_VERSION 1.2.3
BOOTED_FROM OTHER
boot_ontap autoboot ide0.0
boot_primary setenv BOOTED_FROM PRIMARY; boot -elf64
$GX_PRIMARY_KERNEL_
URL $PRIMARY_KERNEL_URL
boot_backup setenv BOOTED_FROM BACKUP; boot -elf64
$GX_BACKUP_KERNEL_UR
L $BACKUP_KERNEL_URL
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 205
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
LOADER
Native Motherboard firmware (BIOS) boots and then it
starts LOADER
LOADER looks a lot like a CFE system
Many of the same commands as CFE
Because it has much of the same functionality as
CFE, some NetApp docs may list storage systems
with LOADER firmware as CFE devices
Firmware Prompt: LOADER>
But we still wanted a standardized command set across all storage systems no matter the type of
Motherboard. We wanted to have the same basic command set that CFE Motherboards have, but not
have to re-write the firmware on the Motherboard for every new motherboard and processor we use. In
comes LOADER which runs on top of BIOS. The Motherboard loads it’s native BIOS firmware which then
starts LOADER. This way NetApp still has a common command set for firmware but we don’t have to re-
write the native firmware for every new Motherboard and chipset
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 206
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Environment Variables 7G
NetApp Internal Only
Do NOT Distribute
Stored in firmware on the motherboard
Read at boot time
Point to where to load Data ONTAP from
Control whether the system auto boots
Store the sysid of the HA Partner
Environment Variables 7G
Environment variables are case sensitive. On systems that run LOADER firmware a reboot is needed to
set the environment variables. CFE firmware does not require a reboot.
CFE> printenv
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 207
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
CPU_TYPE Intel Xeon
CPU_SPEED 2800
CPU_REVISION 9
NETAPP_BOARD_TYPE DEUX
MOBO_SERIAL_NUM 0385999
MOBO_REV A2
MOBO_MODEL 1
MOBO_PART_NUM 110-00084
SYS_SERIAL_NUM 1084260
SYS_REV B1
SYS_MODEL FAS3050
SYS_PART_NUM 104-00041
CPU_NUM_CORES 4
ntap.init.cfdevice /dev/ad4s1
CFE_VERSION 3.1.0
CFE_BOARDNAME DEUX
CFE_ARCH X86_ELF
CFE_MEMORYSIZE 3072
NETAPP_PRIMARY_KERNE fat://ide0.0/X86_ELF/kernel/primary.krn
NETAPP_BACKUP_KERNEL fat://ide0.0/backup/X86_ELF/kernel/primary.krn
NETAPP_NG_KERNEL_URL fat://ide0.0/x86/freebsd/image1/kernel
NETAPP_DIAG_URL fat://ide0.0/X86_ELF/diag/diag.krn
NETAPP_FIRMWARE_URL fat://ide0.0/X86_ELF/firmware/DEUX/firmware.img
BOOTED_FROM OTHER
boot_ontap autoboot ide0.0
boot_primary setenv BOOTED_FROM PRIMARY; boot -elf $NETAPP_PRIMARY_KERNEL_URL
boot_backup setenv BOOTED_FROM BACKUP; boot -elf $NETAPP_BACKUP_KERNEL_URL
boot_diags boot -elf $NETAPP_DIAG_URL
netboot setenv BOOTED_FROM NETWORK; boot -elf
update_flash flash flash0a flash0b && flash $NETAPP_FIRMWARE_URL flash0a
version printenv CFE_VERSION
CFE>
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 208
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Environmental variable name for each firmware type.
Set to factory
set-defaults
defaults
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 209
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
fc-non-array-adapter-list 0a 1a 2b
fc-port-0c 9
fc-port-0d 9
partner-sysid 0118044627
PRIMARY_KERNEL_URL fat://ide0.0/x86_64/kernel/primary.krn
BACKUP_KERNEL_URL fat://ide0.0/backup/x86_64/kernel/primary.krn
DIAG_URL fat://ide0.0/x86_64/diag/diag.krn
GX_DIAG_URL fat://ide0.0/x86_64/diag/kernel
FIRMWARE_URL fat://ide0.0/x86_64/firmware/DRWHO/firmware.img
GX_PRIMARY_KERNEL_URL fat://ide0.0/x86_64/freebsd/image1/kernel
GX_BACKUP_KERNEL_URL fat://ide0.0/x86_64/freebsd/image2/kernel
boot_ontap autoboot ide0.0
boot_primary setenv BOOTED_FROM PRIMARY; boot -elf64
$GX_PRIMARY_KERNEL_URL $PRIMARY_KERNEL_URL
boot_backup setenv BOOTED_FROM BACKUP; boot -elf64
$GX_BACKUP_KERNEL_URL $BACKUP_KERNEL_URL
netboot setenv BOOTED_FROM NETWORK; boot -elf64
boot_diags boot -elf64 $GX_DIAG_URL $DIAG_URL
AutoBoot
The autoboot variable describes if the Storage Controller will perform a normal boot to Data ONTAP or
will stop the boot process at the firmware prompt. With autoboot disabled the Storage Controller boots to
the firmware prompt. This is a useful option if you do not want the Storage Controller to boot up, for
example, if you are performing maintenance, or do not want it data accessible to users. This value is not
automatically set back to normal, so if you have a Storage Controller that consistently boots to firmware
check this setting.
printenv
Used to query the system for the values of the variables
setenv
Setenv is the command used to change the value of an environmental variable. If the variable does not
exist, the setting will not be echoed.
set-defaults
The set-defaults command returns the environmental variables to Data ONTAP defaults. The
defaults are the recommended settings for all variables outside of auto-boot? unless there is a specific
reason for a change.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 210
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Firmware – Booted from MB
– OFW – Floppy, CF, FCAL
– CFE/LOADER
Will boot Data ONTAP from Compact Flash or Netboot
Will not boot from CD, floppy, SCSI, or FCAL
LOADER looks similar to CFE system and uses many, but
not all, the same commands and syntax as CFE
Netboot TFTP
Netboot HTTP
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 211
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Boot Methods
NetApp Internal Only
Do NOT Distribute
OFW CFE/LOADER
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 212
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Boot
Platform Boot Path Chipset Firmware
Image
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 213
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Allows you to choose your boot location
– Boot to Primary Partition on CF card
boot or boot c
– Boot to Secondary Partition on CF card
boot d
– Boot to disks
boot fcal
If you had previously skipped auto-boot, and use boot
now:
– post will not be run
– memory will not be cleared again
– will not re-probe devices
NetApp Confidential — Limited Use
The boot command is run from the firmware prompt. It is available for OFW firmware only; it is not on
CFE or LOADER firmware systems.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 214
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Allows you to choose your boot location
– Boot to primary partition on CF card
boot_primary
– Boot to secondary partition on CF card
boot_secondary
– Normal boot
boot_ontap
If you had previously skipped auto-boot, and use boot now:
– post will not be run
– memory will not be cleared again
– Will not re-probe devices
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 215
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
This is a clean power up of the storage system
Intel Open Firmware by FirmWorks
Copyright 1995-2004 FirmWorks, Network Appliance. All Rights Reserved.
Firmware release 4.3_i1
Press Del to abort boot, Esc to skip POST
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 216
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Compact Flash
NetApp Internal Only
Do NOT Distribute
The compact flash card contains four partitions:
Primary (C): Contains the current Data ONTAP
Version
Secondary (D): Contains the previous Data ONTAP
Version
Service – Contains diagnostics
Firmware
The download command moves the current ‘C’
partition to the ‘D’ partition and the new Data
ONTAP kernel is put in the new ‘C’ partition
Compact Flash
All the currently shipping storage systems have a compact flash card.
When a new Data ONTAP version is installed on the Storage Controller, the old primary partition is
moved to secondary and the new Data ONTAP version becomes the new primary.
Backup Partition
We can boot from D partition, which contains the previous version of Data ONTAP, but we might not be
able to read the disks due to WAFL and RAID version differences. More on this later.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 217
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
OFW firmware can allow booting from FC-AL disks
– (CFE and LOADER will not allow boot from FS-AL
disks)
Pre-compact flash storage systems booted to disk as
their primary boot method (floppy was secondary)
Newer OFW storage systems (with compact flash)
can still boot straight to disk if needed
Booting to disk is slower then booting to compact
flash
To Avoid booting to CF
To boot a Storage Controller such as FAS900 series to disk instead of Compact Flash:
• Skip auto-boot by pressing DEL key during the boot process or disable autoboot in the environment
variables.
• At the OK prompt run the command: boot fcal
• The mini-kernel and the rest of Data ONTAP will be loaded from disk
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 218
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
1. Configure the working storage controller to
be a TFTP server
2. Collect interface name, IP address, gateway
and subnet mask.
3. Configure the interface on the down system
4. Netboot the controller
LOADER> netboot
tftp://10.61.69.75/<path>/<kernel>
The storage controller can netboot for any TFTP server. Here we will provide the steps to configure the
HA partner to be a tftp server, and then netboot from it.
“NOTE: It is not possible to use a management port for netbooting. This includes SP, RLM and BMC
ports, as well as ACP and other management ports.”
ifconfig -a
fas2050cl1-rtp> ifconfig -a
e0a: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
inet 10.61.69.70 netmask 0xffffff00 broadcast 10.61.69.255
partner e0a (not in use)
ether 00:a0:98:05:d0:80 (auto-1000t-fd-up) flowcontrol full
e0b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:05:d0:81 (auto-unknown-cfg_down) flowcontrol full
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 219
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
/etc/rc
NetApp Internal Only
Do NOT Distribute
fas2050cl1-rtp> rdfile /etc/rc
#Auto-generated by setup Fri Oct 5 11:22:59 EDT 2007
hostname fas2050cl1-rtp
ifconfig e0a `hostname`-e0a mediatype auto flowcontrol full netmask 255.255.255.0 partner e0a
route add default 10.61.69.1 1
routed on
options dns.domainname rtp2k3dom.csslabs.netapp.com
options dns.enable on
options nis.enable off
Savecore
•NOTE: If step 4 fails, ensure that you have booted the storage system ‘cleanly’ to the CFE/LOADER
prompt. It may be necessary to set the AUTOBOOT environment variable to ‘false’ (setenv AUTOBOOT
false) temporarily to prevent booting past the CFE/LOADER prompt. Once you are booted ‘cleanly’, set
AUTOBOOT back to ‘true’ (setenv AUTOBOOT true).
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 220
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetBoot HTTP
NetApp Internal Only
Do NOT Distribute
1. Configure the working Storage Controller to
be a HTTP server
2. Gather the interface name, IP address,
gateway and subnet mask
3. Configure the interface on the down system
4. Netboot the controller from na_admin
NetBoot HTTP
STEP 2:
Gather the interface, IP address, Gateway and Subnet Mask for the Storage Controller that will be
netbooted. Get this info the same as we did in the tftp netboot on the previous page
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 221
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
STEP 4:
NetApp Internal Only
Do NOT Distribute
LOADER> netboot http://10.61.69.75/na_admin/722L1_netboot.e
Loading:.....................................................................
...
................................0x200000/33051932 0x218551c/31318852
0x3f63860/2
557763 0x41d3fa3/5 Entry at 0x00200000
Closing network.
Starting program at 0x00200000
cpuid 0x80000000: 0x80000004 0x0 0x0 0x0
Press CTRL-C for special boot menu
Mon Oct 22 17:32:26 GMT [cf.nm.nicTransitionUp:info]: Interconnect link 0 is
UP
•NOTE: If step 4 fails, insure that you have booted the filer ‘cleanly’ to the CFE/LOADER prompt. It may
be necessary to set the AUTOBOOT environment variable to ‘false’ (setenv AUTOBOOT false)
temporarily to prevent booting past the CFE/LOADER prompt. Once you are booted ‘cleanly’, set
AUTOBOOT back to ‘true’ (setenv AUTOBOOT true).
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 222
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 3-1
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 223
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Boot Process
NetApp Internal Only
Do NOT Distribute
BIOS
RAID
WAFL
Data ONTAP
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 224
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
RAID
NetApp Internal Only
Do NOT Distribute
Requires storage layer
Accessed via maintenance
mode or Data ONTAP (7.2+)
Queries and assimilates all
storage
Updates labels on assimilation
Version changes between most
major releases of Data ONTAP
(7.1 -> 7.2)
Maintenance mode queries RAID directly… Data ONTAP booted and you are talking to Data ONTAP!
RAID Label updated/upgraded during RAID group assimilation… *should* only update labels when a raid
group is complete enough to mount. There has been at least one instance where this was not true so be
aware. This includes aggr status –r in maintenance mode. disk show *should* not…
Maintenance mode does not assimilate until you ask it aggr level information
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 225
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
RAID Tree
NetApp Internal Only
Do NOT Distribute
Logical
Volume (Trad
or Aggr)
Plex Plex
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 226
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
RAID Versions
NetApp Internal Only
Do NOT Distribute
Each version of Data ONTAP
has a version of RAID and a Data
RAID
version of WAFL ONTAP
Versions are backward 6.5 6
compatible for most commands
7.0 7
Versions are never forward
compatible 7.1 7
New RAID version is written to 7.2 8
disk as soon as the disk is
accessed by the kernel 7.3 9
– i.e. maintenance mode 8.0 10
Why is this important?
8.1 11
10 (GX) 8
RAID Versions
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 227
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
This is from a situation where a
customer added spare disks
that came from another system
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 228
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
D D D D P S S S
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 229
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
D X D D P D S S
• When a 72G disk fails, a 144G spare becomes a data disk and
it is right sized. RAID uses it as a 72G disk.
• The system can be put back to normal by inserting a 72G disk
and failing the 144G disk that is now in the RAID Group.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 230
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
D D D D D P S S
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 231
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
D D D D D P D S
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 232
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Boot Process
NetApp Internal Only
Do NOT Distribute
BIOS
RAID
WAFL
Data ONTAP
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 233
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
WAFL
NetApp Internal Only
Do NOT Distribute
Requires RAID
Accessed via Data ONTAP
Views the volume as a range of virtual
block numbers (vbn)
Version changes between every major
release (7.0 -> 7.1)
WAFL version can only change when Data
ONTAP successfully boots
WAFL
VBN
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 234
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
WAFL Versions
NetApp Internal Only
Do NOT Distribute
ONTAP WAFL
7.0 54
New WAFL version is 7.1 57
written to disk as soon as 7.2 72
the disk is accessed by
Data ONTAP 7.3 77
8.0 (BR) 82
8.1 (RR) 87
10 (GX) 72 (75?)
VBN
Vbn = one level above RAID
Does not update with WACK/iron until 7.3.(2-ish) (iron) Prior to 7.3(.2?) WAFL will not update/upgrade
during WACK/iron.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 235
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Boot Process
NetApp Internal Only
Do NOT Distribute
BIOS
RAID
WAFL
Data ONTAP
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 236
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Check aggregates and volumes for the root
flag
Check the root volume for /etc/rc file
Load licenses
Load configuration files from /etc
/etc/rc
CIFS configuration
Load other process
The full version of DOT is installed to disks during the Data ONTAP installation or upgrade process. If
you have a blank machine and have TFTP booted, you need to follow the installation process to get the
full Data ONTAP version onto disks. The download command during the install, writes the mini-kernel
files to your compact Flash. The Full version of Data ONTAP is installed onto disks in root volume in the
/etc directory.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 237
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Boot of a FAS960c
PCI Problems
If the system won’t boot due to PCI device issue, such as failed cards, it will hang on Probing Devices.
Try removing PCI devices one-by-one and retry boot to determine the problematic device.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 238
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
the 1-5 menu
Starting Press CTRL-C for special boot menu
.....................
Tue Oct 23 12:29:01 GMT [nvram.battery.state:info]: The NVRAM battery is currently ON. Interconnect is
Tue Oct 23 12:29:05 GMT [cf.nm.nicTransitionUp:info]: Interconnect link 0 is UP OK
NetApp Release 7.2.1.1: Tue Jan 23 01:01:31 PST 2007 Software Disk
Copyright (c) 1992-2006 Network Appliance, Inc.
Ownership
Starting boot on Tue Oct 23 12:28:58 GMT 2007
Tue Oct 23 12:29:11 GMT [diskown.isEnabled:info]: software ownership has been enabled for
enabled
this system
Tue Oct 23 12:29:16 GMT [cf.noDiskownShelfCount:info]: Disk shelf count functionality is
not supported on software based disk ownership configurations.
Tue Oct 23 12:29:16 GMT [fmmbx_instanceWorke:info]: normal mailbox instance on local side
Tue Oct 23 12:29:16 GMT [fmmb.current.lock.disk:info]: Disk 2b.17 is a local HA mailbox
disk.
Mailbox disks
Tue Oct 23 12:29:16 GMT [fmmb.current.lock.disk:info]: Disk 2b.23 is a local HA mailbox
disk.
Tue Oct 23 12:29:19 GMT [raid.assim.label.upgrade:info]: Upgrading RAID labels.
Tue Oct 23 12:29:20 GMT [fmmbx_instanceWorke:info]: normal mailbox instance on partner
side
Tue Oct 23 12:29:20 GMT [fmmb.current.lock.disk:info]: Disk 2a.18 is a partner HA mailbox
disk.
Upgrading RAID
Tue Oct 23 12:29:20 GMT [fmmb.current.lock.disk:info]: Disk 2a.17 is a partner HA mailbox labels due to
disk.
Tue Oct 23 12:29:20 GMT [cf.fm.partner:info]: Cluster monitor: partner 'fas960cl1-ca'
Data ONTAP
Tue Oct 23 12:29:20 GMT [cf.fm.timeMasterStatus:info]: Acting as cluster time master upgrade
WARN: nv_init changing partner log (0x0 0x0) to (0x8000000 0x8000000)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 239
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
DOT checked
Tue Oct 23 12:29:21 GMT [raid.cksum.replay.summary:info]: Replayed 0 checksum blocks. NVRAM and found no
Tue Oct 23 12:29:21 GMT [raid.stripe.replay.summary:info]: Replayed 0 stripes. Replay needed. This
Tue Oct 23 12:29:22 GMT [wafl.vol.guarantee.fail:error]: Space for volume vol0 is NOT
guaranteed
means it was most
Tue Oct 23 12:29:22 GMT [localhost: cf.fm.launch:info]: Launching cluster monitor likely a clean
Tue Oct 23 12:29:22 GMT [localhost: cf.fm.partner:info]: Cluster monitor: partner shutdown
'fas960cl1-ca'
Tue Oct 23 12:29:22 GMT [localhost: cf.fm.notkoverClusterDisable:warning]: Cluster
monitor: cluster takeover disabled (restart)
Tue Oct 23 12:29:23 GMT [localhost: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster
HA takeover is still
monitor: takeover of fas960cl1-ca disabled (cluster takeover disabled) disabled
Tue Oct 23 12:29:24 GMT [localhost: wafl.scan.start:info]: Starting inode to parent
initialization on volume vol2.
add net 127.0.0.0: gateway 127.0.0.1
Tue Oct 23 12:29:24 GMT [localhost: wafl.scan.start:info]: Starting inode to parent
initialization on volume vol0. Inode to parent
Tue Oct 23 12:29:24 GMT [localhost: vol.language.unspecified:info]: Language not set on
volume vol2. Using language config "C". Use vol lang to set language. initialization
Tue Oct 23 12:29:24 GMT [localhost: vol.language.unspecified:info]: Language not set on
volume vol0. Using language config "C". Use vol lang to set language.
Tue Oct 23 12:29:24 GMT [localhost: rc:notice]: The system was down for 56 seconds
Tue Oct 23 12:29:25 GMT [localhost: cf.fm.takeoverDetectionSeconds.Default:warning]: How long the
option cf.takeover.detection.seconds is set to 10 seconds which is below the NetApp
advised value of 15 seconds. alse takeovers and/or takeovers without diagnostic system was
core-dumps might occur.
The option timed.min_skew is deprecated: ignoring attempt to change the value of this
down
option.
Space Guarantee
Note in this example that the space guarantee are not enabled for vol0. This can be problematic if the
volume become full causing the Storage Controller to hang or panic. Always make sure the root volume
does not run out of space.
No Replay of Stripes
No replay of stripes shows that there is no NVRAM contents to replay to disk. This probably means there
was a clean shutdown, but could also mean NVRAM was cleared due to 3 consecutive failed reboots.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 240
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
This is the
1st message
Tue Oct 23 07:29:27 EST [fas960cl2-ca: dfu.firmwareUpToDate:info]: Firmware is up-to-date
with EST.
on all disk drives Before this,
Tue Oct 23 07:30:07 EST [fas960cl2-ca: 10/100-IV/e0:info]: Ethernet e0: Link up
Tue Oct 23 07:30:07 EST [fas960cl2-ca: 10/100-IV/e0:warning]: Autonegotiation failed,
the boot
interface e0 operating at 100 Mbps half-duplex add net default: gateway 10.41.70.1 process
Tue Oct 23 07:30:08 EST [fas960cl2-ca: iscsi.service.startup:info]: iSCSI service startup
Tue Oct 23 07:30:08 EST [fas960cl2-ca: rc:ALERT]: timed: time daemon started
showed
sysconfig: 100/1000 Ethernet III card (PN X1027) in slot 9 is not supported on model GMT.
FAS960.
sysconfig: Unless directed by Network Appliance Global Services volume vol0 should have the
volume option create_ucode set to On.
Tue Oct 23 07:30:08 EST [fas960cl2-ca: rc:error]: sysconfig: 100/1000 Ethernet III card (PN
X1027) in slot 9 is not supported on model FAS960. sysconfig: Unless directed by Note the
Network Appliance Global Services volume vol0 should have the volume option
create_ucode set to On. configuration
Tue Oct 23 07:30:09 EST [fas960cl2-ca: mgr.boot.disk_done:info]: NetApp Release 7.2.1.1
boot complete. Last disk update written at Tue Oct 23 07:28:25 EST 2007 warnings
Tue Oct 23 07:30:09 EST [fas960cl2-ca: mgr.boot.reason_ok:notice]: System rebooted.
Tue Oct 23 07:30:09 EST [fas960cl2-ca: main_proc:notice]: Processor 1 (APIC ID 6) started.
Tue Oct 23 07:30:09 EST [fas960cl2-ca: asup.post.badUrl:warning]: Autosupport was not
posted because there was an invalid or missing url specified (CLUSTER TAKEOVER COMPLETE
MANUAL)
CIFS local server is running.
Configuration Warnings
The configuration warnings are some of the same as you may see in ASUP cases.
The 1st warning, “sysconfig: Unless directed by Network Customer Success Services
volume vol0 should have the volume option create_ucode set to On” is a best
practices warning. Customer should consider fixing this, but it is not required..
The 2nd warning “Tue Oct 23 07:30:08 EST [fas960cl2-ca: rc:error]: sysconfig:
100/1000 Ethernet III card (PN X1027) in slot 9 is not supported on model
FAS960” is important. This is describing an unsupported configuration which you can verify againt the
sysconfig guide. The customer should fix this.
Relog messages:
Note that if this boot sequence were due to a crash there may be relog messages such as:
Tue Oct 23 12:13:45 EST [fas960cl2-ca: rc:info]: relog syslog Tue Oct 23
12:11:26 EST [fas960cl2-ca: rc:debug]: root:IN:console shell:panic
Tue Oct 23 12:13:45 EST [fas960cl2-ca: rc:ALERT]: relog syslog PANIC: reboot
-d panic without any arguments in process idle_thread0 on release NetApp
Release 7.2.1.1 on Tue Oct 23 1
Relog messages are replays of syslog messages that were cached when the Storage Controller crashed
and they are replayed when the Storage Controller boots. So note that these messages may not appear
in the logs in a sequentially correct spot. Use the dates and times to understand when these messages
occurred.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 241
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
1) BIOS loads
2) Boot loader starts from EPROM
3) Environment variables read; boot begins
4) Loader tells us where CF card is and loads it
5) Kernel and platform files read from CF card; specified in
env
6) EMS is initialized
7) Watchdog is started
8) Mini root file system loaded
9) CF card mounted and consistency-checked
10) setenv settings saved to CF card
Boot Sequence
1. BIOS loads
bootarg.init.rootimage="/cfcard/x86/freebsd/image2/rootfs.img"
bootarg.init.cfimagebase="/cfcard/x86/freebsd?
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 242
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
11) BSD root is mounted ( / )
12) Config files are read and loaded (such as /etc/rc in BSD)
13) BSD booted and boot menu available
14) System asks if var is on CF card or if we need update flash
15) If no, we load varfs.tgz from CF
16) NVRAM mount begins
17) Varfs is restored from CF card
18) CDB is read for local config data
19) BSD apps load
20) Modules load and start in the following order:
– Kmod, Dblade, Spinvfs, Sldiag, Vserver, Nblade, Vserver,
N-blade, Nblade KMOD
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 243
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
21) CDB/env sets the local blade UUIDs
22) SCSI blade loads *8.1 and later
23) RDB syncs; mroot mounts
24) Swap is created
25) Protocols enabled
26) Login prompt available
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 244
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Loading x86_64/freebsd/image1/kernel:....0x100000/8170520 0x8cac18/1257368
Entry at 0x80157e20
Loading x86_64/freebsd/image1/platform.ko:.0x9fe000/601008 0xb2a6d0/625608
0xa90bc0/38696 0xbc3298/40680 0xa9a2e8/79548 0xaad9a4/60204 0xabc4e0/127904
0xbcd180/145344 0xadb880/1488 0xbf0940/4464 0xadbe50/280 0xbf1ab0/840
0xadbf68/1624 0xbf1df8/4872 0xadc5c0/960 0xbf3100/2880 0xadc980/152
0xbf3c40/456 0xadca20/448 0xb04300/12412 0xb2a5df/237 0xb07380/78864
0xb1a790/65103
Starting program at 0x80157e20
6. EMS is initialized
7. Watchdog started
watchdog: initializing...
12. Config files are read and loaded (such as /etc/rc in BSD)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 245
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
*******************************
* *
* Press Ctrl-C for Boot Menu. *
* *
*******************************
14. System asks if var is on CF card or if we need to prompt for update flash
ontap_varfs: sysvar.nvr_init: 1
Creating varfs on /dev/nvrd1
/dev/nvrd1: 5.0MB (10240 sectors) block size 16384, fragment size 2048
using 4 cylinder groups of 1.27MB, 81 blks, 192 inodes.
super-block backups (for fsck -b #) at:
160, 2752, 5344, 7936
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 246
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Setting hostname: cluster2-01.
Enabling ipfilter.
net.link.ether.inet.log_arp_movements: 1 -> 0
net.inet.ip.portrange.lowfirst: 1023 -> 899
kern.ipc.maxsockbuf: 262144 -> 2621440
net.inet.tcp.local_slowstart_flightsize: 4 -> 10
net.inet.tcp.nolocaltimewait: 0 -> 1
lo0: flags=80c9<UP,LOOPBACK,RUNNING,NOARP,MULTICAST> metric 0 mtu 8232
inet 127.0.0.1 netmask 0xff000000 LOOPBACKLIF Vserver ID: 0
route: writing to routing socket: Network is unreachable
add net 0.0.0.0: gateway 10.61.94.1: Network is unreachable
route: writing to routing socket: Network is unreachable
add net 0.0.0.0: gateway 10.61.94.1: Network is unreachable
route: writing to routing socket: Network is unreachable
add net 0.0.0.0: gateway 10.61.94.1: Network is unreachable
route: writing to routing socket: Network is unreachable
add net 0.0.0.0: gateway 10.61.94.1: Network is unreachable
Additional routing options:.
Starting devd.
Generating host.conf.
Additional IP options:.
Mounting NFS file systems:.
ELF ldconfig path:
32-bit compatibility ldconfig path:
Initial amd64 initialization:.
Additional ABI support:.
Starting rpcbind.
Setting NIS domain: unset.
Starting ypbind.
Clearing /tmp (X related).
merge of /etc/apache2/httpd-custom.conf.default and /var/etc/httpd-
custom.conf.old into /var/etc/httpd-custom.conf, completed.
Starting process manager
Starting management subsystems
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 247
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Starting common_kmod module...
ck_malloc_init_sizes: Initializing ck_malloc for cluster size of 4
csm_refill (small) thread started
csm_refill (large) thread started
kmeminit: Will NOT use ontap kmem_alloc
a. Dblade loads
b. Spinvfs is loaded
c. Sldiag loads
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 248
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
CDB/env sets local blade
Starting scsi-blade...
Starting scsi-blade module...
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 249
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Boot Process
NetApp Internal Only
Do NOT Distribute
sysvar.boottimes.boottimes sysctl variable
records the boot sequence
Systemshell -command "sysctl
sysvar.boottimes“
Boot Process
The sysvar.boottimes.boottimes sysctl variable records the boot sequence. This information can be
gathered from the systemshell after the storage system has finished booting.
This command can be used for two purposes: to provide an overview of a normal boot or to troubleshoot
a failed boot. The time entry begins at the moment power is applied to the CPU during start. By
examining the output, you can see the amount of time it takes the LOADER/CFE to initialize and transfer
control to FreeBSD. Typically this is about 30 to 45 seconds depending upon the amount of time it takes
to perform I/O against the CF card holding the software image.
At the beginning of the boottimes output there are many entries that are marked FBSD_rc. These entries
are emitted for each rc script executed while FreeBSD starts. Until the script netapp_mgmt is executed
the start sequence is essentially serial. The initial portion of boot will mount the NVRAM partition on /var
and perform upgrade processing if required. There are many FreeBSD scripts that run that can be
ignored. The magic (or Data ONTAP ) begins with netapp_mgmt.
When troubleshooting boot process problems, the boot arg bootarg.init.console_muted can be set to
false before booting. The result will be much more output sent to the console during boot. Normally this
boot arg is unset and the boot sequence does not emit significant output to the console.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 250
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Data written to /var is backed up to several
locations
– At run-time is backed by NVRAM
– During a clean shutdown is backed up to
/mroot/etc/varfs.tgz
– On a dirty shutdown, data stays in NVRAM
– Is backed up to the boot device after node or HA specific
changes are made
Can be auto restored from boot device on boot
Can be manually restored from root volume with
syncflash or update flash
Backing-Up /var
Because the data in /var is critical to the storage system, the data written to /var is eventually stored
in several locations.
At run-time it is backed by NVRAM. During a clean shutdown (reboot or halt), the contents are backed up
to the mroot and the boot device in a file called varfs.tgz.
There is a default varfs.tgz file that is stored inside the rootfs.img. This file will only be used once
during system initialization (eg: options: 4) init, 24/7) fake_init, etc.), and will not be used ever
again, even after a Data ONTAP upgrade.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 251
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Phoenix TrustedCore(tm) Server
Copyright 1985-2005 Phoenix Technologies Ltd. All Rights Reserved
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 252
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Starting program at 0x80148250 Special Boot Menu is
NetApp Data ONTAP 8.0.1P4D9 7-Mode
Copyright (C) 1992-2010 NetApp.
the 1-8 menu.
All rights reserved.
*******************************
* *
* Press Ctrl-C for Boot Menu. *
* *
Interconnect is OK
*******************************
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 253
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Software ownership enabled
Thu May 12 13:11:18 GMT [netif.linkUp:info]: Ethernet e0a: Link up.
Thu May 12 13:11:19 GMT [diskown.isEnabled:info]: software ownership has been enabled for this
system
Thu May 12 13:11:23 GMT [fmmb.current.lock.disk:info]: Disk 5a.00.11 is a local HA mailbox disk.
Thu May 12 13:11:23 GMT [fmmb.current.lock.disk:info]: Disk 5a.00.21 is a local HA mailbox disk.
Thu May 12 13:11:23 GMT [fmmb.instStat.change:info]: normal mailbox instance on local side.
Thu May 12 13:11:23 GMT [fmmb.current.lock.disk:info]: Disk 5c.01.15 is a partner HA mailbox
disk.
Thu May 12 13:11:23 GMT [fmmb.current.lock.disk:info]: Disk 5c.01.20 is a partner HA mailbox
disk.
Thu May 12 13:11:23 GMT [fmmb.instStat.change:info]: normal mailbox instance on partner side.
Thu May 12 13:11:23 GMT [cf.fm.partner:info]: Failover monitor: partner 'fas6040cl1-rtp'
Thu May 12 13:11:23 GMT [cf.fm.timeMasterStatus:info]: Acting as time master Mailbox
Thu May 12 13:11:24 GMT [raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.
Thu May 12 13:11:24 GMT [raid.stripe.replay.summary:info]: Replayed 0 stripes. disks
Thu May 12 13:11:25 GMT [netif.linkDown:info]: Ethernet e0b: Link down, check cable.
Thu May 12 13:11:25 GMT [netif.linkDown:info]: Ethernet e0c: Link down, check cable.
Thu May 12 13:11:25 GMT [netif.linkDown:info]: Ethernet e0d: Link down, check cable.
Thu May 12 13:11:25 GMT [netif.linkDown:info]: Ethernet e0e: Link down, check cable.
Thu May 12 13:11:25 GMT [netif.linkDown:info]: Ethernet e0f: Link down, check cable.
filter sync'd
Thu May 12 13:11:26 GMT [localhost: cf.fm.launch:info]: Launching failover monitor
Thu May 12 13:11:26 GMT [localhost: cf.fm.partner:info]: Failover monitor: partner 'fas6040cl1-
rtp'
Thu May 12 13:11:27 GMT [localhost: cf.fsm.takeoverOfPartnerDisabled:notice]: Failover monitor:
takeover of fas6040cl1-rtp disabled (Controller Failover takeover disabled).
Thu May 12 13:11:28 GMT [localhost: rc:notice]: The system was down for 179 seconds HA
disabled
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 254
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Thu May 12 13:11:28 GMT [localhost: tar.csum.match:info]: Stored checksum matches, not extracting
local://tmp/prestage/mroot.tgz.
Thu May 12 13:11:29 GMT [fas6040cl2-rtp: fcp.service.startup:info]: FCP service startup
Thu May 12 13:11:30 GMT [fas6040cl2-rtp: dfu.firmwareUpToDate:info]: Firmware is up-to-date on all
disk drives
Thu May 12 13:11:30 GMT [fas6040cl2-rtp: iscsi.service.startup:info]: iSCSI service startup
Thu May 12 13:11:30 GMT [fas6040cl2-rtp: snmp.agent.msg.access.denied:warning]: Permission denied for
SNMPv3 requests from root. Reason: Password is too short (SNMPv3 requires at least 8 characters).
Thu May 12 13:11:30 GMT [fas6040cl2-rtp: scsitarget.ispfct.linkUp:notice]: Link up on Fibre Channel
target adapter 0a.
Thu May 12 13:11:31 GMT [fas6040cl2-rtp: cmds.sysconf.logErr:error]: sysconfig: Unless directed by
NetApp Global Services volumes a, luis_64, vol_new, luis_source, vol0, and esx_iscsi_luns should have
the volume option create_ucode set to On. .
Thu May 12 13:11:31 GMT [fas6040cl2-rtp: callhome.sys.config:error]: Call home for SYSTEM
CONFIGURATION WARNING
Thu May 12 13:11:31 GMT [fas6040cl2-rtp: mgr.boot.disk_done:info]: NetApp Release 8.0.1P4D9 7-Mode
boot complete. Last disk update written at Thu May 12 13:08:28 GMT 2011
Thu May 12 13:11:31 GMT [fas6040cl2-rtp: cf.hwassist.notifyEnableOn:info]: hw_assist: hw_assist
functionality has been enabled by user.
Thu May 12 13:11:31 GMT [fas6040cl2-rtp: mgr.boot.reason_ok:notice]: System rebooted after a reboot
command.
Thu May 12 13:11:31 GMT [fas6040cl2-rtp: callhome.reboot.reboot:info]: Call home for REBOOT (reboot
command)
Thu May 12 13:11:31 GMT [fas6040cl2-rtp: cifs.startup.local.succeeded:info]: CIFS: CIFS local server
is running.
CIFS local server is running
Boot
Data ONTAP (fas6040cl2-rtp.ccslabs.netapp.com)
login: completion
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 255
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Boot Process
NetApp Internal Only
Do NOT Distribute
Troubleshooting
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 256
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Possible Problem
– Compact Flash could fail due to Hardware problem
CF card failed
CF reader failed
How to Troubleshoot
– version –b
How to Resolve
– Replace CF card or possibly the motherboard
– Re-install the CF image if needed
CF Card Issues
version –b
Version –b displays the version information of Data ONTAP, diagnostics and firmware contained on
primary boot device (flash). If the Compact Flash Card or the reader is broken then this will fail or you
will receive garbage for output. Also sysconfig –a may show the cf card reader problems.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 257
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
You can use a compact flash writer to fix
compact flash problems
If compact flash image is damaged you can
re-create the image.
– Format command to re-create partitions with
bootfs format command
– Use the download command to write to the
recreated partitions
Knowledge Base Article ntapcs18181 explains how to clear and re-create the compact flash image for
different Storage system models.
https://kb.netapp.com/support/index?page=content&id=2013213
Error Message: Failed to open download script file /etc/boot/x86_elf/kernel_0.cmds: No such file or
directory.
https://kb.netapp.com/support/index?page=content&id=3012382
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 258
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Complete preparation
– HTTP server
– TFTP server (no tftp with clustered Data ONTAP)
node1(takeover)> options tftpd.enable off
node1(takeover)> options tftpd.rootdir /etc
node1(takeover)> options tftpd.enable on
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 259
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
**************************************************************
* Restore Backup Configuration *
* This procedure only applies to storage controllers that *
* are configured as an HA pair. *
* *
* Choose Yes to restore the 'varfs' backup configuration *
* from a TFTP server. Refer to the Boot Device Replacement *
* guide for more details. *
* Choose No to skip the back up recovery and return to the *
* boot menu. *
**************************************************************
Do you want to restore the backup configuration now? {y|n} y
Enter the IP address of the server: 10.97.102.96
Checking network link... success.
Checking route to host "10.97.102.96"... success.
Attempting to reach 10.97.102.96... success.
Checking CF device file system... success.
Mounting CF device to /tmp/mnt... success.
Checking CF mount point... success.
Restoring backup configuration...
Received 82481 bytes in 0.1 seconds Backup Configuration successfully
restored
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 260
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
If no (online) aggregate has the root volume flag set the
Storage Controller will not boot and will post “No root
volume found”
To create a temporary root volume
1. Boot to maint mode
2. Set an aggregate to root
> aggr options <aggr_name> root
3. Halt and reboot
4. On Boot a new flexvol named ‘autoroot’ will be
created on the chosen aggregate
5. Setup will start
To switch root back to the original root volume
1. > vol options <volname> root
2. Reboot
NetApp Confidential — Limited Use
Root Flag
The root flag denotes on which aggregate and volume the DOT files are installed. The Mini-kernel checks
the volumes and aggregates for the root flag during boot. If no aggregate is marked with the root flag then
Data ONTAP will not boot and will post the error “No Root Volume Found.” This can be resolved by:
•bringing the root aggregate back online if it is offline and rebooting.
•Create a new root aggregate and mark it as root (next slide) or mark an existing aggregate as root.
If an aggregate is marked as root, but there is no volume marked as root, then a new root volume is
created named AUTOROOT.
/etc/rc missing?
If the /etc/rc file is missing then setup will automatically be started following boot.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 261
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Diagnostics
NetApp Internal Only
Do NOT Distribute
Diagnostics are used to test hardware
This requires downtime of a single storage controller
It can be an extended amount of time depending on the tests run
From the main diagnostic menu, you can go into sub-menus to
run specific diagnostics for that sub-system
Disk Write tests will write to non-data section of disks
– It can destroy core files
– Should not be done in takeover mode
Loop tests may require loopback plugs
No disk tests should be performed in takeover mode
Without a current version of diags some tests may fail
Disk Tests
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 262
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Launch Diagnostics
NetApp Internal Only
Do NOT Distribute
1. Using a console connection, start the booting of the Storage
Controller
2. Skip Auto-boot to get to the firmware prompt
3. Start the diagnostics using the proper command
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 263
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Main Menu choices
all Run all system diagnostics
mb FAS960 motherboard diagnostic
mem Main memory diagnostic
nvram NVRAM diagnostic
gbe GBE controller diagnostic
cf-card CompactFlash controller diagnostic
scsi SCSI controller diagnostic
fcal FCAL controller diagnostic
stress System wide stress diagnostic
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 264
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Enter Diag, Command or Option: mb
FAS960 Motherboard Diagnostic
-----------------------------
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 265
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Data ONTAP based command that allows diagnostic testing of
FAS3200 and FAS6200 (and future platforms)
Replaces the SYSDIAG tool that was used as a diagnostic on
previous platforms
sldiag is a command in Data ONTAP (maintenance mode)
rather than a separate binary.
sldiag has a user interface that is similar to Data ONTAP 8.X
rather than the menu tables from the previous tool
Specific setup requirements for the following devices:
– NIC
– SAS
– FCAL
– Interconnect
– Metrocluster (FCVI)
Resource
https://kb.netapp.com/support/index?page=content&id=1013307&actp=search&viewlocale=en_US&searc
hid=1349348922229
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 266
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
sldiag (Cont.)
NetApp Internal Only
Do NOT Distribute
Command Format:
sldiag [ version ] [ show ]
sldiag device [
(show|modify|types|run|stop|status|clearstatus|rescan)]
sldiag [ test ] [ (show|modify) ]
sldiag [ option ] [ (show|modify|help) ]
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 267
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
sldiag (Cont.)
NetApp Internal Only
Do NOT Distribute
*> sldiag device show
DEV NAME SELECTION
---------------- ---------------- ----------------
nvmem nvmem enabled
mem mem enabled
nic e0M enabled
nic e0P enabled
nic e0a enabled
nic e0b enabled
bootmedia bootmedia enabled
acp acpview enabled
serviceProcessor serviceProcessor enabled
environmentals environmentals enabled
cna 1a enabled
cna 1b enabled
fcal 0d enabled
sas 0a enabled
sas 0b enabled
storage 0c disabled
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 268
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
sldiag (Cont.)
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 269
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Summary
NetApp Internal Only
Do NOT Distribute
You should now be able to:
Describe the storage controller boot sequence
Explain the different methods of booting
Know that printenv boot-device sets which device from
which to boot
Describe the use for setenv auto-boot
Describe the differences between booting with boot vs. bye
State the role of a the compact flash card at boot time
Define the RAID Tree - plex, RAID group, logical volume, disks
Identify how RAID groups work with mixed disk sizes
Describe the different Data ONTAP, RAID and WAFL versions
Analyze the console logs during the boot process
Evaluate when to run diags
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 270
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Review
NetApp Internal Only
Do NOT Distribute
What are the three types of firmware used by NetApp?
OFW, CFE, LOADER
What command is used to set your environmental variables back
to the NetApp default?
set-defaults
How many locations are on the compact Flash? What are they?
4 locations. Primary & Secondary boot images, Diagnostics,
Firmware.
Do you need to use the revert_to command when
downgrading from Data ONTAP 7.1.2 and 7.1.1?
No. They use the same WAFL and RAID version.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 271
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 3-2
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 272
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 273
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Hardware and Down
Storage System
Troubleshooting for
Partners
Special Boot Menus
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 274
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Objectives
NetApp Internal Only
Do NOT Distribute
By the end of this module, you should be able to:
Explain the 1-5 and 1-8 menu
Explain the 22/7 menu
Describe the 22/7 menu's hidden options in 7.0+
Define the vol commands available from the 1-5
menu
Define when to use ignore medium errors
Explain what is maintenance mode, when to use it
and what you can do from it
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 275
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Topics
NetApp Internal Only
Do NOT Distribute
This module includes the following topics:
Booting to the Special Boot Menus
1-5 and 1-8 menu
Maintenance Mode
22/7 menu
Marking labels clean
Ignore medium errors
Safety Boots: prev_cp and readonly
Vol commands from 22/7
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 276
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Phoenix TrustedCore(tm) Server
[………..Normal boot process……]
Boot Loader version 1.2.1 On CFE/LOADER machines
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2005 Network Appliance Inc.
there will be a CTRL-C to skip
auto-boot. This is NOT the
CPU Type: Dual Core AMD Opteron(tm) Processor 265 special boot menu.
Starting AUTOBOOT press Ctrl-C to abort...
Loading:...............0x200000/36518072 0x24d38b8/33842616 0x4519e70/2371096 Entry at 0x00202018
Starting program at 0x00202018
Press CTRL-C for special boot menu
[……….normal boot messages……..]
Press CTRL-C here
Special boot options menu will be available.
Tue Oct 23 19:26:31 GMT [fci.initialization.failed:error]: Initialization failed on Fibre Channel adapter 0d.
Selection (1-5)?
NetApp Confidential — Limited Use
On an OFW storage system, the DEL key is used to skip Auto-Boot. CNTRL-C is only used to get to the
special boot menu. Some emulation programs need Shift-Delete or CNTRL-Backspace to emulate the
CNTRL-C sequence. Another option is to netboot the Storage System which will go to 1-5 menu by
default.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 277
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Phoenix TrustedCore(tm) Server
Copyright 1985-2004 Phoenix Technologies Ltd.
All Rights Reserved
BIOS version: 2.5.0
Portions Copyright (c) 2006-2009 NetApp All Rights Reserved
CPU= AMD Opteron(tm) Processor 250 X 2
Testing RAM
512MB RAM tested
4096MB RAM installed
Fixed Disk 0: NACF1GBJU-B11
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 278
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Release 7.2.3: Thu Jul 5 10:18:07 PDT 2007
Copyright (c) 1992-2007 Network Appliance, Inc.
Starting boot on Tue Oct 23 19:26:16 GMT 2007
Selection (1-5)?
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 279
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Please choose one of the following:
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 280
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
(1) Normal boot.
(2) Boot without /etc/rc.
Option 1
This is a completely normal boot.
Option 2
Boot but do not load the /etc/rc or the registry. The Storage Controller boots with default options,
however configuration changes can be made from the command line, such as license, ifconfig,
etc. But, without the registry, CIFS will not run and cannot be setup. NFS and FTP are still available if
the network is configured.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 281
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
(3) Change password.
(4) Initialize owned disks (28 disks are owned by
this filer).
(4a) Same as option 4, but create a flexible root
volume.
Option 3
This option is useful if your customer forgot the root password.
TIP: In a standard HA configuration, you may remove one chassis PSU for two minutes which will cause
the system to shutdown automatically. This can help avoid a ‘dirty’ shutdown and allow the customer to
access the 1-5 Menu gracefully.
Option 4
Destroys all data on the disks, but does not sanitize the disks. It then makes a two disk trad volume,
RAID 4. Two disks will take about twenty minutes to run (depending on the disk type and other factors).
The system will run setup at the end of this Option.
Option 4a
This is flex-vol version of option 4. The default is to create a three disk aggregate volume with a flex vol
(vol0) using RAID-DP. However if only two disks are available the system will use RAID-4. The size of
the root volume will be the available space in the aggregate (minus snapshot space, etc.)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 282
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Allows testing and maintenance to be
performed without Data ONTAP running
Data in volumes is not accessible
Protocols and other processes are not started
Storage system is not accessible over
network
Be cautious running maintenance mode
commands in a HA takeover situation, for
example, some FCAL diags could be
hazardous while data is being served by the
partner
NetApp Confidential — Limited Use
HINT: During read tests, a file is written to the space on disk allocated for cores and then read during the
test. This may cause a panic of the UP node that has control of the disks. This may also destroy any
core that has not been written to /etc/crash.
What commands are safe to run from maintenance mode while in takeover?
Most commands that don’t attempt to read the layout of the volume/aggr or environment are safe such as
‘disk_list’, ‘disk show –v’, ‘fcadmin device_map’, ‘sasadmin’.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 283
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Update flash from backup configuration
Install new software first
Reboot node
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 284
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
? fcstat sasstat
aggr fctest sata
disk halt scsi
disk_list help sesdiag
diskcopy led_off storage
disktest led_on version
environment raid_config vol
fcadmin sasadmin
fcdiag
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 285
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
? environment led_off storage
acorn faltest led_on sysconfig
acpadmin fcadmin nv8 systemshell
aggr fcstat raid_config version
disk fctest sasadmin vmservices
disk_list ha-config sasstat vol
disk_mung halt sata vol_db
diskcopy help scsi vsa
disktest ifconfig sesdiag xortest
dumpblock key_manager sldiag
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 286
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
*> aggr ?
The following commands are available; for more
information type "aggr help <command>"
online
Pre-7.0
In pre-7.0 machines aggr commands are not available, because aggregates were not introduced until
Data ONTAP 7.0. For pre-7.0 releases use the vol commands instead. In Data ONTAP7.0 and later vol
commands are no longer available.
Aggr Commands
aggr status –r gives about the same output as sysconfig –r.
aggr rewrite_fsid – aggrs can end up with duplicate ID’s. Duplicate Ids can prevent a HA takeover.
aggr options ignore_inconsistent – allows the aggregate containing the root volume to be
brought online on booting, even though it is inconsistent. The user is cautioned that bringing it online prior
to running WAFL_check or wafliron may result in further file system damage, so you need to run wafliron
or WAFL_check. You need to be sure you have corrected the original problem be sure bringing the file
system back online. This needs approval of EE and Sustaining before using.
Aggr options <aggr_name> root - Assigns root volume to that aggregate or traditional volume
when original root volume is unavailable or corrupt. Discussed more later
aggr destroy - You can destroy from maintenance mode, but not create.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 287
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
*> aggr status -r
Aggregate aggr1 (online, raid_dp) (zoned checksums)
Plex /aggr1/plex0 (online, normal, active)
RAID group /aggr1/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- --------------
dparity 7a.17 7 2 1 FC:A - FCAL 10000 34500/70656000
parity 8a.9 8 1 1 FC:B - FCAL 10000 34500/70656000
data 7a.2 7 0 2 FC:A - FCAL 10000 34500/70656000
data 8a.10 8 1 2 FC:B - FCAL 10000 34500/70656000
aggr status –r
Disks are listed in the same order as they are in sysconfig –r with the following exception: due to a
Burt in some versions of 6.5 and 7.0, volumes over 8T raw size will show the disks in the wrong order.
Note: Physical block information is included in output. It was removed from the slide for readability. It
would be the last column of the output.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 288
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
aggr options <aggr_name> root
Root Option
‘aggr options <aggr> root’ assigns the root volume flag to that aggregate or traditional volume. It
is normally used when original root volume is unavailable or corrupt. If you assign this flag to an
aggregate then you will need to go back to the 1-5 menu to use the vol_pick_root command to assign
which flex volume in the aggregate will be the root volume. If a flexvol root is not chosen, a flexvol called
AUTOROOT is created at boot. Trad vols require only the ‘aggr options’ step.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 289
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
One way RAID keeps track of failed disks
Not dependent on label
Registry is kept across multiple disks
If FDR shows disk as failed, disk will be failed
by DOT
When a disk fails, DOT updates the FDR and
the failure bytes on the disk
Unfailing disks
To recover disks back into a raid group, they must be unfailed to remove the failure byte and removed
from the FDR while in maintenance mode.
NOTE: This type of recovery requires EE level assistance.
NOTE: Do not return failed disks until all reconstructions are completed or you are sure that they will no
longer be needed for recovery efforts.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 290
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Failure Bytes
NetApp Internal Only
Do NOT Distribute
Space on the disk that keeps failure code for that disk
Pre-7.0
– ONTAP did not read the bytes
Failure Bytes
Pre-7.0
Before Data ONTAP 7.0 the failure bytes were only used by NetApp when a disk was returned as failed.
Data ONTAP did not read the bytes so a disk could fail in one system and be used in another.
Kb12312 discusses removing failure bytes before upgrading from pre-7.x to 7.x to avoid this possible
problem
There are two ways to remove Failure Bytes (Update the disk firmware and unfail the disk). We will
discuss these methods shortly.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 291
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Do not unfail disks from within Data ONTAP
– Unless the disk is not needed for recovery
Unfailing a disk can lead to the zeroing of the
disk and permanent loss of data
Safe to unfail if the fix for BURT 95606 is in
place.
– After release 7.2, use with EXTREME caution
Have Escalations Engineer(s) involved when
unfailing disks for system recovery
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 292
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Un-Failing a Disk
NetApp Internal Only
Do NOT Distribute
From Maintenance Mode:
– Remove from FDR:
*> raid_config info deletefdr
– Clear Failure Bytes:
*> disk unfail [disk_id | all]
Un-Failing a Disk
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 293
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
raid_config info [ showfdr | deletefdr ]
showfdr option
This command shows the disks in the FDR (Failed Disk Registry).
deletefdr option
When queueing disks for deletion from the FDR, limit to 10 disks at a time.
The results of deletefdr may be unpredictable – the disk may end up back in the volume and it might not.
As of 7.2, the results are based on gen and time – identical gen & times result in the disk back in the vol;
close gen & times result in a warning and the disk back in the vol; out of date gen or times result in an
orphaned disk.
For example: A disk fails on Monday and is unfailed on Saturday, the time difference will be too large for
reassimilation. If a disk failed on Monday, and many things happened to the volume in the next fifteen
minutes, the gen difference will be too large for reassimilation.
When Used:
Used after too many disks in a Raid Group have failed. Aggregates will not assimilate until disks are
removed from the FDR. Remove disk from FDR to get them back into their Raid Group and the
aggregate. !! Remember to unfail the disk from maintenance mode to remove the failure byte. Else, the
disk will return to the FDR at boot !!
raid_config info deletefdr – request to remove is queued until the commit prompt
Attempts to delete the disk from the DFR. The results may be unpredictable – it may end up back in the
volume, if disk was properly unfailed (7.0+) and in good health.
As of 7.2, results are based on gen time – identical gen times result in the disk back in the vol; close gen
times result in a warning and the disk back in the vol; out of date gen times result in a non-zeroed spare.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 294
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
*> raid_config info showfdr
1) vendor=NETAPP , model=X235_HJURD073F10,
serialno=304Q2854, uniqueid=a9a7e701-2c22755f-
c84e561e-8c0a87cf
timefailed=1155233634 (10Aug2006 18:13:54),
timelastseen=1155259432 (11Aug2006 01:23:52),
device=NotPresent
2) vendor=NETAPP , model=X235_HJURD073F10,
serialno=30515997, uniqueid=b7728fd9-4172b6c3-
f6178b36-fc6c2caa
timefailed=1157316767 (03Sep2006 20:52:47),
timelastseen=1157316817 (03Sep2006 20:53:37),
device=7a.22
Example output:
NotPresent - this disk has been removed from the system.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 295
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
raid_config deletefdr
NetApp Internal Only
Do NOT Distribute
*> raid_config info deletefdr
1) vendor=NETAPP , model=X235_HJURD073F10, serialno=304Q2854,
uniqueid=a9a7e701-2c22755f-c84e561e-8c0a87cf
timefailed=1155233634 (10Aug2006 18:13:54), timelastseen=1155259432
(11Aug2006 01:23:52), device=NotPresent
2) vendor=NETAPP , model=X235_HJURD073F10, serialno=30515997,
uniqueid=b7728fd9-4172b6c3-f6178b36-fc6c2caa
timefailed=1157316767 (03Sep2006 20:52:47), timelastseen=1157316817
(03Sep2006 20:53:37), device=7a.22
Enter the entry to delete (0 for done) 2
1) vendor=NETAPP , model=X235_HJURD073F10, serialno=304Q2854,
uniqueid=a9a7e701-2c22755f-c84e561e-8c0a87cf
timefailed=1155233634 (10Aug2006 18:13:54), timelastseen=1155259432
(11Aug2006 01:23:52), device=NotPresent
2) (DELETED) vendor=NETAPP , model=X235_HJURD073F10,
serialno=30515997, uniqueid=b7728fd9-4172b6c3-f6178b36-fc6c2caa
timefailed=1157316767 (03Sep2006 20:52:47), timelastseen=1157316817
(03Sep2006 20:53:37), device=7a.22
This command shows the same list as showfdr, but here there is prompt to choose the disk to remove
from the FDR. Before removing disks from the Failed Disk Registry please consult your EE if you have
any questions or concerns.
After choosing the disk to remove the next set of output shows the disk as DELETED from the FDR,
although this has not been completed until you commit the changes.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 296
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Enter the entry to delete (0 for done) 0
The following records will be deleted:
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- --------------
dparity 7a.1 7 0 1 FC:A - FCAL 10000 34500/70656000
parity 8a.8 8 1 0 FC:B - FCAL 10000 34500/70656000
data 7a.22 7 2 0 FC:A - FCAL 10000 34500/70656000
Once committed you can use aggr status –r to check where disk is placed; in the volume or in
spares pool.
Note: deletefdr removes the disk from the Failed Disk Registry but does not remove the failure bytes
from the disk. To remove them use disk unfail.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 297
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
disktest - Disk Test Environment
fcdiag - Diagnostic for troubleshooting loop
instability
fcstat - Fibre Channel stats functions
fcadmin – View fiber channel attached
devices
scsi – Send scsi inquiry commands to disks
SAS commands will be covered in a later
module
fcstat
‘fcstat’ is the same as fcadmin link_stats in Data ONTAP, it shows the link statistics maintained
for all drives on a Fibre Channel loop.
fcadmin
‘fcadmin device_map’ – verify connected shelves/disks/loops
‘fcadmin link_stats’ – view loop statistics for diagnosing instability
‘fcadmin config’ – display or manipulate embedded FCAL HBA personality
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 298
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
disk unfail
disk_fw_update_fix
disk_list
disk show -v
Maintenance mode:
disk unfail
Disk unfail from Maintenance mode clears the failure bytes from disk NVRAM (storage). You still will
need to use raid_config command to remove it from FDR.
disk_fw_update_fix
This command is only to fix an old bug for spin-up issues on SEAGATE 18GB Half Height FC disks
(model ST118202FC) and 9GB Low Profile SCSI Disks (model ST39173WC) .
disk_list
Lists all disks on all loops
disk show –v
Display ownership of all disks (only available if software based disk ownership is in use)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 299
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
diskcopy
– diskcopy –s <source> -d
<destination>
– diskcopy –i –s <source> -d
<destination>
Diskcopy is used for a block level copy from one disk to another. It can be used when online
reconstruction is not possible. This is a last resort type of effort, before going to 3rd party recovery
methods. An EE needs to be involved before this option is used. Approximate times for copy is about 4
hours for a 68GB disk, this is depending on the number of errors. WAFL_check needs to be run following
the diskcopy.
When diskcopy hits an unreadable block, it prompts for abort, retry, skip. If you choose skip the bad
block, you must specify how many blocks to skip (this is a cumulative total). It’s recommended that you
start small and increment small until a good block is found. If you choose one, the next increment you will
need to choose at least 2.
This copy is data and labels, be sure to remove source disk promptly or the storage system not be able to
bring the aggr/volume online due to two disks with the same label.
Another use is to check the integrity of a suspected bad disk. This process reads every block so we
verify it is OK.
-i
After 7.0.1.1 ‘-i’ is available which allows you to skip blocks one by one with only one prompt. It will give
a report at the end of blocks skipped. Work with your EE or Engineering to decide if skipped blocks are
crucial user data area or non-critical area on disk.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 300
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
diskcopy - Example
NetApp Internal Only
Do NOT Distribute
*> diskcopy -s 7a.8 -d 10a.14
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 301
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
– diskcopy finds a bad sector
1 - skip sector
2 - retry i/o operation
3 - quit diskcopy
1
how many sectors [1-153517031] to skip?
1
Diskcopy Example
We choose Option 1 to skip sector(s) then we choose how many sectors to skip.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 302
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
diskcopy attempts to continue after skipping one
sector but can not read the next sector either:
1 - skip sector
2 - retry i/o operation
3 - quit diskcopy
1
how many sectors [1-153517031] to skip? 2
Here we choose to skip 2 sectors, the previous one plus the new one equals skipping two. Putting in one
would force a retry of the second sector and another prompt for this sector.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 303
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
After skipping the 3rd sector we read a good one
Large to Small
If you decided to work from skipping a lot of blocks to a small amount of blocks, choose N at this point
and enter a smaller section to skip. An example of going from large to small would look like this:
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 304
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
diskcopy -i -s 7.27 -d 7.18
You are about to copy over disk 7.18 with the contents of disk 7.27.
Retries at the SCSI layer are: ENABLED
I/O size is 2048 sectors auto-skip mode is DISABLED
Any data on disk 7.18 will be lost!
Are you sure you want to continue with diskcopy? Y
Copying from disk 7.27 to disk 7.18.
NOTE: No sectors copied yet, immediate skip mode requested for
disk 7.27.
1 - skip sector
2 - retry I/O operation
3 - auto-skip mode
4 - quit diskcopy
3
WARNING: Sectors [0x0:0x1] (1 sectors) of disk 7.17 NOT copied --
please make a note of this!!
300 MB copied -
Confirmation is very clear about the upcoming actions. Even after choosing –i, the system will prompt
before skipping the first sector. If auto-skip is selected, there will not be prompts for additional media
errors.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 305
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
After the disk copy completes there are two identical disks in
the system
After the disk copy completes there are two identical disks in the system – 7.27 and 7.18, causing a
failure to assimilate. The Storage Controller can not tell which disk should be included in the RAID group
and will not allow the volume to go online until one disk is removed or the label changed.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 306
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
To resolve this, physically remove 7.27 from the system
This allows the RAID group to come online
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 307
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
halt – Halt the storage controller
help – List commands
led_on or led_off – Light LED on the disk
Halt
to get out of maint mode and bring system to ok/CFE/LOADER prompt. There is no reboot, there is no
direct route to ONTAP.
Help
Lists available commands.
Led_on/led_off
Turns on the led on the disk so it can be recognized by the site contact
If led_on does not work for the bad disk, turn on the LEDs of adjacent disks and ask the site contact to
pull the disk in the middle.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 308
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
storage - Commands for managing the disk
subsystem
mailbox – Destroy or recreate mailbox disks
– Used in stale mailbox condition (KB ID:
2011715)
– Not for mailbox disk missing or uncertain
That is likely a loop issue
storage
This is the same as the storage command from Data ONTAP.
storage release disks - releases disk reservations is for HA configurations stuck in an awkward
state (one node is waiting for giveback, but no takeover has occurred). Both nodes should be halted prior
to releasing disk reservations using this command, otherwise the UP node may panic or data loss can
occur.
Also, ‘storage release disks’ can be used with ‘disk remove_ownership all’ from
maintenance mode to remove software based disk ownership.
Mailbox
mailbox destroy local
mailbox destroy partner
It may be necessary to destroy/recreate the mailbox disks in the event that this data becomes stale. The
mailbox disks will be recreated at boot automatically, assuming good access to all paths.
https://kb.netapp.com/support/index?page=content&id=2011715&locale=en_US
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 309
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
(22/7) Print this secret list.
(42/2) DEBUG ONLY: REMOVED! (Used to be manf. test) vol commands were added in
(readonly) Readonly boot
7.0 with the introduction of
(prev_cp) Boot from previous CP
(boot_snap_delete) Interactively delete snapshots from the root aggregate and contained volumes during boot.
Hidden Menu
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 310
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Clearing Errors
NetApp Internal Only
Do NOT Distribute
25/7 – mark labels clean
29/7 – ignore disk errors
These are included for your information, not for your use. There is a potential for data loss, data
inconsistency, and repeated trips to your manager’s office.
Why would you mark labels clean? If you have damaged NVRAM, you need to reboot with offline
volumes after dirty shutdown, or you have “new” NVRAM card that has contents from a different node.
Why use this option? If you have a bad sector hit on good disk during reconstruct. Not as prevalent with
new error control routines.
nd
For example during a RAID4 reconstruct a 2 disk hits a media error so it fails itself causing a double disk
failure and a panic. Try again and same thing occurs. Each time that 2nd disk hits a media error it will fail
causing a MDP (multi-disk panic). By ignoring the media error the disk will not fail itself (if possible) and
reconstruct will complete. Note that the media errors mean those blocks may not have been read, so
parity could not be calculated to reconstruct that stripe, so you may need to WACK the volume when
complete.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 311
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
28/7 – label summary
28/8 – label examine
30/7 – OBSOLETE: REMOVED
– (Was: edit disk labels)
These options were used in recovery scenarios. They are also available in maintenance mode and in
Data ONTAP. There is no harm in running label summary or label examine.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 312
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Safety Boots
NetApp Internal Only
Do NOT Distribute
prev_cp – boot from a previous consistency point (CP)
Readonly – boot readonly from a previous CP
Safety Boots
Consistency Point
A consistency point is like a snapshot but it does not have a name. Like all snapshots a consistency point
is a completely self consistent image of the entire file system at a particular point in time. WAFL avoids
the need for file system consistency checking after an unclean shutdown by creating these consistency
points every few seconds. After an unclean shutdown when WAFL restarts, it simply reverts to the most
recent consistency point. All writes between the last CP and the unclean shutdown are logged in NVRam
and can be replayed so they are not lost. Refer to section 3.5 of TR3002: File System Design for an NFS
File Server Appliance for more on Consistency Points.
"r": regular file system inofile: Mon Mar 3 00:58:56 GMT 2007
"c": inofile from previous CP: Mon Mar 3 00:58:55 GMT 2007
"7": snapshot with mod time Wed Aug 15 17:43:07 GMT 2006
"8": snapshot with mod time Wed Aug 15 17:43:38 GMT 2006
"9": snapshot with mod time Wed Aug 15 18:01:35 GMT 2006
"162": snapshot with mod time Sat Mar 1 03:00:01 GMT 2007
"165": snapshot with mod time Sat Mar 1 19:00:01 GMT 2007
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 313
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
If data is corrupt, possibly including data in snapshots, and there is no backup. Or the data was corrupted
during a period of low change so the data loss is acceptable. *Always* explain to customer that data will
be irretrievably lost, and only proceed with their permission. This will *not* recover for improper ONTAP
reverts
2. WAFL_check prev_cp
This method will WAFL_check a CP or snapshot and then you can consider committing changes and
boot. This decision (made by an EE or Sustaining Engineering) will be based on the errors seen. It is
possible the Storage system will not boot and then you need to try something else such as another
WAFL_check of an older snapshot or CP.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 314
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
These are available as of 7.0+
vol_offline
vol_quota_off
vol_cancel_snaprestore
vol_pick_root
vol_rewrite_fsid
vol_show_ops
vol_remove_op
•vol_offline – to take a flex volume offline, but not the whole aggregate. There is no maintenance
mode equivalent of this command. NVRAM will not be able to replay to this volume if the shutdown was
not clean. This can be used if a single flex vol is corrupt / problematic. You could boot without the single
volume, then wafliron or otherwise resolve the volume issues.
•vol_quota_off – to avoid possible quota related issues. There is usually no need for this.
•vol_cancel_snaprestore – cancels a snaprestore on a flexible volume.
•vol_pick_root – when the original root is damaged, a new one can be chosen. If the new root does
not have an /etc, one will be created. If the previous root volume is readable, licenses will be transferred
to the new root.
•vol_rewrite_fsid – Re-writes a volume File System ID (fsid). There can be instances where there
are duplicate FSIDs between HA nodes. This can prevent failover. Use the rewrite to create a new FSID
on the down head.
•vol_show_ops – list of queued vol commands.
•vol_remove_ops – last in, first out. Since invalid commands aren’t run, it may be easier to leave them
in place and reenter the command correctly.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 315
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Clear flags
NetApp Internal Only
Do NOT Distribute
These are available as of 7.3
vol_clear_inconsistent
vol_clear_invalid
Clear flags
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 316
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Summary
NetApp Internal Only
Do NOT Distribute
You should now be able to:
Explain the 1-5 and 1-8 menu
Explain the 22/7 menu
Describe the 22/7 menu's hidden options in 7.0+
Define the vol commands available from the 1-5
menu
Define when to use ignore medium errors
Explain what is maintenance mode, when to use it
and what you can do from it
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 317
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Review
NetApp Internal Only
Do NOT Distribute
How do you get to the 1-5 Special Boot Menu?
CNTRL-C during boot
What is option 4 used for?
Destroy all data, Initialize all disks and create a traditional root
volume.
In Data ONTAP 7.2 in maintenance mode, what command do
you use to see the volume status?
aggr status
What are the two places that are used to mark a disk as failed?
Failed Disk Registry and Failure Bytes
What is the difference between a prev_cp and a readonly boot?
Prev_cp rolls you back and boots to previous CP in read/write
mode. Readonly boots a previous CP in read only mode,
but does not delete newer CPs.
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 318
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 4-1
Lab 4-2
Lab 4-3
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 319
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 320
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 321
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Hardware and Down
Storage System
Troubleshooting for
Partners
Common Storage
Controller Problems
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 322
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Objectives
NetApp Internal Only
Do NOT Distribute
By the end of this module, you should be able to:
Discuss core file types and process to analyze them
Explain what to do when you have an unresponsive storage
system
Describe a HA takeover and determine when appropriate to
giveback
Explain how to deal with upgrades that have gone bad
Describe watchdog resets, and how to handle them
Illustrate how to use the Panic Message Analyzer (PMA)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 323
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Topics
NetApp Internal Only
Do NOT Distribute
This module contains the following sections
Cores
Unresponsive system
Machine Check Errors
Failed Upgrades
Watchdog resets
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 324
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Core Files
NetApp Internal Only
Do NOT Distribute
Contains contents of write cache and memory at
time of the core dump
Analyzed to find reason for panic or hang
Written across reserved areas on the disks
belonging to the storage system that went down
Moved to /etc/crash when system boots
(savecore)
Pre 7.2 - Can not be retrieved from /etc/crash
until the down head is brought back up
7.2+ Core can be retrieved when in HA failover
mode
NetApp Confidential — Limited Use
Core Files
Core Files can be generated from a panic or manually generated via the sync core process.
In a cluster, if a takeover occurs, you cannot get the core file until a giveback is performed (Pre-7.2).
Savecore
The savecore command is what writes the core to the /etc/crash directory. The savecore
command should always be in the /etc/rc file and therefore runs on boot.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 325
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Savecore Command
NetApp Internal Only
Do NOT Distribute
savecore [ -i | -l | -s ]
savecore [ -f | -k | -w ] [ core id ]
Savecore Command
savecore -f
If savecore fails to save an unsaved core too many times, it will stop trying. This flag tells savecore to
ignore the previous attempts, and try again. This attempt will be made synchronously.
savecore -i
Displays information about the type of coredump that would be performed, if the storage system were to
panic right now. If unsaved cores are present on the disks, this information will be included in the output.
savecore -k
Invalidates the special core area that resides on each of the disks. Typically this is used when a
savecore command cannot complete due to an error such as a disk read error.
savecore -l
Lists all of the saved core files currently in the coredump directory, along with the panic string, panic time,
and OS version. This information, along with a core id, is printed for unsaved cores.
savecore -s
Displays the progress of the current savecore operation. If no save is in progress, but an unsaved core
is present, a warning will be printed if the unsaved core cannot fit in the root filesystem.
savecore -w
Starts savecore in a synchronous mode, which blocks other processing until the save has completed.
Periodic progress updates will be displayed on the console.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 326
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Preferred Method: NMI button or from RLM
Creates a core with no alteration of memory
NMI Button
NMI button creates a core file with no alteration of memory, this is the preferred method to generate a
core file.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 327
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Method to get a good core in Data ONTAP 8
7-mode
Use in instances where there is no RLM/SP or
NMI button)
Go into the systemshell and run the following:
– sudo sysctl debug.debugger_on_panic=0
– sudo sysctl debug.panic=1
Systemshell
For more information on how to enter the systemshell …
https://kb.netapp.com/support/index?page=content&id=1012484
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 328
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Deprecated Methods:
halt –d
panic
reboot -d
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 329
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
There may be other activities to perform prior to
collecting a core
If the customer is going to reboot due to protocol
access issues
If the storage controller is not serving data and is non-
responsive to the console
There is a memory leak and a reboot is the only way
to alleviate the problem
At the direction of Escalations or Sustaining
Engineering
Use this KB for more detailed information:
https://kb.netapp.com/support/index?page=content&id=1010200
Pre-Core Collection
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 330
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
When the storage controller is serving data –
seek direction if unsure
When there are loop issues
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 331
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Customer retrieves core from /etc/crash
Customer uploads core file to NetApp support
If customer has RSE/RSA NetApp can recover
the core ourselves
Core is attached to case by SAP
TSE/EE completes the sync core template (not
necessary for a panic generated core)
Sustaining Engineering’s tools attempt to auto-
analyze the Core (PMA)
If no success, Sustaining analyzes core manually
TSE/EE completes the sync core template and attaches it to the case. NMI / multidisk cores will not be
analyzed without a sync core template!
If the automatic tool does not find the cause then Sustaining will manually analyze the core.
It is always quicker to try the Panic Message Analyzer first and to investigate the cause on our own.
Sustaining Engineering SLA to analyze core is based on the highest priority field in SAP:
P1 core – 4hrs
P2-P4 - up to 3 business days
Engage an EE or email Sustaining if these times are not acceptable.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 332
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Panics
NetApp Internal Only
Do NOT Distribute
When the storage controller has a serious problem
that it can not gracefully recover from, it panics and
reboots to avoid causing file system damage
Note that you can use PMA and BTA to find the reason for a panic, and if those do not produce useful
results, then upload the core.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 333
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 334
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
PMA - Example
NetApp Internal Only
Do NOT Distribute
Current Release - Include P release, D release, etc
Panic Message: Include the line from syslog starting with “Panic String:”
Panic string: Protection Fault accessing Kernel text address
0x00d49a1f from EIP 0x99e7ab in process ndmp_session_0 on release
NetApp Release 7.0.2P2
PMA – Example
If the full panic string does not give a good hit, then you can pare down the panic string to find more
possible matches.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 335
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Machine check errors are detected and reported by
processors when hardware errors occurred
Machine check errors can be correctable or
uncorrectable
Example:
– Uncorrectable Machine Check Error at
CPU1: status 0xb200001080200e0f. BUS
Error: Val, UnCor, Enable, PCC, Response
parity error, ErrExt(BQ_DCU_RFO_TYPE,
BQ_ERR_HARD_TYPE), ErrCode(Gen, NTO,
Gen, Gen, Gen)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 336
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Bootlog contains valuable information
Mismatch of RAID label versions
– Reboot without revert_to
– New CF does not contain desired image
CF card issues
– Bad, corrupt, incorrect image on flash
– Bad flash hardware
HA and Non-Disruptive Upgrades (NDU)
– Errors in /etc/rc or /etc/host files
– Misunderstanding of cf takeover –n
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 337
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
When the RAID version on disks is newer (not
supported) by the version of Data ONTAP that is
booting from the mini-kernel
Can occur when
– If Data ONTAQP is downgraded and revert_to is not used
– Data ONTAP is upgraded and download command is not
used
Higher version
Console Log shows:
– [Disk] has raid label with version (7), which
is not within the currently supported range
(5 - 6).
– No usable root volume found! Lower version
– Disk label processing failed!
Full Logs
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 338
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Each version of Data
ONTAP has a version of
RAID and a version of
ONTAP RAID WAFL
WAFL
Versions are backward 6.5 6 38
compatible for most 7.0 7 54
commands
Versions are never forward 7.1 7 57
compatible 7.2 8 72
Why is this important?
7.3 9 77
8.0 10 82
8.1 11 87
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 339
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
This is from a situation where a customer added spare disks that
came from another system
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 340
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Alternate boot the storage controller to the
version of Data ONTAP on the disks
– Boot backup image from CF
– Boot FCAL (OFW based systems)
– Netboot
Run the revert_to or download command to
update the DOT version on the mini-kernel to
match that of the disks
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 341
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
What is a Non-Disruptive Upgrade (NDU) ?
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 342
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Human Error
– Not following procedure
– Fail to run download
– /etc/rc and /etc/host errors
Use Upgrade Advisor to determine if NDU is advised
Customer was not expecting downtime for CIFS
Customer SAN environment was not set to proper
timeout values
Improper HA setup, so failover does not takeover all
processes
NFS hard mounts are guaranteed, but soft mounts
may experience problems
NetApp Confidential — Limited Use
Etc/rc errors
Typically non-existent/incorrect partner statements
Interfaces not defined in /etc/hosts
Interfaces not defined in /etc/rc
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 343
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Multi-Disk panic
NetApp Internal Only
Do NOT Distribute
If a RAID4 aggr/vol has a double disk failure or a
RAID-DP aggr/vol has a triple disk failure within a
RAID Group the aggregate can not come online
To recover, at least one of the failed disks must
be back into the RAID group
Many of us have been taught that we must bring
the most recently failed disk back into the RAID
group.
This was true before Burt 34481 was fixed
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 344
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Which disks can be re-used
When and where to unfail disks
– Why NOT to use disk unfail –s
Zeroing spares
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 345
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Watchdog Resets
NetApp Internal Only
Do NOT Distribute
Internal check – if the storage system believes it is
hung it will panic and reboot
Watchdog Resets
1st level
Attempts to dump core. If the storage system is to hung to create a core, then the 2nd level watchdog
kicks in
2nd level
This is like a power off and power on. No core file, no messages in the logs. If you feel this will re-occur
then log the console 24x7 to see if you can see any messages on console.
If a RLM is available it will log an Abnormal Reboot and will send an AutoSupport. RLM logs will provide
detailed information.
Watchdogs can be nearly impossible to diagnose. It can be Motherboard or other hardware. It could be
a software issue, so do your KB searches or engage and EE before replacing HW.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 346
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
If the watchdog reset created a core – have it analyzed
If no core:
– Review Sustaining’s watchdog template
– Check storage system history for:
Other Watchdog resets
Other messages about hardware problems
– Work with Escalations who may suggest
No action if this is an isolated event
Verify configuration in sysconfig guide
Check if watchdog occurred during a panic or shutdown
Reseat or replace I/O cards, Memory, and/or RLM
Replace motherboard
Log console to file for monitoring
Disable WatchDog Timeouts
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 347
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
What does the customer mean by hung storage
system?
– Very slow or standing still?
– Is only the console hung?
– Is the storage controller pingable?
– Are other processes running?
What does sysstat -x 1 show?
If console is hung, can you access via other means:
rsh, ssh, nfs, cifs?
Is it hung due to memory leak or other Data ONTAP
issues?
What is Unresponsive?
Understand the difference between a hung storage system and perceived slow performance.
Unresponsive means the storage system is showing no signs of life. Slow Performance means it is
serving data and responding to commands but at a very slow speed. It is necessary to understand the
difference so the problem can be worked properly.
History
In all situations you should check for previous cases for clues. Also review previous AutoSupport for
precedent.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 348
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
There are three potential causes of an
unresponsive Storage System
– 100% CPU
– CPU Deadlock
– Memory Leak
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 349
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
In a normal case 100% busy CPU should NOT be
alarming
Slow storage system performance in conjunction
with high or 100% CPU can be alarming
sysstat –M should be used to collect data for multi-
CPU storage controller
CPU profile (gmon) could be gathered during
perfstat collection
If performance issue is determined then this is not
a down storage system issue – it is a performance
issue
100% Busy
In a normal case 100% busy CPU should not be alarming. The CPU is a resource and the storage
system will use all the resources to achieve the best efficiency. See
https://kb.netapp.com/support/index?page=content&id=3011266
Deswizzler, Container Block Reclamation, quota init are some if the most resource intensive WAFL
scanners that can utilize free CPU cycles.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 350
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Profile
A Profile (gmon) could be gathered during perfstat collection. A CPU Profile measures the CPU
NetApp Internal Only
Do NOT Distribute
utilization by process within one CPU domain or across all CPU domains. Discussion of CPU Domains is
outside the scope of this course and is covered in depth in the Performance classes, but here is a quick
overview. Related storage controller processes are grouped into CPU Domains. One domain may use
100% of one CPU. If one Domain is showing a 100% usage then this may tell you what process is
causing the high CPU problem.
Analyze a Profile
Work with your EE to analyze the Profile. This may involve digging into the DOT code to determine
which high CPU processes are triggered by what events.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 351
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
storage controller shows no sign of life in the
subsystem where dead lock occurred
Follow the sync core process
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 352
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Summary
NetApp Internal Only
Do NOT Distribute
You should now be able to:
Discuss core file types and process to analyze them
Explain what to do when you have an unresponsive storage
system
Describe a HA takeover and determine when appropriate to
giveback
Explain how to deal with upgrades that have gone bad
Describe watchdog resets, and how to handle them
Illustrate how to use the panic message analyzer
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 353
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Review
NetApp Internal Only
Do NOT Distribute
What is the preferred method for generating a core file?
Pressing the NMI button or using the RLM
What tool is available to a TSE to analyze the reason for a
panic?
Panic Message Analyzer
– How do you resolve a RAID Label Mismatch
problem?
Alternate boot and run download or revert_to
What is a watchdog reset?
A reboot initiated by hardware when the storage controller is hung.
When is it appropriate to perform a HA giveback?
When you understand and have resolved the reason for the
takeover.
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 354
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 5-1
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 355
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 356
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 357
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Hardware and Down
Storage System
Troubleshooting for
Partners
WAFL_check and
wafliron
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 358
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Objectives
NetApp Internal Only
Do NOT Distribute
Discuss data inconsistency and some
causes
Describe the differences between
WAFL_check and wafliron
Explain how to use WAFL_check and
wafliron
Discuss RAID Error Propagation (REP)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 359
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Topics
NetApp Internal Only
Do NOT Distribute
What is data inconsistency
How inconsistency is caused
WAFL_check
wafliron
REP
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 360
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
The next few slides cover serious recovery issues that may
be
– Rare
– Complicated
– Cause permanent damage to the users’ data if not
resolved properly
It is crucial that NetApp Escalations Engineering is involved
in deriving an action plan
Remember that at the core, NetApp products are all about
protecting customer data – this has to be the top priority
– Data recovery may take time – DO NOT RUSH THIS!
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 361
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Unintended changes to data
It can be caused by hardware, firmware or
software
Why we never use the word “corruption”
Data Inconsistent
When data is inconsistent, it is not the data that is expected, or perhaps the checks and balances don’t
match. It could be the checksums don’t match, or parity doesn’t match, or some other mismatch.
Inconsistencies are accidental, and often correctable. On the other hand, corrupted data has been
intentionally changed by a user or external program, and is usually harder to recover.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 362
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Pre-corrective Actions
NetApp Internal Only
Do NOT Distribute
Stabilize System BEFORE WAFL_check/wafliron
– Otherwise there is a risk of additional damage during or
after correction – even during WAFL_check commit
phase
– When hardware stability is in doubt check the first 20
minutes of a WAFL_check for errors
Some Stabilization Techniques
– Stabilize disk subsystem
Update disk/shelf Firmware
Stabilize loops / shelves
Evaluate removing disks with bad data
Replace CPU/Motherboard, memory, shelf module, disk, and
so forth if necessary
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 363
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Hardware Problems can be a cause of
filesystem inconsistency
– Storage Problems
– CPU / Memory Glitches
– Network
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 364
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Storage Problems
NetApp Internal Only
Do NOT Distribute
Bad (“lost”) read/write to disk
Misdirected write to disk(s)
Lost write
Unexpected data (disk not properly zeroed)
Storage errors with degraded or no RAID parity
protection
– 7.3.1+ REP will prevent inconsistency on non-metadata
Faulty FC or SAS adapter
Normal I/O Errors, MDPs, and so forth, do not cause
inconsistency
– RAID is resilient – that is our business!
– Find the real cause and fix it!
NetApp Confidential — Limited Use
Storage Problems
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 365
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Fautly FC HBA
•Faulty FC adapter can use a wrong address as DMA source or destination
Lost write
•Lost write is when a write is reported as committed but the data on disk was not changed, this is
usually due to a failing disk. This type of condition is very rare by nature, unless artificially induced by
other means.
•Checksums are self-consistent as data/checksum have not actually changed
•WAFL sanity checking was added to find this problem.
If ignoring medium error, Data ONTAP will provide “data” rather than panic. WAFL_check or wafliron
will be needed if this occurs.
DESCRIPTION: Aggregate or volume became inconsistent if missing nvfile in degrade mode with missing
parity disk. %%% GDB_TRACE: N/A %%% ASUP_AND_OTHER_LOGS:
• The filer had an unclean shutdown (e.g. multi-disk panic).
• On next boot or takeover, there can be 2 situations:
- the aggregate appears as missing or it is failed state then NVRAM replay can not be performed
so it should be saved into nvfile. but save is failed for some reason.
- there is no unclean aggregate so no need to create nvfile then there is no message, like below, to
indicate nvfile save operation for the aggregate: [raid.preserve.nvram:info]: Raid
replay detected nvram entries for a non-existent aggregate or traditional
volume having raidtreeID ...
• During reboot/takeover, filer detected aggregate is degraded and has dirty parity. Find out this
message: [raid.assim.tree.degradedDirty:error]: Aggregate aggr_sata_0 is
degraded and has dirty parity. You must run WAFL_check... Plus, aggregate appears
as missing parity disk.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 366
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Storage
NetApp Internal Only
Do NOT Distribute
How to tell if a storage issue is present
– Checksum errors
Indicative of lost or misdirected writes
– Parity Inconsistency (PI) errors
Usually lost writes; sometimes misdirected writes
Presence of Parity Inconsistency log files
– WAFL sanity check errors
– History of storage related errors
Checksum
A checksum is a count of the number of bits in a transmission unit that is included with the unit so that the
receiver can verify whether the correct number of bits has arrived.
Checksum errors – RAID computes parity on the data blocks and compares against the parity information
in the block. If they don't match then the there is a checksum error. These are normally seen during a
RAID scrub.
History
There is often a history (recent or prolonged) of disk errors or storage instability as evidenced by the
introduction of lower cost SATA drives.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 367
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Review and update disk and shelf FW
– In a controlled manner
– Before running WAFL_check or wafliron
– FW Updates do not fix existing inconsistency on disk(s)
TSB-0709-11: Pre-FW34 Risk Minimization Configuration
Recommendations
If problem impacts many disks
– Stabilize storage subsystem before running WAFL_check or
wafliron
If problem is isolated to a specific disk
– Remove that disk prior to WAFL_check or wafliron
– Once data had been rebuilt from parity, then WAFL_check or
wafliron may run clean after this
Shelf firmware upgrades are not always non-disruptive. Most AT module upgrades are disruptive (such
as AT-FC and AT-FC2) AT-FCX are now NDU (non-disruptive upgrades) with fw37, Data ONTAP 7.3.2
and MPHA as prerequisites. For FC shelves, firmware upgrades are non-disruptive.
Update Firmware and stabilize any existing loop issues before running WAFL_check or wafliron
Address any firmware related issues and stabilize the loops prior to running WAFL_check or wafliron.
This will prevent further damage to the filesystem while wafliron is running or when WAFL_check is
committed. In pre-7.3.1 Data ONTAP, wafliron commits changes as it runs. If wafliron misreads or
miswrites while it is running, this will likely result in more damage to the filesystem. If the underlying issue
is not addressed and WAFL_check is used, this will be evident in the WAFL_check log data and changes
can be discarded (not committed).
NOTE: As of Data ONTAP 7.3.1, wafliron comes with an optional commit feature for online recovery
to minimizing risk of false changes.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 368
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
CPU/Memory
NetApp Internal Only
Do NOT Distribute
Some causes
– CPU glitch
– Memory inconsistency – Memory problems can
change data, and then write the changed data to
disk, causing inconsistency
Might be silent or appear to be a disk error.
– Before checksum calc will be a silent error
– After checksum calc will look like bad disk
– Might not be on disk (buffer post read/write)
Check history for ECC errors / FW / CPU microcode
levels
Note it is imperative to be current on system firmware
to prevent hard to isolate CPU induced issues
A problematic CPU can misinterpret block data or miscalculate checksum values. Core file and PI data is
useful for determining if CPU might be at fault. Replace the motherboard/CPU to correct the issue.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 369
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Network
NetApp Internal Only
Do NOT Distribute
Network packets are CRC protected
Some types of damage will “pass” the CRC check
– Statistically possible with pathological link
– Ex: a faulty DS3 line card corrupted bitmaps
Snap* targets can accrue inconsistencies in the data
when receiving such packets
153223 is a known example: SnapMirror and
SnapVault do not detect network packet errors that are
undetected by TCP checksums
Note
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 370
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
WAFL - WAFL bugs can directly damage
the file system
Snap* - SnapMirror bugs can damage user
data
RAID – Out of Order disks due to label
problems
WAFL
WAFL owns the file system, so bugs at this layer can directly damage file system
WAFL_check or wafliron should attempt to resolve these issues. In the unlikely event that it does not,
engage EE support.
Volume SnapMirror (VSM) can transfer problems from the SnapMirror Source to the Destination
VSM will transfer many types of FS damage if it exists on the source to the target. Corrections on the
target will be undone (overwritten) by next transfer. If this happens, it may be necessary to correct the
source (WAFL_check or wafliron) and then resync or rebuild the target.
Correcting RAID configuration might not eliminate errors if the aggregate was ever brought up in
read/write mode
WAFL info may have been “corrected” wrongly.
WAFL_check or wafliron will not complete in these situations.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 371
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
WAFL_check or wafliron??
NetApp Internal Only
Do NOT Distribute
The Recovery Advisor tool should be consulted for information
specific to your situation (this is required and needs to be
documented in the case)
Run WAFL_check, wafliron or wafliron w/ optional
commit (IOC) to resolve inconsistency
Run aggr/vol scrub (RAID)
– FW upgrade only fixes the cause of inconsistency, not the
inconsistency
– Scrub re-calculates and fixes parity due to any changes
made by WAFL_check or wafliron
– If errors are found re-run disk scrub to confirm no errors are
found
Run backup to /dev/null (WAFL)
– Scans all data (including user data)
– Creates snapshot - won’t trigger panic on 7.2.3 if there is
inconsistency
NetApp Confidential — Limited Use
disk scrub
After lost/misdirected write issue, disk scrub must be done to update RAID with any changes made by
WAFL_check. The FW upgrade only fixes the cause of filesystem damage but does not fix the actual
inconsistency. Disk scrub will verify all disks against the wafl metadata and recalculate and fix parity
where necessary.
If you see checksum errors after running scrub following WAFL_check or wafliron, rerun the scrub to
ensure that these errors are not caused by new issues. If errors continue on the second pass, determine
if the loop issue may be causing this or if there is a problematic disk.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 372
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Now that all contributing factors have been
corrected…
The snapmirror check subcommand is used to start a verification session in the background for a
SnapMirror destination. It allows the user to verify the validity of mirrored volumes and qtrees. The final
report of a check session will be recorded in /etc/log/snapmirror log file. The progress during a
check session is available in snapmirror status -l output for the corresponding relationship.
Mismatches, if detected, will be logged in /etc/log/snapmirror on the destination filer. Look at
na_snapmirror(5) for more details about /etc/log/snapmirror log file.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 373
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
WAFL_check
NetApp Internal Only
Do NOT Distribute
WAFL_check [Aggr/Trad Vol]
WAFL_check is run from the special boot menu
Makes the file system self-consistent again
Cannot detect or repair inconsistency to user
data
Storage system is offline
No changes made until commit is chosen
Can be run on one or more aggregates and
traditional volumes
Deprecated after 8.0
NetApp Confidential — Limited Use
WAFL_check
WAFL_check fixes inconsistencies in the filesystem while the filer is offline. No changes are committed
during the WAFL_check process. Once WAFL_check has been completed, it will prompt to commit
changes. WAFL_check can be run on one aggregate or all aggregates.
NOTE: WAFL_check can be run against aggregates or traditional volumes but not against individual flex
volumes. If you want to WAFL_check a flex vol then you run WAFL_check against the containing
aggregate. When no aggregate is specified, user is prompted for each aggregate at the beginning of
WAFL_check.
Options to WAFL_check
-lost_blocks
Does not collect lost blocks into lost+found, but discards them. This is to be used only when there is
strong reason to believe that the WAFL_check will not be successful without it. Such a case would be
if the Storage Controller hung or panicked while collecting lost blocks in a previous WAFL_check.
-f
This controls whether detailed checking of nfs name translations will be done.
-prev_cp
For each aggregate, presents a list of snapshots in that aggregate and asks the administrator to
choose which snapshot to revert to and check. If changes from such a run are committed, this will be
similar to a SnapRestore. This should only be used if a WAFL_check on the active file system
appeared to have too much damage.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 374
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
WAFL_check (cont.)
-snapshots
NetApp Internal Only
Do NOT Distribute
Kills problematic snapshots. For each volume, the administrator is queried whether to delete each
snapshot. Any snapshots chosen are removed. WAFL_check does not attempt to be clever about
disposing of the snapshots and thus leaves many inconsistencies for the remainder of the scan to
repair. Consequently the scan will report errors related to the snapshots, snapshot directory, summary
and spacemaps and blocks used.
-stats
Does the equivalent of statit -b and wafl_susp -z prior to long running phases and the
equivalent of statit -e and wafl_susp -w after those phases. This provides interesting statistics
that are primarily useful to NetApp Engineering.
-verbose
Prevents any messages from being suppressed from the console. The additional messages may be
useful, but may slow down the scan significantly. Should only be used under escalations supervision.
-all
Check all volumes.
-noestimates
Disable phase specific time estimates.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 375
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Mean size of files in the file system being checked
Number of inodes in use
Layout of the data on a volume/aggregate
Size of the volume/aggregate
Number of file system inconsistencies if they exist
Storage appliance's CPU speed
Storage appliance's Memory
Speed of the disk drives (i.e. 5400/7200/10000/15000 RPM)
Data ONTAP version
Number of flexvols contained in the aggregate being checked
Can be limited by the speed of the console connection – (prior to
DOT 7.2.4)
Presence of LUNs in the volume
NetApp Confidential — Limited Use
Speed of WAFL_check
In general, WAFL_check can take several minutes, hours or days depending on the above
factors. WAFL_check runs through several phases during the file system check. In most cases, phase
4b (Scan inode file normal files) will take the most amount of time since this comprises the bulk of the
data to be checked.
You should never guarantee when a WAFL_check will complete, even though you may be asked. There
are initiatives within Engineering to improve speed, and predictability of time to run. This is also true of
wafliron. So far, all initiatives have been unsuccessful.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 376
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Typically 6 phases
– Phase 1: Verify fsinfo blocks
– Phase 2: Verify metadata indirect blocks
– Phase 3: Inode scan (3a and 3b)
– Phase 4: Directory scan
– Phase 5: Check each Flex-Vol
Phase 5.1: Verify fsinfo blocks
Phase 5.2: Verify metadata indirect blocks
Phase 5.3: Inode scan (3a and 3b)
Phase 5.4: Directory scan
Phase 5.5: skipped
Phase 5.6: Cleanup
– Phase 6: Cleanup
Phases of WAFL_check
Phases 1-4 are checking the aggregate. Phase 5 loops through the flexvols on the aggregate checking
each one in the chosen aggregate. Phase 3b & 4 often take the most time. The first 70% goes the
fastest since WAFL_check begins with level 0 inodes..
Commit Changes?
Committing changes from WAFL_check should minimally have EE review if any errors are found or
changes are to be made. The EE will use great caution and in most cases consult Sustaining.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 377
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
WAFL_check example
NetApp Internal Only
Do NOT Distribute
Selection (1-5)? WAFL_check aggr0
WAFL_check aggr0 Check the Aggregate.
Checking aggr0... Are you logging the
WAFL_check NetApp Release 7.0.4 console to a text file?
Starting at Tue Jul 3 00:13:30 GMT 2007
Phase 1: Verify fsinfo blocks.
Phase 2: Verify metadata indirect blocks.
Phase 3: Scan inode file.
Phase 3a: Scan inode file special files. Check the aggr before
Phase 3a time in seconds: 23 checking the flexvols
Phase 3b: Scan inode file normal files.
(inodes 5%)…
…(inodes 100%)
Phase 3b time in seconds: 12118
Phase 3 time in seconds: 12141
Phase 4: Scan directories.
Phase 4 time in seconds: 0
Phase 5: Check volumes.
Phase 5: Check flex-vols
Phase 5a: Check volume inodes
Phase 5a time in seconds: 0
Phase 5b: Check volume contents
WAFL_check Example
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 378
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Volume boot: Descriptor in a non-root aggregate has root volume flag set (0x101). Clearing.
Checking volume boot...
Phase [5.1]: Verify fsinfo blocks.
Phase [5.2]: Verify metadata indirect blocks.
Phase [5.3]: Scan inode file.
Check volume “boot”
Phase [5.3a]: Scan inode file special files.
Phase [5.3a] time in seconds: 17
Phase [5.3b]: Scan inode file normal files.
(inodes 5%)
Phase [5.3b] time in seconds: 8
Phase [5.3] time in seconds: 25
Phase [5.4]: Scan directories.
(dirs 17%)
(dirs 93%)
Phase [5.4] time in seconds: 3
Phase [5.6]: Clean up.
Phase [5.6a]: Find lost nt streams.
Phase [5.6a] time in seconds: 0
End of checking
Phase [5.6b]: Find lost files. “boot” with no errors
Phase [5.6b] time in seconds: 9
Phase [5.6c]: Find lost blocks.
Phase [5.6c] time in seconds: 0
Phase [5.6d]: Check blocks used.
Phase [5.6d] time in seconds: 43
Phase [5.6] time in seconds: 52
Volume boot WAFL_check time in seconds: 81
WAFL_check output will be saved to file /vol/boot/etc/crash/WAFL_check
Volume boot: Descriptor in a non-root aggregate has root volume flag set
(0x101). Clearing.
Checking volume boot...
Phase [5.1]: Verify fsinfo blocks.
Phase [5.2]: Verify metadata indirect blocks.
Phase [5.3]: Scan inode file.
Phase [5.3a]: Scan inode file special files.
Phase [5.3a] time in seconds: 17
Phase [5.3b]: Scan inode file normal files.
(inodes 5%)
Phase [5.3b] time in seconds: 8
Phase [5.3] time in seconds: 25
Phase [5.4]: Scan directories.
(dirs 17%)
(dirs 93%)
Phase [5.4] time in seconds: 3
Phase [5.6]: Clean up.
Phase [5.6a]: Find lost nt streams.
Phase [5.6a] time in seconds: 0
Phase [5.6b]: Find lost files.
Phase [5.6b] time in seconds: 9
Phase [5.6c]: Find lost blocks.
Phase [5.6c] time in seconds: 0
Phase [5.6d]: Check blocks used.
Phase [5.6d] time in seconds: 43
Phase [5.6] time in seconds: 52
Volume boot WAFL_check time in seconds: 81
WAFL_check output will be saved to file /vol/boot/etc/crash/WAFL_check
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 379
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Checking volume backup1...
Phase [5.1]: Verify fsinfo blocks.
Check volume “backup1”
Phase [5.3b]: Scan inode file normal files.
(inodes 5%)
(inodes 95%)
Tue Jul 3 09:40:26 GMT [raid.cksum.ignore.error:error]: Invalid checksum entry on aggregate aggr0, Disk
/aggr0/plex0/rg0/0a.22 Shelf 1 Bay 6 [NETAPP X267_MGRIZ500SSX H3DD] S/N [H814NK6H], block 79150113.
Ignoring unrecoverable error.
Inode 98: indirect block 375929670, level 1, has bad magic number 0x34d1fd99. Clearing block.
Inode 98: indirect block 375929670, level 1, has too many errors. Clearing.
Inode 98: block count is 537532090. Setting to 537531580. Errors found.
(inodes 100%)
Phase [5.3b] time in seconds: 24752
Engage EE.
Phase [5.6d]: Check blocks used.
FS info shows reserved hole blocks used to be 133320. Setting to 133830.
FS info shows reserved overwrite blocks used to be 537532092. Setting to 537531582.
Phase [5.6d] time in seconds: 486
Phase [5.6] time in seconds: 532
Volume backup1 WAFL_check time in seconds: 26226
510 error messages discarded
Indirect blocks cleared: 1
Message states errors
Block counts corrected: 1 corrected, but they are
Reserved holes blocks corrected. not until changes are
Reserved overwrite blocks corrected.
committed.
510 lost blocks collected into 510 files in lost+found.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 380
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Checking volume vm3...
Phase [5.1]: Verify fsinfo blocks.
Phase [5.2]: Verify metadata indirect blocks.
Phase [5.3]: Scan inode file.
Phase [5.3a]: Scan inode file special files. Errors on
Phase [5.3a] time in seconds: 677 volume vm3
Phase [5.3b]: Scan inode file normal files.
(inodes 5%) … (inodes 95%)
Inode 98, indirect level 2, block 248395500, index 55: vbn 1188408166, vvbn
182692401, doesn't match container vbn 0. Clearing.
Inode 98, indirect level 1, block 39421980, index 79: vbn 841039069, vvbn
182691840, doesn't match container vbn 0. Clearing.
…[ 5 pages of these errors] …
Inode 98, indirect level 1, block 39423000, index 476: vbn 191173090, vvbn
182692374, doesn't match container vbn 0. Clearing.
Inode 98, indirect level 1, block 39423000, index 477: vbn 191173151, vvbn
182692846, doesn't match container vbn 0. Clearing.
(inodes 100%)
Phase [5.3b] time in seconds: 11411
…
Phase [5.6] time in seconds: 204
Volume backup2 WAFL_check time in seconds: 11927
(No filesystem state changed.)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 381
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Phase 6: Clean up.
Phase 6a: Find lost nt streams.
Phase 6a time in seconds: 0 Stage 6: Finished
Phase 6b: Find lost files. checking flex-vols.
Phase 6b time in seconds: 4
Phase 6c: Find lost blocks.
Phase 6c time in seconds: 102
Phase 6d: Check blocks used.
FS info shows total blocks used to be 1006256127. Setting to 1006255103.
FS info, plane 0: shows blocks used to be 1006256127. Setting to 1006255103.
Phase 6d time in seconds: 3
Phase 6 time in seconds: 110 Messages states clearing
Clearing inconsistency flag on aggregate aggr0.
inconsistency flag. But it
WAFL_check total time in seconds: 68305
21 error messages suppressed from console
is not really cleared until
1024 error messages discarded you choose to commit
Indirect blocks cleared: 1 changes.
Block counts corrected: 1
Spacemap entries corrected: 21
Total blocks used corrected. Prompt to commit changes.
Blocks used for 1 planes corrected.
1024 lost blocks freed because "lost_blocks" used.
Commit changes for aggregate aggr0 to disk? ? Please answer yes or no.Commit changes
for aggregate aggr0 to disk?
Commit Prompt
Note that the commit prompt will wait forever for a choice of Y or N. This is important if you leave
WAFL_check running over night. If it completes it will NOT default to yes or no and continue. Also, if it
takes Escalations a while to analyze the WAFL_check logs the prompt will wait. If the WAFL_check has
some serious errors it may take Sustaining some time (hours?) to analyze the data and choose how to
proceed. Keep the customer informed of progress.
Note that if you have a noisy console port random characters may by seen as choices. If you accidentally
hit “N” then the Storage Controller will not make changes and you will have to re-run the WAFL_check
from the beginning.
NOTE: Commits are captured in a consistency point. That consistency point is only written to disk when
the node has been rebooted at least once with the NVRAM card present that was present during the
WAFL_check.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 382
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Same as WAFL_check except for:
Run from Data ONTAP or special boot menu
Storage Appliance is Online
– Data for waflironing aggr is unavailable during
phase 1 (mounting) which can last a long time
Changes are automatically committed unless optional
commit is used
Check status: aggr wafliron status –s
It is recommended to always use “-o” (to be disussed
later in this module) to avoid unwanted changes
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 383
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
wafliron
NetApp Internal Only
Do NOT Distribute
system*> priv set diag
system*> aggr wafliron start [Aggr/Trad Vol] -v
system*> vol wafliron start [Trad Vol]
Wafliron
Wafliron can be run from the 1-5 menu, but normally it is run while Data ONTAP is UP allowing access
to volumes and aggregates that are not inconsistent.
The same factors affect the speed of wafliron as WAFL_check. Another factor specific to Phase 1 (the
mounting phase) is the need to check all files in the root of each volume. If you put LUNs in the root of a
volume it will significantly increase the amount of time this phase takes to complete.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 384
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Wafliron (cont.)
Do NOT Distribute
Try not to use this option as this can cause severe performance problems which will affect access to user
data, especially if left adjusted after wafliron completes. Consult EE when considering this option.
The rate at which WAFL will perform scans such as wafliron is governed by the WAFL scan speed. By
default, this speed is set to 2000. In 7.0.5 and later the speed is automatically tuned by Data ONTAP
based on available system resources and therefore this normally does not need to be altered. However it
can also be set manually using the advanced "wafl scan speed" command which disables the
automatic tuning. To set back the value back to the default and re-enable automatic tuning set the WAFL
scan speed to 0.
NOTE: WAFL scan speed does not affect wafliron while it is in phase 1! In fact, adjusting WAFL
scan speed too high during phase 1 can have a negative impact on phase 1 time to completion.
Manually increasing the WAFL scan speed value from the default will allow the scanners to run quicker
during phases 2 & 3, but it may cause a negative performance impact to the storage controller as more
system resources will be required by the WAFL scanners. The WAFL scan speed setting accepts values
of 1 - 100000, but the maximum recommended setting is 10000. Setting the speed to the maximum value
will affect storage controller performance adversely, including the loss of console control and
inaccessibility to data.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 385
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Wafliron example
NetApp Internal Only
Do NOT Distribute
f825-rtp*> aggr wafliron start aggrtm
Wed Oct 31 16:32:34 GMT [wafl.iron.start:notice]: Starting wafliron on
aggregate aggrtm.
Wed Oct 31 16:32:34 GMT [wafl.iron.start:notice]: Starting wafliron on
volume voltm.
Wed Oct 31 16:32:35 GMT [wafl.iron.start:notice]: Starting wafliron on
volume voltms.
Wafliron Errors
If there are issues during or after wafliron completes, send the wafliron log to Sustaining
Engineering for review.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 386
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
system*> priv set diag
system*> aggr wafliron start [Aggr/Trad Vol] -o
system*> vol wafliron start [Trad Vol]
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 387
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Wafliron status
NetApp Internal Only
Do NOT Distribute
Filer*> aggr wafliron status -s
Optional commit mode is enabled
Status files redirected to volume vol0
15 files, 1326378 blks total, 454055 blks used
Total mount phase of aggregate aggr2_500 took 4h 7m 27s
74ms.
Wafliron scan stats for aggregate: aggr2_500
20 files done
12 directories done
32718 inodes done
1895094869 blocks done
Scanning 86% done
Wafliron is active on aggregate aggr2_500 : Checking
Inodes
Wafliron status
This is not a good indicator of how quickly wafliron will complete. It can sit at 100% complete for
hours. Again: Do not tell the customer how long, or how much longer it will take.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 388
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Primary options:
Run WAFL_check or wafliron from Special
Boot Menu
– Cannot run wafliron on the root aggr while
booted into DOT
Boot to an alternate root volume and then
wafliron the original root
You can run wafliron on a root volume or aggregate by starting it from the special boot menu. After
Phase One completes and the volumes can be mounted the storage controller will boot to Data ONTAP
and finish the wafliron.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 389
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
wafliron wafliron
wafliron pre-
7.3.1+ using 8.0+ & 8.1+
Benefit WAFL_check 7.3.1 or 7.3.1+ w/o
optional with optional
optional commit
commit commit
Speed to get data
consistent
Speed to get data
online
Choose to
commit or not
commit changes
General
Recommendation
from NGS
Minimal impact to
node uptime
IOC
Use the recovery advisor to assist in making the decision between WAFL_check and wafliron
WAFL_check is going away to be replaced by wafliron with optional commit
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 390
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
After wafliron-ing a SnapMirror source
metafile scan is run on flex vols
SnapMirror will not run updates until scans
complete
Check scan status with wafl scan status
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 391
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
SnapMirror and SnapVault Destinations are read-only
To run wafliron the –f option must be used
– Relationships will be broken-off if changes are made
– Attempt to resync relationships after wafliron is complete
(reinitialization may be necessary)
WAFL_check can be used
– Relationships will be broken-off if changes are made
– A "block type initialization" scan is performed following
WAFL_check
– Resync cannot occur until scan completes
Or - destroy the SM destination and re-initialize
wafliron –f
-f is a “best attempt option” to preserve the SnapMirror relationship. There will be cases where re-
initialization is required after wafliron completes.
Internal Note: The official position from EPS is to discuss “destroy or re-init” with the customer FIRST.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 392
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Cannot repair snapshot corruption, or L0 user data corruption
– This is functionally no different than wafliron
– Metadata only!
– You can use WAFL_check to delete snapshots, but…
In order to do so, you have to allow it to WAFL_check the
entire file system!
As of version 7.3.1, you can delete snapshots from the
volumes in the root aggregate from the Special Boot
Menu… one less reason to run WAFL_check.
– boot_snap_delete
Similar capabilities exist for non-root aggregates in 7.3.1
and beyond, but are NOT executed from the Special Boot
Menu, they are run in normal OnTap CLI
– priv set diag; vol online volXXX -r or priv set diag; aggr online
aggrXXX -r
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 393
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Changes in Tools!
NetApp Internal Only
Do NOT Distribute
NEW: Wafliron with optional commit (a.k.a. IOC)
– “aggr wafliron start –o”
– Can only be done when the root aggregate and volume are online at this time
“IOC on Root aggregate” reportedly available in 7.3.3
– Stores changes in root volume instead of memory
– Changes are committed or rejected in response to command line options (human decision
driven)
– This is the “new WAFL_check”, for use when you need or want to confirm changes with
human/manual intervention (and in place of WAFL_check as of Data ONTAP 8.1)
Wafliron “on demand” for files:
– Instead of the entire file (all 2TB of that LUN) needing to be ironed before presenting to
the client, it can do this on demand after Phase 1 (7.3.3/8.0.1)
wafliron (and IOC) are the preferred tools as per Engineering
– WAFL_check will be deprecated and removed from the tool as of Data ONTAP 8.1
REP tools
– aggr block check (7.3.2)
– vol vbncheck
– file check
– These are specific to files… and help identify files that contain corrupt user data blocks
that WAFL_check cannot repair.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 394
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
You may currently see messages like this in 7.3.1+
– Sun Oct 12 03:15:21 PDT [wafl.raid.incons.userdata:error]: WAFL
inconsistent: bad user data block 1662157 (vvbn:0 fbn:0 level:0) in
inode (fileid:586005 snapid:0 file_type:1 disk_flags:0x202) in volume
vol1.
Details
Additional samples:
Level = 0 … userdata
Level ≠ 0 … metadata
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 395
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Why panic the filer if the tools can’t fix it?
– This includes corrupted snapshots
– Another example: L0 user data corruption (damage is
not to metadata) is addressed via “Raid Error
Propagation”
Rep V1: Instead of panic, we present the client with a
zero’d block
– 7.3.1.x
Rep V2: Rep V1, with enhancements, plus we start to
notify the affected client of the event. We also report
the failure a bit better, and add more tools to recover
– 7.3.2+
REP will react without using “WAFL inconsistency” to
remove access to the file system…but you WILL “see”
this if you’re paying attention to your filer…
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 396
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
REP Impact
NetApp Internal Only
Do NOT Distribute
Results:
– Starting in versions of 7.2, panics due to corrupt
blocks have been decreasing.
– Starting in versions of 7.3, panics due to
corruptions of various types should be unexpected
except in cases of severe corruption
– Starting in versions of 7.3.1, you may see
messages that say “WAFL inconsistent”, but there
will be no filesystems (volumes or aggrs) marked
as “offline” or “restricted”, and they may still be
serving data!
In these cases, there is no requirement or suggestion
or recommendation to run WAFL_check. Doing so
will only cause User Downtime, and NetApp
embarrassment.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 397
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Review
NetApp Internal Only
Do NOT Distribute
List some possible causes of Data Inconsistency?
Hardware: Storage, Memory, CPU, Network
Software: WAFL, SnapMirror/SnapVault, Memory, RAID
What ways can you use to correct Data Inconsistency?
WAFL_check and wafliron, rebuild from backup, SnapMirror fixer
What should be done before running WAFL_check or wafliron?
Make sure the cause of the problem has been corrected.
Corrupt L1 block … WAFL_check should be run for this error (T/F)
WAFL_check should be run for this error (True or False)
– Sun Oct 12 03:15:21 PDT [wafl.raid.incons.userdata:error]:
WAFL inconsistent: bad user data block 1662157 (vvbn:0 fbn:0
level:0) in inode (fileid:586005 snapid:0 file_type:1
disk_flags:0x202) in volume vol1.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 398
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 6-1
Lab 6-2
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 399
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 400
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 401
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
Hardware and Down
Storage System
Troubleshooting for
Partners
Troubleshooting Loop Issues
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 402
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Objectives
NetApp Internal Only
Do NOT Distribute
By the end of this module, you should be able to:
Describe how RAID handles disk errors
Show how to look up and use SCSI codes
Identify disk and media errors
Compare the various statistics and data that can
be used to solve FCAL problems
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 403
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Topics
NetApp Internal Only
Do NOT Distribute
This module contains the following sections
Disks
Media Errors
SCSI Codes
Maintenance Center
FC-AL theory
FC-AL components
Resources to analyze FC-AL issues
Steps in troubleshooting a loop problem
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 404
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Disks
NetApp Internal Only
Do NOT Distribute
Disks are commodities
We use disks from several vendors
Data ONTAP uses software RAID to protect data
Disks may have errors
Disks will eventually break
Disks will normally run for tens of thousands of hours without problems, but if they do fail we do not
attempt to fix them but simply replace them. We use disks from Seagate, Maxtor, Hitachi and other
vendors. Just like other storage companies.
Software RAID
Data ONTAP uses software (not hardware) RAID to protect data. RAID keeps track of errors and can
correct errors. So if a disk or two have errors or fail, RAID protects us from any data loss.
We understand that from time to time a disk may fail. We simply reconstruct onto a spare and life goes
on. We do not need to troubleshoot every disk failure. But if there is a high number of disk failures or
certain types of failures then we can use the disk errors to troubleshoot the reason for the failures. Also
we can use the errors to understand the order of disk failures and to find which disk may need to be
brought back to life to allow a reconstruction to complete.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 405
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Media Errors
NetApp Internal Only
Do NOT Distribute
When read or write cannot occur on the first
attempt it may be classified as media error
An internal problem that causes a disk to be
unable to perform the requested read or write
More common on reads than writes
Can occur on write if disk cannot locate position
to write data
Recoverable or unrecoverable
A small number of media errors are expected
Media Errors
Definition:
A media error is an event where the SCSI type disk (that implements the upper level protocol of SCSI
over either a parallel bus or over Fibre Channel) was unable to perform the requested I/O operation
because of problems accessing the stored data.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 406
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
More common than unrecoverable errors
2 types of recovered errors are:
– A retry may have succeeded
– Error correction may have been performed
If the error rates exceed a certain threshold, then
the Storage Health Monitor (SHM) will generate
an AutoSupport
The sector involved in a recovered error is not
reassigned
End result: disk was able to provide the
requested data with no intervention by RAID
NetApp Confidential — Limited Use
Recovered Errors
Note that disks always record error correction codes with data. These correction codes are used to
reliably reproduce missing data bits. As a disk reads data, it computes a CRC for the data being read; it
then compares this CRC to that stored with the data. If they do not match, then error correction codes
may be applied to reproduce the missing bits.
Data recovery generally is a series of steps from retrying, to repositioning and retrying, and applying error
correction codes. If one of these steps is successful, then the disk considers this a recovered data
operation. If all measures of retrieving or reproducing data fail, then this becomes an unrecoverable data
operation.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 407
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Example 1:
line 1: [ispfc_main:error]: Disk 8a.22: op
0x28:0002c118:0080 sector 180549 recovered error - (1 17, 3)
line 2: [scsi.cmd.checkCondition:error]: Device 8a.22: Check
Condition: CDB 0x28:0002c118:0080: Sense Data recovered
error - (0x1 - 0x17 0x3 0xe4).
Example 2:
Sun Nov 25 04:18:31 CET [filer: ispfc_main:error]: Disk
7a.35: op 0x28:06119c60:0080 sector 101817489 recovered
error - (1 16, 0)
Sun Nov 25 04:18:31 CET [filer:
scsi.cmd.checkCondition:error]: Device 7a.35: Check
Condition: CDB 0x28:06119c60:0080: Sense Data recovered
error - (0x1 - 0x16 0x0 0xd2).
Example Explanation:
line 1: The adapter driver reports a recovered error (sense data 1 17, 3), during a read operation (op
0x28) on drive 8a.22, sector 180549. This means the drive had to do some additional work to read the
data in this sector. The sector remains valid and is not reassigned. The three digit code (1 17, 3) is the
sense data reported by the drive. This code is translated into a human readable meaning "recovered
error" by the system.
line 2: The SCSI layer also reports the same error. The SCSI layer reports the sense data in hex and
adds a fourth code called a FRU. The FRU is used by the drive vendor.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 408
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
RAID handles disk errors by computing parity
Computed data is re-assigned to new block on
disk
If this re-write of data fails – the disk is failed
If the reassignment is successful, then the new
physical location associated with the logical block
address is rewritten
Since media faults are expected they can be correctly handled when disks are part of a RAID storage
system. The Storage Controller handles these events in the following manner:
Firstly, the Storage Controller’s software looks at the occurrence of recoverable and unrecoverable block
errors with respect to the data transfer rate. If the error rates exceed a certain threshold, then a module
called the Storage Health Monitor (SHM) will generate an AutoSupport. This SHM actually checks a
variety of parameters besides these error rates, looking for things like excessive time to completion of
I/O's, or excessive timeouts.
If the management function of reassigning the Logical Block Address (LBA) to a different physical location
fails for any reason then an error is reported. Next the Storage Controller responds to this error by failing
the disk. A disk that can not successfully reassign a bad block is no longer used.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 409
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Unrecovered Error
NetApp Internal Only
Do NOT Distribute
line 1: [ispfc_main:error]: Disk 7a.38: op
0x28:023d35b0:0008 sector 37565872 medium error -
Unrecovered read error (3 11, 0)
line 2: [ispfc_main:notice]: Disk 7a.38: sector
37565872 will be reassigned
line 3: [ispfc_main:error]: Medium err for disk 7a.38
(serial no. 3EK02X9K000072011ACT).
line 4: [scsi.cmd.checkCondition:error]: Device
7a.38: Check Condition: CDB 0x28:023d35b0:0008: Sense
Data medium error - Unrecovered read error (0x3 -
0x11 0x0 0xe4).
line 5: [raid_stripe_owner:info]: read error from
disk 7a.38 (S/N 3EK02X9K000072011ACT) block 4695734
line 6: [raid_stripe_owner:notice]: Rewriting bad
block from parity on disk 7a.38, block #4695734
line 7: [ispfc_main:notice]: Disk 7a.38: sector
37565872 was reassigned
Example Explanation
line 1: The adapter driver reports an unrecovered read error (sense data 3 11, 0), during a read operation
(op 0x28) on drive 7a.38, sector 37565872. This means the drive was not able to provide the data
requested from this sector. The three digit code (3 11, 0) is the sense data reported by the drive. This
code is translated into a human readable meaning "unrecovered read error" by the system.
line 2: Reports the sector in question will be reassigned.
line 3: The adapter driver reports the serial number of the drive with the unrecovered read error.
line 4: The SCSI layer also reports the same error. The SCSI layer reports the sense data in hex and
adds a fourth code called a FRU. The FRU is used by the drive vendor.
line 5: The RAID layer reports a read error on this disk in block 4695734. This is the block that was stored
on sector 37565872.
line 6: The RAID layer reports that the data in the bad block was rewritten from parity.
line 7: The adapter driver reports the sector was successfully reassigned. The bad sector, 37565872, will
not be used again.
Note: Once the bad block is rewritten from parity, no further action is required. The drive will continue to
run normally and the data is safe. Do NOT fail a drive for this error.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 410
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Reconstruction of data occurs using parity
Parity can not be computed if:
– RAID-4 in reconstruction and a media error occurs
on 2nd disk
– RAID-DP in double reconstruction and a media
error occurs on a 3rd disk
– RAID-DP in single reconstruction and a media error
occurs on the same block on a 2nd and 3rd disk
simultaneously
If parity can not be calculated a Multi-Disk Panic
(MDP) occurs
Multi-Disk Panic
An Internal Halt (PANIC) is the end result of a situation where parity is needed but can not be calculated.
This is why failing a disk for media faults is not a good idea. While a disk may have a media fault on a
sector, it is extremely unlikely that another disk will have a media fault at the same block address. But, if
you fail a disk then it means a media fault on *any* sector then become effectively a double disk failure.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 411
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
SCSI Codes
NetApp Internal Only
Do NOT Distribute
SCSI Codes are returned by SCSI devices such as disks. They
explain the nature of a failure.
Error in messages:
[scsi.cmd.notReadyCondition:notice]: Device
2b.22: Device returns not yet ready: CDB
0x2f:0012a200:0480: Sense Data SCSI:not ready
-0x2 - 0x4 0x7 0x2a)(2661).
SCSI Codes
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 412
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Sense Key Description
00h NO SENSE
01h RECOVERED ERROR
02h NOT READY
03h MEDIUM ERROR
04h HARDWARE ERROR
05h ILLEGAL REQUEST
06h UNIT ATTENTION
08h BLANK CHECK
0Bh ABORTED COMMAND
0Dh VOLUME OVERFLOW
NetApp Confidential — Limited Use 12
http://en.wikipedia.org/wiki/Key_Code_Qualifier
00h NO SENSE Indicates that there is no specific sense key information to be reported for the
designated device. This would be the case for a successful command.
01h RECOVERED ERROR Indicates that the last command completed successfully with some
recovery action performed. Read operations - An ECC correction or retry was done.
Write operations - Not enough blocks were sent to the device to meet the minimum track length (300
blocks).
02h NOT READY Indicates that the device addressed cannot be accessed. Operator intervention
may be required to correct this condition.
03h MEDIUM ERROR Indicates that the command terminated with a non-recovered error
condition that was probably caused by a flaw in the medium or an error in the recorded data.
04h HARDWARE ERROR Indicates that the device has detected a hardware failure (e.g. controller
or device failure, parity error, etc) while performing the command or during a self-test.
05h ILLEGAL REQUEST Indicates that there was an illegal parameter in the Command Descriptor
Block or in the additional parameters supplied as data for some commands (e.g. MODE SELECT)
06h UNIT ATTENTION Indicates that either the disc or the drive operating parameters may have
been changed (by a MODE SELECT command from another initiator or reset) since the last command
was issued by this initiator.
08h BLANK CHECK Indicates that the drive encountered blank medium or format-defined end of data
indication while reading or the drive encountered a non-blank medium while writing.
0Bh ABORTED COMMAND Indicates that the device aborted the command. The host may be able to
recover by trying the command again.
0Dh VOLUME OVERFLOW Indicates that the device has reached the end-of-volume during a write or
read operation.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 413
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
ASC ASCQ Description
00h 00h NO ADDITIONAL SENSE INFORMATION
00h 06h I/O PROCESS TERMINATED
00h 11h AUDIO PLAY OPERATION IN PROGRESS
00h 12h AUDIO PLAY OPERATION PAUSED
00h 13h AUDIO PLAY OPERATION SUCCESSFULLY COMPLETED
00h 14h AUDIO PLAY OPERATION STOPPED DUE TO ERROR
00h 15h NO CURRENT AUDIO STATUS TO RETURN
00h 16h OPERATION IN PROGRESS
00h 17h CLEANING REQUESTED
02h 00h NO SEEK COMPLETE
04h 00h LOGICAL UNIT NOT READY, CAUSE NOT REPORTABLE
04h 01h LOGICAL UNIT IS IN PROCESS OF BECOMING READY
04h 02h LOGICAL UNIT NOT READY, INITIALIZING CMD. REQUIRED
LOGICAL UNIT NOT READY, MANUAL INTERVENTION
04h 03h REQUIRED
There are many more than listed here. See here for a more complete list:
http://en.wikipedia.org/wiki/Key_Code_Qualifier
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 414
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
04h 04h LOGICAL UNIT NOT READY, FORMAT IN PROGRESS
04h 07h LOGICAL UNIT NOT READY, OPERATION IN PROGRESS
04h 08h LOGICAL UNIT NOT READY, LONG WRITE IN PROGRESS
04h 09h LOGICAL UNIT NOT READY, SELF-TEST IN PROGRESS
05h 00h LOGICAL UNIT DOES NOT RESPOND TO SELECTION
06h 00h NO REFERENCE POSITION FOUND
07h 00h MULTIPLE PERIPHERAL DEVICES SELECTED
08h 00h LOGICAL UNIT COMMUNICATION FAILURE
08h 01h LOGICAL UNIT COMMUNICATION TIME-OUT
08h 02h LOGICAL UNIT COMMUNICATION PARITY ERROR
08h 03h LOGICAL UNIT COMMUNICATION CRC ERROR (ULTRA-DMA/32)
08h 04h UNREACHABLE COPY TARGET
09h 00h TRACK FOLLOWING ERROR
09h 01h TRACKING SERVO FAILURE
09h 02h FOCUS SERVO FAILURE
09h 03h SPINDLE SERVO FAILURE
NetApp Confidential — Limited Use 14
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 415
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Error Message from logs:
Sun Nov 25 04:18:31 CET [filer:
scsi.cmd.checkCondition:error]: Device
7a.35: Check Condition: CDB
0x28:06119c60:0080: Sense Data recovered
error - (0x1 - 0x16 0x0 0xd2).
Syslog translator:
Device [deviceName]: Check Condition: CDB
[cdb]: Sense Data [sSenseKey] -
[sSenseCode] (0x[iSenseKey] - 0x[iASC]
0x[iASCQ] 0x[iFRU])([DTime]).
SenseKey Charts
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 416
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Maintenance Center
NetApp Internal Only
Do NOT Distribute
Data ONTAP has a very aggressive disk failing policy
Maintenance Center purpose:
– Improve storage reliability
– Reduce the number of disk returns due to transient
errors
Maintenance Center qualifications
– List of errors and thresholds is based on DOT
version
– If Errors are know to be fatal, disk is immediately
failed
Maintenance Center
The Maintenance Center has a very minimal performance impact on the NetApp Storage Controller. Many of the
Maintenance Center diagnostics tests are executed directly by the drive instead of requiring CPU resources from the
head.
Data ONTAP has a defined set of errors and thresholds, which are used to select disks for maintenance. This set of
thresholds and errors may vary between releases as they are modified based on new information. Disks that receive
errors, which are known fatal errors, will not go into maintenance testing and will be failed.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 417
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
User-data is migrated from the disk onto a spare
through reconstruction or Rapid RAID recovery
Disk is removed from the RAID group
Disk is sent to the Maintenance Center
Messages sent to console (no ASUP)
Disk is tested in the background
– If transient errors are repaired – disk returns to
spares pool
– If it can not be repaired – disk is failed
Drive can only go through Maintenance Center
once
NetApp Confidential — Limited Use
Instead of the disk being failed and an AutoSupport RMA case being generated, the disk is removed from the current
aggregate and sent to the Maintenance Center.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 418
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
What is FC-AL?
NetApp Internal Only
Do NOT Distribute
Fibre Channel Arbitrated Loop
Loop or Ring -- devices attached to nearest neighbors
in a loop or ring via a single link
– ingress wire upstream
– egress wire downstream
Shared infrastructure must be arbitrated for
Virtual circuit established between endpoints before
communication can occur
Distributed Loop Initialization Primitive (LIP) to assign
addresses and determine device presence
LIP and normal traffic cannot coexist
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 419
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
No separately addressable traffic cop and/or
management entity to make everyone play nice
and by the rules
Any device can disrupt inbound or outbound
traffic, even if it isn’t part of the virtual circuit
LIP, if long-running or too frequent can be fatal to
I/O progress
ESHs (Embedded Switching Hubs) do help a lot
with the above
No Traffic Cop
No separately addressable “traffic cop” and/or management entity to make everyone play nice and by the rules. This
means that Any device can disrupt inbound or outbound traffic, even if it isn’t part of the virtual circuit. Because all
communication travels through each device (store-and-forward) a device that is not initiating a command, or
responding to the command, is still involved. It has to relay the data, so it can cause loop problems if it is
malfunctioning. ESH (Embedded Switching Hubs) do help a lot with this by introducing intelligent circuitry that can
monitor the loop and remove misbehaving disks to stop them from causing problems..
Loop Initialization
LIP is a normal and good event, but if long-running or too frequent can be fatal to I/O progress
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 420
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
True FC-AL has no escalation process to reset
devices
FC-AL only has an Abort Transaction Sequence
(ABTS) to abort an operation
FC-AL devices have an Abort Transaction Sequence (ABTS) which can be issued from the host in order to terminate
an operation. Originally there was no other escalation process to reset devices.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 421
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
The FC-AL standard was modified to allow for a
device reset during loop initialization.
LIP is a normal part of our FC-AL
LIP sends a number of initialize commands to set all
ports to monitor, no activity can occur during this time
After loop initialization, activity can return to normal
on the loop
Invalid characters in transmission are not passed
along loop by loop port
LIP may be sent to reset all devices on loop, which
can lead to LIP storm and similar issues
LIP
Loop initialization Primitive is used to initialize the loop prior to the start of loop operations or when
configuration changes are detected. The primary functions performed by loop initialization are:
• Temporarily suspend loop operations and set the Loop Port State Machines in all loop ports to a
known state (open init state).
• Determine if loop capable ports are connected to an operational loop environment
• Manage the assignment of AL_PA values and consequently Loop Id’s.
• Provide notification of loop failures or potential hang conditions.
• Fill the loop with initial IDLEs and return all loop ports to the monitoring state.
Loop initialization does not perform a reset action unless specifically requested. This allows initialization
to occur on an active loop since it temporarily suspends any operations in progress, performs loop
initialization, then allows resumption of the suspended operations. The Loop initialization Primitive (LIP)
sequence and a series of loop initialization frames are used to accomplish loop initialization. LISM
frames are transmitted through the loop to determine a temporary loop master. The port with the lowest
WWN will become temporary loop master.
LIP Storm
If following a LIP event, a device initiates another LIP, then no traffic can flow. If this continuously occurs,
the situation is known as a LIP storm and has been a highly visible problem for Netapp in the past. These
problems have been resolved by hardware and firmware improvements, and we are no longer discussing
LIP storms.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 422
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Loop Arbitration
NetApp Internal Only
Do NOT Distribute
An idle sequence is replaced When the sender receives its
with arbitration primitive (ARB). 7 own ARB it is clear to send.
31 9
ARB (7)
23 13
15
AL_PA
Loop Arbitration
In the diagram above, a storage controller with an arbitrated loop physical address of 7 sends an arbitration. That
arbitration goes to each device in the loop. Once it is received back at the storage controller, the arbitration is won
and the node is clear to send frames.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 423
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Loop Arbitration
NetApp Internal Only
Do NOT Distribute
Once arbitration is won the
node transmits a SCSI
read request. 7
31
Read 9
Read
23 Response
13
ARB(13)
Loop Arbitration
The storage controller has won arbitration and sends a SCSI read request. Once the read is sent, the storage
controller relinquishes arbitration by sending IDLE primitives. The disk drive must then win arbitration to respond to
the read.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 424
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
7
31
Read 9
Read
Response Read request
23 13 from Controller
ARB(13)
Disk arbitrates
Read Response
15
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 425
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Signal begins at the Storage Controller HBA and flows to the IN port on the first shelf module.
Signal flows from IN port to the first disk in the shelf and then back to the IN port.
Signal travels to each of the next disks in the shelf, one at a time, and back to the IN port each time.
When the end of the shelf is reached the signal goes from the IN port on the module to the OUT port.
If the OUT port is terminated the signal travels back to the IN port on the module so it can be sent to the
Storage Controller HBA.
If the OUT port is not terminated the signal travels from the OUT port to the IN port on the module in the
next disk shelf in the loop and the same process is repeated.
Read
Read
Request:
Disk2
This is how signals actually flow in an FCAL loop. Every signal passes through every device no matter
which device is being read from or written to.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 426
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
A Loop HBA
OUT IN
ESH2
DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK
2a.29 2a.28 2a.27 2a.26 2a.25 2a.24 2a.23 2a.22 2a.21 2a.20 2a.19 2a.18 2a.17 2a.16
•Signal begins at the HBA and flows to the IN port on the ESH2
•Signal flows from IN port to the disks (lowest to highest) within the shelf
•Then flows to ESH2 “out” port
•The out port is “terminated” so, the signal travels across the DS14MK2 internal RX
(receive) line from the “OUT” port to the “IN” port
•Finally, it flows back to the Storage Controller HBA via the “IN” port on the external
cable’s loop RX line
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 427
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
HBA
ESH2: A Loop
OUT IN
DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK
2a.29 2a.28 2a.27 2a.26 2a.25 2a.24 2a.23 2a.22 2a.21 2a.20 2a.19 2a.18 2a.17 2a.16
ESH2: A Loop
OUT IN
DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK
2a.45 2a.44 2a.43 2a.42 2a.41 2a.40 2a.39 2a.38 2a.37 2a.36 2a.35 2a.34 2a.33 2a.32
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 428
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
ESH2: A Loop
OUT IN
DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK
2a.29 2a.28 2a.27 2a.26 2a.25 2a.24 2a.23 2a.22 2a.21 2a.20 2a.19 2a.18 2a.17 2a.16
IN OUT
ESH2: B Loop
•The B loop signal flows from “highest to lowest” disk numbers within the shelf
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 429
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Troubleshooting Resources
NetApp Internal Only
Do NOT Distribute
From AutoSupport:
– /etc/messages (ems if needed)
– sysconfig –a
– sysconfig -r
– fcadmin device_map (ASUPs shows fc device map)
– storage (ESH/ESH2/ESH4) including commands:
storage show hub
storage show disk -p
– environment (Especially AT-FCX)
Other options
– Visual inspection
– fcadmin link_stats (ASUP shows fc link stats)
– Analyze shelf logs
NetApp Confidential — Limited Use
Troubleshooting Resources
Statistics
AutoSupport shows “fc link_stats” the command to get that output is “fcadmin link_stats.”
AutoSupport shows “fc device_map” the command to get that output is “fcadmin device_map.”
Visual Inspection
Module lights
Crimped or loose cables
Disk Lights
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 430
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Loop Map for channel 1c:
Translated Map: Port Count 86
0 97 98 99 96 100 101 102 103 104 106 107 105 108 109 81
82 83 80 84 85 86 87 88 90 91 89 92 93 65 66 67
64 68 69 70 71 72 74 75 73 76 77 49 50 51 48 52
53 54 55 56 58 59 57 60 61 33 34 35 32 36 37 38
39 40 42 43 41 44 45 17 18 19 16 20 21 22 23 24
26 27 25 28 29 7
Shelf mapping:
Shelf 1: 29 28 27 26 25 24 23 22 21 20 19 18 17 16
Shelf 2: 45 44 43 42 41 40 39 38 37 36 35 34 33 32
Shelf 3: 61 60 59 58 57 56 55 54 53 52 51 50 49 48
Shelf 4: 77 76 75 74 73 72 71 70 69 68 67 66 65 64
Shelf 5: 93 92 91 90 89 88 87 86 85 84 83 82 81 80
Shelf 6: 109 108 107 106 105 104 103 102 101 100 99 98 97 96
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 431
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Loop Map for channel 1d:
Translated Map: Port Count 98
0 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 7
Shelf mapping:
Shelf 1: 29 28 27 26 25 24 23 22 21 20 19 18 17 16
Shelf 2: 45 44 43 42 41 40 39 38 37 36 35 34 33 32
Shelf 3: 61 60 59 58 57 56 55 54 53 52 51 50 49 48
SES devices
The SES devices will show at the bottom of the output. In the above case SES device 30 is missing.
This is one of the A modules (14, 30, 46), not one of the B modules (15, 31, 47). Although the lining up of
the columns is skewed when the A module device is missing.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 432
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
0 97 98 99 96 100 101 102 103 104 106 107 105 108 109 81
82 83 80 84 85 86 87 88 90 91 89 92 93 65 66 67
64 68 69 70 71 72 74 75 73 76 77 49 50 51 48 52
53 54 55 56 58 59 57 60 61 33 34 35 32 36 37 38
39 40 42 43 41 44 45 17 18 19 16 20 21 22 23 24
26 27 25 28 29 7
Shelf mapping:
Shelf 1: 29 28 27 26 25 24 23 22 21 20 19 18 17 16
Shelf 2: 45 44 43 42 41 40 39 38 37 36 35 34 33 32
Shelf 3: 61 60 59 58 57 56 55 54 53 52 51 50 49 48
Shelf 4: 77 76 75 74 73 72 71 70 69 68 67 66 65 64
Shelf 5: 93 92 91 90 89 88 87 86 85 84 83 82 81 80
Shelf 6: 109 108 107 106 105 104 103 102 101 100 99 98 97 96
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 433
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Shelf State Description
ONLINE Shelf is fully configured and operational
INIT REQD Shelf needs to configure one or both ESH modules
OFFLINE Contact as been lost with shelf (SES drive access
down)
MISSING Shelf has been removed from system entirely (all
paths)
FAILED Failure has occurred on shelf
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 434
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
system4a> storage show disk -p
PRIMARY PORT SECONDARY PORT SHELF BAY
------- ---- --------- ---- ---------
0a.16 A 0c.16 A 1 0
0c.17 A 0a.17 A 1 1
0a.18 A 0c.18 A 1 2
0a.19 A 0c.19 A 1 3
0a.20 A 0c.20 A 1 4
0a.21 A 0c.21 A 1 5
0a.22 A 0c.22 A 1 6
0a.23 A 0c.23 A 1 7
0c.24 A 0a.24 A 1 8
0c.25 A 0a.25 A 1 9
0c.26 A 0a.26 A 1 10
Use this command to map out your cabling. Is the 0a loop connected to ESH module A or ESH module
B? For MPHA configurations what is the secondary loop for 0A? This can be especially useful when
there are many loop. This output also shows your shelf and bay for each disk.
Note that this slide shows a misconfiguration. For a correct MPHA configuration the 0a loop and 0c loop
should be separated. If 0a is connected to ESH module A then 0c should be connected to ESH Module
B.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 435
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
State Description
OK Port is functioning normally
EMPTY No drive is present in bay
BYP/TBI Port failed loop test before insert and was not allowed into loop
BYP/XMIT Port bypassed due to transmitter fault
BYP/LIPF8 Port bypassed due to drive generating LIP F8’s
BYP/DTO Port bypassed due to Data Timeout errors
BYP/RLOS Port bypassed due to receiver loss of signal
BYP/CLOS Port bypassed due to comma loss of signal
BYP/RPRT Port bypassed due to redundant port connection
BYP/STALL Port bypassed due to excessive stall errors
BYP/WRD Port bypassed due to excessive word errors
BYP/CRC Port bypassed due to excessive CRC errors
BYP/CLK Port bypassed due to excessive clock delta
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 436
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
State Description
BYP/MIR Port bypassed due to cluster Mirror bit being set (check partner)
BYP/LIPF
7 Port bypassed due to drive transmitting LIP F7’s
BYP/GEN Port bypassed due to a “generic” error
BYP/MAN Port was manually bypassed (Mfg test only)
BYP/INIT Port bypassed due to firmware initialization at Start of Day.
BYP/SEL Port bypassed due to drive self bypass (Disk detected an fault and takes
F itself off the loop)
BYP/GEN Port bypassed due to FW generic fault (ESH Firmware issue)
BYP/MAN Port manually bypassed (esh diag commands)
Port bypassed due to excessive LIPs from Drive (faulty disk or unstable
BYP/LIP loop)
BYP/OSC Port bypassed due to excessive port oscillations
BYP/BZR Port bypassed due to bad zone recovery
BYP/FLT Port bypassed due to drive fault
NetApp Confidential — Limited Use 38
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 437
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
State Description
BYP/PWR Port bypassed due to drive power down
BYP/PCYCL Port bypassed due to drive power cycle
WARN/LIP Port is seeing high number of initiated LIPs
WARN/WRDB Port has seen a burst of word errors
WARN/WRD Port is seeing high number of word errors
WARN/CRC Port is seeing high number of CRC errors
WARN/CLK Port is seeing high Clock Delta
TERM-ERR Port termination error (Switch on, but shelf connected)
TERM Port terminated by switch
AUTOTERM Port autoterminated (last in shelf stack)
???:0xXX ESH Admin unable to decode port state XX.
???:0x50 see BYP/SELF (Bug 168572)
???:0x44 see BYP/BRZ (Bug 218312)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 438
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Environment
NetApp Internal Only
Do NOT Distribute
Channel: 7a
Shelf: 1
SES device path: local access: 7a.16
Module type: LRC; monitoring is active
Shelf status: unrecoverable condition
SES Configuration, via loop id 16 in shelf
1:
logical identifier=0x50050cc002000fa3
vendor identification=XYRATEX
product identification=DiskShelf14
product revision level=0x00003131
Environment
Shelf Status:
If you see a shelf status of “Information Condition”, perform a historical analysis and check Burt 262840.
(This is not the condition shown in the slide.)
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 439
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Environment (Cont.)
NetApp Internal Only
Do NOT Distribute
Temperature Sensor installed element list: 1, 2, 3; with error: 1
Shelf temperatures by element:
[1] Unavailable (ambient)
[2] 4 C (39 F) Normal temperature range
[3] 38 C (100 F) Normal temperature range
Temperature thresholds by element:
[1] High critical: 50 C (122 F); high warning 40 C (104 F)
Low critical: 0C (32 F); low warning 10 C (50 F)
[2] High critical: 63 C (145 F); high warning 53 C (127 F)
Low critical: 0C (32 F); low warning 10 C (50 F)
[3] High critical: 63 C (145 F); high warning 53 C (127 F)
Low critical: 0C (32 F); low warning 10 C (50 F)
ES Electronics installed element list: 1, 2; with error: 1
ES Electronics reporting element: 2
ES Electronics serial numbers by element:
[1] LRC0_NO_SERNUM_
[2] IMS316230001411
Environment (Cont.)
When either the A (element 2) or B (element 3) module show as ‘Unavailable’ from the environment
output it is likely to have encountered an I2C (Enclosure) bug. This issue requires a power cycle of the
shelf to clear this condition. Reseat of the shelf module WILL NOT correct this condition. The problem is
considered to be esthetic and does not inhibit access to the disks. However, it is possible to cause a
multi-disk panic if an affected shelf module is attempted to be updated while in this state.
Always clear any shelf issue prior to updating shelf firmware.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 440
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
FC-Link Stats
NetApp Internal Only
Do NOT Distribute
===== FC LINK STATS =====
This is the same case as I previous slides: Case 2430031 You can see that the errors here are
incrementing on the HBA. Per Sustaining the Bad Part is the cable or GBIC.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 441
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Loss of sync for a long enough period of time that
results in a LIP
Due to a component prior to the disk reporting
error, up to and including the previous active
component
The drive will note a link failure event if it cannot synchronize its receiver PLL for a time greater than
R_T_TOV, usually on the order of milliseconds. A link failure is a loss of sync that occurred for a long
enough period of time and therefore resulted in the drive initiating a Loop Initialization Primitive (LIP).
Refer to loss of sync count below.
When a link failure occurs it generally is due to a component prior to the disk reporting Loss of Sync upto
and including the previous active component (disk drives and HA are only active components) in the loop.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 442
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Read from Disk 7
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
NetworkAppliance Network Appliance Network Appliance NetworkAppliance Network Appliance Network Appliance Network Appliance Network Appliance Network Appliance Network Appliance NetworkAppliance Network Appliance Network Appliance Network Appliance
PORT 1 PORT 1
TX
TX
RX
RX
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
PORT 2 PORT 2
TX
TX
RX
RX
FIBRE FIBRE
CHANNEL CHANNEL
1 2 3 4 5 6 7 8 9 10 11 12 13 14
FCP_CMND
FCP_DATA
FCP_RSP ?
Underruns are detected by the Host Adapter (HA) during a read request. The disk sends data to the HA
through the loop and if any frames are corrupted in transit, they are discarded by the HA as it has
received less data than expected. The driver reports the underrun condition and retries the read. The
cause of the underrun is downstream in the loop after the disk being read and before the HA.
When an underrun occurs, it means there's a problem in the loop AFTER the disk being read from.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 443
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Loss of Sync
NetApp Internal Only
Do NOT Distribute
Loss of sync for a short period of time
It does not result in a LIP
Disks that are on the shelf borders are subject to
seeing higher loss of sync counts than disks that
are not on a border
Due to a component prior to the disk reporting
error, up to and including the previous active
component
The drive will note a loss of sync event if it loses PLL synchronization for a time period less than
R_T_TOV and thereafter manages to resynchronize. This event generally occurs when a component,
before the disk, reports loss of sync up to and including the previous active component in the loop. Disks
that are on the shelf borders are subject to seeing higher loss of sync counts than disks that are not on a
border.
When a loss of sync occurs it generally is due to a component prior to the disk reporting Loss of Sync
upto and including the previous active component (disk drives and HA are only active components) in the
loop.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 444
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Write to disk 7
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
ESH2
Netw ork Appliance Netw ork Appliance Network Appliance Netw ork Appliance Netw ork Appliance Network Appliance Network Appliance Netw ork Appliance Network Appliance Network Appliance Network Appliance Network Appliance Network Appliance Network Appliance
PORT 1 PORT 1
TX
TX
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
RX
RX
PORT 2 PORT 2
TX
TX
RX
RX
FIBRE FIBRE
CHANNEL CHANNEL
1 2 3 4 5 6 7 8 9 10 11 12 13 14
FCP_CMND
XFER_RDY ?
FCP_DATA ?
FCP_RSP ?
Every frame received by a drive contains a checksum that covers all data in the frame. If upon receiving
the frame the checksum does not match, the invalid CRC counter is incremented and the frame is
"dropped". Generally, the disk which reports the CRC error is not at fault but a component between the
Host Adapter (which originated the write request) and the reporting drive, corrupted the frame.
When a CRC error occurs, it means there is a problem in the loop BEFORE the disk being written to.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 445
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Autosupports may not have all shelf log data
– ASUPs limit shelf log to 5MB (SHELF-
log.gz)
– Actual shelf log may be over 100MB
Have customer zip full shelf logs
– Located in /etc/log/shelflog folder
Force shelf log dump
– From Data ONTAP CLI, the environment
shelf_log command will output full shelf
log to console
Shelf Log
When the shelf log has been truncated, you will see the following output at the TOP of the log
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 446
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
Search for the following
– exception
– persistent log
– panic
– Non crit
– DNR
Some shelf log data may require investigation by storage
engineering or vendor
--------------------------------------------------------------
Channel: 0c Shelf: 4 Module type: AT-FCX Firmware rev: 34
Shelf Serial Number: OPS8248721BC5E8
Module A Serial Number: IMS8289900293E2
Timestamp: Mon Jul 7 16:00:00 EDT 2008
--------------------------------------------------------------
--------------------[persistent log]--------------------------
3A1221A0 INFO I2CMGR 4 00 00842 0837 023 20015955 04000B00 20000013 20000013
In this case, given AT-FCX fw34 and ‘INFO I2CMGR 4 023’ this would translate to
“I2C_DRIVER_HUNG_RESET”.
Performing a BURT search on this information will bring you to further information about the issue
discovered in the logs.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 447
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Application of Stats
NetApp Internal Only
Do NOT Distribute
ESHX AT-FC/AT-FC2 AT-FCX
Analyze Shelf
Storage Show
Link Stats
Messages
Environment
Sysconfig -a
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 448
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Do NOT Distribute
1. Note the hardware in use
– System model
– Shelf type and modules
– Firmware of modules and disks
– Check firmware BURTs
– Any recent changes
– ASUP history (when did the event start?)
2. Physical Inspection
3. Eject any broken disks to reduce noise on the loop
4. Analyze new data for changes in error counts
5. Use the data to narrow the problem to a subset of the system,
such as one shelf
6. Replace parts if required – do not shotgun parts
Physical Inspection
Are there any broken or crimped cables?
Do you have proper power?
Are there any amber LEDs?
Use the Data to narrow the problem to a subset of the system, such as one shelf
Divide and conquer.
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 449
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Summary
NetApp Internal Only
Do NOT Distribute
You should now be able to:
Describe how RAID handles disk errors
Show how to look up and use SCSI codes
Identify disk and media errors
Compare the various statistics and data that can
be used to solve FCAL problems
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 450
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Module Review
NetApp Internal Only
Do NOT Distribute
A disk’s first attempt to read data failed, but the disk
was able to perform error correction and then provide
the data. Is this a recovered or unrecovered media
error?
– Recovered, because the disk provided the data
without the help of RAID.
When an HBA talks to a disk in an FC-AL
environment how many disks are involved in the
conversation?
– All the disks on the loop
Do Link Failures and Loss of Sync errors show that
an error occurred on the loop prior to, or after disk
reporting the error?
– Prior To for both
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 451
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
Labs
NetApp Internal Only
Do NOT Distribute
Lab 7-1
Lab 7-2
Lab 7-3
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 452
GS Learning and Performance - Hardware and Down Storage System Troubleshooting for Partners
NetApp Internal Only
Do NOT Distribute
NetApp Confidential — Limited Use
© Copyright 2013 NetApp, Inc. all rights reserved. Company confidential – for internal use only Student Guide - 453