You are on page 1of 75

Storage Life-Cycle Management Strategy

April 2019 – 2Q

Jim Olson TI&A Storage SSA DE/Global SSL DA & ACB


GTS Technology and Innovation (T&I) Group

URL: http://ibm.biz/BdrACA
Education:
https://learning.atlanta.ibm.com/hr/global/edvisor/gdf_edvisor.nsf/Start?OpenAgent&Login&id=85257E5200583DE
A
IBM Internal Use Only © 2015 IBM Corporation
IBM Global Technology Services

Agenda
• DT&E Overview – Lifecycle Management • Heartbleed

• Code Strategy • Tech Alerting

• Firmware/Microcode best practices • Interoperability

• Prioritization of code upgrades • Storage Automation Tooling (SAT)

• Security • Call home

• Device Adapter High Performance • Clock Synchronization

• Technology Refresh • Entitlement

• Configuration Database and analytics • Backup


• Change log
• XIV ECAs
• Language to help encourage code upgrades
• DS8 ECAs that we have sunset
• zHPF doc/recommendation

2 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

GTS Delivery Technology and Engineering (DT&E) Group overview


Storage Lifecycle Management

Mission Lead, develop, execute, implement and govern a global


Mission technology strategy for GTS – Lifecycle Management
In Scope:

• Setting quarterly microcode/firmware targets for all storage technology using a minimum
acceptable and target level strategy.

• Driving strategy associated to ECAs

• Prioritization associated to microcode upgrades

• Strategies associated to mandated technology refreshes/EOL guidance

• Communicating to global Storage Service Line owners

• Partnering with STG associated to priorities and code strategies

• Interoperability Strategies, call home, global technical alerting to subscribers

• Strategies associated to data collection and analytics


3 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Microcode/Firmware Storage Strategy


 We use a minimum acceptable and target code strategy. This strategy states that any code level on any
storage device is at an acceptable level if it is at or above our defined minimum acceptable level. If the
device is below our defined minimum acceptable level, that device is required to have a code upgrade and
move to our defined target level version for that technology. This strategy removes the decision making
associated to code upgrades and targets levels from the global storage administrators. Deviations noted in
spreadsheet (occasionally there are versions above min not acceptable).
 We update our code minimum acceptable and target code levels on a quarterly basis. Changes can occur
more frequently if a known code quality issue requires a change.
 We encourage a year between minimum acceptable and target levels (to help eliminate chasing code).
 Our GTS global code levels are posted in pRAM and approved via the Storage Service Line Design
Authority. Our global levels are posted here -> http://ibm.biz/BdrACA
 Storage administrators should always understand position of the hardware for the account they support in
respect to Lifecycle Management. Changes (both H/W and S/W) should be scheduled to stay in
compliance.
 Deviations should only occur via GTS TI&A Storage SSA approval.
 Risks should be written for accounts/clients not allowing us to proceed with our recommended upgrades.
 Call home should be checked/modified per new best practices. Confirm its enabled and SO ACCOUNT is
listed in searchable field
 It is the responsibility of the GEO SSLs to track and act on this guidance.
 Follow support guidance (L2/L3) in respect to recommended ECAs that we are not officially tracking to
completion. Also, if valid issue with a piece of technology and support recommends a course of action,
guidance is to follow it. This is applicable to other devices of same model in that given account. Net, you’re
allowed to upgrade all of the same tech in your account to that fix level as sometimes its workload related.
 All storage technology is in scope minus tape drives. Follow your local CE/SSR for guidance© associated
4 to
2015 IBM Corporation
tape drive code. IBM Internal Use Only
IBM Global Technology Services

Firmware/Microcode best practices


 A review of associated firmware/microcode release notes before upgrade is required. Please act on any
specific items called out.
 DS8s need pre-check performed by CE a few days in advance of code upgrade. Please ensure you open
a change record and request this.
 XIVs/A9Ks need pre-check performed. Process is to open a change record for the CE/SSR to run the TA
tool 48 hours prior to the start of the microcode upgrade change window. For the pre-check, the CE is to
reply NO when prompted to proceed with code load. Any issues raised via this review need to be
addressed thru support to ensure when we go into change window, the box is in a healthy state for the
microcode upgrade.
 FlashSystems need pre-check performed. Process is to open a change record and then run the
FlashSystem Software Upgrade Test Utility tool 48 hours prior to the start of the microcode upgrade
change window. For the pre-check, the CE is to reply CANCEL when prompted to proceed with code
load. Any issues raised via this review need to be addressed thru support to ensure when we go into
change window, the box is in a healthy state for the microcode upgrade.
 When doing a SVC or V7K upgrade, please ensure you check the following link to ensure you know the
process for upgrade. Check to see if hops are necessary to get to new code families instead of staying on
old families ( EOS 7.4 family is April 2018, EOS for 7.5 is Sept 2018, EOS for 7.6 is April 2019).

 SVCs and V7Ks need to use a pre-code upgrade health check process. Please use these resources and
plan your upgrades accordingly:
 Concurrent Compatibility and Code Cross Reference: https://ibm.biz/BdZs7Y
 Software Upgrade Test Utility for use with SAN Volume Controller, Storwize V7000, V5000, V3500,
V3700, FlashSystem V9000 and FlashSystem V840 update: https://ibm.biz/BdZs7M
 Pre and Post Firmware Upgrade Checklist: ibm.biz/globalsvcbp
 SAN Volume Controller (2145) welcome page: https://ibm.biz/BdZs7n
 Cllick on code level and then upgrading to see process
 These pre-checks check for a variety of items that could cause problems during a code upgrade. While
very beneficial to do, they do not cover interoperability. See interoperability slide for details in this area.
5 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Firmware/Microcode best practices - continued


 Complete interoperability checking needs to be completed before code upgrade can proceed (see
interoperability slide later in deck). Multi-path validation required as well (for all servers before upgrade).
 Validation of backups before change
 If you are directed to a code level that is not GTS approved, please question why. If you believe support
has identified a known defect that is fixed in a newer code level and the recommendation appears
justified, you should contact the IBM GTS SO Global Code level team via Karen Haberli.
 GTS documented high-level process that should be reviewed prior to code upgrades. Provides good
reminders associated to process: http://ibm.biz/StorFWupgradeprocedure

 Field experience of SVC upgrades shows that rebooting an SVC node and running the BIOS POST tests
appears to be the most common time to find hardware problems with the SVC nodes. For many
customers, the only time that an SVC node is rebooted is during a software upgrade. If a hardware
problem is detected during a software upgrade, it will disturb the upgrade process and may require rolling
the upgrade back to the original level, depending on which node experiences the hardware fault.
• We therefore strongly recommend that nodes that have not had a boot in 14 months or longer, should
reboot nodes manually prior to doing code upgrades. This will allow a controlled reboot and will also
permit testing of the multi-pathing drivers in a more controlled reboot cycle, all of which will improve
the likelihood of a successful upgrade.

 SVC upgrades planning two family code hops in same upgrade window, it is recommended that the first
upgrade be done manually vs automatically.

 It is recommended to have a USB stick at site where upgrade is occurring in case CE needs to reinstall
the SVC node software if there is a issue with the upgrade.

6 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Firmware/Microcode best practices - continued



Storwize devices may require disk drive firmware upgrades as part of code upgrade process. If you see
a disk drive firmware warning during pre-code upgrade process, we require you to follow up with
executing the disk drive upgrades per best practices. The warning that will come up when needed is…

‘This tool has found the internal disks of this system are not running the recommended firmware versions’.
 when you see this, follow upgrade directions.


NEVER do SVC cluster upgrades concurrently. Only exception is if servers are 100% isolated – none
connected to both SVC Clusters.


Netapp and Nseries code upgrade best practice are similar to DS8/SVC/FlashSystems as there are best
practices to apply ahead of time to ensure you have a system that is free of hardware issues and
configuration issues that would prevent a non-disruptive upgrade.

Please use the following process in preparation for doing Netapp and Nseries code upgrades ->
https://ibm.biz/BdR7PE

 It is your responsibility to update HW/SW Currency when making changes in your environment (adding
or removing hardware, upgrading firmware/microcode, etc). It is imperative for ensuring you have a clear
understanding of your accounts position associated to code currency. Link here --->
https://hwsw.boulder.ibm.com:8443/hscms//welcome.pro

 Always use approved account change management processes before executing any changes (applies to
everything in the deck).

 SVC node swap out process for EOL SVCs - https://ibm.biz/BdZs7b

7 © 2015 IBM Corporation


 IBM Internal Use Only
IBM Global Technology Services

Firmware/Microcode best practices - continued


All storage device (excluding backup/tape devices) code upgrades must be performed during a change
window that targets or schedules low I/O (quiet time). Often, a customer's line of business dictates when
'quiet time' actually is, so take this into account when planning - noting that this may not align directly with
the "lowest I/O period". Performing code upgrades during prime business hours is NOT recommended or
safe as it increases the odds of business impact significantly. While very often prime business is 8am to 5pm
Monday through Friday, it can vary by industry and geography so best to determine this with account
DPE/PE.
 
Environments that are performance sensitive should schedule code upgrades during a time that is
acceptable for possible impact. Storage devices, by design, can have upwards of a 30 second delay during
a code upgrade, and that can negatively affect (cause impact) to performance sensitive applications (like
Oracle RAC, etc).

If a code upgrade change must occur during a client's prime business hours, the customer will need to
provide their approval with understanding of potential impact.

8 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

LCM Simplified View – (Process, Players, Cadence, and Pain Points)

Set Strategy Pain Points

Facilitators (Tools, Process, etc…)


1 1. Priority of Resources (capital/labor) – Work with each IOT to

End-to-End Ownership
determine how to best partner to influence and drive priority for the
decision makers in each IOT.

Communicate Strategy and Attainment 2. IBM Device Serviceability – Code stability, Concurrent upgrades,
2 3 4 etc…

3. Storage Tools (HW/SW) – Data integrity, direction and prioritization.

Perform actions: Global Cadence


Prioritize resources • Microcode upgrades
• Capital 7 1. Weekly - Auto-email is sent from HW/SW to DPE’s with details on
• Enable call-home Callhome, Code, and EOL/EOS. (Additional detail slide 8)
• Labor • Replace/retire EOL hardware
5 6 9 8
2. Bi-Monthly – HRM – Status reported in HRM’s bi-monthly calls. HRM
approaches HRM-Pro accounts which have focal points within HRM-
# Role Action
team and account. In addition, HRM focuses on the top 10-15 non-
HRM-Pro accounts per IOT.
1 TI&A Storage Domain Sets Strategy

2 Global Service Communicates Strategy and Attainment 3. Monthly – WEX Team sends monthly report to the Global Quality
Engineering Team IOT and Storage IOT Leader. Quality teams use material as
part of quality cadence meetings.
3 Storage IOT Leaders Facilitate execution/communication of strategy, best practice, and • “# of devices out of criteria for either Code or EOL from HW/SW” and “time
attainment across the Storage Delivery Teams remaining on contract based on CHIP”

4 HRM Assists with replacement of the EOL hardware 4. Monthly – Storage Service Engineering MOR to communicate
• HRM Team contacts Account Focal points (Chief Architect or delegate), assists with current LCM status.
and refines the optimal technical design in line with Storage strategy. Ensures lowest • Data points tuned to apply heatmap (Additional Detail in backup 9- 12)
cost solution for IBM, sourcing from GARS, etc
• Provides technical approval which is pre-requisite for financial approval in WWCT.
5. Quarterly LCM Strategy Update (Additional Detail in slide 6)

5 Account Team Prioritizes resources (capital and labor) 6. Adhoc – Reporting from HW/SW Currency

6 Storage Delivery Teams Perform actions


High Level Security Flow
7 Domain Tooling Strategy/Development and Work with IBM Systems to address Vendor/owner of tech made aware of issue ----> they develop code
serviceability (Code, etc…) fix ---> alert sent to MSS where it is rated ----> our LCM evaluates
rating (if low/med we just move target to fix level, if high we move
8 Global & IOT Quality Facilitate additional focus and communication min and target to fix level) ----> CIRATs auto-cut -----> global
Teams
teams notified of code change ----> upgrades occur and CIRATs
9 Global Service Global Process Owner - End-to-End Ownership to drive program closed
Engineering © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

LCM closed loop design – Automated health check thru DPE notification
Automated Health Checks run daily on SAT servers (may be co-located with TPC) for IBM managed storage

1. Create configuration 2. Evaluate hundreds of policies:


backups of all supported a) Configuration
storage devices b) Security
c) Status

3. Generate daily report.


4. Sent automatically to Trigger an alert for any
HW/SW Repository for high priority findings
Storage Dashboard
Reporting and Analytics Account
DPEs

Weekly Email to Account


DPEs with list of deviations © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Prioritization for Microcode/Firmware Upgrades It is the responsibility of the


GEO SSLs to track and act on
 Priority #1 – SVC & Storwize(V7K, V5K, V3K) this guidance
• SVC GM vulnerability (posted in spreadsheet) & slide 1
– FlashSystem
 Priority #2 – Brocade SAN switches
 Priority #3 – XIV, DS8
 Priority #4 – Rest

 As there are risk differences with deviations from minimum acceptable levels, we are
starting a strategy for certain devices to help guide the teams from a priority perspective.
See strategy below (via XIV example)…

Minimum
Accepted Target Document Risk if Device is NOT on
Hardware Make / Model (or Code Code RED Level: Urgent Yellow Level: Secondary Minimum or Recommended Code
Software) Level Level(s) Need to Upgrade Priority Levels

Devices below Numerous defects. More significant


XIV GEN 2 10.2.4e 10.2.4e-3 10.2.4c-1 10.2.4c-1 issues below 10.2.4c-1

Devices running 11.1.1 and


XIV GEN 3 (MODEL 114) 11.1.1 *** 11.2.0b Devices below 11.1.1 are using iSCSI or VMWare Significant issues below 11.1.1

XIV GEN 3 (MODEL 214) 11.2.0a 11.2.0b N/A N/A Significant issues below 11.1.1

New GEN3s being shipped 11.2.0b 11.2.0b N/A N/A Significant issues below 11.1.1

11 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Timelines for Security Issues with Firmware/Microcode Risk Management - ITCS104 (IBM internal security document)
Group System Type/ Operating System Severity

High Medium Low

 IGA Infrastructure Storage 60 days 1 120 days 1 12 Months 1, 2


Firmware Microcode
A CIRATS record will be created with a 60 day A CIRATS record will be created with a 120 day A CIRATS record will be created with a 12
target for IGA Corporate Systems target for IGA Corporate Systems month target for IGA Corporate Systems.
Create risk in risk management system per FIN Create risk in risk management system per FIN Create risk in risk management system per
166 if above due date is extended or missed 166 if above due date is extended or missed 3 FIN 166 if above due date is extended or
missed
15
 Non-IGA Shared Commercial 60 days 1 12 Months 1, 2 12 Months 1, 2
Infrastructure Storage Firmware
Microcode A CIRAT record will be created with a 60 day target A CIRATS record will be created with a 12 month A CIRATS record will be created with a 12
for a Shared Infrastructure target for a Shared Infrastructure month target for a Shared Infrastructure
Create risk in risk management system per FIN Create risk in risk management system per FIN Create risk in risk management system per
166 if due date is extended or missed 166 if due date is extended or missed FIN 166 if above due date is extended or
missed

 IGA Infrastructure Storage 180 days 1 See Guidance 4 See Guidance 4


Firmware Microcode
A CIRATS record will be created with a 180 day An informational CIRATS record will be created An informational CIRATS record will be
target for IGA Corporate Systems with no targets. created with no targets.
Create risk in risk management system per FIN
166 if due date is extended or missed
3&45
 Non-IGA Shared Commercial 180 days 1 See Guidance 1, 4 See Guidance 1, 4
Infrastructure Storage Firmware
Microcode A CIRATS record will be created with a 180 day An informational CIRATS record will be created An informational CIRATS record will be
target for a Shared Infrastructure with no targets. created with no targets.
Create risk in risk management system per FIN
166 if due date is extended or missed

 1
Target install time. If install time cannot be met, follow the Risk Management Process per FIN 166.
 2
Follow Storage Microcode Strategy that states a year between minimum acceptable and target levels. Ensure there is a Risk discussion with Client
- Firmware upgrade will take place (only) if agreed with Client. Set Target installation time 12 months
 3
For accounts using WWBCIT risk system: Target install time is 120 days, if the 120 days cannot be met then a WWBCIT SAI (Self Assessment
Issue) record needs to be completed with a new target date for 12 months from vulnerability. If the 12 month microcode strategy implementation
cannot be met, then a WWBCIT CDD (Corporate Directive Deviation) record needs to be completed. The existing SAI can be closed referencing the
newly opened CDD.
 4
Follow Storage Microcode Strategy that states a year between minimum acceptable and target levels. Ensure there is a Risk discussion with Client
- Firmware upgrade will take place. These will only generate informational CIRATS records with no targets.
 5
Group 1 refers to production and group 3&4 refers to development and test.
 Overall firmware/mcode process located here - http://ibm.biz/BdZc9c

 If low/medium MSS classified risk, we move target levels to fix level. If high MSS classified risk, we move min to fix level. LCM owner resp.
 CIRATs auto-cut post MSS classification provided accounts are subscribed to given technology as they are required to be. © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Timelines for Security Issues with Firmware/Microcode Risk Management - CSD (Customer Security Document)

Group System Type/ Operating System Severity

High Medium Low

 Infrastructure Storage Firmware 60 days See Guidance 1 See Guidance 1


Microcode
A CIRATS record will be created with a A CIRATS record will be created with a A CIRATS record will be created with a
60 day target 12 month target 12 month target
12 If the above due date is extended or If the above due date is extended or If the above due date is extended or
missed the CIRAT record must be missed the CIRAT record must be missed the CIRAT record must be
extended extended extended

 Infrastructure Storage Firmware 180 days See Guidance 1 See Guidance 1


Microcode
A CIRATS record will be created with a A CIRATS record will be created with a A CIRATS record will be created with a
180 day target 12 month target 12 month target
3&42 If the above due date is extended or If the above due date is extended or If the above due date is extended or
missed the CIRAT record must be missed the CIRAT record must be missed the CIRAT record must be
extended extended extended

 1
Guidance is to follow the Storage Microcode Strategy. This strategy ensures that we are at the documented minimum acceptable code level..
 2
Group 1 refers to production and group 3&4 refers to development and test
 The above dates are the IBM recommended values. Each individual account may have different values based on contractual agreements.

TPC/TSM timelines will be different as they are not firmware updates, they are software updates, therefore would align to
server timelines/ techspecs.

CIRATs that are automatically opened for storage firmware/microcode security issues need to be updated with target date for
addressing w/fix. Close after fix has been applied.

Security Advisories and Ratings can be reviewed here ->


https://advisories.secintel.ibm.com/api/v0/storage_advs.php?fmt=html

Risk Management process that teams need to use ->Just added it in here...
https://w3-connections.ibm.com/activities/service/html/mainpage#activitypage,dc6d5eed-f279-4632-83ca-b2d4af32f12f

See slide in backup for more on classifications. © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Problematic DDMs – Storwize and DS8K Guidance


We have seen devices declared by support as having problematic DDMs (due to age,  firmware level, or
excessive errors). These storage devices actually lost data due to these drives. The best  way to avoid this is
to work to align to our global strategy which is RAID6/DRAID6. All DS8s and Storwize devices configured
2016 and later should now be configured this way; RAID5 is not approved for use with this technology (the
strategy changed in 2016). We highly recommend accounts work to retrofit existing RAID5 devices to
RAID6/DRAID6.

If your account has had a data loss event due to confirmed problematic drives, you need to notify the global
service line immediately (Karen Haberli/Hartford/IBM).  If that device is EOL/EOS you’ll need to to vacate /
replace it immediately. 

If the device is not EOL/EOS (or near to it), your action plan will be:
• Immediately vacate the device
• Replace all the problematic disk drives (work with Systems)
• Upgrade the DDM and Device code to current
• Reformat the device as RAID6/DRAID6  
The affected device cannot be reused until all 4 steps are complete.
 
Predictive Analytics (HWSW Reporting) will continue to improve to identify faulty ddm’s so we can identify and
field the problem storage devices, to prevent an account outage. This analytics indicates a summarized action
plan to address the errors identified. The recommendations should be worked immediately.

HWSW Risk Scoring Dashboard SPURT: 


https://w3-connections.ibm.com/wikis/home?lang=en-us#!/wiki/W167a02b2c1e2_4e32_9840_91e11c0c947a/
page/Storage%20Spurts%20Volume%20142%20-%20HWSW%20Risk%20Scoring%20Dashboard
Double DDM failure process – USE THIS with PFE/L3 involvement if encounter -
https://w3-01.ibm.com/services/pram/assetDetail/generalDetails.faces?guid=50D8BA26-20CF-0
14 © 2015 IBM Corporation
281-97F7-04FEEB0E5A99 IBM Internal Use Only
IBM Global Technology Services

Technology Refresh (GTS EOL Guidance)


 Our overall strategy is that all storage technology should be refreshed at the 5 year mark.
– Encourage Account Teams to Enforce contractual agreement associated to technology refresh

 The GTS Strategy is based on average usable age lifecycles as well as offical product support dates.
These lifecycle times consider average refresh cycle times, as well as the time required for data
migration and typically begin before the product is out of support to avoid risk of slow refresh periods
and allow time for proper refresh planning and execution to occur.
 New deployments of GTS EOL/EOS hardware is not authorized. (*Note: Exercise caution with GAR’s
purchases - As a general rule devices that are End of Marketing/End of Sale will have limited useful
life and should not be purchased.)

 Account risk process to be used for GTS EOL/EOS devices that are still supported via TSS.
– Risks are to be filed with the Account Team and/or Project Office, NOT the end-client.
– Risks are owned by the account team; this is a wholly internal process
– This is entirely an internal process. While IBM Internal, GTS accounts must follow.
– DS5X devices in this category. Technically supported via TSS however IBM Systems marked EOL
and has ceased doing anything code and security wise (they do not even issue PSIRTs against
this technology anymore).

 Client risk process to be used for TSS EOL/EOS devices


– Risks are to be filed with the Account Team and/or Project Office and the end-client.

 ItDelivery/Account
*Note: is the responsibility of the
teams should GEO
work SSLsproper
to ensure to track and act plans
risk mitigation on this
are guidance.
in place. This should take the form of evaluating
secondary controls for callhome and possibly adding SNMP alerting, SMTP email alerts, engaging with local in-country TSS, manual
health checking, updating account and client risks, and any other possible approaches that mitigate the risks to accounts/customers.
15 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Technology Refresh (GTS EOS/EOL Guidance)


 Current focus needs to be on (see next slide for more details)

EOL/EOS Priority 1 EOL/EOS Priority 2 (mix TSS and GTS EOS)


SHARK 3590, LTO1 & LTO2 tape drives. All are TSS EOS.
McData Flashsystem 710/720/810/820. Not TSS rather GTS EOL
DS3K 1726&1814/DS4/DS6 CISCO GEN1 MODULES AND GEN2 4G MODULES
SVC models 4F2, 8F2, 8F4, 8A4, 8G4 Brocade EOS/EOL (M48, 2498-B40, see next slide for
Nseries all)
DS81/DS83/DS87 XIV Gen2
3494 tape libraries DS88
TS7700 Virtualization Engines: Models 3956, 3957.
Types – CC6, CX6, CC7, CS7, VEA, V06
EOL/EOS Priority 3 (mix TSS and GTS EOS)
DS5s. These are not TSS EOS rather GTS EOL/EOS
 EMC: Any remaining hardware officially listed as EOL/EOS
Hardware: https://support.emc.com/docu47424_EMC_Hardware_Release_and_End_of_Service_Life_Notifications.xlsx?
language=en_US&language=en_US
Software: https://support.emc.com/docu47426_EMC_Software_Release_and_End_of_Service_Life_Notifications.xlsx?
language=en_US&language=en_US
Firmware:
https://support.emc.com/docu47425_EMC_Firmware_Release_and_End_of_Service_Life_Notifications.xlsx?language=en_US&language=en_US
 Brocade EOL/EOS:
https://my.brocade.com/wps/myportal/!ut/p/b1/04_Sj9S1sDQyNTezNDDRj9CPykssy0xPLMnMz0vMAfGjzOKd3BzDjE2MjQ39vbycDTzdXYJCLb18jQx
8zIAKIoEKDHAARwNC-sP1o_ArMYEqwGOFn0d-bqp-blSOpaeuoyIA__aT2w!!/dl4/d5/L2dJQSEvUUt3QS80SmtFL1o2X0JGQVYzNDMzMU9KSkMw
SUdEUlU5Sk0yMDcx/
 IBM Brocade EOL/EOS: See McData tab and Brocade EOL tab in code spreadsheet.
 Hitachi EOL/EOS: https://www.hds.com/assets/pdf/hitachi-data-systems-end-of-service-life-matrix.pdf
 IBM EOL/EOS Hardware: http://www-935.ibm.com/services/us/its/html/maintsvcwithdrawal.html
 NetApp - http://mysupport.netapp.com/info/web/ECMP1110975.html
 Cisco EOL/EOS: See Cisco EOL/EOS Tab in Code Spreadsheet

16 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Technology Refresh (GTS EOL Guidance)


CISCO
DS8100/DS8300 (all) 31)
 IBM 2061-420) Cisco MDS 9020 Series Fabric Switch
 Machine Type 2107 Models 921, 922, 92E, 931, 932, 9A2,
 (IBM 2061-020) Cisco MDS 9120 Multilayer Fabric Switch
9AE and 9B2; ii) Machine Type 2421 Models 931, 932, 92E,
 (IBM 2061-040) Cisco MDS 9140 Intelligent Fabric Switch
9B2 and 9AE; iii) Machine Type 2422 Models 931, 932, 92E,
 (IBM 2054-D1A\D1H, 2062 D1A\D1H) Cisco MDS 9216, 9216A,
9B2 and 9AE; iv) Machine Type 2423 Models 931, 932, 92E,
9216i Multilayer Fabric Switch
9B2 and 9AE; and v) Machine Type 2424 Models 931, 932,
 (IBM 2053-S34\434) Cisco MDS 9134 Multilayer Fabric Switch
92E, 9B2 and 9AE
Mcdata (all)
DS8700 (all) 31)
 IBM Branded models 2026, 2027, 2031, 2032
 2421-941, 2421-94E, 2422-941, 2422-94E, 2423-941, 2423-
 McData Branded M6140, Mi10K, M3232, M6064, M4500
94E, 2424-941, 2424-94E
Brocade
DS8800 (TSS EOS 03/31/19) 31)
 Withdrawn from sale and unsupported: 2005-5KB, 2005-B5K,
 2421-951, 2421-95E, 2422-951, 2422-95E, 2423-951, 2423-
2005-B16, 2005-B32, 2005-B64, 2005-R04, 2005-R18, 59Y1987,
95E, 2424-951, 2424-95E
59Y1993, FC3450,FC3850, 2005-H08, 2005-H16, 2109-F16,
2109-F32, 2109-M12, 2109-M14, 2109-M48, 26K5601,90P0165,
DS3K/DS4K/DS5K/DS6K (all)
3534-F08, 2109-S08, 2109-S16, 3534-1RU, SW20X0, SW20X0,
SW3000, SW3014, SW38X0, SW4012, 3758-L32, 2498-B40,
2498-B80, 2499-192, 2499-384
SVC models 4F2, 8F2 and 8F4
 2145-4F2
 Withdrawn from sale, still supported: 2498-E32*, 44X1921,
 2145-8F2
 44X1920, 42C1828, 69Y1909, FC3870, FC3890, 5410, 32R1813,
2145-8F4
 32R1812 (*2498-E32 goes EOS 11/30/2019)
2145 - 8A4 & 8G4
 2145 – CF8 and CG8 (TSS 07/01/2019)
 FlashSystem
IBM FlashSystem 710 (9830-AS1)
IBM FlashSystem 810 (9830-AE1)
IBM FlashSystem 720 (9831-AS2)
IBM FlashSystem 720 (9832-XS1)
IBM FlashSystem 720 (9832-XS2)
IBM FlashSystem 820 (9831-AE2)
IBM FlashSystem 820 (9832-XE1)
IBM FlashSystem 820 (9832-XE2)

© 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Technology Refresh (GTS EOL Guidance) cont …


TS7700 Virtualization Engines Netapp Models – EOS Already
 Models 3956, 3957. Types – CC6, CX6, CC7, CS7, VEA, Model End of End of No of system (each
Type Availability Support account)
V06 5600 02/11/17 TBA 1
End of Support June / July 2017 (with Exception of Japan) 5660 11/24/17 01/08/18 6
 Models CC8, CS8, CX7, XS7 – EoS 12/31/18 FAS2020 05/06/11 06/30/16 7
FAS2040 11/02/12 12/31/17 31
FAS2050 05/06/11 06/30/16 2
3494 Tape Libraries (all) FAS3140 02/03/12 03/31/17 16
FAS3170 02/03/12 03/31/17 6
3584 – model D32 and L32 FAS3270 05/07/12 06/30/17 29
FAS6080 03/09/12 04/30/17 6
Tape Drives LTO1/LTO2 V3140 02/03/12 03/31/17 3
V3170 02/03/12 03/31/17 2
 3580-L1 LVD - 3580-L1 HVD - 3580-L1 Fibre VTL1400 09/12/10 10/31/15 13
 18P7270 - 3580 Ultrium 2 Tape Drive Model L23 VTL700 09/12/10 10/31/15 1
 18P7269 – 3580 Ultrium 2 Tape Drive Model H23

Nseries (all) DELLEMC EOSL 


AVAMAR CLARIION CONNECTRIX
GEN4 AX100I AP-7600B
XIV Gen 2 EoS 12/31/18 CELERRA AX150I DS-5000B
 2810 – A14 NS120 AX150SCI DS-5100B
 2812 – A14 NS20 CX3-10C ED-48000B
NS4-120 CX3-20 DD610
NS4-480 CX3-20F DD630
Shark (all)
NS480 CX4-120 DD880
 ESS 2105
NSG8 CX4-120C ED-48000B
NX4 CX4-240 MDL1000
DMX CX4-240C RECOVERPOINT
SYMM6.0 DMX800 CX4-480 GEN4
SYMM7.0 950 DMX-3 CX4-480C VNXe
SYMM7.5 DMX4 424 CX4-960 3100
 DMX4-SYS24-3D CX4-960C
VMAX CX500
10K CX700
© 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Technology Refresh (GTS EOL Guidance) cont …


Specific guidance for EOL/EOS
It is imperative that you as the account storage SME highlight the risks with EOL/EOS Hardware in your
environment to your account team (PE/DPE/SIL/etc).

The perception and expectation is regardless of whether a device is EOL/EOS with risks in place is that
there will be prompt service from the vendor / 3rd party vendor, and the Service Line Delivery Teams to
resolve any issue and "somehow" whatever is needed will be found via various escalations - skills, parts,
labor, etc... The reality of the situation is(as experienced first hand by an account this month):
· Extended maintenance contracts typically only cover break/fix on the hardware side (assuming
parts are still available). So any non-hardware event will have NO support. Escalations typically do
not work.
· Complicated device recovery actions requiring L2/L3 development level skills are not available to
help with event.
· There is no maintenance or fixes for device defects. If you hit a new bug, no fix is coming.
· There will be escalation delays required to try and find the right skills and/or parts within IBM or
Vendor teams (high probability skills may not exist anymore)
· CRITICAL admins do daily health checks of EOL tech as call home likely no longer works.

Formally submitted Risk's must be written and accepted by account team (Executive Owner for account) and
client (Executive Owner within client for IBM relationship - ie: CIO/CTO/etc) and the risks must be clear that
data loss, performance issues and elongated outages are a strong possibility due to the use of
EOL/EOS hardware and software. You should NEVER have any critical data on EOL/EOS technology
<period> and if EOL/EOS HW/SW is in use for critical data when the risk letter is written, the risk letter needs
to be accompanied by a migration plan to be approved at the same time as the risk letter by the same parties.
In many cases using newer hardware to optimize storage will provide a positive business case for all sides
and demonstrate IBM's ability to bring forward innovative solutions for our clients. © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Used Hardware Reuse – Acceptable list


Spectrum Virtualize
 2145-DH8s, 2145-SV1s allowed for reuse
 V7K Gen2 & Gen2+ (2076-524 & 624) allowed for reuse

Flash Systems
 FS900s allowed for reuse
 All A9Ks allowed for reuse

XIVs
 XIV GEN3s that are less than 3 years old

DS8Ks
 DS888x

Brocade/IBM Branded GEN5s (16 Gb)


 2498-F96
 2498-F48
 2498-N96
 2498-X24
 2498-24G
 2499-416
 2499-816

© 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

“Out of Service" or "End of Service" Products


The following process will be followed when engaging uplevel STG's Product Engineering (PE) / Product Field
Engineering (PFE) (L2/L3 Support) in order to provide support for any "Out of Service" or "End of Service" STG products

- An "old" product is one that has gone through end-of-service (E10s, E20s, F20s, CVT etc. These are just examples and do not constitute an all
inclusive list
- Note: SO Teams should always open a PMR against a product, model type and serial number

Process Prime Shift:


 - When PMR comes in on an end-of-service product, PE/PFE will notify the 1st line PFE manager
 - 1st line PFE manager will send an email to the DPE for the SO client with their GEO Financial Focal CCed asking for the following
- Valid account code (Div / Major / Dept / Project)
- Written consent to be billed for support provided (Billing rate is the higher of the following: $5000 or $300/hour)
 - GEO Financial Focals will validate the account code provided and inform 1st line PFE manager of the same
 - 1st line manager will then notify PFE to engage in providing support
 - Upon completion of work effort, provide hours of support to GEO Financial Focal for recovery journal processing
Process Off-Shift:
 If the PMR is not a Sev 1, problem will be worked on the next business day and follow the process for Prime Shift
 If the PMR is Sev 1
- PFE will attempt to contact the 1st line PFE manager, if successful, the 1st line will send an email to the DPE for the SO client (w/GEO Financial
Focal on cc)
- If the 1st line can not contacted, the PFE will send an email to the DPE for the SO client (with the 1st line and GEO Financial Focal on cc)
- PFE will engage after either receiving confirmation from the 1st line PFE or written billing consent (rates above) from the DPE with the account
code (Div / Major / Dept / Project)
 Account code will be validated at the earliest possible opportunity by GEO Financial Focal
 Upon completion of work effort, provide hours of support to GEO Financial Focal for recovery journal processing
GEO Financial Focals:
– Americas: John Koster/Rochester/IBM
– EMEA/CMEA: Primary - Diana Schmidt/Germany/IBM Secondary - Anja Mueller/Germany/IBM
– AP: Primary - Rush Lu/China/IBM Secondary - Jun MJ Ma/China/IBM
– Japan: Jun Ashibe/Japan/IBM

21 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Windows Server 2003 EOS 7/14/2015


As part of best practices, we use tooling to help manage storage environments. As such, we have
storage owned servers that house our tools. This guidance is to ensure we understand current
EOS strategy for Windows and plan/coordinate upgrade prior to EOS date.

ISSUE
 Microsoft has announced that support for Windows Server 2003 will end on July 14, 2015
 http://support.microsoft.com/lifecycle/?c2=1163
 Storage Management applications may be running on Windows Server 2003 making problem resolution and
maintenance difficult
 Many SO accounts still utilize Windows Server 2003
 Storage vendors will not be including support for Windows Server 2003 in their new releases

Actions
 Develop action plans to ensure Storage Management Applications are running on an appropriate operating system
 Windows Server 2008 for upgrades from Windows Server 2003 https://ibm.biz/BdRhwe
 Windows Server 2012 for new installations
 Communicate with SO accounts that vendor support (drivers and problem resolution) of Windows Server 2003 storage
clients will be problematic after EOS
 Communicate with SO accounts that new Storage devices are unlikely to support Windows Server 2003

Windows 2008 goes EOL in 2020. Recommendation is to shift to higher version when opportunity arises.

2222 2011 Accomplishments for Storage PSM PCM © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

ECA Section – all marked completed on 10/14/17

The next three slides cover ECAs. ECAs are Engineering Change Authorization provided by
STG to help improve reliability, availability and stability with storage technology.
While this significantly improves quality, it does not completely remove risks.
These are the only ECA’s we are officially tracking in GTS SO. If a SSR or TSS rep
provides other ECA recommendations, we encourage you to follow. There are other
ECA’s that are handled via STG and TSS that we do not track. More details on
these other ECAs are in backup.
If any doubt or concern, reach out to Jim Olson or Karen Haberli with questions.

Best way to track devices specific to ECAs (ones needing/ones completed) are via HW/SW
Currency - https://hwsw.boulder.ibm.com:8443/hscms//welcome.pro
Under view inventory, use ECA scope and ECA number to track your
devices
Global Scorecard 2.x report will generate reports on all ECAs

Please work with your local SSR and use change management to get ECA’s applied

It is the responsibility of the GEO SSL’s to track and manage their inventory and ECA’s.

23 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8K ECA1234 (Apache Struts) – DS8K ECA5678 (BASH)


Security Issue: BASH / ShellShock
 Severe security vulnerabilities were found in September for systems with any use of BASH (Bourne Again SHell). The
vulnerabilities are commonly referred to as the "Bash Bug" or "Shellshock". Every shipped DS8000 system is subject to
this vulnerability -- but only through its associated HMCs (Hardware Management Consoles). The vulnerabilities only apply
to the HMC and not the base DS8000 chassis. Not all access protocols have the same levels of vulnerability.
 See http://www-01.ibm.com/support/docview.wss?uid=ssg1S1004879
 The ICS disk for HMC(s) should be applied (ICS disk CVE_BASH_BUG_PATCH_v1.0). Application is non-disruptive.
Alternatively, bundle 64.36.103.0 (or higher) may be applied.

Security Issue: DS8700 GUI (which uses "Apache Struts") - Completed


 For most recent GUI versions, there are security issues with the GUI code which require remediation. As a workaround,
ensure that the DS8000 HMC is installed behind a firewall that limits access to the ports.
 See notification https://advisories.secintel.ibm.com/adv_database.php?adv_id=54697
 For DS8870, apply ICS Disk ICS_DSGUI_STRUTS_PATCH_v1.0 to each HMC. This is a non-disruptive operation which
does not impact production activity, nor does it disrupt any full-disk-encryption activity. It is normally completed within an
hour per HMC (applied sequentially). Any administrators currently active on the HMC should be notified in advance that
their activities may be interrupted while this ICS disk is being applied. This is NOT a ccl (NOT a firmware update to the
DS8000 CEC). It is very low risk to apply and zero impact on basic production operation. It should be applied as soon as
possible. A maintenance window is not explicitly required for this.
 For DS8700, apply firmware bundle 76.31.105.0 or later. For DS8800, apply firmware bundle 86.31.123.0 or later. The
Apache Struts exposure does not apply to DS8100/DS8300.

Net Summary – code levels call out fix level. See spreadsheet for actual level guidance.
 DS81/DS8 - not affected by Struts. Need BASH ICS applied or min code 64.36.103.0
 DS8700 - BASH ICS can be applied to fix but you need code level 76.31.105.0 for Struts. Or, 76.31.121.0 or above will
address both security issues
 DS8800 - BASH ICS can be applied to fix but u need code level 86.31.123.0 for Struts. Or, 86.31.142.0 or above will
address both security issues
 DS8870 - BASH ICS and STRUTS ICS can be applied. Or 87.21.39.0 address both issues.
24 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

DS8K ECA1111
Call home digital certificates on DS8800 and DS8870 will expire on August 1, 2018

Abstract: Call home requires a digital certificate to successfully report a problem to IBM. DS8800 and
DS8870 machine models will require a new call home certificate by August 1, 2018 to continue to
successfully call home. IBM will need to install a new call home certificate on the HMC(s). DS8880 is not
affected by the change in call home certificates. 
Content: The DS8800 and DS8870 use Call Home Certificates when negotiating with the server to Call
Home to IBM. On August 1st, these certificates expire and will need to be replaced with Digital Geo Trust
digital certificates. Installing new Digital Geo Trust digital certificates is a seamless and concurrent operation.
Exposure: Machines that do not have updated call home certificates by August 1 st will no longer be able
report problems to IBM.

Mitigation:
An ICS-CD is available to update the call home certificate, CSE_CallHomeCertificates_v1.0.iso,
which can apply the new certificate. This can be installed by remote support or by SSR. The certificate update
is concurrent and does not require HMC reboot. Installation will require ~15 min per HMC for remote support
on DS8870 machines running 87.51.93.0 or higher. SSR installation will 10 min per HMC.

If you have already installed CVE_1Q2018 or CVE_2017-1123 to address security concerns, there is no
need to install the CSE_CallHomeCertificates_v1.0.iso because the new certificate was included.

Exposed Microcode Levels: All current microcode levels for DS8800 and DS8870 products.

Fix: Contact your IBM Service Representative or call IBM Service to open a PMR to have the call home
certificate installed. 

25 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8K ECA413

26 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Configuration Database and Analytics – Nadeem Malik owner/focal


https://hwsw.boulder.ibm.com:8443/hscms/welcome.pro - HW/SW Currency

 Overall strategy is to use a new product called HW/SW Currency for staging and validation of account based
firmware and inventory data from STG and TPC deployments that can then be eventually warehoused in
GACDW.
– Currently porting data from the STG “heartbeat” and “Call Home” databases for installed currency levels. Additional details
being pulled from SRM (for servers), Inventory and Security databases
– Plan to extend coverage to include STG SmartCare, TSS Tech Services Appliance, ISC Fulfilment, TPC, TAD4D, TEM, USI-
BRS, etc. for a comprehensive validation of account assets by triangulation
– Challenge here is to reliably filter SO assets from STG data. Until automated account feeds can be established to
validate SO assets, SSL GEO teams responsibility to maintain their current inventory in ProSliM. It is IMPERATIVE
that our storage administrators keep HW/SW Currency updated with accurate code versions. It should be done as
part of your day to day activities to ensure accurate information for Call Home, Code Levels, ECAs and more.

 Reporting, Analytics and Management System in HW/SW Currency for tracking compliance to code levels,
ECAs and EOL
– On demand analytics and push/pull reports on code levels, ECAs and historical trending
– Ongoing automated tracking of assets at the Account and Machine Type/Model and Serial number granularity for adherence
to minimum code and ECA patch levels
– Management system support for views by Delivery Center and Pools, Geographic Account hierarchy, Sectors and
configurable set of focus accounts
– Configurable rules for GTS SO Delivery operational policies for compliance with code levels

 Interoperability Analysis of interconnected server and storage systems


– Automated interoperability validation on code upgrades to check compatibility between server multipath drivers and storage
firmware by leveraging STG System Storage Interoperation Center (SSIC) for tested configurations
– Requires TPC rollout across the accounts to help with automated data collection for the topology and configuration of
devices connected to a storage system

27 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Tech Alerting where we were and where we are today.

Where we were in 2012 Where we are today

 Sent via email  Repository is easy to navigate and has robust search functionality
 Current repository not ideal  User community can subscribe to receive notification when alerts are posted.
 Email notification can be sent to the Community at the touch of a button to ensure all community
 Delay between alert sent and posting in members are notified of urgent alerts. Notifications send to 800+ SMEs when alerts posted.
Global clearing house.  Robust search functionality within the Lotus Notes Community allows for search of all titles and
 Search ability is dependent on each rich text for each alert.
admins email organization skills  Easy to add users via Blue Group, Distribution lists, or Self Registration
 Easy to access and edit, Widespread audience
 Built in Comments Section allows feedback on posted alerts
 Maintains edit history
 Link to tech alert dbase below……

https://w3-connections.ibm.com/communities/service/html/communityview?communityUuid=72dbfef0-8c6f-4dd5-9e51-738902fe1944 
© 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Interoperability – Overall Guidance


 Process (using SVC as an example)
– Storage admin contacts server admin and communicates upgrade plan for SVC to R6.2.0.5
– Server admin needs to login to each server and gather information
• HBA firmware levels
• Driver levels
• OS levels
• Multi-path software levels
• Multi-path health
– Server admin needs to check each level associated to SVC R6.2.0.5 on SSIC interop website ->
http://www-03.ibm.com/systems/support/storage/ssic/interoperability.wss for IBM gear or this for vendor
gear ->  https://w3-connections.ibm.com/wikis/home?lang=en-us#!/wiki/SSOstorage/page/6.%20Storage
%20Microcode%20%26%20Technology%20End%20of%20Life%20Information
– Manual tracking associated to all of this work is required
– Any code level that is not at a minimum (or above) acceptable version specific to interoperability check will
require a upgrade
– All servers that need upgrades require outages (except virtual which can be moved)
– 30 minutes per server is a good average to use for server remediation validation (interoperability check)
 It is the responsibility of the Storage Service Line to gain confirmation associated to server remediation completion
before proceeding to upgrade. You should never upgrade storage devices without this confirmation.
 Vendor interoperability info -> http://w3.tap.ibm.com/w3ki2/display/Storage/Storage+Interoperability+Matrix 
 SVC Storage interoperability info -> http://www-01.ibm.com/support/docview.wss?uid=ssg1S1004111 

*** Future plan is to have the storage team assist with this area. We are working with IBM Research to see if
there is a standalone version of HW/SW (aka ProSlim) to help with this interopability work prior to account
deployments. ***
29 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Interoperability – Spectrum Virtualize (SVC) Upgrade Guidance

 New SVC Interop Guidance (confirmed with development):


– If you are running in a supported configuration at SVC level X and upgrade to level Y then you are
still supported and can expect correct operation unless there are specific exceptions highlighted on
the interop website.
– This is a statement about SVC firmware updates. This is not a statement about
HBA/driver/multipath/OS or switch firmware or storage array updates.

 GTS Delivery policy:


– Always check the SVC interop website and SVC release notes for any exceptions that may affect a
planned update. Links on previous slide for SSIC.
– Always follow SVC upgrade best practices. Links and guidance on slide 5.
– Even if no exceptions are listed, plan on regular HBA firmware / driver / multipath compatibility
checks and updates (Server team)
– SVC Firmware Change Only: If only planning to update SVC firmware, then broader checks on
switch fabric or storage firmware levels, or host levels or HBA firmware/driver/multipath are only
required to confirm host side kit (server) is not EOS/EOL.
– Confirm multi-pathing active (server) and working prior to SVC firmware update. This policy is
NOT being changed.
– SAN and DISK: Interoperability checking and remediation still required.
– Net: Interop checking is not required host side (server) if you follow all of the above guidance.

30 26 July 2013 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Storage Automation Tool (SAT) – Required on all accounts


Overview
 Storage Automation Tool provides:
 Pre Go-Live health checking of storage devices (TPC HC)
 Automated health checking on an ongoing basis in production
 A priority-ordered list of best practice findings showing items that need to be addressed
 Data collection for centralized reporting and analytics
 Runs either stand-alone or with TPC
 Available from: http://bldgsa.ibm.com/projects/s/storage_automation/sat/Downloads.html

Objective
 Install SAT at all SO Accounts
 Configure SAT with appropriate SMTP infrastructure and admin email addresses
 Add all existing storage devices to SAT for routine health checking
 Perform a Pre Go-Live Health Check prior to production on any new storage devices coming into the environment
Maintain SAT at current recommended version. See code strategy spreadsheet for acceptable levels. Spreadsheet can be
found at: http://ibm.biz/BdrACA
Actions
 If SAT is not yet installed, download and install it from the link above
 Check at least once per quarter for new SAT version releases
 New versions of SAT are generally released at the end of each quarter
 Install the latest SAT version for new installations.
 Existing SAT installations may continue to run the prior quarter’s version of SAT or may install the latest version.
 You must install the latest SAT version if the version you are running is more than 2 levels behind the latest
 Most Important: Check output from SAT daily and document and drive plan to remediate findings in priority order:
Fatal, then Critical, then Major, then Minor errors. Ensure issues and actions and shared/tracked with account team (SIL
as an example).
31 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

SAT - Storage Dashboard via HW/SW Currency


Storage Dashboard:
– Central repository for global storage inventory and global best practice position
– HW/SW Currency: Storage Inventory and Firmware data
– Storage Automation: Configuration Health Check and Problem Determination data
– Storage Capacity and Utilization
– Interface with GACDW (NA only) – feed to it from HW/SW Currency
– Admin responsibility: MUST Implement (see below)

Dashboard value:
– Measure/report on device configuration Best Practices
– Measure/report on new device configuration problems identified
– Measure/report on device configuration problems solved
– Measure/report on device configuration issues not being resolved
– Use these measures to drive quality initiatives
– Global Analytics for exec reporting (2015 plan)
– Auto-email notification to DPEs for non-compliance (2015 plan)

To be seen in Dashboard, account must have SAT installed & configured:


– Download and follow TPCHC user guide, located at
http://bldgsa.ibm.com/projects/s/storage_automation/sat/TPC.html
– 24 hours after TPCHC is installed & configured, account data will show up in Dashboard
32 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Call Home – Strategy http://ibm.biz/BdrAQL


Priority 1 – SVC & FlashSystem , DS8, and XIV, Priority 2 – Hydra, Priority 3 – N/A, Priority 4 – CISCO and Brocade

 All storage devices eligible for calling home need to be configured to call home. Exceptions should be
limited only to clients that do not allow this capability. Exceptions require update in HW/SW specific to
not allowed to call home and who from account team approved.
 The purpose of the Call-Home guideline is to ensure that all devices (in scope) operated by GTS SO
Delivery
– have call-home correctly set up and tested
– have verified functionality at least once a year (during mandatory code upgrade)
– have added an appropriate identifier (“<some account name> (SO ACCOUNT)”)
 Identify systems that never call home and have no heart beat
 Strategy is to develop guidelines for the in-scope devices
– add a field descriptor to all arrays
– verify the information added gets through to call-home and is added to PMR
– identify storage arrays for SO accounts in PMRs with searchable descriptor
– feed the information into existing centralized reporting (HW/SW Currency)
 Any device not being allowed to call home MUST be configured for SNMP alerting to SYSOPs.
 A RISK needs to be written for any device not being allowed to call home and not a client security issue.
As an example, if someone states it’s a funding issue, then a RISK needs to be written. Specifically use
the account risk process (not CIRATs).
 Strategy will be to include ‘SO ACCOUNT’ in a specific field for call home so that we will be able to
search the call home databases in the future for SO hardware. See link for best practices for this entire
process.
 Closed loop process via auto-email to DPEs is strategy (via HW/SW Currency). Deployed for XIV and
DS8 currently. SVC and FlashSystem coming 1Q2015.
 Overall strategy is to leverage VPN vs modem. Modem support is going EOL within next 12 months.
 Key item for teams responsible for storage code upgrades  Strategy is to use HW/SW Currency and
validate heartbeat during code upgrade process. If heartbeat is not current (30 days or less), execution
of testing of call home needs to occur to validate it is working. Resolution needs to occur to ensure
heartbeat and call home is working.
33 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Call Home – DS81/DS83/DS87/DS88 updates

• DS8100/DS8300 Call Home Function has been deemed unreliable from


TSS, and storage teams should ensure SMTP based email alerting, SNMP
based event alerting, and daily health checks are run on these boxes, until
they are decommissioned!
• DS8700 Call Home Function can be at risk, unless you are at code level
76.31.29.0 or higher. Please test the call home on your DS8700 devices to
ensure it’s working. You should have SMTP based email alerting, and SNMP
based event alerting functional.
• DS8800 Call Home Function can be at risk, unless you are at code level
86.31.11.0 or higher. You should have SMTP based email alerting and
SNMP based event alerting functional.

34 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

SAN/Storage Clock Synchronization – Strategy


http://ibm.biz/Bd4ict

 It is extremely important to have every device in the data center to show log entries that are
time synchronized, this is to ensure that log files / events can be correlated between them. If
the devices are not working from a common time source, then we need to calculate the deltas
(differences) between each device's time stamps. This is a manual process that can lead to bad
conclusions about which device was responsible for a particular issue. The solution to prevent
this mix-up is having a common clock source using the Network Time Protocol (NTP).
 The purpose of this guideline is to ensure that all devices (in scope) operated by GTS SO
Delivery
– have a common clock source correctly set up and tested
– use a common offset to UTC (Universal Time Coordinated) that needs to be defined per
account and per device
– have verified functionality at least once a year (during mandatory code upgrade)
– have a common clock source per datacenter
– Better when having servers using the same synchronized source as well

 Details about NTP can be found at http://www.ntp.org/

 Application job runtimes will not be affected, as there are system clocks. This is a hardware
clock discussion. However still we should check jobs as things like post processing could be
affected.

35 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Entitlement – AG only (rest of global T&M)


ISSUE
 All hardware in AG requires software and hardware maintenance (called entitlement)
 This is an account team responsibility. DPE owns this.
 Many SO accounts have maintenance contract gaps causing significant delays in problem resolution during outages
– Support teams are unable to work through product entitlement when maintenance is not in place.
– 16 documented cases in a 12 month analyses period where multiple hours or even days have been added to
resolution times
– Significant impact on SLA’s and Customer satisfaction
– This issue applies to both IBM and third party hardware

Objective
 Perform analyses on all SO accounts to identify maintenance contract gaps
 Develop action plans to address gaps and/or put risk letters in place for potential SLA penalties
 Develop closed loop process to ensure maintenance is set up properly at engagement and maintained throughout the
contract lifecycle
 Partner with the NSO to ensure their maintenance contract offering has deeper penetration across SO accounts

Actions
 Analyses work completed to show that only 31% of existing SO accounts are utilizing the NSO Offering
 Executive sponsorship garnered to communicate the criticality of this issue to sectors
 Approached accounts about maintenance gaps through Account Health Dashboards
 Approached Framework blue development to have NSO offering included in engagement cost cases
 Working with NSO to have the offering presented to industries through staff meetings, newsletters and presentations
 Working with account teams to have documented risk letters in place when gaps are identified
 The goal for 2019 is to have all applicable accounts retrofitted with the offering and to include it in all new engagements
3636 2011 Accomplishments for Storage PSM PCM © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Storage Lifecycle Management – Priority Summary


Priority Microcode/Firmware Technology Refresh Call Home
(Slide 8) (Slide 18) (Slide 29)

1 - SVC & Storwize (V7K, V5K, - SHARK - SVC


V3K) - McData
- FlashSystem
- FlashSystem (A9K and FS) - DS8K
- DS3K/DS4/DS6 - XIV
- SVC models 4F2, 8F2, 8F4, 8A4, 8G4
- Nseries
- DS81/DS83/DS87
- 3494 tape libraries
- TS7700 Virtualization Engines: Models - - 3956,
3957.
Types – CC6, CX6, CC7, CS7, VEA, V06

2 - Brocade SAN switches - 3590, LTO1 & LTO2 tape drives. All are - - TSS - Hydra
EOS.
- Flashsystem 710/720/810/820. Not TSS rather GTS
EOL
- CISCO GEN1 MODULES AND GEN2 4G
MODULES
- Brocade EOS/EOL (M48, 2498-B40, see next slide
for all)
- XIV Gen2
- DS88

3 - XIV, DS8 - DS5s. These are not TSS EOS rather GTS EOL/EOS N/A
- Any remaining hardware officially listed as EOL/EOS
- SVC 2145-CF8 & CG8 – TSS EOL 07/01/19

4 - Remainder N/A - Cisco (1Q start)


- Brocade (1Q start)

See Global Scorecard for reporting © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Backup

38 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Changes
Changes Date Author Why are we doing?
Reprioritized priorities on page 6 06/05/2013 Jim Olson We have made a lot of progress with DS8
and XIV and need more focus on Brocade
(due to code quality issues)

DS8 ECA870 (RAID10) added to page 9 06/05/2013 Jim Olson Code level was already a part of our
strategy. STG decided they wanted to
track via a ECA.

XIV ECA306 added (page 11) 06/05/2013 Jim Olson STG asked to push this ECA.
DS6K added to tech refresh priority page 06/05/2013 Jim Olson Parts limitation.
14
Code updates. Key updates to DS8 and 06/05/2013 Jim Olson
XIV.
URL updated page 1 06/05/2013 Jim Olson Hyperlink as iRAM has had link issues.
Added page 22 with XIV ECA information 06/05/2013 Jim Olson So team can see all XIV ECAs.
Added changes slide 06/05/2013 Jim Olson Mike and Karen request. Good idea.
ECA867 added (RAID10) 06/11/2013 Jim Olson Were already doing under RAID10
guidance
Changed min on DS8800 ECA867 to 08/01/2103 Jim Olson Per STG Guidance
86.20.130 from 86.20.114
Included new SVC interop strategy 08/27/2013 Jim Olson, Mark Chitti, Kirby Included new SVC interop strategy
Dahman Via Hursley guidance
ECA876 added 08/12/2013 Jim Olson Added data protection per STG. Low box
count medium priority.
Added TS77XX to code strategy 09/15/2013 Jim Olson, Charlie Hayden Needed
Added new call home strategy 11/12/2013 Jim Olson. Thomas Improving our call home position for SVC,
Brachahn DS8 and XIV
Updated EOL/EOS tech page 16 01/22/2014 Jim Olson EOL tech
Added page 17 03/22/2014
Updated ECA306 XIV slide with code 02/07/2014 Jim Olson Fixed at later levels
levels where fix is
© 2015 IBM Corporation
Added in FlashSystem pre-code upgrade 04/04/2014 Jim Olson New process IBM Internal Use Only
IBM Global Technology Services

Changes
Changes Date Author Why are we doing?
Added slide 19 for Window 2003 EOS 04/25/2014 Jim Olson/Dave Schustek We have servers that run our tools.
Important we ensure we are running on
supported OS’s.

Updated EOL/EOS tech page 16 04/25/2014 Jim Olson Bent Braum Holst asked for clarification
Added page 17 on Brocade EOL.

Slide 21 added for Heartbleed 05/01/2014 Jim/Karen/Mark Security issue


Added bullet to slide 6 associated to 05/27/2014 Jim/Karen/Keith/Mark Improve success of code upgrades.
reboots for SVC prior to code upgrades.
Aged based.

Added SVC upgrade process link to page 06/03/2014 Jim/Karen/Chuck Needed


5
Added page 27 associated to time 06/15/2014 Leandro Torolho Global process developed.
synchronization best practices for storage
Added page 7. Nseries and Netapp 06/22/2014 Glendon Lowder New process to ensure successful code
upgrade best practices upgrades for Netapp and Nseries.
Modified DS8 EOL to include age. 06/24/2014 Jim Olson/Rich Oubre Data Loss and outage prevention
Modified DS8 EOL – moved to 7 years 07/14/2014 Mark Chitti/Stanley Wood Global ACB believed 7 years was better
from 8 per Global ACB discussion than 8. Going thru review process.
Moved ECA306 to backup as old and 08/06/2014 V72 Jim Olson/Raul Estrada
lower priority now.
Added Firmware Implementation 08/15/2014 V73 Alan Skinner Added compliance slide per recent
Timelines for Security Alerts slide security work stream.

DS8K ECA876 moved to backup 08/25/2014 V74 Jim Olson Low priority that has been out for some
time.
TS7700 tape unit ECA009 added 08/25/2014 V74 Jim Olson Per STG Guidance.
Added some CISCO EOL devices to page 09/10/2014 V75 Jim Olson/Karen Haberli New STG announcement
17
Added cover page for ECA section 09/12/2014 V76 Jim Olson Better way to address ECA section
Added ECA899 09/12/2014 V76 Jim Olson New ECA for DS8 600GB drives
© 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Changes
Changes Date Author Why are we doing?
Updated page Call Home slide (currently 09/16/2014 V76 Jim Olson STG Direction
page 27). Modem going EOL.

Updated bottom of page 4 to remove tape 09/23/2014 V78 Jim Olson WAY to difficult to manage.
drives from code management strategy.
Recommendation is to follow CE/SSR
guidance for tape drives

DS8K Apache Struts and BASH added into 10/22/2014 V79 Jim Olson/Kirby Dahman Added to HW/SW so we can now track
ECA section (ECA1234 and ECA5678)
Minor update on page 8 12/03/2014 V81 Jim Olson Added more specifics on ECA

Added slide for Storage Automation 12/04/2014 V82 Jim Olson, Jason, Stanley Needed

Security section added. Old page 9 01/07/2015 V83 Jim Olson, Rodney Results from work with Security teams in
removed and now we have 9 & 10 Mulrooney, Alan Skinner IBM
DS5K added to Tech Refresh Section/GTS 01/07/2015 V84 Global Design Authority Major CIEs in 2014
EOL
Added slide 30 on DashBoard 01/15/2015 V84 Jason, Stanley, Jim T&I Strategy in 2015
Some tuning to security slides 9 & 10 01/27/2015 V84 Rod, Jim Per SARM

Tweaked slide 20. Tech did not change just 02/12/2015 V85 Jim Partnering with TSS.
verbiage so no SSL ACB review.
Minor verbiage changes to slides 9 & 10. 02/12/2015 V85 Steve Biles

Minor tuning to 21/22 for CISCO EOL 02/17/2015 V86 Jim/Art Scrimo More kit going EOS

Tweaked model numbers on EOL slide 03/01/2015 V87 Jim Clearer


Tuned priorities on slides 8, 20 and 32 03/31/2015 V88 Jim/DA Needed tuning

Updated ECA section. Marked complete all 04/03/2015 V89 Jim/IBM Systems/Rich Enough done on older ones to mark
but ECA826, ECA899 and ECA009 Oubre complete

© 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Changes
Changes Date Author Why are we doing?
Updated slide 8. Moved FlashSystem up to 04/07/2015 V90 Jim Olson Total Data loss on one account and major
priority #1 due to major code bugs as of code bugs (timer issues).
late.

Added slide 15 for SVC Global Mirror issue 04/13/2015 V91 Patrick Keyes Critical bug
Added slide 32 to summarize current 04/27/2015 V91 Jim/Ken Morgan Good idea
priorities
Updated slide 10 to clarify differences 05/11/2015 V91 Jim/Steve Biles Security guidance
between storage code and SW like TPC
SVC interop guidance changed 05/29/2015 V92 Jim/Hursley/DA approved Easier planning for SVC upgrades
Updated slide 19 to include Brocade kit 08/07/2015 V93 Jim/Kirby M48 marked EOL
Updated EMC with new EOL links 09/25/2015 V94 Jim/Karen New EOL links
Added V7000 HDD issue 10/01/2015 V95 David Schustek New high impact tech alert
Updated EMC EOL links on slide 20 10/20/2015 V96 Francesco/Karen Better links
Made some updates to slide 20 for CISCO 11/09/2015 V96 Lyle Ramsey New info
Removed ECA826 and ECA899 due to 11/10/2015 V97 Jim Olson Age
age. Slides moved to backup

Added slide 38 in backup. Calls out all new 11/10/2015 V97 Jim Olson/Keith Williams Systems directive
DS8 ECAs. Blue GTS chases while rest
PFE and Systems will chase

Slide 12 and 13 now reflect two new ECAs 11/10/2015 V97 Jim Olson/Keith Williams Systems directive
we will chase for DS8 – ECA714 and
ECA715

ECA826 and ECA009 moved to backup. 11/30/2015 V97 Jim Olson Aged

ECA021 added for XIV 11/30/2015 V97 Patrick Keyes New one to chase per Systems

Update on slide 5 related to concurrent 12/05/2015 V97 Jim Olson Crit. Lesson learned
upgrades on SVC clusters

© 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Changes

Changes Date Author Why are we doing?

ECA826 and ECA009 moved to backup. 11/30/2015 V97 Jim Olson Aged

ECA021 added for XIV 11/30/2015 V97 Patrick Keyes New one to chase per Systems

Update on slide 5 related to concurrent 12/05/2015 V97 Jim Olson Crit. Lesson learned
upgrades on SVC clusters

Added slide 8. Overview of roles and 02/01/2016 V98 Ken Morgan/Alan Skinner Needed
responsibilities.

Completed ECA section (714). Added in 03/01/2016 V99 Jim Olson/Charlie Hayden New DS8 ECAs
two more ECAs as well. See ECA section.

Minor tweaking 03/08/2016 V100 Jim Olson

Moved DS5s to priority 3 03/14/2016 V101 Jim Olson DS5s are not TSS EOL/EOS so reducing
priority due to some pushback

Updated slide 22 Cisco EOS\EOL section 03/29/2016 V103 Lyle Ramsey Some devices listed were incorrect

FlashSystems added to EOL. See slides 06/01/2016 V103 Jim Olson Old and Not many..
21/22.

Updated slide 11 with new Risk Mgmt 06/04/2016 V103 Steve Biles Needed
process

Updated slide 31 & 34 XIV and DS8k to 0627/2016 V103 Ken Morgan Extra focus on DS8K and XIV
Priority 1 for callhome/heartbeat Callhome/Heartbeat
© 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Changes

Changes Date Author Why are we doing?

3494 tape libraries added to EOL 11/30/2016 2Q2016 V2 Jim Olson Aged. Official EOL 1/31/2017

Added in a few new Nseries for EOL Ditto Ditto ditto

Added TS7700 EOS data (page 21 & 23) 1/18/17 Karen EOS data provided for Geo’s

Added DS8700 1/30/2017 Ken Morgan EOS NA Dec 31, 2017, other IOT’s to be
announced soon per systems

Added Storwize (V7, V5, V3) with SVC as 1/30/2017 Ken Morgan Improved clarity Storwize not just SVC
priority 1 code

Updated some Brocade EOL kit 2/14/2017 Clancy Obrien New info

Updates slide 7 with new SVC storwize 5/1/2017 Jim Olson New info
disk drive upgrade process

Updated slide 5 on release notes and 5/14/2017 Jim Olson New Info
usage for code upgrades

Added Nseries to EOL priority 2 (list) 5/14/2017 2Q17 Final V2 Jim Olson Needed
Added slide 24 for EOL

Added to code upgrade guidance for quiet 6/12/2017 Jim/DA Post outage at EMEA account
time and potential impact to slide 8

Move FS 710/720/810/820 to priority 2 9/11/2017 Jim/DA New full data loss events
© 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Changes

Changes Date Author Why are we doing?

Marked all Nseries EOL 10/2/2017 Jim All are TSS EOL in 8 months and we
have had part and support issues. So
giving teams time to vacate before official
TSS beyond parts and support issues

Moved all ECAs to backup as they are 10/13/2017 4Q17 final3 Jim Old and considered done per support
deemed old and completed per support

Added call home details for DS8s 10/13/2017 Karen/Patrick Worked with Systems

Added links to security slides to process 01/15/2018 Jim Needed

Added XIV Gen2, Additional Hydras, 3/15/18 Karen Going EOS 12/31/18
Tape Drives & SVC models to EOS

Updated links, in microcode upgrade 3/15/18 Karen Links outdated.


info – pages 4-8

Added ECA1111 for DS8 call home issue 05/09/18 Jim/Ken IBM Systems Guidance

Added ECA413 same same same

Added 3584 07/22/18 Jim EOL

Clarified v7k and Brocade on reuse slide 9/6/18 Dave Missing details

Added in Netapp & DS8800 EOL & DS3K 10/4/18 Santnana/Karen Was missing

SVC 2145 CF8 and CG8 12/05/18 Jim TSS EOL/EOS 07/19 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Changes

Changes Date Author Why are we doing?

Added slide with guidance on problematic 01/31/2019 Jim Outages and Data loss events
drives.

Updated slide 18 with EMC EOL 02/26/2019 Lyle Ramsey

© 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Logic behind our code strategy

Our storage code strategy is to maintain a range of acceptable versions for many reasons...

- All versions of code have additional fixes, as part of continuous improvement process
(addressing past defects that all code versions have)
- We balance the value of those included fixes vs the risk of a newly released code (running
bleeding edge code has risk)
- Code upgrades are not always 100% successful (many that are not successful result in
outage) - 99.9% for DS8K
- Coordinating large change windows is a challenge. DS8K code upgrades are very time
consuming; most run 6 plus hours per device.
- There is significant work across the entire account when doing code upgrades (many
towers involved)

As such, we try hard to keep 1 year of acceptable code versions (in partnership with
Systems/vendors). By design, the latest version of code that is published is not turned to a
recommended level for 3 to 4 months.

47 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8K ECA714 – marked completed on 10/14/17


ABSTRACT: ECA 714 - Issue with Viper C 600GB 15k RPM DDMs failing due to pivot bearing
outgassing.

Summary:
The ECA 714 is a mandatory FBM. The purpose of this ECA 714 is to provide drive firmware to
detect and reject DDMs which exhibit pivot bearing outgassing conditions that could lead to multiple
drive failures and cause loss of access/data loss. Viper C 600 GB 15k DDM is a large from factor
(LFF, 3.5 in) drive, we shipped these drives in the DS8700 system, to minimize the impact of this
issue, SSRs must order and apply an ICS CD level: DS8k_DDM_SSD_FW_Update_v1.10.iso to
affected machines as stated in the machine list, ICS CD is available for ordering from Super
Shippers DB or using the CDA4TP tool to download from Fix Central at the link below:
https://port.rchland.ibm.com/support/fixcentral/ac/options

Checkpoint:
Use the HMC to check for drive FirmwareLevel: F811 ddmFamily: H5FH, if the drive FirmwareLevel
is at F811 or higher then no further action is required, otherwise obtain ICS CD:
DS8k_DDM_SSD_FW_Update_v1.10.iso and install the ECA 714.

How long does installing the ICS CD take and what if it fails?
It takes approximately 15 minutes to load the ICS CD into the HMC, and the firmware update will
start automatically in the background, and if for whatever the reason the firmware update fails, the
machine will callhome to notify IBM service personnel of the issue, so we can take immediate action
to correct the problem.
48 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

DS8K ECA737 – marked completed on 10/14/17


ABSTRACT: ECA 737 - IBM released microcode with improved error handling for DS8870 High
Performance Flash Enclosure (HPFE) flash drive errors.

Summary:

IBM has developed microcode enhancements for error handling in HPFE. DS8870 (R7.5 SP2.3 )
87.51.23.4 microcode levels contain the changes to improve the reliability and availability of High
Performance Flash Enclosures. This enhancement streamlines error detection and isolation when a
failing Flash Drive exhibits excessive errors.

This change is designed to be concurrently installable on DS8870 presently running the R7.x
families of microcode.

Recommendation: A mandatory ECA 737 is being released to the field, we recommend upgrade to
code bundle R7.5 SP2.3 87.51.23.4 as soon as possible

49 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8K ECA712 – marked completed on 10/14/17


ABSTRACT: ECA 712 - Global Mirror suspends caused by a microcode logic error introduced in
R7.4 that results in a Track Format Descriptor mismatch.

Summary:

DS8870 Global Mirror suspends caused by a microcode logic error introduced in R7.4 that results in
a Track Format Descriptor mismatch. Microcode is improperly setting a flag in a PPRC control block.
This problem is pervasive in Global Mirror environments. R7.4 code levels below 87.41.44.0 and
R7.5 levels below 87.51.23.4 are exposed to this issue.

Recommendation:

A mandatory ECA 712 is being released to the field, we recommend upgrade to code bundle R7.5
87.51.23.4

50 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

XIV ECA021 – GEN3 Seagate Drive Firmware Issue


Tracking is owned by Storage Service Line GEO leaders once lists are produced
 Abstract: There’s a firmware issue on 4 specific models of Seagate disk drives that can be installed in XIV GEN3s. IBM
support knows which XIVs are affected as long as they call-home. The possible impact is a data corruption risk when the
disk drives detect unreadable data (in rare cases Development/Support is unable to determine which partition contains the
correct data). IBM will work to apply a non-disruptive disk drive firmware update, which should take around 2 hours
depending on the number of drives that need to be addressed within the system.

 Affected Machines: GEN3s with specific models of Seagate disk drives inside. IBM has a list of affected XIVs from call-
home data. GTS is working to get affected XIVs highlighted in HW SW CMS tool also.
Affected disk model Capacity Firmware level containing fix for this issue
disk_list command on the XIV Gen3 can be ST2000NM0043 2 TB EC5C
used to see disks installed: ST3000NM0043 3 TB EC5C
ST4000NM0043 4 TB EC5C
ST6000NM0054 6 TB EC6D
 Problem Description
 There is a risk of data corruption (the issue appears when the disk drives detect unreadable data. Should the issue occur,
it will result in a 512-byte block difference between the primary and secondary partitions. In effect undetected data
corruption during a specific drive error recovery sequence)
 The issue is rare: according to the vendor, an affected drive is expected to hit the issue once in every 3,412 years. For a
system with all 180 affected drives it means once in every 38 years. XIV Scrubbing mechanism detects this issue.

 Action: If you have the affected model disks, contact IBM support to update the disks’ firmware to a fixing level. IBM
support is also pro-actively contacting affected accounts - if an account has an XIV Technical Advisor, the XIV Technical
Advisor will contact the account. The fix process is non-disruptive (there is a few seconds performance impact to a
physical disk drive while it is having firmware upgraded but since most I/O is expected to go via cache, this performance is
expected to be trivial).
 Open change records and work with CE team to apply disk firmware fix to necessary machines. A global list has been
provided to GTS from STG. Geo SSL leaders need to track all devices to completion. It is a low risk concurrent activity
© 2015 IBM Corporation
51 IBM Internal Use Only
IBM Global Technology Services

V7000 HDD Firmware issue


http://www-01.ibm.com/support/docview.wss?rs=591&uid=ssg1S1005289

Problem:
Specific hard disk drive models supported by the Storwize family of products may be exposed to possible
undetected data corruption.

Remediation:
A firmware update that remediates against future occurrences of this issue is now available. IBM
recommends that all customers with the affected drives apply these latest levels of code.

Solution (Procedure):
The Systems Group has provided excellent instructions in the link above, follow their guidance for all
v7000s
1. Use the utility provided via the link to determine if the exposure exists
2. Determine which remediation path applies to your environment if it does
3. Follow the set of instructions associated with the remediation path.

52 © 2015 IBM Corporation


IBM Internal Use Only
Quarterly Microcode Updates:
• Provide code recommendations based on discussions with technical specialists, watching for
code releases, code defect alerts and more.
• Partner with focals to ensure collaborative discussion (with specific goals of not moving min
levels often and not having target level being the latest level of code). For example, using
Brocade to vet Brocade levels; your focals should agree with changes in writing.
New Alerts in New Code Family Releases:
• Determine new alerting capabilities, and interlock with the Standard Alerting Team for
automatic alert integration. As an example, XIV got compression capabilities with R6 family
of code which means there are new alerts now available with R6 family of code. Focal for
Standardized Altering is Rodrigo Dias.
Ask A Technical Question:
• Subscribe to the platform (s) you’re assigned to. You’ll receive notifications via email when a
new question is submitted. Please work to ensure all questions are answered within 48
hours.
Security:
• You must own the Tech Spec Topic for the Tech Platform(s) you’re assigned to - https://w3-
connections.ibm.com/forums/html/topic?id=7af4b838-55ca-4783-9538-
ddc6a216aa61&ps=25
• Select your platform from the table listed in the Storage Community
• Then choose to “Follow”  your assigned topic(s),  you will  automatically be emailed when a
tech spec topic is posted.
Stake Holders
# Role Action Primary Escalation Executive Escalation
1 TI&A Storage Domain Sets Strategy Jim Olson Richard Baird

2 Global Service Engineering Communicates Strategy and Attainment Ken Morgan Alan Skinner

NA – Steve Poss, EU – Mike Guy, AP – NA – Jack Heberlig, EU – Mark


Facilitate execution/communication of strategy, best practice, and attainment across Harish
Soni / Dong Min Kim, GCG – Thomas(DS) & Francesco Silveri (MF), AP –
3 Storage IOT Leaders Harish Soni / Dong Min Kim, JP - Mahesh Tayal, GCG - Ley Cheng Lee, JP -
the Storage Delivery Teams Yohsuke Tohkairin, LA – Felix Nofal, Tomosuke Senta, LA –Caio Briski, MEA -
MEA – Ajay Malik/Yalcin Ozsoy Stef Stangret

Assists with replacement of the EOL hardware


• HRM Team contacts Account Focal points (Chief Architect or delegate), assists
with and refines the optimal technical design in line with Storage strategy.
4 HRM Ensures lowest cost solution for IBM, sourcing from GARS, etc. Mukesh K Gupta Stephen Ward (1st) or Jeff Cummings
• Provides technical approval which is pre-requisite for financial approval in
WWCT.
*Note: involving HRM early removes any need for explanations or reworking later

*Leverage Global and IOT Quality *Leverage Global and IOT Quality Leads to
5 Account Team Prioritizes resources (capital and labor) Leads to work with account DPE/PE work with account DPE/PE along with
along with HRM team HRM team

6 Storage Delivery Teams Perform actions Multiple - Leverage IOT Leaders Multiple - Leverage IOT Leaders

Tooling Strategy/Development and Work with IBM Systems to address serviceability Jim Olson
7 Domain Richard Baird
(Code, etc…)

Global - Ty Youngs, NA – Christopher Global - Dave Lowrie, NA – Jim Batterton ,


Peterson, EU – Stephen Muir, AP – EU - Elizabeth Franjian, AP - Vijay
8 Global & IOT Quality Teams Facilitate additional focus and communication Jennifer Jim, GCG – Li Mei Zhao, JP - Chaudhari, GCG - Ley Cheng Lee, JP -
Yasuko Okazaki, LA - Alejandro Tomosuke Senta, LA -Ricardo Gazoli , MEA
Fontana, MEA – Nishant Sharma - Stef Stangret

9 Global Service Engineering Global Process Owner - End-to-End Ownership to drive program Ken Morgan Alan Skinner
IBM Global Technology Services

DS8000 ECAs 11/2015 – will chase blue


ECA Title ECA ECO CMVC Risk Stale adjusted results Costing Targets
and mfgr information.

Updated Poodle ICS 700 J16444 N/A Loss of access 77 $16,170.00 Launched 10/12
Enables SSLV3 TMAN/EPOC – 10/09
As required
DS8700 II’s/Notification

DS8800 Tracking

Viper C ICS 714 J16413 307442 Addresses dual DDM failures, 359 $75,390.00 TMAN/EPOC – 10/16
loss of access, data loss
DS8700 Mandatory II’s/Notification

Tracking

Seagate Della ICS 715 J16445 308135 Addresses a data loss 928 $194,800.00 TMAN/EPOC – 10/16

DS8700/DS8800 Mandatory II’s/Notification

Tracking

Bluehawk DA pair 8 707 J16446 309406 Addresses a loss of access. 200 $186, 340.00 ( CCL time: TMAN/EPOC – 9/24
(CCL) 4 hrs used for
Mandatory calculation ) Fis available –
DS8870 formulating path for
reduced customer risk for
approaching freeze/

R75 Upgrade with GM 712 J16457 310099 Loss of Access 108 $22,680 (CCL or ICS, TMAN/EPOC – 10/13
(ICS/CCL) and 310103 depending on current
mandatory level) II’s/Notification
DS8870
Tracking

Total       1694 $495,380.00

55 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Security Classifications
https://advisories.secintel.ibm.com/faq.php

What's the difference between High, Medium, Low, E-Fix and FYI ratings? Why do FYI ratings
show up as High, Medium, Low in the database?
Ratings are an assessment of the severity of the vulnerability. They are used to calculate due dates
(according to criteria specified in ITCS104) for patch implementation.
FYI and E-Fix designations are not ratings. They are an indication that no compliance activity is required.
The advisory is being issued for awareness (usually the vendor includes workaround information) but
there's no mandatory action associated with it due to the lack of available supported patches.
However, even in the case where the vendor has not released a patch, there's still a vulnerability that the
vendor is reporting. As part of our FYI communication, we assess the severity of that vulnerability and
assign it a rating. The rating and the compliance activity are two separate pieces of the process.

56 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Smart Rebuild
Smart Rebuild was developed for problematic 450gb 3.5inch drives, but has since been expanded to all
drive types. The initial release of Smart Rebuild would perform a check twice a day for drives exceeding
3 media errors, and later was increased to hourly.

SMART REBUILD RAID 5 Rebuild Smart Rebuild

How does it work? Parity calculation for all data. Data copy with parity calculation for
unreadable data only.

What starts a rebuild? Solid drive failure 3 media errors in a week

How long does it take? 3-4 hours for 450gb drive ~1 hour

Am I vulnerable while Can NOT handle any additional Can handle an additional failure.
rebuilding? failures.

There has been a 3.5x reduction in dual disk failures since Smart Rebuild was introduced.

57 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Protective Code Features: BGMS / eBGMS


Background Media Scanning is design to reduce the risk from “latent” media errors that exist on drives
but have not been discovered yet via normal I/O.

BGMS / eBGMS RAID 5 BGMS eBGMS

How does it work? Calculates data from Calculates data from Calculates data from
Parity and reassigns to a Parity and reassigns to a Parity and reassigns to a
new sector. new sector. new sector.

How are bad sectors Client I/O Proactive scanning of DA reads DDM error
discovered? the DDMs by the DA. logs, prioritizes bad
sector reassignments.

How long could a bad Days, weeks, months… Up to 4 days max ~ 1 hour
sector exist? until I/O is requested for
that sector.

58 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

XIV ECAs 2014

 ECA 001 (GEN2) - Apply UPS power switch guard and regulatory labels which are missed items
at GA
 ECA 006 (GEN2) - Manufacturing shipped 29 systems with duplicate serial numbers to clients.
EC corrects the VPD in the box and replaces the serial number label.
 ECA 007 (GEN2) - Apply a UPS circuit breaker guard as UPS switch is too sensitive to incidental
contact that can cause the UPS to power down resulting in loss of access to data.
 ECA 008 (GEN2) - Apply a UPS pigtail retention clip.
 ECA 009 (GEN2) - Apply a 4-line cord retention clip.
 ECA 014 (GEN2) - ATS Monitoring cables
 ECA 110 (GEN2) - Software fix to perform a UPS self test work around
 ECA 116 (GEN2) - Perform commands at software level which update VPD on R2.2 systems
 ECA 132 (GEN2) - Missing Prevent Services Invocation file for all XIVs at code level 10.1.x
through 10.2.4.b
 ECA 135 (GEN2) - Code release level 10.2.4.e and 10.2.4.e-3 after 5/15/13
 ECA 304 (GEN3) - Code release level 11.1.1
 ECA 305 (GEN3) - SAS firmware rolling upgrade with 11.1.1 on 126 XIVs
 ECA 306 (GEN3) - SM Memory Leak Patch

** Purple are ECAs that align to our global XIV storage strategy
** Blue is the only non-code ECA we are globally chasing.
59 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Language to Help encourage code upgrades

 Microcode and firmware should be updates on a yearly basis. The key reason for this is…
– That the newer versions of code bring increased stability
– Reduction in client impacting events
– Overall performance improvements
– New feature benefits.
– Advanced copy services/Replication improvements.

 Without following this strategy, it makes it more difficult to upgrade as the further away from
a year you get, the more complex the upgrades become…
– More often than not requiring multi-hops to get to new target code level.
– Increased interoperability work when not complying
– Overall increased risk of code upgrades when they do occur.

 Out of support code

60 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8000 ECAs 2014


ECA # / Category MT Affected Release Impact / Reason
Name Date
ECA 010-4 PE 242X models Jun 2012 Eliminate risk of loss of access due to dual CEC reboot caused by FSP memory
CONTROLLED 92x, 93x, 9Ax actual leak. About 190 systems added due to approaching 700 day uptime dual HMC
(DS8100, config, or 1400 day uptime single HMC config, threshold
DS8300)

ECA 825 242X models Feb 2012 Reduce unnecessary DDM rejections and better handle those DDMs being
Communicated 92x, 93x, 9Ax actual rejected to prevent loss of access or data loss CIEs (focus on 450GB DDMs) –
(DS8100, approximately 1283 systems are targeted; minimum ICS CD version is 4.4
DS8300) (available now), R4.2 code systems need version 4.6 or higher (available in
March)

ECA 826 242X models Mar 2012 ~2400 systems, to better address DS8700 fabric and DDM errors; launched
94x (DS8700) actual March 20 after R6.2 SP1 exited test Feb 15 (this service pack also fixes two
HIPER defects: the RBC zHPF problem, and the Mizuho quiesce / failback
Communicated problem) -

ECA 986 242X models Jan 2012 As Required ECA focused on Hitachi B – rare data loss (CKD) or data corruption
Communicated 92x, 93x, 9Ax actual (open host) scenario – Nonconcurrent solution available now, concurrent solution
(DS8100, available to 2100 boxes (expect no more than 140 to be interested in fix) by
DS8300) March (High Impact / Non-pervasive) – Fix delivered via ICS CD for code levels
from 3.1x through R4.3 - extremely rare data error exposure and not being
pushed to field; provided only per client request

ECA 850 PE 242X models Nov 2012 Load ICS CD V4.7 to 5+ year old systems already on R4.2 or R4.3 code, to
CONTROLLED 92x, 93x, 9Ax actual expand refresh rate / refresh count monitoring and handling to other vintages and
(DS8100, capacities of DDMs to further reduce CIEs – 1529 systems
DS8300)

ECA 860 PE 242X models Dec 2012 Load R6.2 code on DS8700s on R6.2 code below bundle having DL exposure fix
CONTROLLED 94x (DS8700) actual (76.20.90.0) to further reduce CIEs – 102 systems

ECA 861 PE 242X models Jan 2013 Disable eBGMS on RAID 6 systems on designated bundles – 105 systems
CONTROLLED 9xx actual
(DS81/83/87/
8800)

61 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8K ECA826 – Disk Drive Smart Rebuild – DS8700


Older ECA’s listed here but no longer tracked. ECA26 only one in scope here.
 ECA 825 (old) – providing smart rebuild and other DDM error handling improvements for DS8100s and DS8300s having 450 GB capacity DDMs –
this is now over 93% complete

 ECA 850 (new) – expanding the capabilities of ECA 825 for approximately 1500 5+ year old DS8100s and DS8300s already on release 4.2 or 4.3
code for earlier vintages of DDMs and for capacities other than 450 GB. SSRs are being encouraged to combine this very short and simple update
with other scheduled repair actions for systems qualifying for this ECA (since these older systems have periodic repairs anyway).

 All, in trying to make tracking and managing our ‘Smart Rebuild’ DS8100/DS8300 ECAs (ECA825/ECA850) easier, we are changing the strategy to
this.....
• Any device where ICSv4.4 was applied per ECA825 tracking, you are good.
• Any device where ICSv4.7 was applied per ECA850 tracking, you are good
• All other DS81/DS83s need ICSv4.7 applied

 ECA826 – updating all DS8700s to release 76.2.90.0 code to benefit from DDM
error handling improvements, PCIe / fabric error handling and other improvements
• Working with HW/SW team to get included. Do not have clear % complete.

62 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8K ECA899
ABSTRACT: ECA899 - 600 GB 15K DDM Code Update

SUMMARY:
 The ECA899 is a mandatory FBM. The purpose of this ECA is to improve field performance and
resiliency of systems which have 600 GB 15k DDMs. ECA899 provides enhancements which identify and
'fence' DDMs before a secondary event occurs that might affect clients IT operations, the EBGMS
(Enhanced Background Media Scan) is a firmware which is part of he DA (Device Adapter), this firmware
was first introduced in Microcode Bundle 76.31.55.0 (released July 2, 2013). The ECA899 will require a
code load in order to pick up this firmware.
CHECKPOINT:
 If your machines are on code level R6.3.1 (76.31.55.0) or higher then no action is required, otherwise
use the SuperShipper or follow the established procedure in your GEO to order the microcode bundle
(R6.3 SP6) - 76.31.79.0 and install ECA899.
Evaluation Order:
 ECA899 (600GB DDM on DS8700) fixes the issue for ECA861 (eBGMS). Therefore, the need for
ECA899 should be evaluated on DS8700 before considering applying ECA861 to avoid adding ECA861
unnecessarily.
Installation instructions:
 It is advisable that you should pull the latest code installation instructions from the following PFE
DB:https://ssgtech10.tucson.ibm.com/cress/TestDS8K.nsf/b3b63faf91dd65cb0725760900762dff/da6480
bc7aa2d21707257b3b007fce48?OpenDocument&TableRow=5.1#5

63 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8K zHPF - Potential zOS I/O Timeout on SRC Problem Events and PPRC
Link State Changes when zHPF is Enabled– High Priority (low box count)
Tracking is owned by Storage Service Line GEO leaders once lists are produced
 Abstract
- Potential zOS I/O Timeout on SRC Problem Events and PPRC Link State Changes when zHPF is Enabled
 Problem Description
- IBM has identified a problem on R63 in zHPF processing where SRC events and PPRC link state change notifications
can cause an I/O hang for a volume until the I/O Timeout (MIH) is reached and the host cleans up the hung operation.
This will result in a IOS071I "START PENDING" message to be seen at the zOS console indicating that an I/O
timeout has occurred. This problem is possible on DS8800s on 86,3x.xx.x below 86.31.49.0 and DS8700s on
76.3x.xx.x below 76.31.32.0. The following conditions must be present to have the possibility to hit the problem…
- zOS connected DS8700 or DS8800 on R63 code
- An SRC (Problem) or a PPRC Link state change notification must be generated from the LPAR, which causes a
SIM to be sent from the DS8000. Typical PPRC link state change notifications would be for high failure rate and
path loss notifications for PPRC Links.
- A collision must occur between the SIM going to the host and a zHPF I/O.
 Mitigation
- The mitigation to this problem is to enable the Multi-Host System Information Message (SIM) function. This function
will enable sending SIM events to multiple zOS hosts connected instead of just one zOS host. When the new function
is enabled, SIM communication to go through a different code path which bypasses the affected area of code with the
problem. The Multi-Host SIM function is a SSR accessible method to turn this feature on, please contact next level of
support for instructions.
 Resolution/Support
- The DS8000 has firmware now available that addresses this problem, resulting in SIM presentation to be correctly
offloaded to zHPF I/O without causing an I/O timeout. This fix is available in the following code bundles:
- DS8800 on release 6.3 code: 86.31.49.0 or higher
- DS8700 on release 6.3 code: 76.31.32.0 or higher
64 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

XIV ECA306 – XIV-InfiniBand Master SM memory Leak issue


Tracking is owned by Storage Service Line GEO leaders once lists are produced
 Abstract: The InfiniBand Master SM memory leak issue can create a condition where a module may fail
after a specific timeframe (224 days after a code load, for example). This is NOT an impacting event
that would lead to a loss of access. Development and PFE are creating a long-term action plan, but the
following action plan will be used by IBM support until the full plan is available.

 Affected Machines: All GEN3 code level systems currently out in the field.

 Problem Description
 The signature of the issue is the following events sequence:
 The critical event NODE_FAILED with a description saying the cache has failed (something like: Node
#<n> of type cache on 1:Module:<n> failed because of <reason>)
 Approximately a week before that the following event MASTER_SM_CHOSEN was emitted.
 If the above sequence occurs, the action plan is power cycle the failed module and phase it in. There is
NO need to replace the failed module in this case.

 Action: Fix is contained in 11.2.0.b - being minimum level, but 11.3.1 or 11.4.1.a being recommended
based on required features and/or fixes needed in the client environment. Follow code guidance in
spreadsheet.

 Open change records and work with CE team to apply patch to needed machines. Global list provided to
GTS community from STG. Geo SSL leaders need to track all devices to completion. It is a low risk
concurrent activity
© 2015 IBM Corporation
65 IBM Internal Use Only
IBM Global Technology Services

DS8K ECA876 – Improvements for Data Protection


Medium Priority (low box count) - closed
Tracking is owned by Storage Service Line GEO leaders once lists are produced

 ABSTRACT: Enhanced Thresholding & Error Recovery for a focused set of DS8700s containing 450GB
DDMs.

 SUMMARY: With the evolution of Smart Rebuild (SMRB) algorithms and insight gained of DDM failure
modes and field analytics, we have developed enhancements to thresholding and the DDM sparing process.
For these specific subsystems defined by the ECA876, we expect a 2x improvement in DDM error handling
robustness.

 DETAILS: Two major enhancements include:


– 1) Tighter threshold & processing for DDM soft errors that indicate potential media errors. Additional
integration of device adapter processing to DDM error log analysis.
– 2) Enhancements for specific failure modes to start the sparing process.

 The subsystems chosen for this ECA were identified through full field analytics. The changes in algorithms
and thresholds have the highest value for the DDM failure modes specific to these subsystems. As part of
the field quality improvement process, we will continue to identify tail of the distribution & unique failure
mode opportunities to improve overall quality.

 Resolution/Support: Included in 76.31.55.0 for DS8700. A full subsystem CCL is required to


install the code.
66 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

DS8K RAID 10 Fixes: ECA860 (DS8700), ECA870 (DS8100/DS8300),


ECA867 (DS8800) – closed – completed – agreed to with Systems
 Abstract
- Raid 10 configured DS8700 systems on release 6.2 code, but below the bundle ( 76.20.90.0 ) are being
provided ECA860, to prevent potential RAID 10 data loss events. EC J16337, Retain Tip H207153. The same
exposure is being addressed in DS8100 and DS8300 systems below bundle (64.36.48.0) with ECA870.
 Problem Description
- Specific DS8000 systems are being provided the Mandatory FBM ECA860/ECA867/ECA870, concurrent code
load updates that will help improve reliability and reduce the risk of potential Raid 10 data loss events. This
update is to be installed on RAID 10 configured DS8700 systems running release level R6.2 bundles below
( 76.20.90.0 ) and DS8100/DS8300 systems running bundles below (64.36.48.0).
 Root Cause
- Following a DDM non-correctable media error, both DA Adapters initiate a repair of the same area that is marked
component in doubt. The first adapter obtains a lock, completes the repair, clears the component in doubt (CiD),
releases the lock and notifies the other adapter. The second adapter obtains the lock, sees the CiD is cleared,
and selects the mirrored copy from which the repair can be completed. Not seeing the CiD of the DDM with the
media error, the repair is completed using the mirrored copy of the failing DDM, resulting in a killsector (Data
Loss) event.
 Mitigation
- This EC releases a Mandatory FBM (ECA860) to upgrade DS8700 systems via CCL to R6.2 level code bundle
(76.20.107.0) or higher. For DS8100/DS8300 systems, ECA870 upgrades via CCL to bundle (64.36.68.0) or
higher (it will upgrade to a current shipping level – avoid 64.36.68.0 and prefer target 64.36.90.0). For DS8800
systems, ECA867 upgrades via CCL to bundle (86.20.130.0). The update will reduce potential Raid 10 Data
Loss events which impact our client operations. The CCL upgrade process is estimated to complete in 4-6
hours, but actual durations will depend upon on the machines configuration

67 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8K ECA861 - Potential loss of access / data loss exposure


High Priority (low box count) - closed per agreement with Systems

 Abstract
- Potential loss of access / data loss exposure when running the new Enhanced
Background DDM Media detection algorithms and RAID6.

 Problem Description
- Due to a RAID6 potential issue, IBM has identified that a rare disk triggered loss of
access / data loss exposure exists when running the new enhanced background DDM
media detection algorithms and RAID6 on certain DS8000 code bundles.
- Customers running R4.3H (DS8100/8300 bundles 64.36.35.0 or higher, but less than
target level 64.36.89.0), R6.3 (DS8700 bundles 76.30.42.0 or higher but less than
76.31.55.0) / DS8800 bundles 86.30.50.0 or higher but less than 86.31.70.0), and R7.0
(DS8870 bundles 87.x.x.x) using RAID6 are exposed.

 Mitigation
- It is recommended that clients with exposed DS8000 systems arrange with their IBM
Service Representative to install ECA 861 to disable the new Enhanced Background
DDM Media detection algorithms until a fix is available.

68 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

DS8K ECA688
ABSTRACT: ECA 688 - DDM F/W update

Summary:
The ECA 688 is a mandatory FBM. The purpose of this ECA is to provide drive firmware
updates to prevent potential data loss.
These firmware codes affect the following drive types:

Lightning 450 GB 10k 2.5" DDM


Lightning 600 GB 10k 2.5" DDM
Megaladon 3 TB 10k SED 2.5" DDM
Megaladon 3 TB 10k base 2.5" DDM
Megaladon 4 TB 7.2k SED 2.5" DDM

Checkpoint:
Use the HMC to check Firmware Level for the following DDM types:

Lightning SED: 450GB E806 ddmFamily: S0ZG


Lightning SED: 600GB E806 ddmFamily: S0ZH
Seagate Megalodon SED 3TB drive FW E073 ddmFamily: S7CQ
Seagate Megalodon base 3TB drive FW A063 ddmFamily: S7DQ
Seagate Megalodon SED 4TB drive FW E073 ddmFamily: S7CR

If the drive Firmware Level on any of the DDMs is lower than the minimum level as indicated above then SSRs must order an ICS CD:
DS8k_DDM_SSD_FW_Update_v1.12.iso
69 and install the ECA 688. © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

TS7700 ECA009 – High Priority (GRID Solutions Only)


http://snjlnt02.tucson.ibm.com/tape/tapetec.nsf/pages/TS7700_Grid_replication_potential_issue
 Problem:
A potential TS7700 problem has been identified where Grid replication might inadvertently skip replication for a given volume, resulting in
potentially out-of-date volumes to be mounted.

Background:
The following applies only to IBM Virtualization Engine TS7700s implementing replication services. TS7700 stand-alone configurations
(i.e., configurations not in a grid) are not exposed.

When an IBM Virtualization Engine TS7700 is running any release level from R2.1 (8.21.0.63) through and including R2.1pga4a
(8.21.0.145) in a TS7700 Grid configuration, it is possible for a replication target to inadvertently skip the replication for a given volume and
view the prior replicated content for the same volume serial at the same target location as valid. In the event that a skipped volume is read
by a host and the prior level instance within the TS7700 Grid is chosen, System z mount or open processing should detect a “dataset
mismatch failure.” If label bypass processing were utilized, then previously written, but now potentially out of date, content for said volume
could be returned to the host application. If the previous use of the volume had its contents successfully deleted while in a scratch state,
the copy target can still be inadvertently skipped but there is no potential to surface out of date content given said stale content was
previously deleted. If R2.1 through R2.1pga4a (8.21.0.145) is currently installed or has been installed in the past, an exposure may have
occurred. Once the release level R2.1pga5 (8.21.0.155) or later is installed on all members of the Grid configuration, the risk of additional
exposure is eliminated.

IBM has created a tool that can detect whether such an error possibly occurred, which must be followed by a manual process to determine
whether it actually did occur. If one or more cases are detected, the down level instances can typically be corrected through automatic
TS7700 Grid replication. The tool uses minimal system resource and can be run concurrently with a production workload. Please contact
your local support team to schedule an opportunity for IBM service personnel to run this tool on your TS7700 if you have or feel there is a
potential for you to encounter this problem.

Solution (Procedure):
In order to verify if a TS7700 has been affected by this issue, vtd_exec.171 needs to be run against one of the clusters in the grid. It is not
necessary to repeat the check on all clusters separately. All cluster must be online when running vtd_exec.171. The exec can be run at
multiple chosen intervals until R2.1pga5 (8.21.0.155) or higher is installed in which the risk of exposure is eliminated. If a higher level of
code is already installed, the exec only needs to be run once in order to determine if a past exposure occurred while the above affected
code levels were installed.

70 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Heartbleed Guidance – Additional Measures


The replacement of the device certificate is needed to overcome the Heartbleed exposure
post code upgrade w/fix (see spreadsheet).

The new certificate will mean that users will need to accept a new certificate when they
next login - any live GUI sessions will close.

There is no impact on SSH keys - they are not impacted by Heartbleed and there is no need
to regenerate those keys.

For SVC/V7000/V5000/V3700:
1) Replace your SSL Certificates:
Regenerate the system's private key and SSL certificate by issuing the command line interface (CLI)
command "svctask chsystem -regensslcert".

2) Reset User Credentials:


Change all user passwords using either the graphical user interface (GUI) or by issuing the
CLI command "svctask chcurrentuser -password".

No action is necessary for session-related cookies.


You should ensure that GUI access is functional before and after the SSL certificate fix.

Warning: Your environment may require additional fixes for other products, including non-IBM
products. Please replace the SSL certificates and reset the user credentials after applying the
necessary fixes to your environment.

71 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Interoperability – SVC Guidance – past guidance


 New statement (from Hursley): http://www-01.ibm.com/support/docview.wss?uid=ssg1S1004620
– If you are running in a supported configuration at SVC level X and upgrade to level Y then you are still supported
and can expect correct operation unless there are specific exceptions highlighted on the interop website.
– This is a statement about SVC firmware updates. This is not a statement about HBA/driver/multipath/OS or
switch firmware or storage array updates.
 GTS Delivery policy:
– Always check the SVC interop website for any exceptions that may affect a planned update.
– Even if no exceptions are listed, plan on regular HBA firmware / driver / multipath compatibility checks and
updates if appropriate and recommended by the server team
– SVC Firmware Change Only: If only planning to update SVC firmware, then broader checks on switch fabric or
storage firmware levels, or host levels or HBA firmware/driver/multipath are only required by GTS when upgrading
to a new major family and the last full check was the major family prior to the current one. Simply stated: full
checks are required only at every 2nd major family increment (e.g. the last check was done at 5.x, the code is
currently at 6.y, and the intended update is to 7.z)
– Always run the latest version of the SVC Update Test Utility prior to any change. (See link below). Always ensure
all server multipathing redundancy and storage pathing redundancy is operational. Always run a TPC/HC health
check against the SVC cluster in question to uncover any configuration non-compliance that may impact update
(cascaded node failover) success (such as pathing oversubscription)
 Compatibility websites:
– http://www-01.ibm.com/support/docview.wss?uid=ssg1S4000585 SVC Upgrade Test Utility
– http://www-03.ibm.com/systems/storage/software/virtualization/svc/interop.html (4,5, 6.x compat website)
– http://www-01.ibm.com/support/docview.wss?uid=ssg1S1001707 (concurrent compat website)
– http://www-01.ibm.com/support/docview.wss?uid=ssg1S1004392 (detailed 7.3.x compat website – SAN/SVC/DISK)
– http://www-03.ibm.com/systems/support/storage/ssic/interoperability.wss (SSIC – storage subsystem interoperation
center)
72 26 July 2013 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services

Interoperability – SVC Guidance continued – past guidance


 SVC firmware updates.
– Check compatibility of new SVC level with existing host/hba/driver/multipath/OS levels, as well as switch
and storage for every major sublevel jump (e.g. 6.3.x to 6.4.x). This is being superseded with the new
policy (previous page)
– Confirm multipathing active and working prior to SVC firmware update. This policy is NOT being
changed.
 Storage interoperability with SVC is independent of server interoperability
– You can update the firmware on a disk array behind an SVC and not concern yourself with the host-side HBA
or driver levels. The host sees the vdisk, not the back-end array (even in image mode)
– You can change the back-end array (through migration or mirroring, for example, and it has no impact on the
host compatibility
 Server interoperability with SVC is independent of storage interoperability
– Server specifics are relatively unimportant, for the most part. Most server annotations are being removed
from the official interop charts
– Server specifics may play a role with switch compatibility
 Switch interoperability must be checked.
– When updating switch firmware or SVC firmware, ensure that the switch / SVC combination is supported. If
not explicitly listed, then check with Delivery SSO. Establish proper LSAN zones. Modify zoning as needed
during migrations
– Switch interoperability must be confirmed with HBA / Driver / Multipath / OS. Multipathing should always be
verified active and compatible prior to switch firmware update and prior to SVC firmware update.
 Host HBA / Driver / Multipath Driver / OS updates. Always confirm mutual compatibility prior to updates.
Confirm compatibility with switch and SVC.

 Not applicable to EOL server operating systems.

73 26 July 2013 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Spectrum Virtualize (SVC) Global Mirror Issue

Problem:
• Accounts using SVC / V7000 standard Global Mirror on certain levels of Primary and Secondary controller code are
exposed to the risk that Global Mirror Source Data May Be Incompletely Replicated to Target Volumes

Background:
• Note that this applies only to standard Global Mirror; if you run only Global Mirror with Change volumes (cycling mode)
then you are not affected.
• Note that your current code level is not the only relevant factor, you may still be exposed if you were previously running on
an effected code level: If Secondary GM Cluster was ever on code level 7.2... up to 7.2.0.10, or 7.3.. up to 7.3.0.8 ; If
Primary GM Cluster was ever on code level 7.2..  to 7.2.0.7 then you are at risk
• If you are affected by this, follow the GTS Delivery specific code guidance and the instructions in the IBM support FLASH.
If you are affected by this, it is very important NOT to start running off your Global Mirror target volumes or reverse Global
Mirror direction. If you have done this, contact us immediately.
• The official IBM Flash is here: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005053
• GTS delivery also has this FAQ page:
https://w3-connections.ibm.com/wikis/home?lang=en-us#!/wiki/We1a54e872660_4f1d_a6ea_3e76a494b6a3/page/FAQs

Solution (Procedure):
• Upgrade source and target controllers to a safe level which means no further “data holes” will be created, see above links
for code levels.
• Then address the possible data holes:
• By a full fresh synch of all Global Mirror data (this is the GTS preferred approach)
• Or by installing checksum servers and running the IBM-provided checksum scripts to check for and address any
inconsistencies on target Global Mirror volumes (details in the IBM Flash above)

74 © 2015 IBM Corporation


IBM Internal Use Only
IBM Global Technology Services

Device Adapter High Performance


 For DS8100, DS8300, and DS8700
– Default is DAHP is off. Leave it off
• Exception is where original purpose of guaranteed maximum timeout is to be very low.
Japanese banks had contractual agreement of no more than 8 second response time delay
on IO. DAHP guarantees that marginal drive is dropped within 5 seconds and traditional
sparing is initiated (IO completed via parity with other disks)
– Rationale:
• Drive FC-AL loop instabilities can cause response delays that DAHP will pick up as multiple
hdd failures – falsely indicated failing DDMs that should not be failed. Mitigated for DS87 at
R6.3 sp4 code.
 For DS8800
– Default is DAHP is off. Leave it off
• Exception is where original purpose of guaranteed maximum timeout is to be very low – as
above
– Rationale:
• Drive SAS infrastructure no longer has loop instability issue. However, DAHP must be
turned off before any upgrade that updates drive firmware (still confirming this is not more
general). If not, then DAHP may easily be triggered by the firmware update delay itself –
rejecting the drive incorrectly. DAHP would need to be re-enabled post CCL. DAHP
modification is a PFE-only activity – expensive to invoke for every CCL.
 For DS8870
– Default is DAHP is on. Leave it on.
– Rationale:
• Drive SAS infrastructure no longer has loop instability issue. DS8870 automatically
75 switches DAHP off before CCLs, back on after
© 2015 IBM Corporation
IBM Internal Use Only

You might also like