Professional Documents
Culture Documents
April 2019 – 2Q
URL: http://ibm.biz/BdrACA
Education:
https://learning.atlanta.ibm.com/hr/global/edvisor/gdf_edvisor.nsf/Start?OpenAgent&Login&id=85257E5200583DE
A
IBM Internal Use Only © 2015 IBM Corporation
IBM Global Technology Services
Agenda
• DT&E Overview – Lifecycle Management • Heartbleed
• Setting quarterly microcode/firmware targets for all storage technology using a minimum
acceptable and target level strategy.
Field experience of SVC upgrades shows that rebooting an SVC node and running the BIOS POST tests
appears to be the most common time to find hardware problems with the SVC nodes. For many
customers, the only time that an SVC node is rebooted is during a software upgrade. If a hardware
problem is detected during a software upgrade, it will disturb the upgrade process and may require rolling
the upgrade back to the original level, depending on which node experiences the hardware fault.
• We therefore strongly recommend that nodes that have not had a boot in 14 months or longer, should
reboot nodes manually prior to doing code upgrades. This will allow a controlled reboot and will also
permit testing of the multi-pathing drivers in a more controlled reboot cycle, all of which will improve
the likelihood of a successful upgrade.
SVC upgrades planning two family code hops in same upgrade window, it is recommended that the first
upgrade be done manually vs automatically.
It is recommended to have a USB stick at site where upgrade is occurring in case CE needs to reinstall
the SVC node software if there is a issue with the upgrade.
NEVER do SVC cluster upgrades concurrently. Only exception is if servers are 100% isolated – none
connected to both SVC Clusters.
Netapp and Nseries code upgrade best practice are similar to DS8/SVC/FlashSystems as there are best
practices to apply ahead of time to ensure you have a system that is free of hardware issues and
configuration issues that would prevent a non-disruptive upgrade.
Please use the following process in preparation for doing Netapp and Nseries code upgrades ->
https://ibm.biz/BdR7PE
It is your responsibility to update HW/SW Currency when making changes in your environment (adding
or removing hardware, upgrading firmware/microcode, etc). It is imperative for ensuring you have a clear
understanding of your accounts position associated to code currency. Link here --->
https://hwsw.boulder.ibm.com:8443/hscms//welcome.pro
Always use approved account change management processes before executing any changes (applies to
everything in the deck).
If a code upgrade change must occur during a client's prime business hours, the customer will need to
provide their approval with understanding of potential impact.
End-to-End Ownership
determine how to best partner to influence and drive priority for the
decision makers in each IOT.
Communicate Strategy and Attainment 2. IBM Device Serviceability – Code stability, Concurrent upgrades,
2 3 4 etc…
2 Global Service Communicates Strategy and Attainment 3. Monthly – WEX Team sends monthly report to the Global Quality
Engineering Team IOT and Storage IOT Leader. Quality teams use material as
part of quality cadence meetings.
3 Storage IOT Leaders Facilitate execution/communication of strategy, best practice, and • “# of devices out of criteria for either Code or EOL from HW/SW” and “time
attainment across the Storage Delivery Teams remaining on contract based on CHIP”
4 HRM Assists with replacement of the EOL hardware 4. Monthly – Storage Service Engineering MOR to communicate
• HRM Team contacts Account Focal points (Chief Architect or delegate), assists with current LCM status.
and refines the optimal technical design in line with Storage strategy. Ensures lowest • Data points tuned to apply heatmap (Additional Detail in backup 9- 12)
cost solution for IBM, sourcing from GARS, etc
• Provides technical approval which is pre-requisite for financial approval in WWCT.
5. Quarterly LCM Strategy Update (Additional Detail in slide 6)
5 Account Team Prioritizes resources (capital and labor) 6. Adhoc – Reporting from HW/SW Currency
LCM closed loop design – Automated health check thru DPE notification
Automated Health Checks run daily on SAT servers (may be co-located with TPC) for IBM managed storage
As there are risk differences with deviations from minimum acceptable levels, we are
starting a strategy for certain devices to help guide the teams from a priority perspective.
See strategy below (via XIV example)…
Minimum
Accepted Target Document Risk if Device is NOT on
Hardware Make / Model (or Code Code RED Level: Urgent Yellow Level: Secondary Minimum or Recommended Code
Software) Level Level(s) Need to Upgrade Priority Levels
XIV GEN 3 (MODEL 214) 11.2.0a 11.2.0b N/A N/A Significant issues below 11.1.1
New GEN3s being shipped 11.2.0b 11.2.0b N/A N/A Significant issues below 11.1.1
Timelines for Security Issues with Firmware/Microcode Risk Management - ITCS104 (IBM internal security document)
Group System Type/ Operating System Severity
1
Target install time. If install time cannot be met, follow the Risk Management Process per FIN 166.
2
Follow Storage Microcode Strategy that states a year between minimum acceptable and target levels. Ensure there is a Risk discussion with Client
- Firmware upgrade will take place (only) if agreed with Client. Set Target installation time 12 months
3
For accounts using WWBCIT risk system: Target install time is 120 days, if the 120 days cannot be met then a WWBCIT SAI (Self Assessment
Issue) record needs to be completed with a new target date for 12 months from vulnerability. If the 12 month microcode strategy implementation
cannot be met, then a WWBCIT CDD (Corporate Directive Deviation) record needs to be completed. The existing SAI can be closed referencing the
newly opened CDD.
4
Follow Storage Microcode Strategy that states a year between minimum acceptable and target levels. Ensure there is a Risk discussion with Client
- Firmware upgrade will take place. These will only generate informational CIRATS records with no targets.
5
Group 1 refers to production and group 3&4 refers to development and test.
Overall firmware/mcode process located here - http://ibm.biz/BdZc9c
If low/medium MSS classified risk, we move target levels to fix level. If high MSS classified risk, we move min to fix level. LCM owner resp.
CIRATs auto-cut post MSS classification provided accounts are subscribed to given technology as they are required to be. © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Timelines for Security Issues with Firmware/Microcode Risk Management - CSD (Customer Security Document)
1
Guidance is to follow the Storage Microcode Strategy. This strategy ensures that we are at the documented minimum acceptable code level..
2
Group 1 refers to production and group 3&4 refers to development and test
The above dates are the IBM recommended values. Each individual account may have different values based on contractual agreements.
TPC/TSM timelines will be different as they are not firmware updates, they are software updates, therefore would align to
server timelines/ techspecs.
CIRATs that are automatically opened for storage firmware/microcode security issues need to be updated with target date for
addressing w/fix. Close after fix has been applied.
Risk Management process that teams need to use ->Just added it in here...
https://w3-connections.ibm.com/activities/service/html/mainpage#activitypage,dc6d5eed-f279-4632-83ca-b2d4af32f12f
If your account has had a data loss event due to confirmed problematic drives, you need to notify the global
service line immediately (Karen Haberli/Hartford/IBM). If that device is EOL/EOS you’ll need to to vacate /
replace it immediately.
If the device is not EOL/EOS (or near to it), your action plan will be:
• Immediately vacate the device
• Replace all the problematic disk drives (work with Systems)
• Upgrade the DDM and Device code to current
• Reformat the device as RAID6/DRAID6
The affected device cannot be reused until all 4 steps are complete.
Predictive Analytics (HWSW Reporting) will continue to improve to identify faulty ddm’s so we can identify and
field the problem storage devices, to prevent an account outage. This analytics indicates a summarized action
plan to address the errors identified. The recommendations should be worked immediately.
The GTS Strategy is based on average usable age lifecycles as well as offical product support dates.
These lifecycle times consider average refresh cycle times, as well as the time required for data
migration and typically begin before the product is out of support to avoid risk of slow refresh periods
and allow time for proper refresh planning and execution to occur.
New deployments of GTS EOL/EOS hardware is not authorized. (*Note: Exercise caution with GAR’s
purchases - As a general rule devices that are End of Marketing/End of Sale will have limited useful
life and should not be purchased.)
Account risk process to be used for GTS EOL/EOS devices that are still supported via TSS.
– Risks are to be filed with the Account Team and/or Project Office, NOT the end-client.
– Risks are owned by the account team; this is a wholly internal process
– This is entirely an internal process. While IBM Internal, GTS accounts must follow.
– DS5X devices in this category. Technically supported via TSS however IBM Systems marked EOL
and has ceased doing anything code and security wise (they do not even issue PSIRTs against
this technology anymore).
ItDelivery/Account
*Note: is the responsibility of the
teams should GEO
work SSLsproper
to ensure to track and act plans
risk mitigation on this
are guidance.
in place. This should take the form of evaluating
secondary controls for callhome and possibly adding SNMP alerting, SMTP email alerts, engaging with local in-country TSS, manual
health checking, updating account and client risks, and any other possible approaches that mitigate the risks to accounts/customers.
15 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
The perception and expectation is regardless of whether a device is EOL/EOS with risks in place is that
there will be prompt service from the vendor / 3rd party vendor, and the Service Line Delivery Teams to
resolve any issue and "somehow" whatever is needed will be found via various escalations - skills, parts,
labor, etc... The reality of the situation is(as experienced first hand by an account this month):
· Extended maintenance contracts typically only cover break/fix on the hardware side (assuming
parts are still available). So any non-hardware event will have NO support. Escalations typically do
not work.
· Complicated device recovery actions requiring L2/L3 development level skills are not available to
help with event.
· There is no maintenance or fixes for device defects. If you hit a new bug, no fix is coming.
· There will be escalation delays required to try and find the right skills and/or parts within IBM or
Vendor teams (high probability skills may not exist anymore)
· CRITICAL admins do daily health checks of EOL tech as call home likely no longer works.
Formally submitted Risk's must be written and accepted by account team (Executive Owner for account) and
client (Executive Owner within client for IBM relationship - ie: CIO/CTO/etc) and the risks must be clear that
data loss, performance issues and elongated outages are a strong possibility due to the use of
EOL/EOS hardware and software. You should NEVER have any critical data on EOL/EOS technology
<period> and if EOL/EOS HW/SW is in use for critical data when the risk letter is written, the risk letter needs
to be accompanied by a migration plan to be approved at the same time as the risk letter by the same parties.
In many cases using newer hardware to optimize storage will provide a positive business case for all sides
and demonstrate IBM's ability to bring forward innovative solutions for our clients. © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Flash Systems
FS900s allowed for reuse
All A9Ks allowed for reuse
XIVs
XIV GEN3s that are less than 3 years old
DS8Ks
DS888x
- An "old" product is one that has gone through end-of-service (E10s, E20s, F20s, CVT etc. These are just examples and do not constitute an all
inclusive list
- Note: SO Teams should always open a PMR against a product, model type and serial number
ISSUE
Microsoft has announced that support for Windows Server 2003 will end on July 14, 2015
http://support.microsoft.com/lifecycle/?c2=1163
Storage Management applications may be running on Windows Server 2003 making problem resolution and
maintenance difficult
Many SO accounts still utilize Windows Server 2003
Storage vendors will not be including support for Windows Server 2003 in their new releases
Actions
Develop action plans to ensure Storage Management Applications are running on an appropriate operating system
Windows Server 2008 for upgrades from Windows Server 2003 https://ibm.biz/BdRhwe
Windows Server 2012 for new installations
Communicate with SO accounts that vendor support (drivers and problem resolution) of Windows Server 2003 storage
clients will be problematic after EOS
Communicate with SO accounts that new Storage devices are unlikely to support Windows Server 2003
Windows 2008 goes EOL in 2020. Recommendation is to shift to higher version when opportunity arises.
2222 2011 Accomplishments for Storage PSM PCM © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
The next three slides cover ECAs. ECAs are Engineering Change Authorization provided by
STG to help improve reliability, availability and stability with storage technology.
While this significantly improves quality, it does not completely remove risks.
These are the only ECA’s we are officially tracking in GTS SO. If a SSR or TSS rep
provides other ECA recommendations, we encourage you to follow. There are other
ECA’s that are handled via STG and TSS that we do not track. More details on
these other ECAs are in backup.
If any doubt or concern, reach out to Jim Olson or Karen Haberli with questions.
Best way to track devices specific to ECAs (ones needing/ones completed) are via HW/SW
Currency - https://hwsw.boulder.ibm.com:8443/hscms//welcome.pro
Under view inventory, use ECA scope and ECA number to track your
devices
Global Scorecard 2.x report will generate reports on all ECAs
Please work with your local SSR and use change management to get ECA’s applied
It is the responsibility of the GEO SSL’s to track and manage their inventory and ECA’s.
Net Summary – code levels call out fix level. See spreadsheet for actual level guidance.
DS81/DS8 - not affected by Struts. Need BASH ICS applied or min code 64.36.103.0
DS8700 - BASH ICS can be applied to fix but you need code level 76.31.105.0 for Struts. Or, 76.31.121.0 or above will
address both security issues
DS8800 - BASH ICS can be applied to fix but u need code level 86.31.123.0 for Struts. Or, 86.31.142.0 or above will
address both security issues
DS8870 - BASH ICS and STRUTS ICS can be applied. Or 87.21.39.0 address both issues.
24 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
DS8K ECA1111
Call home digital certificates on DS8800 and DS8870 will expire on August 1, 2018
Abstract: Call home requires a digital certificate to successfully report a problem to IBM. DS8800 and
DS8870 machine models will require a new call home certificate by August 1, 2018 to continue to
successfully call home. IBM will need to install a new call home certificate on the HMC(s). DS8880 is not
affected by the change in call home certificates.
Content: The DS8800 and DS8870 use Call Home Certificates when negotiating with the server to Call
Home to IBM. On August 1st, these certificates expire and will need to be replaced with Digital Geo Trust
digital certificates. Installing new Digital Geo Trust digital certificates is a seamless and concurrent operation.
Exposure: Machines that do not have updated call home certificates by August 1 st will no longer be able
report problems to IBM.
Mitigation:
An ICS-CD is available to update the call home certificate, CSE_CallHomeCertificates_v1.0.iso,
which can apply the new certificate. This can be installed by remote support or by SSR. The certificate update
is concurrent and does not require HMC reboot. Installation will require ~15 min per HMC for remote support
on DS8870 machines running 87.51.93.0 or higher. SSR installation will 10 min per HMC.
If you have already installed CVE_1Q2018 or CVE_2017-1123 to address security concerns, there is no
need to install the CSE_CallHomeCertificates_v1.0.iso because the new certificate was included.
Exposed Microcode Levels: All current microcode levels for DS8800 and DS8870 products.
Fix: Contact your IBM Service Representative or call IBM Service to open a PMR to have the call home
certificate installed.
DS8K ECA413
Overall strategy is to use a new product called HW/SW Currency for staging and validation of account based
firmware and inventory data from STG and TPC deployments that can then be eventually warehoused in
GACDW.
– Currently porting data from the STG “heartbeat” and “Call Home” databases for installed currency levels. Additional details
being pulled from SRM (for servers), Inventory and Security databases
– Plan to extend coverage to include STG SmartCare, TSS Tech Services Appliance, ISC Fulfilment, TPC, TAD4D, TEM, USI-
BRS, etc. for a comprehensive validation of account assets by triangulation
– Challenge here is to reliably filter SO assets from STG data. Until automated account feeds can be established to
validate SO assets, SSL GEO teams responsibility to maintain their current inventory in ProSliM. It is IMPERATIVE
that our storage administrators keep HW/SW Currency updated with accurate code versions. It should be done as
part of your day to day activities to ensure accurate information for Call Home, Code Levels, ECAs and more.
Reporting, Analytics and Management System in HW/SW Currency for tracking compliance to code levels,
ECAs and EOL
– On demand analytics and push/pull reports on code levels, ECAs and historical trending
– Ongoing automated tracking of assets at the Account and Machine Type/Model and Serial number granularity for adherence
to minimum code and ECA patch levels
– Management system support for views by Delivery Center and Pools, Geographic Account hierarchy, Sectors and
configurable set of focus accounts
– Configurable rules for GTS SO Delivery operational policies for compliance with code levels
Sent via email Repository is easy to navigate and has robust search functionality
Current repository not ideal User community can subscribe to receive notification when alerts are posted.
Email notification can be sent to the Community at the touch of a button to ensure all community
Delay between alert sent and posting in members are notified of urgent alerts. Notifications send to 800+ SMEs when alerts posted.
Global clearing house. Robust search functionality within the Lotus Notes Community allows for search of all titles and
Search ability is dependent on each rich text for each alert.
admins email organization skills Easy to add users via Blue Group, Distribution lists, or Self Registration
Easy to access and edit, Widespread audience
Built in Comments Section allows feedback on posted alerts
Maintains edit history
Link to tech alert dbase below……
https://w3-connections.ibm.com/communities/service/html/communityview?communityUuid=72dbfef0-8c6f-4dd5-9e51-738902fe1944
© 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
*** Future plan is to have the storage team assist with this area. We are working with IBM Research to see if
there is a standalone version of HW/SW (aka ProSlim) to help with this interopability work prior to account
deployments. ***
29 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Objective
Install SAT at all SO Accounts
Configure SAT with appropriate SMTP infrastructure and admin email addresses
Add all existing storage devices to SAT for routine health checking
Perform a Pre Go-Live Health Check prior to production on any new storage devices coming into the environment
Maintain SAT at current recommended version. See code strategy spreadsheet for acceptable levels. Spreadsheet can be
found at: http://ibm.biz/BdrACA
Actions
If SAT is not yet installed, download and install it from the link above
Check at least once per quarter for new SAT version releases
New versions of SAT are generally released at the end of each quarter
Install the latest SAT version for new installations.
Existing SAT installations may continue to run the prior quarter’s version of SAT or may install the latest version.
You must install the latest SAT version if the version you are running is more than 2 levels behind the latest
Most Important: Check output from SAT daily and document and drive plan to remediate findings in priority order:
Fatal, then Critical, then Major, then Minor errors. Ensure issues and actions and shared/tracked with account team (SIL
as an example).
31 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Dashboard value:
– Measure/report on device configuration Best Practices
– Measure/report on new device configuration problems identified
– Measure/report on device configuration problems solved
– Measure/report on device configuration issues not being resolved
– Use these measures to drive quality initiatives
– Global Analytics for exec reporting (2015 plan)
– Auto-email notification to DPEs for non-compliance (2015 plan)
All storage devices eligible for calling home need to be configured to call home. Exceptions should be
limited only to clients that do not allow this capability. Exceptions require update in HW/SW specific to
not allowed to call home and who from account team approved.
The purpose of the Call-Home guideline is to ensure that all devices (in scope) operated by GTS SO
Delivery
– have call-home correctly set up and tested
– have verified functionality at least once a year (during mandatory code upgrade)
– have added an appropriate identifier (“<some account name> (SO ACCOUNT)”)
Identify systems that never call home and have no heart beat
Strategy is to develop guidelines for the in-scope devices
– add a field descriptor to all arrays
– verify the information added gets through to call-home and is added to PMR
– identify storage arrays for SO accounts in PMRs with searchable descriptor
– feed the information into existing centralized reporting (HW/SW Currency)
Any device not being allowed to call home MUST be configured for SNMP alerting to SYSOPs.
A RISK needs to be written for any device not being allowed to call home and not a client security issue.
As an example, if someone states it’s a funding issue, then a RISK needs to be written. Specifically use
the account risk process (not CIRATs).
Strategy will be to include ‘SO ACCOUNT’ in a specific field for call home so that we will be able to
search the call home databases in the future for SO hardware. See link for best practices for this entire
process.
Closed loop process via auto-email to DPEs is strategy (via HW/SW Currency). Deployed for XIV and
DS8 currently. SVC and FlashSystem coming 1Q2015.
Overall strategy is to leverage VPN vs modem. Modem support is going EOL within next 12 months.
Key item for teams responsible for storage code upgrades Strategy is to use HW/SW Currency and
validate heartbeat during code upgrade process. If heartbeat is not current (30 days or less), execution
of testing of call home needs to occur to validate it is working. Resolution needs to occur to ensure
heartbeat and call home is working.
33 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
It is extremely important to have every device in the data center to show log entries that are
time synchronized, this is to ensure that log files / events can be correlated between them. If
the devices are not working from a common time source, then we need to calculate the deltas
(differences) between each device's time stamps. This is a manual process that can lead to bad
conclusions about which device was responsible for a particular issue. The solution to prevent
this mix-up is having a common clock source using the Network Time Protocol (NTP).
The purpose of this guideline is to ensure that all devices (in scope) operated by GTS SO
Delivery
– have a common clock source correctly set up and tested
– use a common offset to UTC (Universal Time Coordinated) that needs to be defined per
account and per device
– have verified functionality at least once a year (during mandatory code upgrade)
– have a common clock source per datacenter
– Better when having servers using the same synchronized source as well
Application job runtimes will not be affected, as there are system clocks. This is a hardware
clock discussion. However still we should check jobs as things like post processing could be
affected.
Objective
Perform analyses on all SO accounts to identify maintenance contract gaps
Develop action plans to address gaps and/or put risk letters in place for potential SLA penalties
Develop closed loop process to ensure maintenance is set up properly at engagement and maintained throughout the
contract lifecycle
Partner with the NSO to ensure their maintenance contract offering has deeper penetration across SO accounts
Actions
Analyses work completed to show that only 31% of existing SO accounts are utilizing the NSO Offering
Executive sponsorship garnered to communicate the criticality of this issue to sectors
Approached accounts about maintenance gaps through Account Health Dashboards
Approached Framework blue development to have NSO offering included in engagement cost cases
Working with NSO to have the offering presented to industries through staff meetings, newsletters and presentations
Working with account teams to have documented risk letters in place when gaps are identified
The goal for 2019 is to have all applicable accounts retrofitted with the offering and to include it in all new engagements
3636 2011 Accomplishments for Storage PSM PCM © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
2 - Brocade SAN switches - 3590, LTO1 & LTO2 tape drives. All are - - TSS - Hydra
EOS.
- Flashsystem 710/720/810/820. Not TSS rather GTS
EOL
- CISCO GEN1 MODULES AND GEN2 4G
MODULES
- Brocade EOS/EOL (M48, 2498-B40, see next slide
for all)
- XIV Gen2
- DS88
3 - XIV, DS8 - DS5s. These are not TSS EOS rather GTS EOL/EOS N/A
- Any remaining hardware officially listed as EOL/EOS
- SVC 2145-CF8 & CG8 – TSS EOL 07/01/19
Backup
Changes
Changes Date Author Why are we doing?
Reprioritized priorities on page 6 06/05/2013 Jim Olson We have made a lot of progress with DS8
and XIV and need more focus on Brocade
(due to code quality issues)
DS8 ECA870 (RAID10) added to page 9 06/05/2013 Jim Olson Code level was already a part of our
strategy. STG decided they wanted to
track via a ECA.
XIV ECA306 added (page 11) 06/05/2013 Jim Olson STG asked to push this ECA.
DS6K added to tech refresh priority page 06/05/2013 Jim Olson Parts limitation.
14
Code updates. Key updates to DS8 and 06/05/2013 Jim Olson
XIV.
URL updated page 1 06/05/2013 Jim Olson Hyperlink as iRAM has had link issues.
Added page 22 with XIV ECA information 06/05/2013 Jim Olson So team can see all XIV ECAs.
Added changes slide 06/05/2013 Jim Olson Mike and Karen request. Good idea.
ECA867 added (RAID10) 06/11/2013 Jim Olson Were already doing under RAID10
guidance
Changed min on DS8800 ECA867 to 08/01/2103 Jim Olson Per STG Guidance
86.20.130 from 86.20.114
Included new SVC interop strategy 08/27/2013 Jim Olson, Mark Chitti, Kirby Included new SVC interop strategy
Dahman Via Hursley guidance
ECA876 added 08/12/2013 Jim Olson Added data protection per STG. Low box
count medium priority.
Added TS77XX to code strategy 09/15/2013 Jim Olson, Charlie Hayden Needed
Added new call home strategy 11/12/2013 Jim Olson. Thomas Improving our call home position for SVC,
Brachahn DS8 and XIV
Updated EOL/EOS tech page 16 01/22/2014 Jim Olson EOL tech
Added page 17 03/22/2014
Updated ECA306 XIV slide with code 02/07/2014 Jim Olson Fixed at later levels
levels where fix is
© 2015 IBM Corporation
Added in FlashSystem pre-code upgrade 04/04/2014 Jim Olson New process IBM Internal Use Only
IBM Global Technology Services
Changes
Changes Date Author Why are we doing?
Added slide 19 for Window 2003 EOS 04/25/2014 Jim Olson/Dave Schustek We have servers that run our tools.
Important we ensure we are running on
supported OS’s.
Updated EOL/EOS tech page 16 04/25/2014 Jim Olson Bent Braum Holst asked for clarification
Added page 17 on Brocade EOL.
DS8K ECA876 moved to backup 08/25/2014 V74 Jim Olson Low priority that has been out for some
time.
TS7700 tape unit ECA009 added 08/25/2014 V74 Jim Olson Per STG Guidance.
Added some CISCO EOL devices to page 09/10/2014 V75 Jim Olson/Karen Haberli New STG announcement
17
Added cover page for ECA section 09/12/2014 V76 Jim Olson Better way to address ECA section
Added ECA899 09/12/2014 V76 Jim Olson New ECA for DS8 600GB drives
© 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Changes
Changes Date Author Why are we doing?
Updated page Call Home slide (currently 09/16/2014 V76 Jim Olson STG Direction
page 27). Modem going EOL.
Updated bottom of page 4 to remove tape 09/23/2014 V78 Jim Olson WAY to difficult to manage.
drives from code management strategy.
Recommendation is to follow CE/SSR
guidance for tape drives
DS8K Apache Struts and BASH added into 10/22/2014 V79 Jim Olson/Kirby Dahman Added to HW/SW so we can now track
ECA section (ECA1234 and ECA5678)
Minor update on page 8 12/03/2014 V81 Jim Olson Added more specifics on ECA
Added slide for Storage Automation 12/04/2014 V82 Jim Olson, Jason, Stanley Needed
Security section added. Old page 9 01/07/2015 V83 Jim Olson, Rodney Results from work with Security teams in
removed and now we have 9 & 10 Mulrooney, Alan Skinner IBM
DS5K added to Tech Refresh Section/GTS 01/07/2015 V84 Global Design Authority Major CIEs in 2014
EOL
Added slide 30 on DashBoard 01/15/2015 V84 Jason, Stanley, Jim T&I Strategy in 2015
Some tuning to security slides 9 & 10 01/27/2015 V84 Rod, Jim Per SARM
Tweaked slide 20. Tech did not change just 02/12/2015 V85 Jim Partnering with TSS.
verbiage so no SSL ACB review.
Minor verbiage changes to slides 9 & 10. 02/12/2015 V85 Steve Biles
Minor tuning to 21/22 for CISCO EOL 02/17/2015 V86 Jim/Art Scrimo More kit going EOS
Updated ECA section. Marked complete all 04/03/2015 V89 Jim/IBM Systems/Rich Enough done on older ones to mark
but ECA826, ECA899 and ECA009 Oubre complete
Changes
Changes Date Author Why are we doing?
Updated slide 8. Moved FlashSystem up to 04/07/2015 V90 Jim Olson Total Data loss on one account and major
priority #1 due to major code bugs as of code bugs (timer issues).
late.
Added slide 15 for SVC Global Mirror issue 04/13/2015 V91 Patrick Keyes Critical bug
Added slide 32 to summarize current 04/27/2015 V91 Jim/Ken Morgan Good idea
priorities
Updated slide 10 to clarify differences 05/11/2015 V91 Jim/Steve Biles Security guidance
between storage code and SW like TPC
SVC interop guidance changed 05/29/2015 V92 Jim/Hursley/DA approved Easier planning for SVC upgrades
Updated slide 19 to include Brocade kit 08/07/2015 V93 Jim/Kirby M48 marked EOL
Updated EMC with new EOL links 09/25/2015 V94 Jim/Karen New EOL links
Added V7000 HDD issue 10/01/2015 V95 David Schustek New high impact tech alert
Updated EMC EOL links on slide 20 10/20/2015 V96 Francesco/Karen Better links
Made some updates to slide 20 for CISCO 11/09/2015 V96 Lyle Ramsey New info
Removed ECA826 and ECA899 due to 11/10/2015 V97 Jim Olson Age
age. Slides moved to backup
Added slide 38 in backup. Calls out all new 11/10/2015 V97 Jim Olson/Keith Williams Systems directive
DS8 ECAs. Blue GTS chases while rest
PFE and Systems will chase
Slide 12 and 13 now reflect two new ECAs 11/10/2015 V97 Jim Olson/Keith Williams Systems directive
we will chase for DS8 – ECA714 and
ECA715
ECA826 and ECA009 moved to backup. 11/30/2015 V97 Jim Olson Aged
ECA021 added for XIV 11/30/2015 V97 Patrick Keyes New one to chase per Systems
Update on slide 5 related to concurrent 12/05/2015 V97 Jim Olson Crit. Lesson learned
upgrades on SVC clusters
Changes
ECA826 and ECA009 moved to backup. 11/30/2015 V97 Jim Olson Aged
ECA021 added for XIV 11/30/2015 V97 Patrick Keyes New one to chase per Systems
Update on slide 5 related to concurrent 12/05/2015 V97 Jim Olson Crit. Lesson learned
upgrades on SVC clusters
Added slide 8. Overview of roles and 02/01/2016 V98 Ken Morgan/Alan Skinner Needed
responsibilities.
Completed ECA section (714). Added in 03/01/2016 V99 Jim Olson/Charlie Hayden New DS8 ECAs
two more ECAs as well. See ECA section.
Moved DS5s to priority 3 03/14/2016 V101 Jim Olson DS5s are not TSS EOL/EOS so reducing
priority due to some pushback
Updated slide 22 Cisco EOS\EOL section 03/29/2016 V103 Lyle Ramsey Some devices listed were incorrect
FlashSystems added to EOL. See slides 06/01/2016 V103 Jim Olson Old and Not many..
21/22.
Updated slide 11 with new Risk Mgmt 06/04/2016 V103 Steve Biles Needed
process
Updated slide 31 & 34 XIV and DS8k to 0627/2016 V103 Ken Morgan Extra focus on DS8K and XIV
Priority 1 for callhome/heartbeat Callhome/Heartbeat
© 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Changes
3494 tape libraries added to EOL 11/30/2016 2Q2016 V2 Jim Olson Aged. Official EOL 1/31/2017
Added TS7700 EOS data (page 21 & 23) 1/18/17 Karen EOS data provided for Geo’s
Added DS8700 1/30/2017 Ken Morgan EOS NA Dec 31, 2017, other IOT’s to be
announced soon per systems
Added Storwize (V7, V5, V3) with SVC as 1/30/2017 Ken Morgan Improved clarity Storwize not just SVC
priority 1 code
Updated some Brocade EOL kit 2/14/2017 Clancy Obrien New info
Updates slide 7 with new SVC storwize 5/1/2017 Jim Olson New info
disk drive upgrade process
Updated slide 5 on release notes and 5/14/2017 Jim Olson New Info
usage for code upgrades
Added Nseries to EOL priority 2 (list) 5/14/2017 2Q17 Final V2 Jim Olson Needed
Added slide 24 for EOL
Added to code upgrade guidance for quiet 6/12/2017 Jim/DA Post outage at EMEA account
time and potential impact to slide 8
Move FS 710/720/810/820 to priority 2 9/11/2017 Jim/DA New full data loss events
© 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Changes
Marked all Nseries EOL 10/2/2017 Jim All are TSS EOL in 8 months and we
have had part and support issues. So
giving teams time to vacate before official
TSS beyond parts and support issues
Moved all ECAs to backup as they are 10/13/2017 4Q17 final3 Jim Old and considered done per support
deemed old and completed per support
Added call home details for DS8s 10/13/2017 Karen/Patrick Worked with Systems
Added XIV Gen2, Additional Hydras, 3/15/18 Karen Going EOS 12/31/18
Tape Drives & SVC models to EOS
Added ECA1111 for DS8 call home issue 05/09/18 Jim/Ken IBM Systems Guidance
Clarified v7k and Brocade on reuse slide 9/6/18 Dave Missing details
Added in Netapp & DS8800 EOL & DS3K 10/4/18 Santnana/Karen Was missing
SVC 2145 CF8 and CG8 12/05/18 Jim TSS EOL/EOS 07/19 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Changes
Added slide with guidance on problematic 01/31/2019 Jim Outages and Data loss events
drives.
Our storage code strategy is to maintain a range of acceptable versions for many reasons...
- All versions of code have additional fixes, as part of continuous improvement process
(addressing past defects that all code versions have)
- We balance the value of those included fixes vs the risk of a newly released code (running
bleeding edge code has risk)
- Code upgrades are not always 100% successful (many that are not successful result in
outage) - 99.9% for DS8K
- Coordinating large change windows is a challenge. DS8K code upgrades are very time
consuming; most run 6 plus hours per device.
- There is significant work across the entire account when doing code upgrades (many
towers involved)
As such, we try hard to keep 1 year of acceptable code versions (in partnership with
Systems/vendors). By design, the latest version of code that is published is not turned to a
recommended level for 3 to 4 months.
Summary:
The ECA 714 is a mandatory FBM. The purpose of this ECA 714 is to provide drive firmware to
detect and reject DDMs which exhibit pivot bearing outgassing conditions that could lead to multiple
drive failures and cause loss of access/data loss. Viper C 600 GB 15k DDM is a large from factor
(LFF, 3.5 in) drive, we shipped these drives in the DS8700 system, to minimize the impact of this
issue, SSRs must order and apply an ICS CD level: DS8k_DDM_SSD_FW_Update_v1.10.iso to
affected machines as stated in the machine list, ICS CD is available for ordering from Super
Shippers DB or using the CDA4TP tool to download from Fix Central at the link below:
https://port.rchland.ibm.com/support/fixcentral/ac/options
Checkpoint:
Use the HMC to check for drive FirmwareLevel: F811 ddmFamily: H5FH, if the drive FirmwareLevel
is at F811 or higher then no further action is required, otherwise obtain ICS CD:
DS8k_DDM_SSD_FW_Update_v1.10.iso and install the ECA 714.
How long does installing the ICS CD take and what if it fails?
It takes approximately 15 minutes to load the ICS CD into the HMC, and the firmware update will
start automatically in the background, and if for whatever the reason the firmware update fails, the
machine will callhome to notify IBM service personnel of the issue, so we can take immediate action
to correct the problem.
48 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Summary:
IBM has developed microcode enhancements for error handling in HPFE. DS8870 (R7.5 SP2.3 )
87.51.23.4 microcode levels contain the changes to improve the reliability and availability of High
Performance Flash Enclosures. This enhancement streamlines error detection and isolation when a
failing Flash Drive exhibits excessive errors.
This change is designed to be concurrently installable on DS8870 presently running the R7.x
families of microcode.
Recommendation: A mandatory ECA 737 is being released to the field, we recommend upgrade to
code bundle R7.5 SP2.3 87.51.23.4 as soon as possible
Summary:
DS8870 Global Mirror suspends caused by a microcode logic error introduced in R7.4 that results in
a Track Format Descriptor mismatch. Microcode is improperly setting a flag in a PPRC control block.
This problem is pervasive in Global Mirror environments. R7.4 code levels below 87.41.44.0 and
R7.5 levels below 87.51.23.4 are exposed to this issue.
Recommendation:
A mandatory ECA 712 is being released to the field, we recommend upgrade to code bundle R7.5
87.51.23.4
Affected Machines: GEN3s with specific models of Seagate disk drives inside. IBM has a list of affected XIVs from call-
home data. GTS is working to get affected XIVs highlighted in HW SW CMS tool also.
Affected disk model Capacity Firmware level containing fix for this issue
disk_list command on the XIV Gen3 can be ST2000NM0043 2 TB EC5C
used to see disks installed: ST3000NM0043 3 TB EC5C
ST4000NM0043 4 TB EC5C
ST6000NM0054 6 TB EC6D
Problem Description
There is a risk of data corruption (the issue appears when the disk drives detect unreadable data. Should the issue occur,
it will result in a 512-byte block difference between the primary and secondary partitions. In effect undetected data
corruption during a specific drive error recovery sequence)
The issue is rare: according to the vendor, an affected drive is expected to hit the issue once in every 3,412 years. For a
system with all 180 affected drives it means once in every 38 years. XIV Scrubbing mechanism detects this issue.
Action: If you have the affected model disks, contact IBM support to update the disks’ firmware to a fixing level. IBM
support is also pro-actively contacting affected accounts - if an account has an XIV Technical Advisor, the XIV Technical
Advisor will contact the account. The fix process is non-disruptive (there is a few seconds performance impact to a
physical disk drive while it is having firmware upgraded but since most I/O is expected to go via cache, this performance is
expected to be trivial).
Open change records and work with CE team to apply disk firmware fix to necessary machines. A global list has been
provided to GTS from STG. Geo SSL leaders need to track all devices to completion. It is a low risk concurrent activity
© 2015 IBM Corporation
51 IBM Internal Use Only
IBM Global Technology Services
Problem:
Specific hard disk drive models supported by the Storwize family of products may be exposed to possible
undetected data corruption.
Remediation:
A firmware update that remediates against future occurrences of this issue is now available. IBM
recommends that all customers with the affected drives apply these latest levels of code.
Solution (Procedure):
The Systems Group has provided excellent instructions in the link above, follow their guidance for all
v7000s
1. Use the utility provided via the link to determine if the exposure exists
2. Determine which remediation path applies to your environment if it does
3. Follow the set of instructions associated with the remediation path.
2 Global Service Engineering Communicates Strategy and Attainment Ken Morgan Alan Skinner
*Leverage Global and IOT Quality *Leverage Global and IOT Quality Leads to
5 Account Team Prioritizes resources (capital and labor) Leads to work with account DPE/PE work with account DPE/PE along with
along with HRM team HRM team
6 Storage Delivery Teams Perform actions Multiple - Leverage IOT Leaders Multiple - Leverage IOT Leaders
Tooling Strategy/Development and Work with IBM Systems to address serviceability Jim Olson
7 Domain Richard Baird
(Code, etc…)
9 Global Service Engineering Global Process Owner - End-to-End Ownership to drive program Ken Morgan Alan Skinner
IBM Global Technology Services
Updated Poodle ICS 700 J16444 N/A Loss of access 77 $16,170.00 Launched 10/12
Enables SSLV3 TMAN/EPOC – 10/09
As required
DS8700 II’s/Notification
DS8800 Tracking
Viper C ICS 714 J16413 307442 Addresses dual DDM failures, 359 $75,390.00 TMAN/EPOC – 10/16
loss of access, data loss
DS8700 Mandatory II’s/Notification
Tracking
Seagate Della ICS 715 J16445 308135 Addresses a data loss 928 $194,800.00 TMAN/EPOC – 10/16
Tracking
Bluehawk DA pair 8 707 J16446 309406 Addresses a loss of access. 200 $186, 340.00 ( CCL time: TMAN/EPOC – 9/24
(CCL) 4 hrs used for
Mandatory calculation ) Fis available –
DS8870 formulating path for
reduced customer risk for
approaching freeze/
R75 Upgrade with GM 712 J16457 310099 Loss of Access 108 $22,680 (CCL or ICS, TMAN/EPOC – 10/13
(ICS/CCL) and 310103 depending on current
mandatory level) II’s/Notification
DS8870
Tracking
Security Classifications
https://advisories.secintel.ibm.com/faq.php
What's the difference between High, Medium, Low, E-Fix and FYI ratings? Why do FYI ratings
show up as High, Medium, Low in the database?
Ratings are an assessment of the severity of the vulnerability. They are used to calculate due dates
(according to criteria specified in ITCS104) for patch implementation.
FYI and E-Fix designations are not ratings. They are an indication that no compliance activity is required.
The advisory is being issued for awareness (usually the vendor includes workaround information) but
there's no mandatory action associated with it due to the lack of available supported patches.
However, even in the case where the vendor has not released a patch, there's still a vulnerability that the
vendor is reporting. As part of our FYI communication, we assess the severity of that vulnerability and
assign it a rating. The rating and the compliance activity are two separate pieces of the process.
Smart Rebuild
Smart Rebuild was developed for problematic 450gb 3.5inch drives, but has since been expanded to all
drive types. The initial release of Smart Rebuild would perform a check twice a day for drives exceeding
3 media errors, and later was increased to hourly.
How does it work? Parity calculation for all data. Data copy with parity calculation for
unreadable data only.
How long does it take? 3-4 hours for 450gb drive ~1 hour
Am I vulnerable while Can NOT handle any additional Can handle an additional failure.
rebuilding? failures.
There has been a 3.5x reduction in dual disk failures since Smart Rebuild was introduced.
How does it work? Calculates data from Calculates data from Calculates data from
Parity and reassigns to a Parity and reassigns to a Parity and reassigns to a
new sector. new sector. new sector.
How are bad sectors Client I/O Proactive scanning of DA reads DDM error
discovered? the DDMs by the DA. logs, prioritizes bad
sector reassignments.
How long could a bad Days, weeks, months… Up to 4 days max ~ 1 hour
sector exist? until I/O is requested for
that sector.
ECA 001 (GEN2) - Apply UPS power switch guard and regulatory labels which are missed items
at GA
ECA 006 (GEN2) - Manufacturing shipped 29 systems with duplicate serial numbers to clients.
EC corrects the VPD in the box and replaces the serial number label.
ECA 007 (GEN2) - Apply a UPS circuit breaker guard as UPS switch is too sensitive to incidental
contact that can cause the UPS to power down resulting in loss of access to data.
ECA 008 (GEN2) - Apply a UPS pigtail retention clip.
ECA 009 (GEN2) - Apply a 4-line cord retention clip.
ECA 014 (GEN2) - ATS Monitoring cables
ECA 110 (GEN2) - Software fix to perform a UPS self test work around
ECA 116 (GEN2) - Perform commands at software level which update VPD on R2.2 systems
ECA 132 (GEN2) - Missing Prevent Services Invocation file for all XIVs at code level 10.1.x
through 10.2.4.b
ECA 135 (GEN2) - Code release level 10.2.4.e and 10.2.4.e-3 after 5/15/13
ECA 304 (GEN3) - Code release level 11.1.1
ECA 305 (GEN3) - SAS firmware rolling upgrade with 11.1.1 on 126 XIVs
ECA 306 (GEN3) - SM Memory Leak Patch
** Purple are ECAs that align to our global XIV storage strategy
** Blue is the only non-code ECA we are globally chasing.
59 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Microcode and firmware should be updates on a yearly basis. The key reason for this is…
– That the newer versions of code bring increased stability
– Reduction in client impacting events
– Overall performance improvements
– New feature benefits.
– Advanced copy services/Replication improvements.
Without following this strategy, it makes it more difficult to upgrade as the further away from
a year you get, the more complex the upgrades become…
– More often than not requiring multi-hops to get to new target code level.
– Increased interoperability work when not complying
– Overall increased risk of code upgrades when they do occur.
ECA 825 242X models Feb 2012 Reduce unnecessary DDM rejections and better handle those DDMs being
Communicated 92x, 93x, 9Ax actual rejected to prevent loss of access or data loss CIEs (focus on 450GB DDMs) –
(DS8100, approximately 1283 systems are targeted; minimum ICS CD version is 4.4
DS8300) (available now), R4.2 code systems need version 4.6 or higher (available in
March)
ECA 826 242X models Mar 2012 ~2400 systems, to better address DS8700 fabric and DDM errors; launched
94x (DS8700) actual March 20 after R6.2 SP1 exited test Feb 15 (this service pack also fixes two
HIPER defects: the RBC zHPF problem, and the Mizuho quiesce / failback
Communicated problem) -
ECA 986 242X models Jan 2012 As Required ECA focused on Hitachi B – rare data loss (CKD) or data corruption
Communicated 92x, 93x, 9Ax actual (open host) scenario – Nonconcurrent solution available now, concurrent solution
(DS8100, available to 2100 boxes (expect no more than 140 to be interested in fix) by
DS8300) March (High Impact / Non-pervasive) – Fix delivered via ICS CD for code levels
from 3.1x through R4.3 - extremely rare data error exposure and not being
pushed to field; provided only per client request
ECA 850 PE 242X models Nov 2012 Load ICS CD V4.7 to 5+ year old systems already on R4.2 or R4.3 code, to
CONTROLLED 92x, 93x, 9Ax actual expand refresh rate / refresh count monitoring and handling to other vintages and
(DS8100, capacities of DDMs to further reduce CIEs – 1529 systems
DS8300)
ECA 860 PE 242X models Dec 2012 Load R6.2 code on DS8700s on R6.2 code below bundle having DL exposure fix
CONTROLLED 94x (DS8700) actual (76.20.90.0) to further reduce CIEs – 102 systems
ECA 861 PE 242X models Jan 2013 Disable eBGMS on RAID 6 systems on designated bundles – 105 systems
CONTROLLED 9xx actual
(DS81/83/87/
8800)
ECA 850 (new) – expanding the capabilities of ECA 825 for approximately 1500 5+ year old DS8100s and DS8300s already on release 4.2 or 4.3
code for earlier vintages of DDMs and for capacities other than 450 GB. SSRs are being encouraged to combine this very short and simple update
with other scheduled repair actions for systems qualifying for this ECA (since these older systems have periodic repairs anyway).
All, in trying to make tracking and managing our ‘Smart Rebuild’ DS8100/DS8300 ECAs (ECA825/ECA850) easier, we are changing the strategy to
this.....
• Any device where ICSv4.4 was applied per ECA825 tracking, you are good.
• Any device where ICSv4.7 was applied per ECA850 tracking, you are good
• All other DS81/DS83s need ICSv4.7 applied
ECA826 – updating all DS8700s to release 76.2.90.0 code to benefit from DDM
error handling improvements, PCIe / fabric error handling and other improvements
• Working with HW/SW team to get included. Do not have clear % complete.
DS8K ECA899
ABSTRACT: ECA899 - 600 GB 15K DDM Code Update
SUMMARY:
The ECA899 is a mandatory FBM. The purpose of this ECA is to improve field performance and
resiliency of systems which have 600 GB 15k DDMs. ECA899 provides enhancements which identify and
'fence' DDMs before a secondary event occurs that might affect clients IT operations, the EBGMS
(Enhanced Background Media Scan) is a firmware which is part of he DA (Device Adapter), this firmware
was first introduced in Microcode Bundle 76.31.55.0 (released July 2, 2013). The ECA899 will require a
code load in order to pick up this firmware.
CHECKPOINT:
If your machines are on code level R6.3.1 (76.31.55.0) or higher then no action is required, otherwise
use the SuperShipper or follow the established procedure in your GEO to order the microcode bundle
(R6.3 SP6) - 76.31.79.0 and install ECA899.
Evaluation Order:
ECA899 (600GB DDM on DS8700) fixes the issue for ECA861 (eBGMS). Therefore, the need for
ECA899 should be evaluated on DS8700 before considering applying ECA861 to avoid adding ECA861
unnecessarily.
Installation instructions:
It is advisable that you should pull the latest code installation instructions from the following PFE
DB:https://ssgtech10.tucson.ibm.com/cress/TestDS8K.nsf/b3b63faf91dd65cb0725760900762dff/da6480
bc7aa2d21707257b3b007fce48?OpenDocument&TableRow=5.1#5
DS8K zHPF - Potential zOS I/O Timeout on SRC Problem Events and PPRC
Link State Changes when zHPF is Enabled– High Priority (low box count)
Tracking is owned by Storage Service Line GEO leaders once lists are produced
Abstract
- Potential zOS I/O Timeout on SRC Problem Events and PPRC Link State Changes when zHPF is Enabled
Problem Description
- IBM has identified a problem on R63 in zHPF processing where SRC events and PPRC link state change notifications
can cause an I/O hang for a volume until the I/O Timeout (MIH) is reached and the host cleans up the hung operation.
This will result in a IOS071I "START PENDING" message to be seen at the zOS console indicating that an I/O
timeout has occurred. This problem is possible on DS8800s on 86,3x.xx.x below 86.31.49.0 and DS8700s on
76.3x.xx.x below 76.31.32.0. The following conditions must be present to have the possibility to hit the problem…
- zOS connected DS8700 or DS8800 on R63 code
- An SRC (Problem) or a PPRC Link state change notification must be generated from the LPAR, which causes a
SIM to be sent from the DS8000. Typical PPRC link state change notifications would be for high failure rate and
path loss notifications for PPRC Links.
- A collision must occur between the SIM going to the host and a zHPF I/O.
Mitigation
- The mitigation to this problem is to enable the Multi-Host System Information Message (SIM) function. This function
will enable sending SIM events to multiple zOS hosts connected instead of just one zOS host. When the new function
is enabled, SIM communication to go through a different code path which bypasses the affected area of code with the
problem. The Multi-Host SIM function is a SSR accessible method to turn this feature on, please contact next level of
support for instructions.
Resolution/Support
- The DS8000 has firmware now available that addresses this problem, resulting in SIM presentation to be correctly
offloaded to zHPF I/O without causing an I/O timeout. This fix is available in the following code bundles:
- DS8800 on release 6.3 code: 86.31.49.0 or higher
- DS8700 on release 6.3 code: 76.31.32.0 or higher
64 © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Affected Machines: All GEN3 code level systems currently out in the field.
Problem Description
The signature of the issue is the following events sequence:
The critical event NODE_FAILED with a description saying the cache has failed (something like: Node
#<n> of type cache on 1:Module:<n> failed because of <reason>)
Approximately a week before that the following event MASTER_SM_CHOSEN was emitted.
If the above sequence occurs, the action plan is power cycle the failed module and phase it in. There is
NO need to replace the failed module in this case.
Action: Fix is contained in 11.2.0.b - being minimum level, but 11.3.1 or 11.4.1.a being recommended
based on required features and/or fixes needed in the client environment. Follow code guidance in
spreadsheet.
Open change records and work with CE team to apply patch to needed machines. Global list provided to
GTS community from STG. Geo SSL leaders need to track all devices to completion. It is a low risk
concurrent activity
© 2015 IBM Corporation
65 IBM Internal Use Only
IBM Global Technology Services
ABSTRACT: Enhanced Thresholding & Error Recovery for a focused set of DS8700s containing 450GB
DDMs.
SUMMARY: With the evolution of Smart Rebuild (SMRB) algorithms and insight gained of DDM failure
modes and field analytics, we have developed enhancements to thresholding and the DDM sparing process.
For these specific subsystems defined by the ECA876, we expect a 2x improvement in DDM error handling
robustness.
The subsystems chosen for this ECA were identified through full field analytics. The changes in algorithms
and thresholds have the highest value for the DDM failure modes specific to these subsystems. As part of
the field quality improvement process, we will continue to identify tail of the distribution & unique failure
mode opportunities to improve overall quality.
Abstract
- Potential loss of access / data loss exposure when running the new Enhanced
Background DDM Media detection algorithms and RAID6.
Problem Description
- Due to a RAID6 potential issue, IBM has identified that a rare disk triggered loss of
access / data loss exposure exists when running the new enhanced background DDM
media detection algorithms and RAID6 on certain DS8000 code bundles.
- Customers running R4.3H (DS8100/8300 bundles 64.36.35.0 or higher, but less than
target level 64.36.89.0), R6.3 (DS8700 bundles 76.30.42.0 or higher but less than
76.31.55.0) / DS8800 bundles 86.30.50.0 or higher but less than 86.31.70.0), and R7.0
(DS8870 bundles 87.x.x.x) using RAID6 are exposed.
Mitigation
- It is recommended that clients with exposed DS8000 systems arrange with their IBM
Service Representative to install ECA 861 to disable the new Enhanced Background
DDM Media detection algorithms until a fix is available.
DS8K ECA688
ABSTRACT: ECA 688 - DDM F/W update
Summary:
The ECA 688 is a mandatory FBM. The purpose of this ECA is to provide drive firmware
updates to prevent potential data loss.
These firmware codes affect the following drive types:
Checkpoint:
Use the HMC to check Firmware Level for the following DDM types:
If the drive Firmware Level on any of the DDMs is lower than the minimum level as indicated above then SSRs must order an ICS CD:
DS8k_DDM_SSD_FW_Update_v1.12.iso
69 and install the ECA 688. © 2015 IBM Corporation
IBM Internal Use Only
IBM Global Technology Services
Background:
The following applies only to IBM Virtualization Engine TS7700s implementing replication services. TS7700 stand-alone configurations
(i.e., configurations not in a grid) are not exposed.
When an IBM Virtualization Engine TS7700 is running any release level from R2.1 (8.21.0.63) through and including R2.1pga4a
(8.21.0.145) in a TS7700 Grid configuration, it is possible for a replication target to inadvertently skip the replication for a given volume and
view the prior replicated content for the same volume serial at the same target location as valid. In the event that a skipped volume is read
by a host and the prior level instance within the TS7700 Grid is chosen, System z mount or open processing should detect a “dataset
mismatch failure.” If label bypass processing were utilized, then previously written, but now potentially out of date, content for said volume
could be returned to the host application. If the previous use of the volume had its contents successfully deleted while in a scratch state,
the copy target can still be inadvertently skipped but there is no potential to surface out of date content given said stale content was
previously deleted. If R2.1 through R2.1pga4a (8.21.0.145) is currently installed or has been installed in the past, an exposure may have
occurred. Once the release level R2.1pga5 (8.21.0.155) or later is installed on all members of the Grid configuration, the risk of additional
exposure is eliminated.
IBM has created a tool that can detect whether such an error possibly occurred, which must be followed by a manual process to determine
whether it actually did occur. If one or more cases are detected, the down level instances can typically be corrected through automatic
TS7700 Grid replication. The tool uses minimal system resource and can be run concurrently with a production workload. Please contact
your local support team to schedule an opportunity for IBM service personnel to run this tool on your TS7700 if you have or feel there is a
potential for you to encounter this problem.
Solution (Procedure):
In order to verify if a TS7700 has been affected by this issue, vtd_exec.171 needs to be run against one of the clusters in the grid. It is not
necessary to repeat the check on all clusters separately. All cluster must be online when running vtd_exec.171. The exec can be run at
multiple chosen intervals until R2.1pga5 (8.21.0.155) or higher is installed in which the risk of exposure is eliminated. If a higher level of
code is already installed, the exec only needs to be run once in order to determine if a past exposure occurred while the above affected
code levels were installed.
The new certificate will mean that users will need to accept a new certificate when they
next login - any live GUI sessions will close.
There is no impact on SSH keys - they are not impacted by Heartbleed and there is no need
to regenerate those keys.
For SVC/V7000/V5000/V3700:
1) Replace your SSL Certificates:
Regenerate the system's private key and SSL certificate by issuing the command line interface (CLI)
command "svctask chsystem -regensslcert".
Warning: Your environment may require additional fixes for other products, including non-IBM
products. Please replace the SSL certificates and reset the user credentials after applying the
necessary fixes to your environment.
Problem:
• Accounts using SVC / V7000 standard Global Mirror on certain levels of Primary and Secondary controller code are
exposed to the risk that Global Mirror Source Data May Be Incompletely Replicated to Target Volumes
Background:
• Note that this applies only to standard Global Mirror; if you run only Global Mirror with Change volumes (cycling mode)
then you are not affected.
• Note that your current code level is not the only relevant factor, you may still be exposed if you were previously running on
an effected code level: If Secondary GM Cluster was ever on code level 7.2... up to 7.2.0.10, or 7.3.. up to 7.3.0.8 ; If
Primary GM Cluster was ever on code level 7.2.. to 7.2.0.7 then you are at risk
• If you are affected by this, follow the GTS Delivery specific code guidance and the instructions in the IBM support FLASH.
If you are affected by this, it is very important NOT to start running off your Global Mirror target volumes or reverse Global
Mirror direction. If you have done this, contact us immediately.
• The official IBM Flash is here: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005053
• GTS delivery also has this FAQ page:
https://w3-connections.ibm.com/wikis/home?lang=en-us#!/wiki/We1a54e872660_4f1d_a6ea_3e76a494b6a3/page/FAQs
Solution (Procedure):
• Upgrade source and target controllers to a safe level which means no further “data holes” will be created, see above links
for code levels.
• Then address the possible data holes:
• By a full fresh synch of all Global Mirror data (this is the GTS preferred approach)
• Or by installing checksum servers and running the IBM-provided checksum scripts to check for and address any
inconsistencies on target Global Mirror volumes (details in the IBM Flash above)