You are on page 1of 35
Ask Yourself Lesson 1: Ask Yourself Overview Lesson Objectives After completing this lesson, you will be able to: Explain the importance of assessing the potential impact of your work before: beginning a task Explain why it is more important than ever to avoid preventable outages, operational errors and reduce riskto the business Provide execution, attention to detail, and prevent problems before they occur, "Dott Right, the First Time, and Every Time" Be prepared, havea solid pian of execution anda backout plan, and validate the wark performed to ensure the change did nat have any negative impacts Ask Yourself Lesson 1: Ask Yourself Overview WHAT TS ASK YOURSELF? Ask Yourself is a valuable program that requires every person at AT&T toassess the potential impact of their work before beginning a task. The Ask Yourself (AY) principles provide a means for every AT&T employee, agent, and Non- Payroll Worker to improve AT&T's network, technology and operational service reliability. After all, protecting AT&T's technology capabilities and our customers’ service capabilities is everyone's responsibility. The goal is to provide flawless work execution and to prevent problems before they occur. We should be prepared, havea solid plan of execution, and validate to ensure the work we performed did not cause any negative impacts. We mustall be committed to "Do ItRight, the First Time,and Every Time’. Ask Yourself WHAT TS ASK YOURSELF? |A, Aptilosophy that renuires every personperforming work at ATATt0 assess the potential impact of that wor B Adirectivesdvocated by AT&T Ameans toimprove AT&T's new CC ‘reliabity through every AT&T employee, agent and Non Payroll Worker D Amethodology thar ensures workis executed flawiessly \WHO DOES ASK YOURSELF AFFECT? ‘Organizations impacted by Ask Yourself ncudethe following, Coutarenot limited) + ATSTintearated Goud (AIC) 2 EnrenanmeneGroup + Digitale {DIRECTV Fietasenoes (OFS) + Domain2oarchecture ana Planning testing Application Services (HSAS) { itenetand Eerie OataCerters IDCs EDC) + Managed services ana Oursoureing (MSO) 2 epi + Security + Tecinalogy Design & Arcicecture(TDA) + Tecnology Development (Tech Dev) ‘Technology Operations Tech Oze) + Were + wep Hosting Data Centers Lesson 1: Ask Yourself Overview WHY 15 ASK YOURSELF IMPORTANT? itis more important tnan ever thatthe Ask Yourselt Prinrples ae Followed! Consider the folowing + Consumarshave become moreralianton ATAT's broadaand, mabilty and entertainmentnetworks + The technology and network environmentsare more integrated and competitive than ever before, + -ATATenetworkishandling more data on en hourly basis ‘han aver berore, making each outage more impaexul + Businesses anc consumers snoase ATT because ofthe company's advanced technology. operational excellence andstrongreputston farnetworkrelabilty. Customer etwork devices managedbyATAT un our customers! ‘mostentical business applications including their web Presence, manufacturing, customer ordering are Hransaction processing Asma errrcan manct milions bf customersand each outagediminishes AT&T valuw 0 theeyes af customers + Today'senvironment has us simultaneously touching ‘more parts of thevirtua and physical network than ever before, Heightoned auareness, good udgmentand sound THEASK Hh a ANUeitay Lesson 1: Ask Yourself Overview low Do [PUT Ask a mn PRACTICE? Lesson 1: Ask Yourself Overview _ THEO “ASK YOURSELF” PRINCIPLES ao @ o SOB) bot have the proper access and appropriate login credentials for 7A) teworkiamabout to do? + Ensure you nave the appropriate login credentials and access permissions for the work youare about to perform + Ensure you have worked with the local operations supervisor, Corporate Real Estate (CRE) or information Resource Management (IRM) to obtain the proper IDsrequired + Arrange for hotlines/war room and/or onsite coverageas needed Do you have your company provided appropriate ID? + Are youdisplaying the required IDs? > Do know why I'm doing this work? + Ensure that you have a full understanding of the work and the reason itis more than justa directive * Understand the impact the work has on AT&T's network reliability, products, services, applications, and the customers AT&T supports + Understand the task to be performed and the proper sequence in which the workis tobe performed If the answer to this principle is No, you do not know why you are doing this work: * Review the document requesting the work, such asa Change Request (CR), Application Interface Design (AID), Methods of Procedure (MOP), Trouble Ticket (TT), Incident Ticket (INC), or Action Request (AR) etc. + Check the process documentation and applicable local procedures + Talk with your Supervisor, Co-Workers, Engineering and SMEs Have lidentified and notified all key stakeholders who will be directly and possibly indirectly affected by this work? + Notifyall organizations thatmay be impacted, such as Technology Operations, Technology Development, Entertainment Group, Technology Reliability Centers (TRCs), and Advanced Technical Supoort (ATS), Global Technology Operations Center (GTOC), etc. * Identity other organizations that might be impacted by this work (e.g, Customer Facing Operations, Provisioning, and Product Management) * Notify internaland external customers of planned change activities Identify the impacted customers and internal groups by: * Reviewing the document requesting the work, such asa Change Request (CR), Application Interface Design (AID), Methods of Procedure (MOP), Trouble Ticket (TT), Incident Ticket (INC), or Action Request (AR) etc. * Reviewing theasset/equipment/component database + Checking with the originator of the document * Checking local procedures + Talking with your Change coordinator/requester, Supervisor, Co-Workers, Engineering, DataScrentist ard ae Can | prevent or control service interruption or defect 2) introduction? + Ensure all appropriate elements/processes are monitored during the work activity + Attention to detail - follow all MOP/Specific Methods of Procedure (SMOP)/AID/Standard Operating Procedure (SOP), implementation/execution Plansas documented + Possessa clearly understood backout /recovery plan + Survey the work area and make sure all appropriate safety precautions have beentaken @ Is this the right time to do this work? + Ensure scheduled work meets AY Handbook or other maintenance/change window requirements * Anticipate customer impact of possible network failure and/or issues with their products, services and applications + Ensure conflicts have been resolved + Perform health check prior to commencing work + Ensure technical support resources are available Si Am I trained and qualified to do this work? + Ensure the training you have received enables you to perform the work you will be doing + Performa procedural review of the technical documentation to assure a solid understanding of the work to be performed + Understand the impacts to the customer associated with this change activity Whatif Idonot understand the work tobe performed or lack the necessary training to complete the task(s)? + Collaborate with a change coordinator/requester, SME or peer + Obtain the necessary training * Talk with your supervisor for assistance and/or re-assign the work + Stop the work (Ifstill in doubt, stand down and do not perform the work) ==30)) Isaltrelevant documentation for this work complete, current and ~ error-free? * Verify you nave an approved and current document (e.g., vendor documentation, methodsand procedures) + Read through the documentation atleast once, verifying the contents, prior to beginning the work * Verify that the procedure has been certified in the appropriate environment Do! have everything | need to quickly and safely restore service if SS something goes wrong? + Review the change request. Ensure all required toolsand components are available. + Know who to contactin the event something goes wrong Have the tools available on the job site that may be required torestore the network, products, services, and applications Have | walked through the entire process or procedure | am about 4 tofollow? + Completeawalk through at the start of each shift to determine the work to be performed + Understand the proceduresand your responsibi ies + Ensure the procedure to be performed makes sense (timeline, sequence of steps, completeness, testing, safety, etc.) Consider the potential consequences of NOT walking through the procedure: + Interruption to our network, products, services and applications * Customer Dissatisfaction + Damage to equipment/building * Hours of work to trouble-shoot problems, + Lost revenue and costs of repair + Personal injury (safety hazards) process or procedure, obtained proper closure and notified Have I verified/validated that | have completed all the steps in the a) appropriate technology centers, as applicable? + Where applicable, contact the technology center and have an understanding of what needs to be done when the work is completed * Do lhave all the tools to close the work? Do | need access to asystem to do this or do Ineed to call someone? + Perform validation or post health checks + Close out your Change Request (CR), Trouble Ticket (TT), Incident Ticket (INC), Action Request (AR), etc. with the most accurate information ina timely manner Peace WHAT SHOULD YOU DO TF YOU DON'T KNVOW WY SUE aT UO SOS Comores Fevew the document requesting the work (Change Request MOP Trouble Ticket/ 8 ©) pevonreauest ce) sen annieeatesitimannscpncion EO aksnsiraisammiics akiacempaii coven © rrertormtneraskanyway= hat count? Oa Ta SERVICE INTERRUPTION? BO seupetananermanes DO Mersemcinmosseesstarccnorecssenenetansinensoees a Ask Yourself Lesson 1: Ask Yourself Overview Preventingand Correcting Process/Procedural Problems ‘Applying the Ask Yoursetf principles may uncover process or procedural problems Applying Best Practices minimizes impacts, preventable outages and operational errors + Each individual has the capability and responsibility to identify process/procedural improvements Rootcause analysis should be completed to prevent recurrence ‘of the same problems inthe future ok C © Ounersnivot teresponsbiltytopointout whereprocessimprovements can bemace 1 © sootcnvseanaysis Lesson 1: Ask Yourself Overview ‘Global Technology Operations Center Change Management (STOC- ‘cnange managements sninteotal component forAT&M'soveral network product avin andapslcationsralstiy bromacanga culture ofinauny leading bast practoeswllmnmmizenetwart rls, teaiesunplannad outages anvesarce Sralabityandutenately mprovequaleyanserasi, ‘The Grange Managemen proces isintendedte diveaccountabity, encourage cllsboration and provide a framewarkto cary outeareull pannasevens ssuellascoordnate and grintae changes ina convoled manner prevenear MIAME ouupes ik Factors Ret pretabilty of derusontothe busier anuironment bared onthe technicalcomplesiiesoftne change Ret Factors Sreaesignecoy change coordnatorrequssters SoBe sng ATS, Rsk Factor Sietishignent tet ant isthe owas ee 83. MEDUM=Compexiy multiple devices ceric delveryineacted 43. Gls Complex actives mustpie devices, service dalvery impact, igh visit. Business Ac Uni Lowornacomplexty,nimpact taceric delivery Changeactiitiaz ube trancparent 2 LOU=towCompesty maral impacto serve ehvsry 5 EXCEPTIONAL =Complexactivtes,multpleceves serve delverympacted Nh VSB, extended outage Iimpactcetinee nalevel ofceviationromnormalzsruce considerations asthe number ofueetsancihe service impacted Immpacteanderign msaumeron 6—o ES Ba @ MATCH A TERM ON THE LEFT WETH THE CORRECT ig UN ANCUCUAT aN + Completesendcealdeionandessuance esting. <—HitiurrT wo Fully mplerenc raining and certification patnsasappropriate, i vendor training. + Review mont thiyprocs a Tac usw aONAY -ANOWIUDGE TRA WML AE SEU IN THE FUTURE rt ARE THINGS THAT YOU CAN DO 10 ENSURE THE ASK YOURSELF PRINCLPLES ARE FOLLOWED! . Personal Obligation to Excellonce Remember: Prong Ask Yourself Lesson 2: Case Studies Lesson2 Objectives, Aftercompleting this|esson, youwill beable to + Explainhow the Ask ourself principles could be applied in actual on thejob scenarios + Describe how manual/human errors, defectsanc ather issues can beprevented Ask Yourself SELECT THE SCENARIOS THAT YOU WOULD LIKE T0 VIEW (MINIMUM OF) ‘loko sect deselect. then cckthe Next button. . ‘Thefolloming senares ‘Youmyst view al fourmodles Not omainoudl, Incacrcsrastudytocontnee / GSES Incident The Video Operations Center (VOC) repartec power being last ta the Super Hub Office (SHO) resulting in U-verse customers experiencing aloss of local and national channels. In addition, U-verse custamers experienced the loss cf HSIA, (High Speed Internet Access) and CValP (Consumer Voice over Internet Protocol) during this event. After the appropriate service recovery teams were engaged to assist with the troubleisolation, the recovery teams identified the Uninterruptible Power Supply (UPS) was powered down ancinot functioning. During a quarterly fire alarm test, 8 ‘vendor mistakenly pressed the Emergency Power Off (EPO) button. ‘After the recovery teams engaged the power vendor to restore power back to the UPS units, the affected network equipment began to recover and service began restoring for most of our customers. However, some equipment didnot recover from the power down/power up process. ‘The Recovery teams worked to restore the remaining channels by rebooting ‘equipment, rerouting traffic and replacing hardware packs. In addition, the Customer Care teams verified the restoration of U-verse servicein the VHO. marketas wellas national channels for all markets. Root Cause Er ‘vendor was performing an annual fire alarm test at the SHO, and activated the Emergency Power Off (EPO) button in error. The vendar mistakenly pushed the EPO. button instead of pulling down the fire alarm pull station. Both devices are located near one another and have similar clear protective covers to avoid inadvertent activation. — wa The following would have contributed to preventing this incident: + The vendor should have been familiar with the layout ahead of time and performed a walk-through of the procedure to ensure the correct procedural steps were taken. F Incident F ‘An outside contracting company, Tower Construction was performing work to replace p theold 24 Voltage Direct Current (VDC) power plantat the Microwave hub site when an ‘outage occurred. Tower Construction perfarmed the work contracted by ATAT C&E Implementation } Thepower plant swap outwas performed while running @ temporary power plantin parallel in orderto avoid any downtime Following is the timeline of events that occurred: ‘At 02:00 MST, the contractor turned power over to the new plant, ‘At 02-01 MST, an alarm was generated for this incident, ‘Ac02:52 MST, MTRC RAN Tier discovered the outage and started investigating. ‘Ac 03:15 MST, Tower completed the work and the National Field Support Desk (NFSD) team logged them out + At03:23 MST, MTRC Triage was notified Root Cause WY ~ ‘The root cause for this eventhas been determined to be an operational error. It was found that during the power plan installation at one of the Microwave hub sites, power was interrupted to one of the Microwave equipment, causing the Microwave to ge into ahung state. The Methods of Procedures (MOP) that was provided for this work did not contain step by step procedures, bck-out plan and verification checks at each critical point. ‘Generic MOP (GMOP) was used for this activity anda Specific MOP (SMOP) had not been. developed Prevention ‘The following would have contributed to preventing this incident: + Aspecific step by step MOP (SMOP) should have been developed. + Asthe technician was reviewing the GMOP prior to work, he should have contacted a SME to develop 2 SMOP. Incident $F Video Operations Center (VOC) received a call from Blackout Quality Assurance » (BOGA) stating incorrect content was being displayed on the Set Top Box (STB) for national channels. The channel was broadcasting Woman's Softball instead of Major League Baseball + Trafficteam called in and determined there was incorrect equipment listed on the Service Change Notification (SCN). Traffic made changes. This corrected the channel for customers STB on Standard Definition (SD) side only, High Definition (HD) encoder format had to be corrected first and then the HD channel started to. » re-stream. Root Cause Wy ‘The root cause for this event was determined to be an operational error. The Change ‘Management request contained incorrect equipment and the error caused incorrect programming for a customer. The following would have contributed to preventing this incident: * The technician should have walked through the procedure and would have found the error in the Change Request. The technician was not trained enough and did not walk through the procedure ang di¢ not know why work is being done. Incident $F AT&T implemented a planned load balancing maintenance Cloud change activity to add three nodes to a database server, to level out performance across more, nodes to increase capacity Incident Management received alerts indicating all BlackFlag APIs were unavailable across all markets. PADS (Cloud Platform Application and Data Layer Support) advised that che issue occurred due to a planned maintenance work dane to add three nodes under an approved change request. PADS redirected DynectDNS from the newly added site to the original site at 00:17 CT thus mitigating impact to run-time/design-time transactionsas of that time while continuing to troubleshoot at the affected site. The new cloud ring was restarted in an effort to bring the site back tonarmal. A restart on the newly added nodes did not resolve the trouble. A backout of the implementation on these nades was successfully performed and the original confirmation of three nodes was validated and performing as expected, Root Cause OD x The root cause wasa version incompatibility between the existing Cassandra database and the threenew noes. Prevention — The following would have contributes to preventing this incident: ‘The MOP should have contained instructions to upgrade the version on the three new nodes prior to connecting to the existing database. During the planning of the change, both environments should have verified to have the same software | Incident AT&T was performing an upgrade on Storage Arrays located at two Data Centers. The maintenance involved upgrading the operating system concurrently on 36 storage units Jn the data centers bases on disaster recovery and replication requirements. Achange request was created to document the change activity and the implementation plans were included in the MOP. ‘Maintenance work was performed as part of an on-going project to upgrade the storage Units and traffic was to be shifted from each device for the duration of the work toavoid customer impact. ‘The maintenance activities began at2am, local data center time, with a 6 hour duration During change execution, the implementer noticed error messages that stopped the upgrade. The Implementer started the troubleshooting process Out was unable to determine the cause of the error messages. ‘The Implementer followed the Back-out plan documented in the MOP to successfully back out the change and restore the environment to normal nnn Oz The root cause for this event was a known configuration issue that required the MOP to be updated to include steps for pre-work server remediation verification on the audited servers. This known issue from the Audit team was not addressed prior to the start of the OnTap upgrade change. “beets The following activities would all have contributed to preventing this incident: + Validation that the MOP was updated to include verification steps to ensure thet pre- ‘work server remediation from the aucit was completed, + Solid process steps for required pre-and pest-implementation and validation in the Mop + Checks to ensure that the implementation plan (i.e. MOP) is current, complete and approved for usage prior to implementing the change activity Case Study 7: Global Customer Service a Incident One of AT&T's customersis an international airline. AT&T manages the airline's, data network, including WAN, LAN, load balancing and firewall During the business day, an AT&T WAN engineer worked on a core customer router pair in preparation for a future critical change window. There was 2 lot of preliminary work that needed to be done for the future window, and the engineer decided to make a small, non-service disruptive alteration to prepare the router configuration for the change. The engineer completed the change in each of the router pairs and logged out of the routers. Customer transactions to the airline's web site began to fail intermittently and shortly the failures became severe. Airline employee access to critical internal applications such as reservations and check-ins also began to slow down and hang, creating a worldwide back up of boarding and flights. Root Cause OD { Aseverity one trouble ticket was opened by the customer with AT&T. The AT&T technical support team determined that a change had recently been made to the Access Control Lists (ACLs) in each of the routers, and thata critical statement was missing in one of the router configurations. They backed the changes out. The customer needed to reboot multiple critical serversand applications, which had hung due to the traffic overload and ‘transaction drop. While the intent of the engineer was to prepare fora future change, no change toa customer network device should be performed without following the Ask Yourself checklist. Every change, no matter how simple, should be considered to have potential customer service impact. If the Ask Yourself principles are not followed, the outcome can be much more impactful to the client than if time is taken to do it right IN THE NEXT SCENARTO, Te TTI AUT Lia a Oy LOS Lesson 2: Case Studi Ask Yourself 6 Lesson 3: Course Summary Luvebythe 10 Ask Yourself Questions Remember xostrcny adhere to hoten Ask Yourself Questions inthe worklecel Wouis youtikea quickreresher? Downioada printsblewalet-cardvesionot the AY question. FullSzelBadaesize Ask Yourself Lesson 3: Course Summary ‘Almost Done! During this training, youhavelesrned: ‘Ack Yourself isa rective that requires every person perfortning work next> atATAT to assess tne potential mpactbeforebeginning task tis vitally important thateveryone who hasan effecton ATT's next> technology capabilities provides Nawlace wort execution and prevents problems before iney occur Today's environmenthas ATST representatives simultaneously touching next> ‘mote parts ofthe virtual and physicalnetwork than ever befor. Heightened awareness, good judgmentandsound cisions are critical 0 ATaT'ssuccess ‘The definitive uide rath Ack Yourself methodologyisthe Ask Vourseit ns Handbook which isiocatedon APE®. ‘Thereare 10 Ack Youreelfprinciples, whichareprasented inthefermof next» ‘questions, Each Ask Yourself question should be confidently answered titha" Ves" before undertaking any task related to changing ATAT's /etwork products, serviess and applications. Process and procedure problemscanbeoreventedbyfoliowing theASk — paxt» ‘Yourself principles and MOP Best Practices. Fo tproblems D0 occur proper change management siepsshould be pext> followed ro fixthe issue andro: cause analysis shouldbe completed ro prevent recurrence of te same problems che fucure Jaf Remerneersna noobie sourgaesnatwecaonettteshesiness Le ww ‘Government — wea, Screen See, — Eee ‘Small sie BUSiness Enterprise Business Services yer Root Cause Wy ~ The root cause for this incident was the existence of undocumented dependencies between the two servers and their MC applications in the asset database. Although this dependency was ina document attached to the implementation plan, the implementer did not read nor did he notify the impacted application end-users that their applications would be unavailable during the change window. — ww The following would have contriouted to preventing this incident: + Ifthe asset database had been up to date, and contained the dependency information these two servers hac on one another, the implementation plan would have contained the proper steps torebootboth servers, Ifthe asset database contained the dependency information, the Application end- users on Server B would have been notified that the applications would be unavailable during the change window and to NOT attempt ro process orders during this time.

You might also like