You are on page 1of 6

6 STEPS FOR HACKING POST

MORTEM REPORTING

Mom
MICROSOFT [Company address]

and the actions taken to mitigate or resolve it. TO THIS END.”2 While the definition of a post mortem makes it sound like a straight forward process. post mortems are important as they are effective tools for managing the team’s SLAs. was the right team notified? If the team was notified. the simplicity can belie some important technical and managerial details that must be done correctly if the exercise is to be an effective one. Ops or ITSM to institute an effective post mortem culture that is focused on results.google. the root cause. Was the problem due to a scheduled or unscheduled incident? When the Sev1 incident occurred. Even if your product is sold to other businesses and not to customers. 6 STEPS FOR HACKING POST MORTEM REPORTING Blameless post-mortems allow us to examine mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure. you still have SLAs on keeping the 1 The DevOps Handbook 2016 pg. The goal of this whitepaper is to provide suggestions on the types of tools and frameworks that need to be introduced in order for IT. and the follow-up actions to prevent the incident from recurring. 274 2 https://landing. the underlying problem is never really resolved. 1 WHAT IS A POST MORTE M The engineers at Google describe a postmortem as a “written record of an incident. its impact.html . Without providing post mortems with an effective framework. They allow us to deconstruct a particular incident and see what transpired after the critical event and how that can be improved in the future. Indeed. did they actually hear the alert or did the alert just go off as a ping on their smartphone? Additionally. It’s like the definition of insanity which is described as doing the same thing over and over again and hoping for a different outcome.com/sre/book/chapters/postmortem-culture. WE WILL LOOP AT THE FOLLOWIN G POINTS:  Why are post mortems necessary  What do post mortems allow us to achieve  How can we implement an effective post mortem WHY ARE POST MORTEMS NECESSARY? Post mortems are necessary as they give us insight into why an incident happened. what didn’t work and how can the team get better. post mortems are important for DevOps and ITSM professionals as they allow these groups to see what worked.

was the incident only sent to one team member who then in turn needed to identify a number of other team members which slowed down the time until team members could respond? Alternatively. In these two incidents as well as many others.pusher. HERE ARE THE 6 STEPS TO HACK POST MORTEM REPORTING Learning from mistakes is something that’s often quite difficult to do. As opposed 3 https://zapier. This is even truer if you are a cloud-based service. knowing that you have a five nines level SLA. the problems come to light in the course of an effective post mortem. the greatest contributor to how long it takes for an issue to be resolved is how long it takes until the issue is acknowledged. The post mortems are designed to break down sacred cows and reveal points of truth that might not have been previously recognized. For example. WHAT DO POST MORTEMS ACHIEVE? Above all. These are usually the terms that teams manage as they represent the metrics most tied with resolution effectiveness. you should make sure your discussions lead to actual change. it can be haphazard and important details can be overlooked or forgotten. Post mortems also allow you to more specifically manage MTTA (mean time ‘til acknowledgement) and MTTR (mean time til resolution).3 Post mortems.product up and running. when carried out correctly.com/dont-repeat-your-mistakes-conducting-post-mortems/ . when a service interruption was identified by the monitoring tool. HACK #1 BRING IN KEY TECHNOLOGIES AND TAKE ADVANTAGE OF THEIR AVAILA BILITY As post mortems have become an important part of IT and DevOps culture. you can see how an incident effected your SLA with your customers. Indeed. Effective post mortems are not meant to be blame games or cheap talk. you know you cannot afford much downtime. Without a framework to help you do it consistently. So. it is important to consider beforehand what technologies team members will need to enable effective post mortems. can achieve a whole lot that advances the team in the direction of further progress and IT knowledge.4 Post mortems are both necessary and important to effective incident management as they bring to the surface how effective your team is at managing critical events. With a post mortem.com/blog/project-retrospective-postmortem/ 4 https://blog. were all team members alerted when the incident occurred such that no one knew who was going to respond to the alert? This result is equally problematic as there is always the feeling that some other team member can take care of the issue. they are meant as effective management tools to improve the effectiveness of the team. Instead.

these tools also have a time stamp that will allow concerned parties to see what happened and when. TICKETING TOOLS like Jira or Service Now also are time stamped but become the record for when incidents took place such as when an incident occurred on the server as well as any back and forth that occurred during the incident’s resolution. it can be hard to follow up on action items5 The first point of action of the post mortem meeting should be to look at the timeline of events. HACK 4: CREATE A TIMELINE If you don’t have things written down. So it is best to enable the post mortem as soon after the event as possible. Was the incident acknowledges. ticketing tools and reporting tools the team will need to remain in contact during an incident. As you were smart and invested in an incident alert management system.to a discussion on monitoring tools. Don’t forget to include an invitation to a representative of the group affected by the problem. this sort of conversation instead looks into what alerting tools. These stakeholders are people who might have contributed to the problem. forwarded or escalated? Also. Team leaders need to be rigorous about recording details and sharing information HACK 3: BRING IN INSIGHTS OF TEAM Make sure the relevant stakeholders and participants are at the post mortem meeting. chat tools. you will insure that you have the relevant parties at the table who can identify the relevant issues and help bring resolution to the issues. Importantly. In addition. a communications management. This could be because a particular team is overloaded with alerts and as a result cannot answer all the alerts they are receiving. A REPORTING TOOL that enables managers and stake holders to review the workloads and busyness of various teams will provide insight into why teams might be less effective than they ideally should be. ignored. By bringing in this robust group. Alternatively. you will want to include any people who responded to the problem as well as people who diagnosed the problem. All three of these instruments create time stamped incidents which are critical for post mortems to run effectively. how long did it take until the incident was acknowledged? In a robust alert management platform (like OnPage) all this information is captured. ALERTING TOOLS will indicate when an incident arrived to the engineering team and who responded to it. These tools are where work gets done. a particular faulty piece of infrastructure could be producing an outsized number of alerts that keeps the team unable to respond to other issues. CHAT TOOLS like HipChat or Slack are where engineers conduct business. HACK 2: ENABLE POST MORTEMS AS SOON AFTER THE EVENT AS POSSIBLE Memories are shaky. a 5 https://zapier.com/blog/project-retrospective-postmortem/ .

6 Important to share this information and make it easily available. On the management side. but have a positive context. they are also enthusiastic in helping the rest of the company avoid the same error in the future. Practice will dictate which team members are most effective at providing perspective and insight.com/pulse/20141001093119-69047-5-tips-on-running-effective-postmortems . post mortems for successful events can highlight what went right and why. Time will also allow teams to determine which technologies take the most time to manage when problems arise. post mortems should also be used to identify. in the case of successful events. Need to publish post-mortems as widely as possible. With the reporting capabilities. on the other hand. Successful projects. Whereas post mortems are used to identify things that went wrong and why. Google drive is a good place to post this information. which helps people relax when addressing issues.ticket management platform and a reporting platform. post mortems stand a chance of being successful. it is equally important to create post mortems on successful events as well. effective post mortems are a process and take time to perfect.7 CONCLUSION Effective post mortems are equal parts technology and management. there needs to be the processes in place from the acknowledgement of the event to setting up the meeting to bringing in the relevant stake holders. are still ripe with errors. near misses etc. you will be able to see aggregate data that provides context to the timeline. The technology your team brings on needs to be able keep track of and log the events that took place from the time the event began until the incident was resolved. The first three tools allow you to see what happened in a step by step manner. you have all the relevant data you need to view the order in which the events unfolded. Additionally. You need to educate other members of the team as to why the event occurred and commit to changes that will prevent the event from happening again in the future. what could have been done to make the outcome even more desirable. which is as valuable. They are also likely to have best practices and novel ideas. HACK 6: ENABLE POST MORTEMS FOR SUCCESSFUL EVENTS AS WELL While this whitepaper has primarily focused on creating post mortems after critical Sev1 or Sev2 incidents. inefficiencies.linkedin. HACK 5: CREATE A FINAL DIGITAL RECORD A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable.com/2012/05/22/blameless-postmortems/ 7 https://www. By combining these components. 6 https://codeascraft. In the end though.

and responses. . As part of your IT service management. you can track alert delivery. TO LEARN MORE. ABOUT ONPAGE OnPage is a cloud-based.The important point though is to start practicing post mortems as they are key to continued growth of the company and its leaders. you will improve responsiveness to SLAs and lower your and your clients’ costs. OnPage provides critical alerts to Managed Service Providers based on notifications from RMM or PSA system for faster incident resolution. real enterprise messaging. you will improve MTTR and better manage your clients’ ecosystem by decreasing service interruptions. Using OnPage you get instant visibility and feedback on alerts.COM/CONTACT-US 781-916-0040 Visit iTunes or Google Play from your smart phone or tablet to download the OnPage app. industry leading smartphone application for high-priority. VISIT OUR WEBSITE OR CALL: ONPAGE. As an organization. As a result. ticket status.