You are on page 1of 11

The ITSM Review 2014 _ Availability,

Incident and Problem Management


Availability, Incident and Problem Management The New
Holy Trinity? (part 1)
http://www.theitsmreview.com/2014/02/availability-incident-problem-management-holy-trinity-part1/
Availability, Incident and Problem Management The New Holy Trinity? (part 1)

Published by Vawns Murphy on Feb 4th, 2014 // 2 Comments

Home Opinion Availability, Incident and Problem Management The New Holy Trinity?
(part 1)
So heres the thing. We all know that incident and problem management, if working well, can
reduce interruptions to the end user and improve service quality for the business. From an end
users perspective though, availability is the name of the game. While most organisations have
the basics covered with incident management, how many use problem & availability
management to look at the underlying cause of Incidents at a service as well as a component
level?
Working together effectively, availability, incident & problem management can improve both
quality of service and the business perception of IT. Getting back to basics, incident management
is a purely reactive process. We sort things out so that the business can carry on as usual.
Page 1 of 11

Problem management is both reactive and proactive. We look at what went wrong but also how
to stop it from happening again. Availability management looks at all availability issues at both a
component & service level, ensures that we consider availability at the point of service design as
well as monitoring up time during normal operations.
When describing the three processes, I call incident management the superheroes of ITIL. They
save the world several times a day, fighting fires and making people smile. Problem management
are detectives. They get to the root cause and sort it out to stop the same issues from recurring.
Availability management are the scientists of the ITIL world. Like the guys from The Big Bang
Theory, they design the service to keep it up & running as much as possible based on user
requirements.
Today, IT service issues are constantly in the news. With the advent of social media, news of
service downtime can be spread globally in minutes kind of embarrassing especially if you are
a highly visible entity such as a bank or government department. Putting aside the
embarrassment factor for a minute, what about financial implications such as fines, service
credits? Or regulatory impact such as failing to comply with any standards mandated by your
management. Lets not forget the angry mob waiting outside to make their dissatisfaction known
if downtime is an own goal such as a poorly managed change. With this in mind, Ive put
together some tips on how to use availability, incident and problem management to maximise
service effectiveness, with this article covering the first three of ten.

Tip 1: Getting your facts straight


Have separate records for availability, incident & Problem Management. Incident Management
records fix it quick should focus on getting the user details and a full description of the issue.
Some of the information captured by Incident records could include:

Page 2 of 11

When managing an Incident, different support teams may need different views e.g.

Networks team by location


Service desk by customer satisfaction
Desktop support by hardware
Development by software application
Capacity management by resource usage
Service delivery managers by business impact
Change management by date / time to compare with the change schedule

Problem management records focus on establishing the root cause and actions to prevent
recurrence. Problem records can contain the following information:

Page 3 of 11

Availability records should look at planning for the appropriate level of availability and ensuring
that availability & recovery criteria are considered when designing new services. Your
availability plan should contain the following information:

Tip 2: Identify roles & responsibilities


Be organised so theres no duplication or wasted effort. In short the incident manager is
concerned with speed, the problem manager is concerned with investigation and diagnosis and
the availability manager is concerned with the end to end service.
Page 4 of 11

Key priorities for the incident manager will include co-ordinating the incident, managing
communications with both technical support teams and business customers, and ensuring that the
issue is fixed ASAP.
The problem manager will focus on root cause investigation, trending (has this issue popped up
before?), finding a fix (interim workarounds and permanent resolution) and ensuring that any
lessons learned are documented & acted on.
The availability manager will look at ensuring the service is designed with the appropriate levels
of availability, working with service operations to tackle issues at both a service and component
level and using the extended incident cycle to look at trends and how the service can be
improved.

Tip 3: Keeping up to date


Its really important to keep an eye on the BAU as seeming small incidents can spiral out of
control and have a negative effect on availability levels and customer satisfaction. Simple things
can make a big difference for example, placing a white board near the service desk with a list of
the top ten problems so that its easy for service desk analysts to link incidents to problems so
that trends can be identified later on. If the service desk have a team meeting ask to attend and
update them on any new problems as well as updates and workarounds on existing problems.
Dont forget to close the loop and let the service desk know when a problem record has been
fixed and closed off, theres nothing worse for a service desk to have to call a list of customers
about an issue that was sorted out months ago!
Get proactive! Work as a team to view service availability through out the month. Have a
process to automatically raise a new proactive problem record if availability targets are
threatened so that things can be done to prevent further issues. Dont just sit there waiting to fail
the SLA!

In part two, I will continue with a further seven tips on how to use availability, incident and
problem management to maximise service effectiveness.

Vawns Murphy
Irish mum of 3. ITIL V2 Manager (red badge) and ITIL V3 Expert (purple badge). SDI
Managers certificate. Further qualifications in COBIT, ISO 20000, SAM, PRINCE2 and
Microsoft. Author of itSMF UK collateral on Service Transition, Software Asset Management,
Page 5 of 11

Problem Management & the "How to do CCRM" book. Reviewer for the Service Transition
ITIL 3 2011 publication. When not being pelted with brightly coloured balls in name of ITIL, I
am a senior ITSM analyst for Enterprise Opinions.
More Posts
Follow Me:

Tags: availability management, incident management, itil, itsm, problem management, service
desk, Vawns Guest

Patric Hogan
Hi Vawns, as a problem manager I couldnt agree more with your article, availability is
experienced directly by the customer and they will perceived that as service quality at this
point no matter how quickly we restore service or how detailed the Root Cause Analysis
the perception is a much harder challenge to turn around..
Its very easy to look at servers, helpdesk hardware and network firewalls and
loose that connection to the end user. I work with a large number of Scum
Masters and product owners and have been working to balance the drive of
new functionality and reliability/availability .
Im looking forward to the next instalment

Pingback: Availability, Incident and Problem Management The New Holy Trinity? |
CAI's Accelerating IT

Availability, Incident and Problem Management The New


Holy Trinity? (part 2)
Published by Vawns Murphy on Mar 27th, 2014 // No Comment

Page 6 of 11

http://www.theitsmreview.com/2014/03/availability-incident-problem-management-holy-trinitypart-2/
Home Opinion Availability, Incident and Problem Management The New Holy Trinity?
(part 2)

Following on from part one, here are my next seven tips on on how to use availability, incident
and problem management to maximise service effectiveness.

Tip 4: If you cant measure it, you cant manage it


Ensure that your metrics map all the way back to your process goals via KPIs and CSFs so that
when you measure service performance you get clear tangible results rather than a confused set
of metrics that no one ever reads let alone takes into account when reviewing operational
performance. In simple terms, your service measurements should have a defined flow like the
following:

Start with a mission statement so that you have a very clearly defined goal. An example could be
something like to monitor, manage and restore our production environment effectively,
efficiently & safely.
Page 7 of 11

Next come your critical success factors or CSFs. CSFs are the next level down in your reporting
hierarchy. They take the information held in the goal statement and break them down into
manageable chunks. Example CSFs could be:

To monitor our production environment effectively, efficiently & safely


To manage our production environment effectively, efficiently & safely
To restore our production environment effectively, efficiently & safely

KPIs or key performance indicators are the next step. KPIs provide the level of granularity
needed so that you know you are hitting your CSFs. Some example KPIs could be:

Over 97% of our production environment is monitored


98% of all alerts are responded to within 5 minutes
Over 95% of Calls to the Service Desk are answered within 10 seconds
Service A achieves an availability of 99.5% during 9 5, Monday Friday

Ensure that your metrics, KPIs & CSFs map all the way back to your mission statement &
process goals so that when you measure service performance you get clear tangible results. If
your metrics are linked in a logical fashion, if your performance goes to amber during the month
(eg threat of service level breach) you can look at your KPIs and come up with an improvement
plan. This will also help you move towards a balanced scorecard model as your process matures.

Tip 5: Attend CAB!


Availability, incident and problem managers should be key and vocal members of the CAB.
70%-80% of incidents can be traced to poorly implemented changes.
Problem management should have a regular agenda item to report on problems encountered and
especially where these are caused by changes. Incident management should also attend so that if
a plan change does go wrong, they are aware and can respond quickly & effectively. In a very
real sense being forewarned is forearmed so if a high risk change has been authorised, having
that information can help the service desk manager to forward plan for example having extra
analysts on shift the morning of a major release.
Start to show the effects of poorly planned and designed change with downtime information to
alter mind-sets of implementation teams. If people see the consequences of poor planning or not
following the agreed plan, there is a greater incentive to learn from them and by prompting teams
to think about quality, change execution will improve, there will be a reduction in related
incidents and problems and availability will improve.

Tip 6: Link your information


You must be able to link your information. Working in your own little bubble no longer works,
you need to engage with other teams to add value. The best example of this is linking Incidents
to problem records to identify trends but it doesnt stop there. The next step is to look at the
trends and look at how they can be fixed. This could be reactive e.g raising a change record to
Page 8 of 11

replace a piece of server hardware which has resulted in down time. It could also be proactive for
example we launched service A and experienced X, Y and Z faults which caused a hit to our
availability, were now launching service B, what can we do to make sure we dont make the
same mistakes? Different hardware? More resilience? Using the cloud?
You need to have control over the quality of the information that can be entered. Out of date
information is harmful so make sure that validation checks are built in to your process. One way
to do this is to do a deep dive into your Incident information. Look at the details to ensure a
common theme exists and that it is linked to the correct Problem record.
Your information needs to be accessible and easy to read. Your audience sees Google and their
expectation is that all search engines work in the same way.
Talk to people! Ask relationship and service delivery managers what keeps them awake at night
and if there is know problem record or SIP then raise one. Ask technical teams what are their top
ten tech concerns. Ive said it before and Ill say it again. Forewarned it forearmed. If you know
theres an issue or potential for risk you can do something about it, or escalate to the manager or
team that can. Ask the customer if there is anything they are worried about. Is there a critical
product launch due? Are the auditors coming? This is where you can be proactive and limit risk
for example working with change management to implement a change freeze.

Tip 7: Getting the right balance of proactive and reactive activities


Its important to look at both the proactive and reactive sides of the coin and get a balance
between the two. If you focus on reactive activities only, you never fix the root cause or make it
better; youll just keep putting out the same fires. If you focus on proactive activities only, you
will lose focus on the BAU and your service quality could spiral out of control.
Proactive actions could include building new services with availability in mind, working with
problem management to identify trends and ensuring that high availability systems have the
appropriate maintenance (e.g regular patches, reboots, agreed release schedules) Other activities
could include identifying VBFs (more on that later) and SPOFs (single points of failure).
Reactive activities could include working with incident management to analyse service uptime /
downtime in more granularity with the expanded incident cycle and acting on lessons learned
from previous failures.

Tip 8: Know your VBFs


No, not your very best friends, your vital business functions! Talk to your customers and ask
them what they consider to be critical. Dont assume. That sparkling new CRM system may be
sat in the corner gathering dust. That spreadsheet on the other hand, built on an ancient version
of excel with tens of nested tables and lots of macros could be a critical business tool for
capturing customer information. Go out and talk to people. Use your service catalogue. Once you
have a list of things you must protect at all costs you can work through the list and mitigate risk.

Page 9 of 11

Tip 9: Know how to handle downtime


No more hiding under your desk or running screaming from the building! With the best will in
the world, things will go wrong so plan accordingly. The ITIL service design book states that
recognising that when services fail, it is still possible to achieve business, customer & user
satisfaction and recognition: the way a service provider acts in failure situation has a major
influence on customer & user perception & expectation.
Have a plan for when downtime strikes. Page 1 should have Dont Panic written in bright, bold
text sounds obvious but its amazing how many people panic and freeze in the event of a crisis.
Work with incident and problem management to come up with the criteria for a major incident
that works for your organisation. Build the process and document everything even the blindingly
obvious (because you cant teach common sense). Agree in advance who will coordinate the fix
effort (probably Incident management) and who will investigate the root cause (problem
management). Link in to your IT service continuity management process. When does an incident
become so bad that we need to invoke DR? Have we got the criteria documented? Who makes
the call? Who is their back up in case theyre on holiday or off sick? Speak to capacity
management they look at performance at what point could a performance issue become so
bad that the system becomes unusable. Does that count as down time? Who investigates further?

Tip 10: Keep calms and carry on


Your availability, incident and problem management processes will improve and mature over
time. Use any initial quick wins to demonstrate the value add and get more buy in. As service
levels improve, your processes will gather momentum as its human nature to want to jump on the
bandwagon if something is a storming success.
As your process matures, you can look to other standards and framework. Agile and lean can be
used to make efficiency savings. COBIT can be used to help you gauge process maturity as well
as practical guidance on getting to the next level. PRINCE2 can help with project planning and
timescales. You can also review your metrics to reflect greater process maturity for example you
could add critical to quality (CTQ) and operational performance indicators (OPIs) to your
existing deck of goals, CSFs and KPIs.
Keep talking to others in the service management industry. The itSMF, ISACA and Back2ITSM
groups all have some fantastic ideas for implementing and improving ITIL processes so have a
look!

Final thoughts
Id like to conclude by saying that availability, incident and problem management processes are
critical to service quality. They add value on their own, but aligning them and running them
together will not only drive improvement but will also reduce repeat (boring) incidents, move
knowledge closer to the front line and increases service uptime.

Page 10 of 11

In conclusion, having availability, incident and problem management working together as a trio
is one of the most important steps in moving an IT department from system management to
service management as mind-sets start to change, quality improves and customer satisfaction
increases.
Image Credit

Vawns Murphy
Irish mum of 3. ITIL V2 Manager (red badge) and ITIL V3 Expert (purple badge). SDI
Managers certificate. Further qualifications in COBIT, ISO 20000, SAM, PRINCE2 and
Microsoft. Author of itSMF UK collateral on Service Transition, Software Asset Management,
Problem Management & the "How to do CCRM" book. Reviewer for the Service Transition
ITIL 3 2011 publication. When not being pelted with brightly coloured balls in name of ITIL, I
am a senior ITSM analyst for Enterprise Opinions.
More Posts
Follow Me:

Tags: #back2itsm, availability management, crm, incident management, ISACA, itil, itsm, itsmf,
problem management, service desk, Vawns Guest

Page 11 of 11