You are on page 1of 41

Deal with Production Issues

Suggestions from ITIL


Problems to solve

 Long resolution time


 Neglected issues
 Issues we lose track of until our
users remind us
 Recurring issues
 Inconsistency in response time
 Developers are distracted
constantly to resolve issues
Goal

 Manage issues in a consistent


manner
 Fast resolution
 Reduce client impact
 Proactively resolve issues
before they impact clients
Basic Concepts
 Incidents
 Any event which is not part of the standard
operation of a service and which causes, or may
cause an interruption to or a reduction in, the
quality of that service
 Problems
 A problem is a condition often identified as the
cause of multiple incidents that exhibit common
symptoms.
 Known Errors
 A known error is a condition identified by
successful diagnosis of the root cause of a
problem, and subsequent development of a
Work-around
Relationship of the three

 Problem is the root cause of the


incidents
 Incident is the manifest of a
underline Problem
 One Problem can cause many
Incidents
 Known error is a problem with
known root cause and known
workaround
Manage Incident vs. Manage
Problem
 Different goals
 Incident Management focus on restoring the
service operation as quickly as possible
 Problem management focus on finding and
eliminating the root cause
 Different actions
 Incident management applies workarounds or
temporary fixes to quickly restore the services
 Problem management issue a change to
fundamentally eliminate the root cause
 Incident management is reactive and
problem management is proactive
 Incident management emphasize speed and
problem management emphasize quality
Common mistakes

 Spend tremendous time and


efforts to find root cause before
the service level is recovered
 Stop the investigation after an
incident is fixed by a
workaround
 Same incident occurs
repeatedly without
understanding of the root cause
Solutions from ITIL

 Separate out Incident Management


and Problem Management into two
independent but related processes
 Handle incidents (restore service) as
quickly as possible
 Proactively and independently work
on resolving problems
 Wisely manage Known Errors
Incident Management
 Always remember the goal is to
“Restore service level as quickly as
possible”
 How to go fast?
 Classification
 Match known errors and known
workarounds
 Appropriate escalation
 Go fast, but not go crazy. Don’t miss
 Record
 Prioritize
 Follow up
Incident Management Process
Acceptance And Record
 Benefits of recording
 Help to diagnosis new incidents based
on known incidents
 Help Problem Management to find the
root cause
 Easy to determine the impact
 Be able to track and control the issue
resolution.
 Incident Reporting Channels
 User
 System Monitor/Alert
 IT person
Incident Record
 Unique ID
 Basic diagnosis info
 Timestamp
 Symptoms
 User info (name, contact info)
 Who’s responsible
 Additional information
 Screenshots
 Logs
 Status
 New, Accepted, Scheduled, Assigned, Active,
Suspended, Resolved, Terminated
Classification
 Classification
 Possible reasons (application, network,
database, business logic, etc.)
 Supporting group (application group,
database group, infrastructure group,
network group, etc.)
 Prioritize
 Priority = Impact X Urgency
 Determine resolution timeline (resolve
within X hours) based on Service Level
Agreement
Preliminary Support

 Preliminary Response
 Acknowledge of acceptance
 Collect basic info
 Provide basic help to the user
 Service Requests
 Service Request is standard service like
check status, reset password, etc.
 Go through standard procedure to
handle service requests
Match
 Match known errors
 Known solution
 Known workaround
 Known resolution procedure
 Match existing incidents
 Link the new incident with the existing
incidents
 Increase the impact level of the existing
incident
 If the existing one is already worked on,
inform the responsible personal/group
Investigate and Diagnosis

 Escalation
 Functional escalation (Technical
escalation) : Involve more
technical experts, involve teams in
other functional group, or involve
external suppliers
 Hierarchical escalation
(Management escalation):
Escalate to higher level
management team
Escalation by Priorities
Priority Resolution 0 10 30% 60% 100%
timeline Minute Minute timeline timeline timeline

1 2 hr A B CD EF
2 4 hr A B C D E,F
3 6 hr A B C D
4 8 hr A B C

 A (Service Desk)  D (Incident Manager)


 B (Second Line)  E (Division
Management)
 C (Third Line,
Supplier)  F (Corporate
Management
Investigation Activities
 Assign dedicated support person
 Collect basic info
 Query historical data
 Recent releases
 Recent changes
 Workload trend
 Analyze
 Again, don’t spend too much time in
finding the root cause. Find a
workaround as soon as possible!
Resolve and recover

 Resolution (workarounds or
permanent fix)
 Create a Request For Change (RFC)
 Approve RFC
 Implement Change.
 Record the analysis, the root cause,
the workaround and the solution
 Leave the incident in Open status
when resolution hasn’t been found
Termination

 Contact the user to confirm


incident is resolved
 Change the Incident status into
“Closed”
 Update all the Incident record to
reflect the final priority, impact,
user and root cause
Track and Monitor

 Assign an owner to each


incident. Usually it’s the Service
Desk person.
 Provide feedback to the users
after a change
 Enforce the escalation based on
the priority
Problem Management

 Problem Control
 Find the root cause of a problem
 Turn a problem into a Known Error
 Error Control
 Control and Monitor the Known Errors
until they are appropriately handled
 Proactive Problem Management
 Resolve problems before they cause
any incidents
Problem Control
Identify Problems

 Analyze the trends of incidents


 Likely to reoccur
 Likely more will occur
 Likely to have larger impact

 Analyze the weakness of the


infrastructure
 Availability
 Capability

 A significant incident (outage)


Diagnosis

 Recreate incident in testing


environment
 Link the modules with incidents
 Review the latest changes
 After the root cause of a
problem is found, this problem
becomes a Known Error
Temporary Fixes
 It’s important to find a temporary fix if
the problem causes significant
incident
 If temporary fix involves changes in
the infrastructure, a Request For
Change must be submitted. (Later,
another RFC may be submitted to
fix the root cause)
 For urgent problems, Emergency
Change Request Process should be
initialized.
Error Control
Identify and Record Known
Error
 Identify
 Find the root cause of a problem
 Link a problem with a known error
 Record
 Assign an ID
 Symptoms
 Root cause
 Status
 Notification
 Notify incident management team. They
can associate new incidents with known
errors
Determine the solution
 Evaluate based on
 Service Level Agreement
 Impact and Urgency
 Cost and benefit
 Possible solutions
 Temporary fixes
 Permanent fixes
 No fix (cost is greater than benefits)
 Record the decision in Problem
Database
Known Errors from other
environments
 Known errors from development
environment
 We may choose to release with some
minor known issues
 Known errors from suppliers
 Usually reported in the release notes
 Record, Monitor and Track those
known errors
 Relate problems with those known
errors
PIR (Post Implementation
Review)
 Normal problems
 Confirm all the related incidents are closed
 Verify if the problem record is complete
(symptoms, root cause and solutions)
 Change the problem status into Resolved
 Significant problems
 What went well?
 What went wrong?
 How to do better next time?
 How to prevent the similar issues from
happening again?
Track and Monitor

 Track the full lifecycle of each


known error
 Reevaluate impact and urgency.
Adjust the priorities accordingly.
 Monitor the progress of the
diagnosis and implementation of
the solution. Monitor the
implementation of the RFC.
Proactive Problem
Management
 Focus on the quality of the
service and the infrastructure
 Analyze operational trends
 Detect the potential incidents
and prevent them from
happening
 Find out the weak points of the
infrastructure or the overloaded
components
Ideas to improve our
Production Support process
 Idea 1: Create an independent Problem
Management Team.
 Idea 2: Create an Problem Database
 Idea 3: Define the Production Support
Procedure
 Idea 4: Review and revise the procedures
of using TeamTrack
 Idea 5: Enforce Post Implementation
Review
 Idea 6: Proactively manage problems
 Idea 7 (optional): Acquire an Service Desk
software to facilitate the process
Create an independent
Problem Management Team.
 Can be a full time team or a part time team
 Appoint a Problem Management Manager.
Must be different than the Production
Support Manager. Their goals, schedules
and requirements are different.
 Responsible for managing all the
production problems (not incidents) for
multiple applications
 Identify problems
 Record problem
 Find and evaluate solutions
 Track the progress till closure
 Work closely with the existing Production
Support team.
Create a Problem Database
 A easy to search knowledge database
 Include problems and known errors
 Track symptoms, root causes, temporary
fixes, workarounds, and permanent
solutions
 Include all the known errors in DEV and
unresolved or deferred defects in QA/RATE
environments
 Maintained by the Problem Management
Team
 Will be used by Production Support team
for match and fast resolution of incidents
Define the Production Support
Procedure (Work Instructions)
 Create a formal and detailed document.
Train Production Support Team to follow
the new procedure
 Start with ITIL Incident Management
Process. Adjust it to our own situation and
tools
 Clearly define how to calculate priorities
 Clearly define the time-bound escalation
procedure
 Clearly define the monitoring and tracking
steps
Review and define the procedure
of using TeamTrack
 TeamTrack is our existing Incident Tracking
system
 Review the functions of TeamTrack
 Redefine the incident escalation process
according to ITIL suggestions
 Define the interface between PC Support
and IT Production Support Team
 Communication channel
 Roles and responsibilities
 Escalation
 Track and Control
 Knowledge sharing
Enforce PIR

 Contact each user to confirm all


the incidents are closed
 Make sure the Problem record is
complete and useful
 Identify issues in the Incident
and Problem Management
process. Add those to Problem
database.
Proactively Manage Problems
 Responsibility of the Problem Management
Team.
 Perform the following activities:
 Analyze incidents to find the trend
 Analyze infrastructure to identify possible
bottleneck
 Run fail-over and stress tests
 Apply a problem solution across multiple related
applications
 Establish and maintain the Production Monitor
System to proactively detect system anomalies
 Evaluate how many problems are
proactively identified and resolved
Service Desk Software

 Evaluate the existing TeamTrack


software and see if it covers out
needs
 Other popular options
 HP Openview Service Desk
 Remedy Strategic Service Suite

 CA Unicenter Service Desk

You might also like