Deal With Production Issues

Deal with Production Issues
Suggestions from ITIL

Problems to solve
 Long resolution time

 Neglected issues
 Issues we lose track of until our
users remind us
 Recurring issues
 Inconsistency in response time
 Developers are distracted
constantly to resolve issues
Goal
 Manage issues in a consistent

manner
 Fast resolution
 Reduce client impact
 Proactively resolve issues
before they impact clients
Basic Concepts
 Incidents
 Any event which is not part of the standard
operation of a service and which causes, or may
cause an interruption to or a reduction in, the
quality of that service
 Problems
 A problem is a condition often identified as the
cause of multiple incidents that exhibit common
symptoms.
 Known Errors
 A known error is a condition identified by
successful diagnosis of the root cause of a
problem, and subsequent development of a
Work-around
Relationship of the three
 Problem is the root cause of the

incidents
 Incident is the manifest of a
underline Problem
 One Problem can cause many
Incidents
 Known error is a problem with
known root cause and known
workaround
Manage Incident vs. Manage
Problem
 Different goals
 Incident Management focus on restoring the
service operation as quickly as possible
 Problem management focus on finding and
eliminating the root cause
 Different actions
 Incident management applies workarounds or
temporary fixes to quickly restore the services
 Problem management issue a change to
fundamentally eliminate the root cause
 Incident management is reactive and
problem management is proactive
 Incident management emphasize speed and
problem management emphasize quality
Common mistakes
 Spend tremendous time and

efforts to find root cause before
the service level is recovered
 Stop the investigation after an
incident is fixed by a
workaround
 Same incident occurs
repeatedly without
understanding of the root cause
Solutions from ITIL
 Separate out Incident Management

and Problem Management into two
independent but related processes
 Handle incidents (restore service) as
quickly as possible
 Proactively and independently work
on resolving problems
 Wisely manage Known Errors
Incident Management
 Always remember the goal is to
“Restore service level as quickly as
possible”
 How to go fast?
 Classification
 Match known errors and known
workarounds
 Appropriate escalation
 Go fast, but not go crazy. Don’t miss
 Record
 Prioritize
 Follow up
Incident Management Process
Acceptance And Record
 Benefits of recording
 Help to diagnosis new incidents based
on known incidents
 Help Problem Management to find the
root cause
 Easy to determine the impact
 Be able to track and control the issue
resolution.
 Incident Reporting Channels
 User
 System Monitor/Alert
 IT person
Incident Record
 Unique ID
 Basic diagnosis info
 Timestamp
 Symptoms
 User info (name, contact info)
 Who’s responsible
 Additional information
 Screenshots
 Logs
 Status
 New, Accepted, Scheduled, Assigned, Active,
Suspended, Resolved, Terminated
Classification
 Classification
 Possible reasons (application, network,
database, business logic, etc.)
 Supporting group (application group,
database group, infrastructure group,
network group, etc.)
 Prioritize
 Priority = Impact X Urgency
 Determine resolution timeline (resolve
within X hours) based on Service Level
Agreement
Preliminary Support
 Preliminary Response
 Acknowledge of acceptance
 Collect basic info
 Provide basic help to the user
 Service Requests
 Service Request is standard service like
check status, reset password, etc.
 Go through standard procedure to
handle service requests
Match
 Match known errors
 Known solution
 Known workaround
 Known resolution procedure
 Match existing incidents
 Link the new incident with the existing
incidents
 Increase the impact level of the existing
incident
 If the existing one is already worked on,
inform the responsible personal/group
Investigate and Diagnosis
 Escalation
 Functional escalation (Technical
escalation) : Involve more
technical experts, involve teams in
other functional group, or involve
external suppliers
 Hierarchical escalation
(Management escalation):
Escalate to higher level
management team
Escalation by Priorities
Priority Resolution 0 10 30% 60% 100%
timeline Minute Minute timeline timeline timeline
1 2 hr A B CD EF
2 4 hr A B C D E,F
3 6 hr A B C D
4 8 hr A B C
 A (Service Desk)  D (Incident Manager)

 B (Second Line)  E (Division
Management)
 C (Third Line,
Supplier)  F (Corporate
Management
Investigation Activities
 Assign dedicated support person
 Collect basic info
 Query historical data
 Recent releases
 Recent changes
 Workload trend
 Analyze
 Again, don’t spend too much time in
finding the root cause. Find a
workaround as soon as possible!
Resolve and recover
 Resolution (workarounds or
permanent fix)
 Create a Request For Change (RFC)
 Approve RFC
 Implement Change.
 Record the analysis, the root cause,
the workaround and the solution
 Leave the incident in Open status
when resolution hasn’t been found
Termination
 Contact the user to confirm

incident is resolved
 Change the Incident status into
“Closed”
 Update all the Incident record to
reflect the final priority, impact,
user and root cause
Track and Monitor
 Assign an owner to each

incident. Usually it’s the Service
Desk person.
 Provide feedback to the users
after a change
 Enforce the escalation based on
the priority
Problem Management
 Problem Control
 Find the root cause of a problem
 Turn a problem into a Known Error
 Error Control
 Control and Monitor the Known Errors
until they are appropriately handled
 Proactive Problem Management
 Resolve problems before they cause
any incidents
Problem Control
Identify Problems
 Analyze the trends of incidents

 Likely to reoccur
 Likely more will occur
 Likely to have larger impact
 Analyze the weakness of the

infrastructure
 Availability
 Capability
 A significant incident (outage)

Diagnosis
 Recreate incident in testing

environment
 Link the modules with incidents
 Review the latest changes
 After the root cause of a
problem is found, this problem
becomes a Known Error
Temporary Fixes
 It’s important to find a temporary fix if
the problem causes significant
incident
 If temporary fix involves changes in
the infrastructure, a Request For
Change must be submitted. (Later,
another RFC may be submitted to
fix the root cause)
 For urgent problems, Emergency
Change Request Process should be
initialized.
Error Control
Identify and Record Known
Error
 Identify
 Find the root cause of a problem
 Link a problem with a known error
 Record
 Assign an ID
 Symptoms
 Root cause
 Status
 Notification
 Notify incident management team. They
can associate new incidents with known
errors
Determine the solution
 Evaluate based on
 Service Level Agreement
 Impact and Urgency
 Cost and benefit
 Possible solutions
 Temporary fixes
 Permanent fixes
 No fix (cost is greater than benefits)
 Record the decision in Problem
Database
Known Errors from other
environments
 Known errors from development
environment
 We may choose to release with some
minor known issues
 Known errors from suppliers
 Usually reported in the release notes
 Record, Monitor and Track those
known errors
 Relate problems with those known
errors
PIR (Post Implementation
Review)
 Normal problems
 Confirm all the related incidents are closed
 Verify if the problem record is complete
(symptoms, root cause and solutions)
 Change the problem status into Resolved
 Significant problems
 What went well?
 What went wrong?
 How to do better next time?
 How to prevent the similar issues from
happening again?
Track and Monitor
 Track the full lifecycle of each

known error
 Reevaluate impact and urgency.
Adjust the priorities accordingly.
 Monitor the progress of the
diagnosis and implementation of
the solution. Monitor the
implementation of the RFC.
Proactive Problem
Management
 Focus on the quality of the
service and the infrastructure
 Analyze operational trends
 Detect the potential incidents
and prevent them from
happening
 Find out the weak points of the
infrastructure or the overloaded
components
Ideas to improve our
Production Support process
 Idea 1: Create an independent Problem
Management Team.
 Idea 2: Create an Problem Database
 Idea 3: Define the Production Support
Procedure
 Idea 4: Review and revise the procedures
of using TeamTrack
 Idea 5: Enforce Post Implementation
Review
 Idea 6: Proactively manage problems
 Idea 7 (optional): Acquire an Service Desk
software to facilitate the process
Create an independent
Problem Management Team.
 Can be a full time team or a part time team
 Appoint a Problem Management Manager.
Must be different than the Production
Support Manager. Their goals, schedules
and requirements are different.
 Responsible for managing all the
production problems (not incidents) for
multiple applications
 Identify problems
 Record problem
 Find and evaluate solutions
 Track the progress till closure
 Work closely with the existing Production
Support team.
Create a Problem Database
 A easy to search knowledge database
 Include problems and known errors
 Track symptoms, root causes, temporary
fixes, workarounds, and permanent
solutions
 Include all the known errors in DEV and
unresolved or deferred defects in QA/RATE
environments
 Maintained by the Problem Management
Team
 Will be used by Production Support team
for match and fast resolution of incidents
Define the Production Support
Procedure (Work Instructions)
 Create a formal and detailed document.
Train Production Support Team to follow
the new procedure
 Start with ITIL Incident Management
Process. Adjust it to our own situation and
tools
 Clearly define how to calculate priorities
 Clearly define the time-bound escalation
procedure
 Clearly define the monitoring and tracking
steps
Review and define the procedure
of using TeamTrack
 TeamTrack is our existing Incident Tracking
system
 Review the functions of TeamTrack
 Redefine the incident escalation process
according to ITIL suggestions
 Define the interface between PC Support
and IT Production Support Team
 Communication channel
 Roles and responsibilities
 Escalation
 Track and Control
 Knowledge sharing
Enforce PIR
 Contact each user to confirm all

the incidents are closed
 Make sure the Problem record is
complete and useful
 Identify issues in the Incident
and Problem Management
process. Add those to Problem
database.
Proactively Manage Problems
 Responsibility of the Problem Management
Team.
 Perform the following activities:
 Analyze incidents to find the trend
 Analyze infrastructure to identify possible
bottleneck
 Run fail-over and stress tests
 Apply a problem solution across multiple related
applications
 Establish and maintain the Production Monitor
System to proactively detect system anomalies
 Evaluate how many problems are
proactively identified and resolved
Service Desk Software
 Evaluate the existing TeamTrack

software and see if it covers out
needs
 Other popular options
 HP Openview Service Desk
 Remedy Strategic Service Suite
 CA Unicenter Service Desk

Deal With Production Issues

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deal With Production Issues

Uploaded by

Copyright:

Available Formats

Deal with Production Issues

Suggestions from ITIL

 Long resolution time

 Manage issues in a consistent

 Problem is the root cause of the

 Spend tremendous time and

 Separate out Incident Management

 A (Service Desk)  D (Incident Manager)

 Contact the user to confirm

 Assign an owner to each

 Analyze the trends of incidents

 Analyze the weakness of the

 A significant incident (outage)

 Recreate incident in testing

 Track the full lifecycle of each

 Contact each user to confirm all

 Evaluate the existing TeamTrack

 CA Unicenter Service Desk

You might also like