No one person is clearly responsible for responding to the alert; multiple people all jumpin and add to the confusion.2.
A record of past changes is not kept, or if it is kept, it is not easily filterable by affectedserver. In other words, to be useful in a time of crisis, it must be both accurate andaccessible.3.
Once a possible cause was identified, there is no clear escalation procedure. This has atwofold impact: one, multiple people may attempt to ³fix it´, stepping on each other inthe process; and two, it is highly unlikely that the ³fix´ will be captured in the record of changes.4.
After the initial problem is ³fixed´, the level of confidence in the solution will be quitelow, and communication to business stakeholders as to the root causes and future preventative measures will be incomplete or entirely absent.The solution, then, appears to involve both technology and process:1.
An on call schedule and rotation must be clearly defined, and participants trained in whatis expected of them.2.
A searchable database, or ticketing system, must be setup.3.
On call participants, engineers, and managers must be trained in both
to use theticketing system (technical) and
to use it (process.)4.
An escalation procedure must be clearly defined, and participants trained in what isexpected of them.5.
A standardized form of communication which is issued after each outage or impairmentincident must be created, and participants trained in its use.