You are on page 1of 5

Lindstrom 1 Nathan W.

Lindstrom Professor Jebaraj Systems Analysis and Design 10 October 2010

Defining a Problem

When something goes wrong within the information systems of a company, often the first question asked is, what changed? Its a good question. According to Donna Scott, Vice President of Research at the Gartner Group, eighty percent of unplanned downtime is caused by people and process issues, including poor change management practices while technology failures account for only twenty percent (Behr). And to further compound the problem, eighty percent of [the time spent recovering from the outage] is wasted on non-productive activities chief among these being activities surrounding trying to figure out what changed (Behr). If you ask ten different IT managers why this is, youll undoubtedly get ten different answers, ranging from vague complaints about process failures to IT workers not documenting their actions; and statements about the need for tighter change control to gripes about how the system administrators act like cowboys. But according to Whitten and Bentley in Systems Analysis and Design Methods, none of these ill-defined problems we are likely to hear about are going to move us any closer to understanding the root cause. Instead, we must (a) clearly define the problem; (b) describe the situation we would like to be in, and recast the problem as a gap

Lindstrom 2 analysis between where we are now and where we would like to be; and then (c) document the actions needed to move us from where we are to where we want to be. Weve already described the problematic situation as it currently exists; namely, monitoring alerts that a critical server has gone offline at six oclock on a Friday afternoon, but nobody knows why. Multiple people respond, poking and prodding the unresponsive server. Frantic questions are flung about, including did anyone change anything on it? and what was the last change we made today? and could have Thursdays release broken it? Here is the same unexpected server failure on a Friday afternoon, except it takes place within a high-performing IT organization, which is what wed like to become: the monitoring system alerts the on call person that the server has gone off the air. That person consults the change management system, and pulls up a log of recent changes that impacted the server or the network to which it is attached. System log files for the server show a critical process as having run out of memory seconds before the crash; according to the change log, the software to which the process belongs was updated during Tuesdays release. The on call person adds this information to the incident ticket, and escalates it to a senior engineer. The senior engineer proposes rolling back the software version on the cluster lest all the servers begin to fall, domino-like, as they exhaust their memory. The IT manager concurs, and the rollback is successfully executed. Mean time to recovery was under twenty minutes, and at no time did anyone run in circles, screaming. So what is the difference, or gap, between the first scenario and the second? Both begin the same way, with a critical server going down, and the monitoring service setting off an alert. At this point the two scenarios rapidly diverge. Here are the key differences:

Lindstrom 3 1. No one person is clearly responsible for responding to the alert; multiple people all jump in and add to the confusion. 2. A record of past changes is not kept, or if it is kept, it is not easily filterable by affected server. In other words, to be useful in a time of crisis, it must be both accurate and accessible. 3. Once a possible cause was identified, there is no clear escalation procedure. This has a twofold impact: one, multiple people may attempt to fix it, stepping on each other in the process; and two, it is highly unlikely that the fix will be captured in the record of changes. 4. After the initial problem is fixed, the level of confidence in the solution will be quite low, and communication to business stakeholders as to the root causes and future preventative measures will be incomplete or entirely absent. The solution, then, appears to involve both technology and process: 1. An on call schedule and rotation must be clearly defined, and participants trained in what is expected of them. 2. A searchable database, or ticketing system, must be setup. 3. On call participants, engineers, and managers must be trained in both how to use the ticketing system (technical) and when to use it (process.) 4. An escalation procedure must be clearly defined, and participants trained in what is expected of them. 5. A standardized form of communication which is issued after each outage or impairment incident must be created, and participants trained in its use.

Lindstrom 4 In order for this to be successful, existing systems and processes need to be analyzed before the solution is designed. Furthermore, business stakeholders and all those individuals expected to participate in the changes should be consulted. JRP is probably overkill for this specific challenge, but care should be used to fully understand the scope of the issues involved, particularly given that resolving these issues will jointly involve both technology solutions and BPR.

Lindstrom 5 Works Cited Behr, Kevin, Gene Kim, and George Spafford. The Visible Ops Handbook. Eugene: ITPI, 2006. Whitten, Jeffrey, and Lonnie Bentley. Systems Analysis and Design Methods. New York: McGraw-Hill, 2007.

You might also like