The Challenge of Human Errors
Incorporating Human Error Monitoring Into the System Management Infrastructure
In spite of the adoption of monitoring platforms, the processes available for troubleshooting service outages and securing IT servers has remained out-of-sync with the natural human analysis of IT administrators.
Service Outage: Smart Users Make “Smart” Mistakes
Recent studies show that human errors account forapproximately 34% of total outages when countingthe number of outages. Factoring in the total effectof each incident (as measured by Mean Time toRepair and amount of lost data), 56% of totalserver outage impact comes as a result of humanerror.
OutageType% of IncidentsTotal Impact(Downtime /Data Loss)System Error
For system errors, the tedious process of loganalysis has been alleviated partially by theadoption of system monitoring platforms andsoftware profiling utilities.But for human errors, administrators remain at aloss for assistance when searching for potentialcauses. There is no indicator that “This is probablya human error”, or even the simple statement that“Someone touched this server around the sametime as the start of the problem.”A corollary of this issue is that smart users makethe “smartest” mistakes. They know the nooks andcrannies of arcane configuration files that mighttweak an extra 5% of performance out of thesystem. These nooks and crannies are subsequentlythe most difficult to find when they go wrong.
Compliance Auditing and Security
In today’s world of compliance regulations, anadditional trend of outsourcing, temporarycontract employees and remote IT support staff creates a significant loophole to compliancy efforts:This rise of outsourcing brings with it a rise inmission-critical data being handled via RemoteDesktop Protocol (RDP).Of course, the bottom-line complianceresponsibilities remain solely in the hands of thecore IT team. You can outsource Support, but youcan’t outsource Compliance. IT managers todaymay know how their platforms are secured fromexternal access, but the question “How can we besure that third-party vendors with RDP access arenot accessing data that they should not betouching?” is more difficult to answer.
How Human Error Differs from System Error
There are a few fundamental differences betweensystem errors and human errors, and the way thatthey each need to be managed.The first basic difference is repeatability. Forsystem errors, finding the problem is equivalent tofixing the problem. If your troubleshooting processleads to a conclusion that a NIC card is not working,then swapping in a new card closes the issue, andyou can sleep well that evening.Human errors are not this direct. If you find animproper configuration file and swap it with thecorrect configuration file, the problem may besolved temporarily. But when you go to bed thatnight, you’re probably still scratching your head,wondering who or what caused this error, andwhether it will happen again tomorrow.