Professional Documents
Culture Documents
Case Study - Rca - Customer Complaints - Sologic
Case Study - Rca - Customer Complaints - Sologic
Problem Statement
Focal Point: Customer Complaints
When
Date: 03/05/2012
Time: 8:44am - 2:31pm
Unique Timing: While database admin was on vacation
Where
System: Company website, Company IT infrastructure
Location: Philadelphia, PA
Impact
Total: $1,500,000
Frequency: Two times overall
On March 5, 2012 we received numerous complaints from customers about our website being down
while they were attempting to use it. The website was down from approximately 8:44am to 2:31pm
EST. Customers were unable to use our site because they were receiving "500"-type errors from our
web server. "500" errors prevent users from accessing the website. The server was returning "500"
errors because the application server which processes requests was timing out, and we have only one
application server.
The application server was timing out because it was receiving requests, and the associated database
was not working. The database was not working because the SQL server was not processing queries.
The SQL server could not process queries due to the fact that the transaction log stopped growing. The
1
Note:
This
is
an
example
only!
The
main
source
of
information
for
this
report
is
from
a
Sologic
RCA
client,
but
specific
information
has
been
omitted
to
ensure
anonymity.
Page 2 of 3
log couldn't grow because the T:Drive was full and we were using only one database cluster. There
was only one database cluster in use because we only have two, and the other cluster was being used
for UAT testing. The drive was full because there is fixed capacity, the log file storage grew, the logs
were not truncated, and the logs are required to be truncated to reduce memory needs. The logs
weren't truncated because the database administrator (DBA) is tasked with manually truncating them,
and he was on vacation. The backup DBA was not aware the logs needed truncating because there
was no process in place to inform the backup DBA of critical tasks.
Solutions
ID Label Detail
1 Cause: No process in place to inform backup DBA
Solution: Implement process to notify backup DBA of critical tasks when taking
over duties.
Assigned: Jennifer Elderberry
Due: No due date assigned – example only!
Term: Medium
Notes: This would be an automated process to notify the backup DBA.
Est. Cost: No estimated cost available – example only!
Page 3 of 3
ID Label Detail
5 Cause: T:Drive at zero bytes free
Solution: Increase space on T:Drives
Assigned: Ted Dezember
Due: No due date assigned – example only!
Term: Medium
Est. Cost: No estimated cost available – example only!
People visiting Site was live
website
Requests made of
application server Transaction log
located on T:Drive
Application server
processes requests We only have two SQL
clusters
The application Application server SQL trans. log needs Only one database
server was timing relies on working to grow to process cluster in use
out database queries
Web server returned Time outs result in Database not working Transaction log was Storage required for
error ("500"-type) "500" errors unable to grow log to grow
Functional database
relies on working
SQL server
Customers not able "500" errors prevent Only one application T:Drive damaged Database Admin (DBA)
to access our web access to website server exists was on vacation
site
Customer Complaints Customers need/want Customers attempted T:Drive at zero Logs are manually
to access site to access site bytes free truncated by DBA