You are on page 1of 7

RCA – Root Cause Analysis

1. EXECUTIVE SUMMARY

Customer: Seadrill Priority: 2 High


Incident Start: 07/11/2017 20:23 Incident End: 07/11/2017 22:50
Inc. Outage duration 11h, 24m
CIM escalation 01h, 27m
duration
Short Problem Title: Hyperion - Task flows are not running
Brief Description: Incident Description:
On the 7th of November, it was reported by the Seadrill Business (Daniel
Arciniega) to DXC that the Hyperion task flows weren’t running. At the time,
there was no business impact confirmed, so only a P3 Incident was opened.
The DXC Hyperion application support team was involved and approached
Apps DBA team to clear the inactive session, which appeared for more than
1 hour. From DBA perspective, the AppsDBA has checked the tables, which
show the current status of the task flows completion, but no issues were
identified.
A check was performed on the Hyperion logs, which showed that the
connection between the HFM module and the Hyperion shared services was
broken. Therefore it was concluded and decided from the DXC Hyperion
application support team that the communication between HFM and the
Shared services should be reset. If was considered that if the issue still
persists after resetting of the communication, then as a last resort, a server
reboot may be required. This is why the CIM team was involved as well and
the Incident was upgraded to P2, in case approvals for the server reboot
would be needed,
However after the reset was done, the issue was resolved and this was
acknowledged by Daniel Arciniega as well. The Incident was agreed to be
placed under monitoring till mid-day on the 8th of November. Since no
further issues were observed, the ticket was agreed for closure.

• When the issue was first detected/reported. - reported from Daniel


Arciniega to DXC on 07/11/2017 at 10:30 AM CET.
(what date/time did the connection break as per the log?) The Task flows
started accumulating from 6th Nov 10.00 CET, The logs will not show the
connection break information but as task flows started accumulating the
connection would have break after 10.00 CET
• When the incident was opened and with what priority. - Incident E2-
IM014327023 was opened as a P3 on the 07/11/2017 at 13:21 CET.
• When the priority was escalated. - Incident was escalated to P2 at on
07/11/2017 20:23 CET.
• When CIM Team was involved. - on 07/11/2017 at 20:23 CET.
• When workaround/perm. Solution was implemented and by which team. -
DXC Hyperion team performed the reset of the communication between
HFM and Hyperion Shared services on the 07/11/2017 at 22:50 CET.
• When incident was closed - After the monitoring period, the Incident was

Saved 23-Nov-2017 Confidential Page 1 of 7


© 2017 DXT Technology, L.P. All rights reserved.
628947358.docx
RCA – Root Cause Analysis

marked as Resolved on the 08/11/2017 at 22:34 CET and closed


automatically from HPSM on the 12th of November.

Incident Resolution:
Hyperion team re-registered the HFM applications with shared services.

Customer Impact: Business Impact: There was no Business impact for the Hyperion business
users.
Service Impact: Corporate Application will not get the data and consolidation
schedules will not run.
System Criticality: High
Impacted Locations: None of the location were impacted. There was only
service impact of the Hyperion application.
Number of users impacted : N/A

Key Findings (causes):


Initiating Root Cause: During the investigation it was identified by the DXC Hyperion support team
that there were a lot of active task flows in the application, which were in an
active state. The task flows run with the functional team's Active Directory
ID. This particular ID was locked out, hence all the task flows were not able
to authenticate and remained in the active state. How was the task able to
be scheduled whilst the ID was locked? Task flows were scheduled before
the user account was locked (User password was expired) Why are these
tasks running with a User-id and not under a generic (service account type)
ID? This user ID required admin access to the application, It was always
scheduled with either Dan’s or DXC FAM Team user ID’s. We need to check
with SOX Team to get the exception to schedule with generic account. This
has led to piling up of the task flows, which were not getting cleared
correctly and this caused the connection breakdown between the HFM
module and the Hyperion shared services.
This is an individual ID to a person from the functional team. The password
of this individual ID was expired, which has caused the ID to get locked. It is
the responsibility of the individual from the functional team to renew such
passwords and not let them expire and lock their ID accounts. How can
there be a situation where a personal user-id is able to break a system? The
functionality which was scheduled with the user ID caused the pileup and
break the connection specific to that functionality but not the entire system.
Contributory Causes: At the time of the Incident, there was no monitoring in place, so the issue
needed to be flagged from Seadrill to DXC.
After the Incident was resolved, the DXC Hyperion support team has raised
a SR (Severity 2 SR 3-16157666521: HFM Task flows progress alert) to the
Oracle vendor.
Until the vendor provides guidance on the setting up of the e-mail
monitoring, the DXC Hyperion support team has implemented a manual
monitoring on the task flows, which is being performed every 3 hours.

Saved 23-Nov-2017 Confidential Page 2 of 7


© 2017 DXT Technology, L.P. All rights reserved.
628947358.docx
RCA – Root Cause Analysis

After the e-mail monitoring is set, the alerting will be done automatically and
will notify the team on any issues with the state of the task flows.
Cause Code1 Application
Sub-Cause Code1 App: Human Error
Key actions to Eliminate
Initiating Root Cause: Action 1: DXC Hyperion functional support team and the Seadrill Hyperion
functional administrator to ensure that the task flows are updated with a
correct password every 2 weeks. – I need to understand the issue first
before I agree. Piling up the Task flows breaks the Task flow functionality. In
this case the password expiry was the cause for the task flow pileup.
Action 2: Calendar alert to be set in order to re-schedule the task flows
every week. – can we not find an automated solution? As we do not receive
the email alert for the password expiry to DXC email id’s, Either we have to
schedule Task flows with generic account so that password will never expire
or We need to have this calendar alert.
Contributory Causes: Action 3: DXC Hyperion team to set up an e-mail monitoring to alert the
team regarding the status of the task flows. – Please share the design with
me. We have steps for receiving Email alerts for Tasks which are either
completed or failed but the issue happens when the task flow remains in the
running status. We are checking with Oracle for further steps.

1 From the standard Cause and Sub-Cause Codes documented in PRBM Cause Codes Sub-
Cause Codes.

Related to a Change / No Change Reference ID: N/A


Service Request:
Change Management System: N/A
Name Account or Service Name Root Region: RCA Version RCA Version
Line: Cause Owner: Number: Date:
APPS Org. Tsanov, Spas 1.0 23/11/2017
Incident Record Number: Problem Record Number: RtOP Incident Record Number:
Service Manager (EMEA) E2-PM00051911 N/A
Incident E2-IM014327023

Root Cause Analysis

Name KPE(s) / or Primary CI(s) Involved: sdrlhpprhypwb1p.corp.local


Sdrlhpprhypwb2p.corp.local

Business Application Affected: Hyperion


Application Project Reference(s): N/A
Application Contacts: HP_SDRL_HYPERION@groups.ext.hpe.com

Saved 23-Nov-2017 Confidential Page 3 of 7


© 2017 DXT Technology, L.P. All rights reserved.
628947358.docx
RCA – Root Cause Analysis

Problem Statement: Task flows in Hyperion application were not running


Q:1 Why the task flows in the Hyperion application were not running?
A: There were a lot of active task flows in the application, which were in an active state.
Why did this cause them not to run? Piling up the Task flows breaks the Task flow
functionality. In this case the password expiry was the cause for the task flow
pileup
Action:

Q:2 Why there were a lot of active task flows in active state in the Hyperion application.
A: The task flows run with the functional team's Active Directory ID. This particular ID was
locked out, hence all the task flows were not able to authenticate and remained in the
active state. This has led to piling up of the task flows, which were not getting cleared
correctly and this caused the connection breakdown between the HFM module and the
Hyperion shared services.
Action: Action 2:Calendar alert to be set in order to re-schedule the task flows every week. Can
we not have an automated solution? As we do not receive the email alert for the
password expiry to DXC email id’s, Either we have to schedule with generic
account so that password will never expire or We need to have this calendar
alert.
Q:3 Why the particular active directory ID got locked out?
A: This is an individual ID to a person from the functional team. The password of this
individual ID was expired, which has caused the ID to get locked.
Action:

Q:4 Why was an individual user-id able to break the system? Why are jobs running under
user-id names and dependent upon have active credentials to enable jobs to run?
The functionality which was scheduled with the user ID caused the pileup and
break the connection specific to that functionality but not the entire system. As
the product behavior, it uses the active credentials to run the jobs.
Q: 4 Why the password was expired and did not get renewed?
A: It is the responsibility of the individual from the functional team to renew such
passwords and not let them expire and lock their ID accounts.
Action: Action 1:DXC Hyperion functional support team and the Seadrill Hyperion functional
administrator to ensure that the task flows are updated with a correct password every 2
weeks.

Q:5 How the DXC Hyperion support team is monitoring the task flows?
A: At that the time of the Incident there was no monitoring on the state of the task flows.
After the Incident was resolved, the DXC Hyperion support team has raised a SR
(Severity 2 SR 3-16157666521: HFM Task flows progress alert) to the Oracle vendor. –
Why raise an SR? Are DXC not able to identify how to monitor task flows? As per the
functionality it is possible to send an alert after a step in the Task flow completes or
Failure but not when it remains in running state. We are reaching Vendor if there is a

Saved 23-Nov-2017 Confidential Page 4 of 7


© 2017 DXT Technology, L.P. All rights reserved.
628947358.docx
RCA – Root Cause Analysis

solution for this.


Until the vendor provides guidance on the setting up of the e-mail monitoring, the DXC
Hyperion support team has implemented a manual monitoring on the task flows, which
is being performed every 3 hours.
After the e-mail monitoring is set, the alerting will be done automatically and will notify
the team on any issues with the state of the task flows.
Action:

Q:6 Can we re-schedule task flows every week to avoid password expire issue?

Disagree, “this is the tail wagging the dog”


As we do not receive the email alert for the password expiry to DXC email id’s,
we have to schedule with generic account so that password will never expire.

A: The DXC Hyperion support team will set a calendar alert in order to re-schedule the
task flows every week.
Action: Action 2:Calendar alert to be set in order to re-schedule the task flows every week.

Q:7 Were Apps and DB servers checked on a capacity perspective?


A: The DXC Hyperion support team checked both Apps and DB servers and it was
confirmed that CPU, Memory and the disk capacity were in normal state.
Action:

Q:8 How could we have detected this quicker?


A: The issue could have been detected quicker by the DXC Hyperion support team if at the
time of the Incident, there was a monitoring in place for the state of the task flows.
Action: Action 3:DXC Hyperion team to set up an e-mail monitoring to alert the team regarding
the status of the task flows.

Q:9 How could we have resolved it quicker?


A: If the issue was detected quicker proactively via a monitoring, then the DXC Hyperion
support team could have taken immediate actions to restore the service functionality.
Action:

Q:10 How do we prevent this from happening again?


A: The issue can be prevented by the combined implementation of the following corrective
actions.
Action: Action 1:DXC Hyperion functional support team and the Seadrill Hyperion functional
administrator to ensure that the task flows are updated with a correct password every 2
weeks.
Action 2:Calendar alert to be set in order to re-schedule the task flows every week.
Action 3:DXC Hyperion team to set up an e-mail monitoring to alert the team regarding
the status of the task flows.

Saved 23-Nov-2017 Confidential Page 5 of 7


© 2017 DXT Technology, L.P. All rights reserved.
628947358.docx
RCA – Root Cause Analysis

Saved 23-Nov-2017 Confidential Page 6 of 7


© 2017 DXT Technology, L.P. All rights reserved.
628947358.docx
RCA – Root Cause Analysis

Resolution List
Completion
# Action Statement Action deliverable Action Owner Target date
date
DXC Hyperion functional support team Mitigation
and the Seadrill Hyperion functional
nageswararao.korra
1 administrator to ensure that the task 01/12/2017
pati@hpe.com
flows are updated with a correct
password every 2 weeks.
Calendar alert to be set in order to re- Corrective nageswararao.korra
2 24/11/2017
schedule the task flows every week. pati@hpe.com
DXC Hyperion team to set up an e- Corrective
nageswararao.korra
3 mail monitoring to alert the team 15/12/2017
pati@hpe.com
regarding the status of the task flows.
DXC Hyperion team to ensure Corrective
following the INCM process for the nageswararao.korra
4 30/11/2017
correct prioritization of Incidents pati@hpe.com

Proposal for a service account to be Mitigation nageswararao.korra


5 used and the password set to: “never 08/12/2017
pati@hpe.com
expire”

Checkpoint meeting: Held between DXC Hyperion support team, Scott Ainslie and Daniel Arciniega on the
13th of November
Problem Manager - Spas Tsanov
DXC ADM - Abhranil Dhar
Application Lead - Nageswararao Korrapati
DXC on call person that attended the War room calls – No WAR room was organized
Main CIM representative that managed the War room – Georgi Todorov
Main Seadrill representative that attended the War room calls – No WAR room was organized

Saved 23-Nov-2017 Confidential Page 7 of 7


© 2017 DXT Technology, L.P. All rights reserved.
628947358.docx

You might also like