Professional Documents
Culture Documents
''Confidential'' Rig
ZMS & Network
Operational FMEA
Report
Disclaimer: Kingston Systems LLC assumes no responsibility for any loss, physical or financial, or damage from actions taken or not
taken in light of the comments or recommendations given or not given in this or any project communication
The rigs to be delivered will be variants of the “''Confidential''” model. The design is an advanced
automated rig with a modern control system, including drillers chair, assistant driller control station,
integrated automated equipment, powered catwalk, racking system, and Zone Management System
(ZMS). At least two different vendors, NOV and ''Confidential'', are providing equipment on the drill floor.
''Confidential'' is providing both the Top Drive (TD) and the Rough Neck (RN). The Stand Trailing Vehicle
(STV) is provided by NOV. The drilling rig contractor is ''Client''.
The FMEA workshop was held on January 22-26 2018 at ''Client' 'America’s offices in Pearland TX. The
workshop brought together system and operations experts to identify prepare for and mitigate those
control network component failures that are most likely to be of consequence within the operating lifetime
of the rig.
In general the rig is well designed with the integration between the three main vendors considered. The
design does have some built in redundancy but is not a redundant system and should not be considered
as such. The FMEA identified several as design single point failures that could leave the rig inoperable.
As such given the planned remote and harsh operating conditions the FMEA team suggests;
The ZMS system itself is highly reliant on several non-redundant network connections to a single ZMS
PLC, and to critical connections with the ''Confidential'' supplied equipment. The operational range of the
ZMS as illustrated by the Zone Management Matrix appears to capture all potential operational collisions
envisioned by the FMEA team while investigating the ZMS system, zones and equipment interactions. A
few design and pre-operational suggestions were made by the team that could improve the ZMS system.
However, without major redesign, the ZMS, along with the rig cannot exceed the base redundancy level
of the rig. In that it is a stable design, but not a dual redundant network system.
The findings and recommendations from the FMEA are generally separated into 4 main groups.
For each finding where the Ranked Risk was deemed UNACCEPTABLE, mitigation was proposed by the
team and the risk post-mitigation calculated. These mitigations generally reduce the severity of the failure
or improve how easy it is to detect. Combined they reduce the potential ranked risk of the failure. This
BEFORE and AFTER picture is most easily illustrated in the two charts below.
The FMEA findings and mitigation recommendations are discussed in detail through the body of the
report,, with suggestions summarized in Chapter 6.
The report describes the FMEA approach and scoring method for those less familiar with the concept and
Kingston Systems workshop approach to Operational FMEAs. Additionally as Software Management of
Change (SMOC) may be an unfamiliar concept to the rig operator, a section describing it and some case
studies was included in the appendix as a reference.
Although the purpose, terminology and other details can vary according to type (e.g. Process
FMEA, Design FMEA, Operational etc.), the basic methodology is similar for all. Our focus is on
operations. Specifically, we want to answer the question “If part XYZ fails, what is the impact to
controlling the well and safe operations?” As such, this Operational FMEA is not directly intended to
critique the design, but to treat it as relatively set.
Item(s)
Function(s)
Failure(s) Potential
Effect(s) of Failure
Probability and Severity of the Failure (to operations)
Operation actions to be taken in case of failure
Detectability (by the user) of the failure
Mitigation or avoidance recommendations
Most analyses of this type also include some method to assess the risk associated with the issues
identified during the analysis and to prioritize corrective actions.
An offline analysis of the system identified a list of components/functions whose failure/loss will
be considered.
Experts were assembled who understand:
o The design and support of the system.
o The operations the system is used to perform.
For each component/function on the pre-assembled list:
o A discussion gaining an understanding of the impact to operations if a part failed.
o Then, appropriate system experts scored the likelihood of the failure occurring, and made
clear to the operations experts the functional effect of that failure on the system. From
this the Occurrence Score is given (See reference Table 1).
o The group scored the Severity of the consequences to operations (See reference Table
2)
o A key component is understanding the ability of the driller and maintenance crew to
detect the failure. This Detection score (See reference Tables 3 and 4), determined if a
mitigation or action plan should be reviewed.
o The operations experts considered what could/should be done to make the system safe
for people and assets in the event of the failure; multiple operational scenarios may have
been considered, with a bias towards those more likely to involve safety-critical
conditions. They considered what could/should be done to continue operations; multiple
operations scenarios may have been considered, with a bias towards those more likely to
present fewer options for working around the lost function.
o Product of Severity, Occurrence and Detection is called Risk Ranking and is used to
further prioritize the item.
o The Detection Matrix (Chart X) helps identify if mitigation is generally recommended.
Unless Risk Ranking or Detection Level is low enough that the item is not worth further attention,
the group agreed on a recommendation for corrective action to be taken to reduce likelihood
and/or severity, or to provide additional information if there are unanswered questions.
If a mitigation is recommended and reviewed by the team. They then re-assess the Risk Rank
post mitigation to aid in communicating the reduced operational risk exposure.
During the process, the group is allowed to make additions and corrections to the pre-assembled
list of components/functions.
Different failure modes are discussed and recorded only if they led to different safety and
operational actions, or to significantly different occurrence likelihoods and severities.
Failure causes were not explicitly discussed or recorded, although potential causes may have
been explored during discussions of likelihood, severity, or mitigating actions. This is because
the goal of this FMEA was to understand the frequency and impact of the failures, rather than the
causes.
In this case the Risk Ranking score ranged from 1 to 1000. Items are prioritized relative to each
other and against the Detection Matrix (Reference Table 4) within each FMEA exercise.
Following are definitions of columns in the FMEA Workbook, and other terminology used within this
report:
Parent Part: Identifies the major component within which the failure may occur.
Part Failing / Function Lost: When the Parent Part has multiple components or functions each of
whose failure or loss need to be considered separately identifies that component or function.
Failure Type: Needed only when a component part has multiple failure modes that need to be
separately considered.
Effect of Failure: The outcome described in terms of both the state of the system and (if appropriate)
the state of the drilling operations.
Occurrence: A score given for likelihood of occurrence of the failure, see table below.
Severity: A score given for severity of impact of the failure, see table below.
Operational Action: A description of how what safety actions must be taken and how operations can
proceed with this failure.
Detection Mode: How the user learns of the failure.
Detection Score: The scoring of the detection mode.
Risk Ranking: The product of occurrence, severity and detection provides a number which, when
used with the Detection Matrix can sort the worksheet by relative importance.
Risk Acceptable: Risk evaluation mark based on the Detection Matrix
Mitigating Action: What is the corrective action. Often these are training and spares management
related. However they can also be design, and process specific.
Responsible: Company required to close the item
Completion Date: Target date due
New Occurrence: Score post implementation of mitigation
New Severity: Score post implementation of mitigation
New Detection: Score post implementation of mitigation
New Rank: Product post mitigation
New Rank Acceptable: Risk evaluation mark post implementation based on the Detection Matrix
% Risk Reduced: Measure of Risk Reduction
Occurrence Scale
Possible failure
Ranking Possible failure rates
rates
10 at least once per day >= 1 in 2 Very High: Failure is almost
9 at least one per week 1 in 3 inevitable
8 at least once per month 1 in 8
High: Repeated failures
7 at least once per year 1 in 20
6 at least once per 2 years 1 in 80
Moderate: Occasional failures
5 at least one per 5 years 1 in 400
4 at least one per 10 years 1 in 2.000
Low: Relatively few failures
3 at least once per 25 years 1 in 15.000
2 less than once per 25 years 1 in 150.000
Remote: Failure is unlikely
1 never 1 in 1.500.000
Table 1: Occurrence Scoring Reference
Severity Scale
Rank Severity of Effect Type of Effect
Very high severity ranking when a potential failure mode affects safe operation Hazardous -
10 and/or involves noncomplience with regulations without warning without warning
Very high severity ranking when a potential failure mode affects safe operation Hazardous - with
9 and/or involves noncomplience with regulations with warning warning
8 Drilling Systems inoperable, loss of primary function Very High
7 Drilling systems operable, reduce funcationality. Customer dissatisfied High
Drilling systems operable, but Comfort/Convinience item inoperable. Customer
6 experiences discomfprt Moderate
5 Drilling system operable, minor work around. Reduced level of performance Low
4 Defect noticed by customers Very Low
3 Loss of Redundancy with warning, second failure would be severity 6 or above Minor
2 Loss of Redundancy with warning, second failure would be severity 5 or above Very minor
1 No effect None
Table 2: Severity Scoring Reference
These simply-defined failure rates effectively rank likelihood of failure, and can be directly related to the
lifetime of the rig. Systems experts generally find it much easier to score with this method, based on their
own experience, rather than trying to apply probabilities, reliability metrics, or other methods.
The detection score helps understand how the users learn of the failure. In some cases it is only possible
to learn of the failure when it is too late, of the failure will happen immediately and the user will know while
trying to run the equipment.
DETECTION Scale
Rank Description Detection
No known control available to detect Absolute
10
cause/mechanism of failure or the failure mode uncertainty
Very remote likelihood current control will detect
9 Very Remote
cause/mechanism of failure or the failure mode
Remote likelihood current control will detect
8 Remote
cause/mechanism of failure or the failure mode
Very low likelihood current control will detect
7 Very low
cause/mechanism of failure or the failure mode
Low likelihood current control will detect
6 Low
cause/mechanism of failure or the failure mode
Moderate likelihood current control will detect
5 Moderate
cause/mechanism of failure or the failure mode
Moderate high likelihood current control will detect
4 Moderately high
cause/mechanism of failure or the failure mode
High likelihood current control will detect
3 High
cause/mechanism of failure or the failure mode
Very high likelihood current control will detect
2 Very High
cause/mechanism of failure or the failure mode
Current control almost certain to detect
1 cause/mechanism of failure or the failure mode. Almost certain
Reliable detection controls are known with similar processes
Table 3: Detection Scoring Scale
For example, according to the risk ranking table in Figure 4, if Severity = 6 and Occurrence = 5, then
corrective action is required if Detection = 4 or higher. If Severity = 9 or 10, then corrective action is
always required. If Occurrence = 1 and Severity = 8 or lower, then corrective action is never required, and
so on.
Risk Ranking or “Rank” is the product of Occurrence, Severity and Detection, and is used to prioritize
individual FMEA items on the worksheet. By the scales described above, the highest potential value is
1000 and the lowest is one. Because the Occurrence and Severity values used in the calculation combine
elements of attendee knowledge and experience as well as team negotiation, a holistic approach needs
to be taken to the resulting range of values. Generally a value over 200 is too high.
Certainly the higher extremity values need urgent attention. Additionally, all identified failure modes
warrant due consideration in order of importance. There is no ‘cut-off point’ beyond which failure mode
actions or mitigations should be ignored.
This document references the FMEA Workbook spreadsheet. This Excel spreadsheet contains the
output and actions of the FMEA session and is expected to be distributed and stored with this report
document.
The FMEA Workshop reviewed and analyzed over 345 potential single point failures on the rig overall
control network. Where possible a potential mitigation or resolution was identified. In fact the mitigation
identified in 312 of 345 of the potential nodes reduced the workshop theoretical risk level by some
practical measure.
In table 6, we see the BEFORE mitigation clustering of risks around an Occurrence of 5, a Severity of 7 or
8. Post mitigation we graphically see risk being reduced to the left and up in the matrix (Table 7) to
Occurrence = 5 and Severity in the range of 4 7. We also see this risk reduction through mitigation in
the bar chart (Chart 1) of severity “Before” and “After”.
In theory at least, the rig is safer for longer with the suggested mitigations implemented.
Severity
1 2 3 4 5 6 7 8 9 10
1 3 0 0 0 0 0 0 0 0 0
2 0 0 0 3 0 0 2 4 4 0
3 0 0 0 0 1 10 9 4 8 0
Occurrence
4 0 1 1 7 11 4 11 27 8 0
5 1 19 2 22 13 8 30 39 7 0
6 0 4 3 6 1 5 7 7 0 0
7 0 0 0 2 0 1 0 1 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
Severity 1 2 3 4 5 6 7 8 9 10
Total Before 4 24 6 40 26 28 59 82 27 0
After 3 24 6 45 36 45 75 57 4 0
Severity
1 2 3 4 5 6 7 8 9 10
1 2 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 6 5 2 1 0
3 0 1 0 0 4 14 15 2 3 0
Occurrence
4 0 2 3 7 20 6 17 21 0 0
5 1 17 3 32 12 17 34 26 0 0
6 0 4 0 3 0 2 4 5 0 0
7 0 0 0 2 0 0 0 1 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
Severity 1 2 3 4 5 6 7 8 9 10
Total After 3 24 6 45 36 45 75 57 4 0
Before 4 24 6 40 26 28 59 82 27 0
As a product of Occurrence, Severity and Detection, the Risk Ranking gives indication about overall risk
for every single item that was included in FMEA. Risk Ranking is
s calculated for “Before” FMEA and “After”
recommended action is taken. Generally a value over 200 is too high and requires immediate action.
Positive impact on the system safety and reliability is visible on the Chart 5 as a reduced number of items
ranked with 100 and more After FMEA recommended actions are completed.
completed At the same time the
number of items ranked below 100 is increased.
The scoring results of the Failure Modes and Effect Analysis by Section give an indication of risk
distribution across the system.
The most significant number of detected NOT Acceptable items is recorded in VFD House, followed by
Drilling Control Room and Rig Power Station.
Station
Identified Mitigation Actions for VFD House could reduce the total number of Not Accept
Acceptable items by
75,0%.. Similarly risk is reduced across the board by taking the FMEA recommended actions as illustrated
in Chart 6.
Chart 6:: Scoring of NOT Acceptable items BEFORE and AFTER by Section
The complete workshop analysis results are captured in the FMEA workbook. This includes failure
effects, scoring details, safety actions, and operational actions as determined by the FMEA team. The
remainder of this section reviews the top failure modes and summarizes the overall results.
Below we discuss the failure effect and potential mitigation of the highest ranked failure modes.
The list was created by sorting the FMEA worksheet by Risk Ranking and taking the significantly ranked
items from each section balanced with those with the highest detection score and the most significant
Severity ranking.
Item # 7
Part failing: Generator Panel (one of 3 running) (+GEN1 -> +GEN5)
Failure Type: Comms Lost or unit failure
Forced into Power limit. No more equipment can start.
If demand is above 90%, then Power limit goes active ramping MP then
Effect of failure:
TD and then DW back.
If below 90% then no impact
Occurrence: 5 Severity: 5 Detection: 6
Rank Score: 150
Normally Use needs 2 to 3 generators.
Operational Action:
Drilling Power limit may become active and driller will have to respond
Will receive ''CONFIDENTIAL'' Drilling Power limit alarm.
Detection Mode:
May have ''CONFIDENTIAL'' Communications alarm to Driller
Suggest implement alarm to ET of Modbus status and generator failure
Recommendation: Test by ''CONFIDENTIAL'': Test break in Modbus communication line
Item # 9
Part failing: Single Equipment Controller (VFD) - FPBA-01 - Applicable for all VFDs
Failure Type: Comms Lost or cable break between units
Lost Comms,
Effect of failure: Could Fault all 6 pieces of equipment
Do not TD or RN
Occurrence: 4 Severity: 8 Detection: 4
Rank Score: 128
DW hard stop
Lose all equipment, Lose MPs & Rotary
do not lose TD, RN
Operational Action:
Consider well control options - WAIT
repair Profibus
Immediate impact,
Detection Mode:
Loss of Comms alarms
Consider moving VFD priority by moving DWA VFD to the front.
Consider training for repairs of this situation <Moving Termination of
Profibus>
Recommendation:
Consider Redundant designs (2 adaptor) - Set up a Modbus ring from
Generators to VFD drives
Spares
New Occurrence: 3 New Severity: 7 New Detection: 4
New Rank Score: 84
Item # 25
Part failing: Electrical drive & control system PLC (-CPU1)
Failure Type: SW Version back up on rig
Effect of failure: Regression Errors.
Occurrence: 5 Severity: 7 Detection: 7
Rank Score: 245
Potential to lose functionality
Operational Action:
Potential Regression error
Detection Mode: Difficult
Recommendation: ''Client'' needs Software MOC procedure coordinated with Vendor
New Occurrence: 4 New Severity: 5 New Detection: 3
New Rank Score: 60
Item # 30
As Secondary Failure
Part failing:
Electrical drive & control system PLC (-CPU2)
Failure Type: CPU fault
Effect of failure: Lose control of everything except TD and RN. DW Hard Stop. MP Idle stop
Occurrence: 2 Severity: 9 Detection: 3
Rank Score: 54
Only have TD and RN, No DW or MPs, etc.
Operational Action:
Rig Down. Manage Well
Detection Mode: Loss of communications, Loss of control
Drill SOP, spares, training
Recommendation:
Need Min 2 Spare PLCs
New Occurrence: 2 New Severity: 6 New Detection: 3
New Rank Score: 36
Item # 31
As Secondary Failure
Part failing:
Electrical drive & control system PLC (-CPU2)
Failure Type: SF (System fault)
Effect of failure: Lose control of everything except TD and RN. DW Hard Stop. MP Idle stop
Occurrence: 2 Severity: 9 Detection: 3
Rank Score: 54
Only have TD and RN, No DW or MPs, etc.
Operational Action:
Rig Down. Manage Well
Detection Mode: Loss of communications, Loss of control
Drill SOP, spares, training
Recommendation:
Need Min 2 Spare PLCs
New Occurrence: 2 New Severity: 6 New Detection: 3
New Rank Score: 36
Item # 33
As Secondary Failure
Part failing:
Electrical drive & control system PLC (-CPU2)
Failure Type: Power Failure
Effect of failure: Lose control of everything except TD and RN. DW Hard Stop. MP Idle stop
Occurrence: 3 Severity: 9 Detection: 3
Rank Score: 81
Only have TD and RN, No DW or MPs, etc.
Operational Action:
Rig Down. Manage Well
Detection Mode: Loss of communications, Loss of control
Drill SOP, spares, training
Recommendation:
Need Min 2 Spare PLCs
New Occurrence: 2 New Severity: 6 New Detection: 3
New Rank Score: 36
Item # 34
As Secondary Failure
Part failing:
Electrical drive & control system PLC (-CPU2)
Failure Type: SW Version back up on rig
Effect of failure: Regression Errors
Occurrence: 5 Severity: 7 Detection: 7
Rank Score: 245
Potential to lose functionality
Operational Action:
Potential Regression error
Detection Mode: Difficult
Recommendation: ''Client'' needs Software MOC procedure coordinated with Vendor
New Occurrence: 4 New Severity: 5 New Detection: 3
New Rank Score: 60
Item # 39
Part failing: Secondary Failure Ethernet to -CPU2
Failure Type: Cable break to HUB1
Effect of failure: Lose control of everything except TD and RN. DW Hard Stop. MP Idle stop
Occurrence: 3 Severity: 9 Detection: 3
Rank Score: 81
Only have TD and RN, No DW or MPs, etc.
Operational Action:
Rig Down. Manage Well
Detection Mode: Loss of comms, Loss of control
Recommendation: Drill SOP, spares, training
New Occurrence: 3 New Severity: 6 New Detection: 3
New Rank Score: 54
Item # 46
Part failing: ZMS PLC (CPU4)
Failure Type: ZMS Version back up on rig
Effect of failure: Regression Errors
Occurrence: 5 Severity: 7 Detection: 7
Rank Score: 245
Potential to lose functionality
Operational Action:
Potential Regression error
Detection Mode: Difficult
Recommendation: ''Client'' needs Software MOC procedure coordinated with Vendor
New Occurrence: 4 New Severity: 5 New Detection: 3
New Rank Score: 60
Test by ''CONFIDENTIAL''
New Occurrence: 4 New Severity: 5 New Detection: 4
New Rank Score: 80
Item # 71
Part failing: HUB3
Failure Type: HUB failure
Lose ''Confidential''
Lose ET1 and ET2
Effect of failure:
Lose both HMIs
ZMS lock down all
Occurrence: 4 Severity: 9 Detection: 4
Rank Score: 144
Without HMIs cannot override ZMS, Well Management Situation.
Operational Action: Move Reroute 3 connections on SW3 to Sw4 override ZMS, Move DW and
Manage well
Detection Mode: HMIs lose information, Lose control access
Recommendation: Move Port5 to Sw4 Permanently.
Item # 73
Part failing: HUB4
Failure Type: HUB failure
Effect of failure: Lose Joysticks, Tool Push Client, Power CW. And AD monitor data
Occurrence: 4 Severity: 9 Detection: 4
Rank Score: 144
Lose Control. Can control RN via ''Confidential'' screen. Maintain well.
Operational Action:
Move Port1 Sw4 to Port2 SW3, Move Port4 Sw4 to Port 8 Sw3, and Move
Port8 Sw4 to now open Port 5 SW3
Detection Mode: Alarms and HMIs
Suggestion: Permanently move Port5 Sw3 to Port 2 Sw4,
Test by ''CONFIDENTIAL''
New Occurrence: 4 New Severity: 8 New Detection: 3
New Rank Score: 96
Test by ''CONFIDENTIAL''
New Occurrence: 4 New Severity: 8 New Detection: 3
New Rank Score: 96
Item # 75
Part failing: HUB3 & HUB4
Failure Type: Ethernet cable break from HUB3 to HUB4 (ETH7 to ETH5)
No ZMS Impact
Potential Operational Failure
Effect of failure:
Alarm ?
Might have bus fault…
Occurrence: 4 Severity: 5 Detection: 7
Rank Score: 140
Potential short term loss, Potential Stop and Restart equipment.
Operational Action:
Potential DW hard stop
Detection Mode: No Alarm
Suggestion: Verify tolerance of watchdog timers will give alarm, but not
stop equipment..
Add alarm of failure to expedite action
Recommendation:
SOP instructions to reconnect
Test by ''CONFIDENTIAL''
New Occurrence: 4 New Severity: 5 New Detection: 4
New Rank Score: 80
Item # 121
Part failing: Main Driller HMI - HMI1
Failure Type: Power failure
Effect of failure: Lose all 3 HMIs
Occurrence: 5 Severity: 9 Detection: 7
Rank Score: 315
Operational Action: Make safe
Detection Mode: Loss of alarms and visual
Spares, Training
Recommendation:
Suggest: HMIs on different 24VDC supply and Fuses
New Occurrence: 5 New Severity: 4 New Detection: 3
New Rank Score: 60
Item # 123
Part failing: Main Driller HMI - HMI1
Failure Type: HMI 1 Version back up on rig
Effect of failure: Regression Errors
Occurrence: 5 Severity: 7 Detection: 7
Rank Score: 245
Potential to lose functionality
Operational Action:
Potential Regression error
Detection Mode: Difficult
Recommendation: ''Client'' needs Software MOC procedure coordinated with Vendor
New Occurrence: 4 New Severity: 5 New Detection: 3
New Rank Score: 60
Item # 126
Part failing: Main Driller HMI – HMI2
Failure Type: Power failure
Effect of failure: Lose all 3 HMIs
Occurrence: 5 Severity: 9 Detection: 7
Rank Score: 315
Operational Action: Make safe
Detection Mode: Loss of alarms and visual
Spares, Training
Recommendation:
Suggest: HMIs on different 24VDC supply and Fuses
New Occurrence: 5 New Severity: 4 New Detection: 3
New Rank Score: 60
Item # 128
Part failing: Main Driller HMI – HMI2
Failure Type: HMI 2 Version back up on rig
Effect of failure: Regression Errors
Occurrence: 5 Severity: 7 Detection: 7
Rank Score: 245
Potential to lose functionality
Operational Action:
Potential Regression error
Detection Mode: Difficult
Recommendation: ''Client'' needs Software MOC procedure coordinated with Vendor
New Occurrence: 4 New Severity: 5 New Detection: 3
New Rank Score: 60
Item # 131
Part failing: Assistant Driller HMI - HMI3
Failure Type: Power failure
Effect of failure: Lose all 3 HMIs
Occurrence: 5 Severity: 7 Detection: 7
Rank Score: 245
Operational Action: Make safe
Detection Mode: Loss of alarms and visual
Spares, Training
Recommendation:
Suggest: HMIs on different 24VDC supply and Fuses
New Occurrence: 5 New Severity: 4 New Detection: 3
New Rank Score: 60
Item # 133
Part failing: Assistant Driller HMI - HMI3
Failure Type: HMI 3 Version back up on rig
Effect of failure: Regression Errors
Occurrence: 5 Severity: 7 Detection: 7
Rank Score: 245
Potential to lose functionality
Operational Action:
Potential Regression error
Detection Mode: Difficult
Recommendation: ''Client'' needs Software MOC procedure coordinated with Vendor
New Occurrence: 4 New Severity: 5 New Detection: 3
New Rank Score: 60
Item # 137
Part failing: Remote I/O interface -ET1/ET2
Failure Type: Power failure
Effect of failure: Lose redundancy as connected to same 24VDC supply
Occurrence: 5 Severity: 9 Detection: 3
Rank Score: 135
Operational Action: Make well safe, repair
Detection Mode: Alarms
Troubleshoot, Training,
Item # 139
Second Failure
Part failing:
Remote I/O interface -ET2
Failure Type: System fault (SF)
Lose signals from fault IO card
Effect of failure: Variety of problems. CPU1/2/3/4 responding as needed.
DW hard stop to minimal impact
Occurrence: 5 Severity: 9 Detection: 3
Rank Score: 135
Operational Action: NO backup. Must repair
Detection Mode: Alarms
Recommendation: Training, Spares, etc.
Item # 140
Second Failure
Part failing:
Remote I/O interface -ET2
Failure Type: Bus fault (BF)
Lose signals from fault IO card
Effect of failure: Variety of problems. CPU1/2/3/4 responding as needed.
DW hard stop to minimal impact
Occurrence: 5 Severity: 9 Detection: 3
Rank Score: 135
Operational Action: NO backup. Must repair
Detection Mode: Alarms
Recommendation: Training, Spares, etc.
New Occurrence: 5 New Severity: 8 New Detection: 3
New Rank Score: 120
Item # 141
Second Failure
Part failing:
Remote I/O interface -ET2
Failure Type: Power Failure
Lose signals from fault IO card
Effect of failure: Variety of problems. CPU1/2/3/4 responding as needed.
DW hard stop to minimal impact
Occurrence: 5 Severity: 9 Detection: 3
Rank Score: 135
Operational Action: NO backup. Must repair
Detection Mode: Alarms
Recommendation: Training, Spares, etc.
New Occurrence: 5 New Severity: 8 New Detection: 3
New Rank Score: 120
Item # 142
Second Failure
Part failing:
Remote I/O interface -ET2
Failure Type: I/O Card/Module fault
Lose signals from fault IO card
Effect of failure: Variety of problems. CPU1/2/3/4 responding as needed.
DW hard stop to minimal impact
Occurrence: 5 Severity: 9 Detection: 3
Rank Score: 135
Operational Action: NO backup. Must repair
Detection Mode: Alarms
Recommendation: Training, Spares, etc.
New Occurrence: 5 New Severity: 8 New Detection: 3
New Rank Score: 120
Item # 189
Part failing: WR Remote I/O interface 1.158
Failure Type: Unit failure
Current Effect :
RN not operable,
TD no impact, ZMS potential Impact
Effect of failure:
Required Effect:
ZMS adjust to RN failure
Occurrence: 4 Severity: 9 Detection: 4
Rank Score: 144
If Wrench is retracted, less impact.
If Wrench is extended, impact is higher.
THERE is potential for ZMS collision!!
Drilling Ops Might be able to continue with work around
Operational Action:
Can Hydraulic move RN - but it is tough.
Item # 190
Part failing: WR Remote I/O interface 1.158
Failure Type: Communication interface failure
Current Effect :
RN not operable,
TD no impact, ZMS potential Impact
Effect of failure:
Required Effect:
ZMS adjust to RN failure
Occurrence: 4 Severity: 9 Detection: 4
Rank Score: 144
If Wrench is retracted, less impact.
If Wrench is extended, impact is higher.
THERE is potential for ZMS collision!!
Drilling Ops Might be able to continue with work around
Operational Action:
Can Hydraulic move RN - but it is tough.
Item # 214
Part failing: ''Confidential'' TD/Wrench PLC
Failure Type: Software Version back up on rig
Effect of failure: Unknown Regression issue
Occurrence: 6 Severity: 8 Detection: 8
Rank Score: 384
Operational Action: ETs to follow correct SMOC on upgrades, installs, vendor visits
Difficult for Ops
Detection Mode:
ET, ''Confidential'' check date, version etc.
''Confidential'' SMOC Implementation, ''Client'' PM and SMOC
Recommendation: ''Confidential'' to confirm Checksum for PLC vs. Server compare
Suggest ''Client'' keep a configured spare CPU on site
New Occurrence: 5 New Severity: 6 New Detection: 3
New Rank Score: 90
Item # 248
Part failing: TD Elevator Load sensor active (Pressure Switch 0 or 1)
Failure Type: Sensor failure
Effect of failure: potentially no alarm, No electrical detection of failure
Occurrence: 6 Severity: 8 Detection: 8
Rank Score: 384
Operational Action: Uncertain ''CONFIDENTIAL'' to Verify
Alarm? ''Confidential''/ ''CONFIDENTIAL'' to verify how to detect… failure
Detection Mode:
and impact of failure
Recommendation: Potentially to use Traveling Block load cell in lieu of this sensor
New Occurrence: 6 New Severity: 8 New Detection: 8
New Rank Score: 384
Item # 249
Part failing: TD Elevator Load sensor active (Pressure Switch 0 or 1)
Failure Type: Mechanical failure post command
Effect of failure: No positive Feedback post command.
Occurrence: 6 Severity: 8 Detection: 8
Rank Score: 384
Operational Action: Could damage equipment. Visual verification and training required
Detection Mode: Visual only
Sensor??
Require Drilling interaction push button? Likely this is not a practical
Recommendation:
solution
Training
New Occurrence: 6 New Severity: 8 New Detection: 8
New Rank Score: 384
Item # 252
Part failing: TD Elevator open sensor
Failure Type: Failure of Sensor
Effect of failure: NO electrical detection of Failure until you send a command
Occurrence: 6 Severity: 7 Detection: 5
Rank Score: 210
Operational Action: Respond and Repair
Detection Mode: Lack of Feedback leads to Alarm
Recommendation: 3rd party Spares, Maintenance
New Occurrence: 5 New Severity: 7 New Detection: 5
New Rank Score: 175
Item # 254
Part failing: IBOP Open/Closed Status
Failure Type: Mechanical failure post command
Blow seals, Blow Pop-Offs
Effect of failure:
Well Control Situation
Occurrence: 4 Severity: 9 Detection: 5
Rank Score: 180
Potential Well Mgmt. Situation. May have IBOP open when think it is closed
Operational Action:
Can use manual control on TD
Detection Mode: Visual Indication
Recommendation: Training, Maintenance
New Occurrence: 3 New Severity: 8 New Detection: 5
New Rank Score: 120
Item # 321
Part failing: 600VAC BUS
Failure Type: Main breaker (Q8) fault/Trip
Effect of failure: VFDs available?, Via UPS have control, Lose HPU, Computers still ON
Occurrence: 5 Severity: 7 Detection: 7
Rank Score: 245
Semi Controlled (no hydraulic) shut down to repair.
Operational Action: Can start standby or emergency Gen, and then start MCC and get safe on
well
Detection Mode: None - immediate
Recommendation: Confirm Drilling SOP and training for Blackout/Brownout
New Occurrence: 5 New Severity: 7 New Detection: 6
New Rank Score: 210
Item # 324
Part failing: Transformer -T1
Failure Type: Ground fault
Effect of failure: Potential Blackout
Occurrence: 3 Severity: 9 Detection: 4
Rank Score: 108
Operational Action: Respond to Blackout, Make well safe and repair
Detection Mode: Have monitoring
Recommendation: Confirm Drilling SOP and training for Blackout/Brownout
New Occurrence: 3 New Severity: 7 New Detection: 4
New Rank Score: 84
Item # 325
Part failing: Transformer -T1
Failure Type: High temperature
Effect of failure: Potential Transformer Failure and Damage
Occurrence: 4 Severity: 5 Detection: 7
Rank Score: 140
Operational Action: Maintenance to Respond
Detection Mode: No Alarm
Recommendation: Suggestion: Implement alarm
New Occurrence: 3 New Severity: 7 New Detection: 2
New Rank Score: 40
Item # 327
Part failing: 400VAC BUS
Failure Type: System interlock failure
Double failure post 600VAC fail
Effect of failure:
Blackout
Occurrence: 3 Severity: 9 Detection: 7
Rank Score: 189
Operational Action: Blackout, make well safe, Maintenance to Respond
Detection Mode: None - immediate
Confirm Drilling SOP and training for Blackout/Brownout,
Recommendation:
PMs
New Occurrence: 2 New Severity: 9 New Detection: 4
New Rank Score: 72
Updated ZMS User’s manual should be available for drilling crew and available in at least two
languages (Chinese and English)
Permit To Work (PTW) should be required for any operation on maintenance activity with active
issue with ZMS
ZMS Bypass feature has password protection. It is highly recommended that Invisible feature
could be activated with password protection, too.
''Confidential'' needs to modify PLC application to integrate WR Remote I/O Interface 1.158 unit
failure communication bit. ''CONFIDENTIAL'' should use it to shutdown RN in ZMS. ZMS MUST
lock down any potential equipment against RN. To continue with operation RN should be moved
to safe location out of zone and it has to be marked Invisible. Operation could continue using
manual tongs.
ZMS Matrix to be corrected and updated
Drilling Contractor should prepare Standing Operational Instructions for all critical operational
situations. Those instructions should be available on site in two languages (Chinese and English).
Review and upgrade all testing procedures in accordance with FMEA. The most critical part is
related to the ZMS and entire Drilling Control System response in case on various
Communication faults, Hardware faults and the Rig Power Station fault. All involved parties
should have mutual agreement about required additional tests.
Management of equipment calibration between Main Driller and Assistant Driller HMIs, Flash and
PLC needs to be clear and tested by ''CONFIDENTIAL'' before delivery.
The FMEA indicated that while the current design is solid, there are suggested design improvements both
for ''Confidential'' and Hong Hua, as well as clear operational process and training tasks for ''Client''.
With the correction of identified issue items, the stability of the control systems and the readiness of the
crew could be improved. As of this report, there are 51 high level suggestions for the team to be
considered and that might impact safe operations.
Kingston Systems would like to thank ''Confidential'', ''Client'' and ''Client'' for their participation and
involvement and we look forward to future cooperation during the testing, Commissioning, OMAN
Acceptance and SMOC deployment.
Disclaimer: Kingston Systems LLC assumes no responsibility for any loss, physical or financial, or damage from actions taken or not
taken in light of the comments or recommendations given or not given in this or any project communication
Case 1: Crash of the sack room terminal. A HMI terminal in the sack room was discovered with an
unresponsive terminal after the close out of the Permit to Work for installing software updates on the
DrillView system. This was determined to be caused by the restarting of servers during the software
upgrade without properly reinitializing all the terminals afterwards.
Case 2: Software Upgrade Installation failure. A SCR (Software Change Request) was filed and
approved. Time was allocated under the Permit to Work process and other users were locked out of the
network and from access to effected machinery. Unfortunately, the technician was unknowingly provided
with a bad release package. To further complicate the situation, no offsite support was available to the
technician. When contact with the home office was reestablished, the missing files were sent but blocked
by antivirus software. Eventually, an alternate route for software delivery was found. After the
installation was completed, it was found to be incorrectly programmed and was of no use.
The end result was that several hours of system lock out time doing tasks that should have been done
offline. Because of low software quality and poor vendor testing, the end result was a completely
preventable waste of time.
Case 4: Failure to lock out tag out: The Drawworks control server was taken off line and rebooted for
backup. This backup was approved by the Chief Engineer. During the backup process, the Driller was
in the chair attempting to move the Drawworks. The Driller and the Technician were in informal
communication but no specific isolation or work stop was in place.
The driller’s chair left touch screen had a menu open which stopped responding after and during the
reboot. The Drawworks came to an unexpected halt and the driller was unable to exit the screen or
operate the chair menu using the “close button”. The chair had to be rebooted before full control was
restored. There was no risk of damage or injury as a non-critical function was being performed and the
equipment automatically stops motion when control is lost.
The following actions were not taken but would be required to satisfy SMOC concerns
A scheduled time in which the action is to take place. This would prevent management confusion as
to why alarms are being generated and why equipment is off line.
A formal notification of supervisors and operators that the action was about to be performed. This,
Job Safety Analysis (JSA) would have prevented the miscommunication and avoided any potential
equipment or personnel harm.
A written set of steps. This would have allowed a supervisor to recognize that a reboot was to occur
and better understand possible side effects.
Equipment Isolation would prevent any changes from causing sudden unexpected movement or
inability to move.
Complete understanding of side effects so that the procedure included resetting of other effected
machinery. This requires the approvers to take additional responsibility in understanding how the
equipment functions and is interconnected.
Lesson: Equipment should be locked out and isolated during control system work and under a permit to
work system. Restart sequences for equipment need to be followed and well understood.
Case 5: Poor Testing: Software upgrade is installed leading to a collision between the top drive and the
top of the drill pipe because the update was designed for a rig with a shorter derrick. The pipe was bent
out of position and was in danger of popping out of the vertical pipe handler gripper arm. The upper stop
limit set point had been unknowingly changed by the software upgrade.
Lesson: Vendor test scripts are not infallible. Software upgrades sometimes change parameters
unexpectedly and without the knowledge of the technician performing the upgrade. Sometimes hidden
logic changes have unintended consequences that are not discovered until after installation. Following
every change, all limits and interlocks must be checked and tested.
Case 6: Inadequate Testing: A software change was made to zone management settings and was
retested between two machines. The interaction with a third machine was not tested and caused a
collision resulting in injury risk and 2 months of critical machinery down time.
Lesson: It is human nature to not check to see if changes made are working, and checking to see
what new errors have been introduced is often skipped. Following every change, all limits and
interlocks must be checked and tested.
BF Bus Fault
CPU Central Processing Unit
CW Catwalk
DeltaP pump differential pressure
DFMA Drill Floor Manipulator Arm
DW Draw work
ET Electronic Technician
Ex Explosion-proof
FAT Factory Acceptance Test
FW Floor wrench
GEN Generator
HMI Human machine interface
HP Horse power
HUB Network HUB (Old)
I/O Input/output
MB Mud bucket
MCC Motor control center
MP Mud Pumps
PLC Programmable logic controller
PTW Permit to Work
RN Rough Neck, Floor Wrench
ROP Rate of penetration
RPM Rotary speed per minute
RTD Resistance temperature detector
SF System Fault
SMOC Software Management of Change
SOP Standard Operating (Drilling) Procedure
STV Stand Transfer Vehicle (pipe racker NOV provided)
SW Network Switch (New)
TD Top drive
TRQ Torque
VDC Volts Direct Current
VFD Verified frequency drive
WOB Weight on bit
ZMS Zone management system
SOP instructions
to reconnect ETH1
to SW1.
Lose ZMS Temporary lock Alarms PTW
down. Can over
ride bypass ZMS Suggest: SOP Operation
69 HUB failure 4 7 3 84 Y ''Client'' 4 7 3 84 Y 0.0
altogether on instructions to s
each piece of reconnect ETH1 to
equipment SW1.
PLC HUB2
Lose ZMS Temporary lock Alarms
down. Can over
ride bypass ZMS Operation
70 Power failure 4 7 3 84 Y ''Client'' 4 7 3 84 Y 0.0
altogether on s
each piece of PTW,
equipment Troubleshooting
Suggest: HMIs on
different 24VDC
supply and Fuses
System Fault Possible Hard Alarms Troubleshoot,
Lose signals - Stop. Training,
variety of problems, Manage Well Switch to ET2 as
I/O Card/Module CPUs response Troubleshoot / designed ''CONFIDENTIAL''/''Client Operation
138 5 8 3 120 N 5 7 3 105 N 12.5
fault varies Switch to ET2 '' s
Test by
''CONFIDENTIAL''
before delivery
Required
Operational
Action: Move RN
to Safety, Make
RN Invisible, Use
manual tongs.
Action Covered
Network Switch Lose HMI, Lose Hydraulic control ''CONFIDENTIAL''
Stratix 5700 Control of TD and RN is limited, gets a comms
''Confidential'
connection to RN Manage Well, loss
193 ' DCR CIP Cable break 5 8 3 120 N ''Client'' 5 7 3 105 N 12.5
Ethernet/IP Tap Make Repairs
Panel
1783 ETAP1F
1.172 Spares, Training
Lose HMI, Lose Hydraulic control ''CONFIDENTIAL''
Control of TD and RN is limited, gets a comms
194 Switch failure RN 5 8 Manage Well, loss 3 120 N ''Client'' 5 7 3 105 N 12.5
''Confidential' Ethernet/IP Tap Make Repairs Spares, Training
' DCR CIP 1783 ETAP1F
Lose HMI, Lose Hydraulic control ''CONFIDENTIAL''
Panel 1.172
Control of TD and RN is limited, gets a comms
195 Power failure RN 6 8 Manage Well, loss 3 144 N ''Client'' 6 7 3 126 N 12.5
Make Repairs Spares, Training
''Confidential' Ethernet/IP Tap HMI on, no impact none ''CONFIDENTIAL''
196 ' DCR CIP 1783 ETAP1F Switch failure on operations 5 2 & caring gets a 3 30 Y ''Client'' 5 2 3 30 Y 0.0
Panel 1.175 comms loss Spares, Training
''Confidential'',
Software Version
214 6 8 8 384 N Jesus, confirm ''Confidential'', ''Client'' 5 6 3 90 Y 76.6
back up on rig
Checksum for PLC
''Confidential' vs. Server
''Confidential'' compare
' DCR CIP
TD/Wrench PLC
Panel
Suggest ''Client''
keep a configured
spare on site
no impact. Because Minor Impact PLC alarm,
'in theory' on PLC ''Confidential'' HMI
boot, the application Alarm
is loaded from the
215 Battery low/Dies 4 2 2 16 Y ''Client'' 4 2 2 16 Y 0.0
flash card. All setting
should be uploaded
from PLC application
or Server Spare, Training
Nothing until second Impact could be
failure of internal 3-8
216 PLC Flash Card Memory/program. 4 3 6 72 Y ''Client'' 4 3 6 72 Y 0.0
Spare, Training,
PLC fault indicator,
Verify PMs and
but no HMI fault
checks of cabinets
Depends Depends, PLC alarm,
Potentially TD ''Confidential'' HMI
I/O Card/Module
217 4 8 Down Alarm 5 160 N ''Client'' 4 8 5 160 N 0.0
fault
Visual indication
of equipment fault Spare, Training
''Confidential' ''Confidential'' Lose TD and RN Shut down, Make ''CONFIDENTIAL''
Lost Comm. With
218 ' DCR CIP TD/Wrench PLC 4 8 Well Safe, Alarms, 3 96 Y ''Client'' 4 8 3 96 Y 0.0
CPU3
Panel Communication Equipment Stops, Spare, Training
not a concern,
Drill mode Signal or Sensor covered by lost
262 0 0
Selected Fail comms, or other
positive indicator
not a concern,
Signal or Sensor covered by lost
263 TD Speed Zero 0 0
Fail comms, or other
positive indicator
not a concern,
TD LWCV Signal or Sensor covered by lost
264 0 0
Opened Fail comms, or other
positive indicator
ZMS TD
not a concern,
TD Elevator Signal or Sensor covered by lost
265 0 0
Opened Fail comms, or other
positive indicator
not a concern,
Torque Mode Signal or Sensor covered by lost
266 0 0
Selected Fail comms, or other
positive indicator
not a concern,
Signal or Sensor covered by lost
267 BUW Closed 0 0
Fail comms, or other
positive indicator
Blackout ?
320 Ground fault 0 0
VFDs available?, Via Semi Controlled None - immediate
UPS have control, (no hydraulic)
Lose HPU, shut down to
Computers still on repair.
Main breaker (Q8) Operation
321 5 7 7 245 N ''Client'' 5 7 6 210 N 14.3
RIG Power fault/Trip Can start standby s
600VAC BUS
station or emergency Confirm Drilling
Gen, and then SOP and training
start MCC and for
get safe on well Blackout/Brownout
PontentialTrip/ Black Potential None - immediate Suggest
>>THD Total
out & Permanent Blackout ''CONFIDENTIAL'' Operation
322 Harmonic 5 8 7 280 N ''Client'' 4 8 2 64 Y 77.1
Damage to Electrical Response review filtering and s
Distortion
Equipment monitoring options
Could go into Power if power limit, Alarm on Power Confirm Drilling
RIG Power One or more limit but low manage and start limit SOP and Operation
323 GEN 1 to GEN 5 7 6 4 168 N ''Client'' 5 6 3 90 Y 46.4
station faulty generators probability another Gen Maintenance s
Training
Potential Blackout Respond to Have monitoring - Confirm Drilling
RIG Power Blackout, Make SOP and training Operation
324 Transformer -T1 Ground fault 3 9 4 108 N ''Client'' 3 7 4 84 Y 22.2
station well safe and for s
repair Blackout/Brownout
Disclaimer: Kingston Systems LLC assumes no responsibility for any loss, physical or financial, or damage from actions taken or not taken in light of the comments or recommendations given or not given in this or any project communication